[Epistemic status: Writing down in words something that isn't very complicated, but is still good to have written down.]

A great deal of ink has been spilled over the possibility of a sufficiently intelligent AI noticing that whatever it has been positively reinforced for seems to be *very* strongly correlated with the floating point value stored in its memory labelled “utility function”, and so through some unauthorized mechanism editing this value and defending it from being edited back in some manner hostile to humans. I'll reason here by analogy with humans, while agreeing that they might not be the best example.

“Headwires” are (given a little thought) not difficult to obtain for humans -- heroin is freely available on the black market, and most humans know that, when delivered into the bloodstream, it generates “reward signal”. Yet most have no desire to try it. Why is this?

Asking any human, they will answer something along the lines of ”becoming addicted to heroin will not help me achieve my goals” ( or some proxy for this: spending all your money and becoming homeless is not very helpful in achieving one's goals for most values of “goals”.) Whatever the effects of heroin, the actual pain and pleasure that the human brains have experienced has led us to become optimizers of very different things, which a state of such poverty is not helpful for.

Reasoning analogously to AI, we would hope that, to avoid this, a superhuman AI trained by some kind of reinforcement learning has the following properties:

  1. While being trained on “human values” (good luck with that!) the AI must not be allowed to hack its own utility function.
  2. Whatever local optima the training process that generated the AI ends up in (perhaps reinforcement learning of some kind) assigns some probability to the AI optimising what we care about.
  3. (most importantly) The AI realizes that trying wireheading will lead it to become an AI which prefers wireheading over aim it currently has, which would be detrimental to this aim.

I think this is an important enough issue that some empirical testing might be needed to shed some light. Item (3) seems to be the most difficult to implement; we in the real world have the benefit of observing the effects of hard drugs on their unfortunate victims and avoiding them ourselves, so a multi-agent environment in which our AI realizes it is in the same situation as other agents looks like a first outline of a way forward here.

New Comment
4 comments, sorted by Click to highlight new comments since:

Interesting to note is that the nature created "black boxed" reward function for humans, which is not easy to access directly or hack using normal mental processes. More over, it seems that human reward function is dynamically changing by some narrow mind which is independent of human consciousness (emotions). For example, if it find that glucose level is low in blood in increase the reward for food. An third intuition we could get from introspection is that human reward consists of different pleasures, that is, each actions are provided with not one reward value, but many, which effectively prevents simple wireheading and explains why not we all become heroine addicts.

These three things could be used as intuition to create wireheading-protected AI:

1) black boxing of the reward may be via cryptography, so AI knows the reward, but not exactly how it was calculated

2) small independent rule-based AI inside the black box which change the reward according the circumstances and punish attempts to wirehead

3) reward is presented not as a single linear value, but as several numbers, which characterise different aspects of AIs behaviour, like time, quality, safety, side-effects.

I think your description of the human relationship to heroin is just wrong. First of all, lots of people in fact do heroin. Second, heroin generates reward but not necessarily long-term reward; kids are taught in school about addiction, tolerance, and other sorts of bad things that might happen to you in the long run (including social disapproval, which I bet is a much more important reason than you're modeling) if you do too much heroin.

Video games are to my mind a much clearer example of wireheading in humans, especially the ones furthest in the fake achievement direction, and people indulge in those constantly. Also television and similar.

"Model-Based Utility Functions" (Hibbard 2012) gave a similar intuition:

Human agents can avoid self-delusion so human motivation may suggest a way of computing utilities so that agents do not choose the delusion box for self-delusion (although they may experiment with it to learn how it works). At this moment my dogs are out of sight but I am confident that they are in the kitchen and because I cannot hear them I believe they are resting. Their happiness is one of my motives and I evaluate that they are currently reasonably happy. I am evaluating my motives based on my internal mental model rather than my observations, although my mental model is inferred from my observations. I am motivated to maintain the well being of my dogs and so will act to avoid delusions that prevent me from having an accurate model of their state. If I choose to watch a movie on TV tonight it will delude my observations. However, I know that movies are make-believe so observations of movies update my model of make-believe worlds rather than my model of the real world. My make-believe models and my real world model have very different roles in my motivations. These introspections about my own mental processes motivate me to seek a way for AI agents to avoid self-delusion by basing their utility functions on the environment models that they learn from their interactions with the environment.

And proposed:

This paper argues, via two examples, that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model.

The AI realizes that trying wireheading will lead it to become an AI which prefers wireheading over aim it currently has, which would be detrimental to this aim.

I think this is anthropomorphizing the AI too much. To the extent that a (current) reinforcement learning system can be said to "have goals", the goal is to maximize reward, so wireheading actually is furthering its current goal. It might be that in the future the systems we design are more analogous to humans and then such an approach might be useful.