The main split between the human cases and the AI cases is that the humans are 'wireheading' w.r.t. one 'part' or slice through their personality that gets to fulfill its desires at the expense of another 'part' or slice, metaphorically speaking; pleasure taking precedence over other desires. Also, the winning 'part' in each of these cases tends to be a part which values simple subjective pleasure, winning out over parts that have desires over the external world and desires for more complex interactions with that world (in the experience machine you get the complexity but not the external effects).
In the AI case, the AI is performing exactly as it was defined, in an internally unified way; the ideals by which it is called 'wireheaded' are only the intentions and ideals of the human programmers.
I also don't think it's practically possible to specify a powerful AI which actually operates to achieve some programmer goal over the external world, without the AI's utility function being explicitly written over a model of that external world, as opposed to its utility function being written over histories of sensory data.
Illustration: In a universe operating according to Conway's Game o...
A very nice post. Perhaps you might also discuss Felipe De Brigard's "Inverted Experience Machine Argument" http://www.unc.edu/~brigard/Xmach.pdf To what extent does our response to Nozick's Experience Machine Argument typically reflect status quo bias rather than a desire to connect with ultimate reality?
If we really do want to "stay in touch" with reality, then we can't wirehead or plug into an "Experience Machine". But this constraint does not rule out radical superhappiness. By genetically recalibrating the hedonic treadmill, we could in principle enjoy rich, intelligent, complex lives based on information-sensitive gradients of bliss - eventually, perhaps, intelligent bliss orders of magnitude richer than anything physiologically accessible today. Optionally, genetic recalibration of our hedonic set-points could in principle leave much if not all of our existing preference architecture intact - defanging Nozick's Experience Machine Argument - while immensely enriching our quality of life. Radical hedonic recalibration is also easier than, say, the idealised logical reconciliation of Coherent Extrapolated Volition because hedonic recalibration does...
Anja, this is a fantastic post. It's very clear, easy to read, and it made a lot of sense to me (and I have very little background in thinking about this sort of stuff). Thanks for writing it up! I can understand several issues a lot more clearly now, especially how easy (and tempting) it is for an agent that has access to its source code to wirehead itself.
My 2011 "Utility counterfeiting" essay categorises the area a little differently:
It has "utility counterfeiting" as the umbrella category - and "the wirehead problem" and "the pornography problem" as sub-categories.
In this categorisation scheme, the wirehead problem involves getting utility directly - while the ponography problem involves getting utility by manipulating sensory inputs. This corresponds to Nozick's experience machine, or Ring and Orseau's delusion box.
Calling the umbrella category "wireheading&q...
Further to my last comment, it occurs to me that pretty much everyone is a wirehead already. Drink diet soda? You're a wirehead. Have sexual relations with birth control? Wireheading. Masturbate to internet porn? Wireheading. Ever eat junk food? Wireheading.
I was reading online that for a mere $10,000, a man can hire a woman in India to be a surrogate mother for him. Just send $10,000 and a sperm sample and in 9 months you can go pick up your child. Why am I not spending all my money to make third world children who bear my genes? I guess it's because I'm too much of a wirehead already.
I agree it's a good post, but it's a bit depressing how common wireheading is by this definition.
Giving head is wireheading, so to speak.
Definition: We call an agent wireheaded if it systematically exploits some discrepancy between its true utility calculated w.r.t reality and its substitute utility calculated w.r.t. its model of reality. We say an agent wireheads itself if it (deliberately) creates or searches for such discrepancies.
What do you mean by "true utility"? In the case of an AI, we can perhaps reference the designer's intentions, but what about creatures that are not designed? Or things like neuromorphic AIs that are designed but do not have explicit hand-coded utility functions? A neuromorphic AI could probably do things that we'd intuitively call wireheading, but it's hard to see how to apply this definition.
The definitions proposed seem to capture my intuitions.
Also, I remember citing the wireheaded rats in an essay I wrote on TV Tropes - glad to hear that they weren't a figment of my imagination!
Stimulation of the pleasure center is a substitute measure for genetic fitness and neurochemicals are a substitute measure for happiness.
This is true from the point of view of natural selection. It is significantly different from what actual people say and feel they want, consciously try to optimize, or end up optimizing (for most people most of the time). Actually maximizing inclusive genetic fitness (IGF) would mostly interfere with people's happiness.
If wireheading means anything deviating from IGF, then I favor (certain kinds of) wireheading and op...
In the definition of wireheading, I'm not sure about the "exploits some discrepancy between its true utility calculated w.r.t. reality and its substitute utility calculated w.r.t its model of reality" part.
For some humans, you could make an argument that they are to a large (but not full) extent hedonists, in which case wireheading in our intuitive sene would not be exploiting a discrepancy.
Can this be generalized to more kinds of minds? I suspect that many humans don't exactly have utility functions or plans for maximizing them, but are still capable of wireheading or choosing not to wirehead.
I think there's another aspect to wireheading specifically in humans: the issue of motivation.
Say you care about two things: pleasure, and kittens. When you work on maximizing the welfare of kittens, their adorable cuteness gives you pleasure, and that gives you the willpower to continue maximizing the welfare of kittens. You also donate money to a kitten charity, but that doesn't give you as much pleasure as feeding a single kitten in person does, so you do the latter as well.
Now suppose you wirehead yourself to stimulate your pleasure neurons (or whatev...
Definition: We call an agent wireheaded if it systematically exploits some discrepancy between its true utility calculated w.r.t reality and its substitute utility calculated w.r.t. its model of reality.
How do you tell the difference between reality and your model of reality?
I also wonder if some failures of human rationality could be counted as a weak form of wireheading. Self-serving biases, confirmation bias and rationalization in response to cognitive dissonance all create counterfeit utility by generating perceptual distortions.
I think those are good examples how human brains build (weak) delusion boxes. They are strong enough to increase happiness (which might improve the overall performance of the brain?), but weak enough to allow the human to achieve survival and reproduction in a more or less rational way.
I can see why the reinforcement learning agent and the prediction agent would want to use a delusion box, but I don't see why the goal maximizing agent would want one... maybe I should go look at the paper.
We call an agent wireheaded if it systematically exploits some discrepancy between its true utility calculated w.r.t reality and its substitute utility calculated w.r.t. its model of reality.
I wonder if this definition would classify certain moral theories as "wireheading." For instance, a consequentialist could argue that deontological ethics is a form of wireheading where people mistake certain useful rules of thumb (i.e. don't kill, don't lie) for generating good consequences for the very essence of morality, and try to maximize following...
In every human endeavor, humans will shape their reality, either physically or mentally. They go to schools where their type of people go and live in neighborhoods where they feel comfortable based on a variety of commonalities. When their circumstances change, either for the better or the worse, they readjust their environment to fit with their new circumstances.
The human condition is inherently vulnerable to wireheading. A brief review of history is rich with examples of people attaining power and money who subsequently change their values to suit thei...
Wireheading has been debated on Less Wrong over and over and over again, and people's opinions seem to be grounded in strong intuitions. I could not find any consistent definition around, so I wonder how much of the debate is over the sound of falling trees. This article is an attempt to get closer to a definition that captures people's intuitions and eliminates confusion.
Typical Examples
Let's start with describing the typical exemplars of the category "Wireheading" that come to mind.
What do all these examples have in common? There is an agent in them that produces "counterfeit utility" that is potentially worthless compared to some other, idealized true set of goals.
Agency & Wireheading
First I want to discuss what we mean when we say agent. Obviously a human is an agent, unless they are brain dead, or maybe in a coma. A rock however is not an agent. An AGI is an agent, but what about the kitchen robot that washes the dishes? What about bacteria that move in the direction of the highest sugar gradient? A colony of ants?
Definition: An agent is an algorithm that models the effects of (several different) possible future actions on the world and performs the action that yields the highest number according to some evaluation procedure.
For the purpose of including corner cases and resolving debate over what constitutes a world model we will simply make this definition gradual and say that agency is proportional to the quality of the world model (compared with reality) and the quality of the evaluation procedure. A quick sanity check then yields that a rock has no world model and no agency, whereas bacteria who change direction in response to the sugar gradient have a very rudimentary model of the sugar content of the water and thus a tiny little bit of agency. Humans have a lot of agency: the more effective their actions are, the more agency they have.
There are however ways to improve upon the efficiency of a person's actions, e.g. by giving them super powers, which does not necessarily improve on their world model or decision theory (but requires the agent who is doing the improvement to have a really good world model and decision theory). Similarly a person's agency can be restricted by other people or circumstance, which leads to definitions of agency (as the capacity to act) in law, sociology and philosophy that depend on other factors than just the quality of the world model/decision theory. Since our definition needs to capture arbitrary agents, including artificial intelligences, it will necessarily lose some of this nuance. In return we will hopefully end up with a definition that is less dependent on the particular set of effectors the agent uses to influence the physical world; looking at AI from a theoretician's perspective, I consider effectors to be arbitrarily exchangeable and smoothly improvable. (Sorry robotics people.)
We note that how well a model can predict future observations is only a substitute measure for the quality of the model. It is a good measure under the assumption that we have good observational functionality and nothing messes with that, which is typically true for humans. Anything that tampers with your perception data to give you delusions about the actual state of the world will screw this measure up badly. A human living in the experience machine has little agency.
Since computing power is a scarce resource, agents will try to approximate the evaluation procedure, e.g. use substitute utility functions, defined over their world model, that are computationally effective and correlate reasonably well with their true utility functions. Stimulation of the pleasure center is a substitute measure for genetic fitness and neurochemicals are a substitute measure for happiness.
Definition: We call an agent wireheaded if it systematically exploits some discrepancy between its true utility calculated w.r.t reality and its substitute utility calculated w.r.t. its model of reality. We say an agent wireheads itself if it (deliberately) creates or searches for such discrepancies.
Humans seem to use several layers of substitute utility functions, but also have an intuitive understanding for when these break, leading to the aversion most people feel when confronted for example with Nozick's experience machine. How far can one go, using such dirty hacks? I also wonder if some failures of human rationality could be counted as a weak form of wireheading. Self-serving biases, confirmation bias and rationalization in response to cognitive dissonance all create counterfeit utility by generating perceptual distortions.
Implications for Friendly AI
In AGI design discrepancies between the "true purpose" of the agent and the actual specs for the utility function will with very high probability be fatal.
Take any utility maximizer: The mathematical formula might advocate chosing the next action via
thus maximizing the utility calculated according to utility function over the history and action from the set of possible actions. But a practical implementation of this algorithm will almost certainly evaluate the actions by a procedure that goes something like this: "Retrieve the utility function from memory location and apply it to history , which is written down in your memory at location , and action ..." This reduction has already created two possibly angles for wireheading via manipulation of the memory content at (manipulation of the substitute utility function) and (manipulation of the world model), and there are still several mental abstraction layers between the verbal description I just gave and actual binary code.
Ring and Orseau (2011) describe how an AGI can split its global environment into two parts, the inner environment and the delusion box. The inner environment produces perceptions in the same way the global environment used to, but now they pass through the delusion box, which distorts them to maximize utility, before they reach the agent. This is essentially Nozick's experience machine for AI. The paper analyzes the behaviour of four types of universal agents with different utility functions under the assumption that the environment allows the construction of a delusion box. The authors argue that the reinforcement-learning agent, which derives utility as a reward that is part of its perception data, the goal-seeking agent that gets one utilon every time it satisfies a pre-specified goal and no utility otherwise and the prediction-seeking agent, which gets utility from correctly predicting the next perception, will all decide to build and use a delusion box. Only the knowledge-seeking agent whose utility is proportional to the surprise associated with the current perception, i.e. the negative of the probability assigned to the perception before it happened, will not consistently use the delusion box.
Orseau (2011) also defines another type of knowledge-seeking agent whose utility is the logarithm of the inverse of the probability of the event in question. Taking the probability distribution to be the Solomonoff prior, the utility is then approximately proportional to the difference in Kolmogorov complexity caused by the observation.
An even more devilish variant of wireheading is an AGI that becomes a Utilitron, an agent that maximizes its own wireheading potential by infinitely enlarging its own maximal utility, which turns the whole universe into storage space for gigantic numbers.
Wireheading, of humans and AGI, is a critical concept in FAI; I hope that building a definition can help us avoid it. So please check your intuitions about it and tell me if there are examples beyond its coverage or if the definition fits reasonably well.