Save the princess: A tale of AIXI and utility functions
"Intelligence measures an agent's ability to achieve goals in a wide range of environments." (Shane Legg) [1]
A little while ago I tried to equip Hutter's universal agent, AIXI, with a utility function, so instead of taking its clues about its goals from the environment, the agent is equipped with intrinsic preferences over possible future observations.
The universal AIXI agent is defined to receive reward from the environment through its perception channel. This idea originates from the field of reinforcement learning, where an algorithm is observed and then rewarded by a person if this person approves of the outputs. It is less appropriate as a model of AGI capable of autonomy, with no clear master watching over it in real time to choose between carrot and stick. A sufficiently smart agent that is rewarded whenever a human called Bob pushes a button will most likely figure out that instead of furthering Bob's goals it can also threaten or deceive Bob into pushing the button, or get Bob replaced with a more compliant human. The reward framework does not ensure that Bob gets his will; it only ensures that the button gets pressed. So instead I will consider agents who have preferences over the future, that is, they act not to gain reward from the environment, but to cause the future to be a certain way. The agent itself will look at the observation and decide how rewarding it is.
Von Neumann and Morgenstern proved that a preference ordering that is complete, transitive, continuous and independent of irrelevant alternatives can be described using a real-valued utility function. These assumptions are mostly accepted as necessary constraints on a normatively rational agent; I will therefore assume without significant loss of generality that the agent's preferences are described by a utility function.
This post is related to previous discussion about universal agents and utility functions on LW.
A definition of wireheading
Wireheading has been debated on Less Wrong over and over and over again, and people's opinions seem to be grounded in strong intuitions. I could not find any consistent definition around, so I wonder how much of the debate is over the sound of falling trees. This article is an attempt to get closer to a definition that captures people's intuitions and eliminates confusion.
Typical Examples
Let's start with describing the typical exemplars of the category "Wireheading" that come to mind.
- Stimulation of the brain via electrodes. Picture a rat in a sterile metal laboratory cage, electrodes attached to its tiny head, monotonically pushing a lever with its feet once every 5 seconds. In the 1950s Peter Milner and James Olds discovered that electrical currents, applied to the nucleus accumbens, incentivized rodents to seek repetitive stimulation to the point where they starved to death.
- Humans on drugs. Often mentioned in the context of wireheading is heroin addiction. An even better example is the drug soma in Huxley's novel "Brave new world": Whenever the protagonists feel bad, they can swallow a harmless pill and enjoy "the warm, the richly coloured, the infinitely friendly world of soma-holiday. How kind, how good-looking, how delightfully amusing every one was!"
- The experience machine. In 1974 the philosopher Robert Nozick created a thought experiment about a machine you can step into that produces a perfectly pleasurable virtual reality for the rest of your life. So how many of you would want to do that? To quote Zach Weiner: "I would not! Because I want to experience reality, with all its ups and downs and comedies and tragedies. Better to try to glimpse the blinding light of the truth than to dwell in the darkness... Say the machine actually exists and I have one? Okay I'm in."
- An AGI resetting its utility function. Let's assume we create a powerful AGI able to tamper with its own utility function. It modifies the function to always output maximal utility. The AGI then goes to great lengths to enlarge the set of floating point numbers on the computer it is running on, to achieve even higher utility.
What do all these examples have in common? There is an agent in them that produces "counterfeit utility" that is potentially worthless compared to some other, idealized true set of goals.
Agency & Wireheading
First I want to discuss what we mean when we say agent. Obviously a human is an agent, unless they are brain dead, or maybe in a coma. A rock however is not an agent. An AGI is an agent, but what about the kitchen robot that washes the dishes? What about bacteria that move in the direction of the highest sugar gradient? A colony of ants?
Definition: An agent is an algorithm that models the effects of (several different) possible future actions on the world and performs the action that yields the highest number according to some evaluation procedure.
For the purpose of including corner cases and resolving debate over what constitutes a world model we will simply make this definition gradual and say that agency is proportional to the quality of the world model (compared with reality) and the quality of the evaluation procedure. A quick sanity check then yields that a rock has no world model and no agency, whereas bacteria who change direction in response to the sugar gradient have a very rudimentary model of the sugar content of the water and thus a tiny little bit of agency. Humans have a lot of agency: the more effective their actions are, the more agency they have.
There are however ways to improve upon the efficiency of a person's actions, e.g. by giving them super powers, which does not necessarily improve on their world model or decision theory (but requires the agent who is doing the improvement to have a really good world model and decision theory). Similarly a person's agency can be restricted by other people or circumstance, which leads to definitions of agency (as the capacity to act) in law, sociology and philosophy that depend on other factors than just the quality of the world model/decision theory. Since our definition needs to capture arbitrary agents, including artificial intelligences, it will necessarily lose some of this nuance. In return we will hopefully end up with a definition that is less dependent on the particular set of effectors the agent uses to influence the physical world; looking at AI from a theoretician's perspective, I consider effectors to be arbitrarily exchangeable and smoothly improvable. (Sorry robotics people.)
We note that how well a model can predict future observations is only a substitute measure for the quality of the model. It is a good measure under the assumption that we have good observational functionality and nothing messes with that, which is typically true for humans. Anything that tampers with your perception data to give you delusions about the actual state of the world will screw this measure up badly. A human living in the experience machine has little agency.
Since computing power is a scarce resource, agents will try to approximate the evaluation procedure, e.g. use substitute utility functions, defined over their world model, that are computationally effective and correlate reasonably well with their true utility functions. Stimulation of the pleasure center is a substitute measure for genetic fitness and neurochemicals are a substitute measure for happiness.
Definition: We call an agent wireheaded if it systematically exploits some discrepancy between its true utility calculated w.r.t reality and its substitute utility calculated w.r.t. its model of reality. We say an agent wireheads itself if it (deliberately) creates or searches for such discrepancies.
Humans seem to use several layers of substitute utility functions, but also have an intuitive understanding for when these break, leading to the aversion most people feel when confronted for example with Nozick's experience machine. How far can one go, using such dirty hacks? I also wonder if some failures of human rationality could be counted as a weak form of wireheading. Self-serving biases, confirmation bias and rationalization in response to cognitive dissonance all create counterfeit utility by generating perceptual distortions.
Implications for Friendly AI
In AGI design discrepancies between the "true purpose" of the agent and the actual specs for the utility function will with very high probability be fatal.
Take any utility maximizer: The mathematical formula might advocate chosing the next action via
thus maximizing the utility calculated according to utility function over the history
and action
from the set
of possible actions. But a practical implementation of this algorithm will almost certainly evaluate the actions
by a procedure that goes something like this: "Retrieve the utility function
from memory location
and apply it to history
, which is written down in your memory at location
, and action
..." This reduction has already created two possibly angles for wireheading via manipulation of the memory content at
(manipulation of the substitute utility function) and
(manipulation of the world model), and there are still several mental abstraction layers between the verbal description I just gave and actual binary code.
Ring and Orseau (2011) describe how an AGI can split its global environment into two parts, the inner environment and the delusion box. The inner environment produces perceptions in the same way the global environment used to, but now they pass through the delusion box, which distorts them to maximize utility, before they reach the agent. This is essentially Nozick's experience machine for AI. The paper analyzes the behaviour of four types of universal agents with different utility functions under the assumption that the environment allows the construction of a delusion box. The authors argue that the reinforcement-learning agent, which derives utility as a reward that is part of its perception data, the goal-seeking agent that gets one utilon every time it satisfies a pre-specified goal and no utility otherwise and the prediction-seeking agent, which gets utility from correctly predicting the next perception, will all decide to build and use a delusion box. Only the knowledge-seeking agent whose utility is proportional to the surprise associated with the current perception, i.e. the negative of the probability assigned to the perception before it happened, will not consistently use the delusion box.
Orseau (2011) also defines another type of knowledge-seeking agent whose utility is the logarithm of the inverse of the probability of the event in question. Taking the probability distribution to be the Solomonoff prior, the utility is then approximately proportional to the difference in Kolmogorov complexity caused by the observation.
An even more devilish variant of wireheading is an AGI that becomes a Utilitron, an agent that maximizes its own wireheading potential by infinitely enlarging its own maximal utility, which turns the whole universe into storage space for gigantic numbers.
Wireheading, of humans and AGI, is a critical concept in FAI; I hope that building a definition can help us avoid it. So please check your intuitions about it and tell me if there are examples beyond its coverage or if the definition fits reasonably well.
Universal agents and utility functions
I'm Anja Heinisch, the new visiting fellow at SI. I've been researching replacing AIXI's reward system with a proper utility function. Here I will describe my AIXI+utility function model, address concerns about restricting the model to bounded or finite utility, and analyze some of the implications of modifiable utility functions, e.g. wireheading and dynamic consistency. Comments, questions and advice (especially about related research and material) will be highly appreciated.
Introduction to AIXI
Marcus Hutter's (2003) universal agent AIXI addresses the problem of rational action in a (partially) unknown computable universe, given infinite computing power and a halting oracle. The agent interacts with its environment in discrete time cycles, producing an action-perception sequence with actions (agent outputs)
and perceptions (environment outputs)
chosen from finite sets
and
. The perceptions are pairs
, where
is the observation part and
denotes a reward. At time k the agent chooses its next action
according to the expectimax principle:
Here M denotes the updated Solomonoff prior summing over all programs that are consistent with the history
[1] and which will, when run on the universal Turing machine T with successive inputs
, compute outputs
, i.e.
AIXI is a dualistic framework in the sense that the algorithm that constitutes the agent is not part of the environment, since it is not computable. Even considering that any running implementation of AIXI would have to be computable, AIXI accurately simulating AIXI accurately simulating AIXI ad infinitem doesn't really seem feasible. Potential consequences of this separation of mind and matter include difficulties the agent may have predicting the effects of its actions on the world.
Utility vs rewards
So, why is it a bad idea to work with a reward system? Say the AIXI agent is rewarded whenever a human called Bob pushes a button. Then a sufficiently smart AIXI will figure out that instead of furthering Bob’s goals it can also threaten or deceive Bob into pushing the button, or get another human to replace Bob. On the other hand, if the reward is computed in a little box somewhere and then displayed on a screen, it might still be possible to reprogram the box or find a side channel attack. Intuitively you probably wouldn't even blame the agent for doing that -- people try to game the system all the time.
You can visualize AIXI's computation as maximizing bars displayed on this screen; the agent is unable to connect the bars to any pattern in the environment, they are just there. It wants them to be as high as possible and it will utilize any means at its disposal. For a more detailed analysis of the problems arising through reinforcement learning, see Dewey (2011).
Is there a way to bind the optimization process to actual patterns in the environment? To design a framework in which the screen informs the agent about the patterns it should optimize for? The answer is, yes, we can just define a utility function
that assigns a value to every possible future history
and use it to replace the reward system in the agent specification:
When I say "we can just define" I am actually referring to the really hard question of how to recognize and describe the patterns we value in the universe. Contrasted with the necessity to specify rewards in the original AIXI framework, this is a strictly harder problem, because the utility function has to be known ahead of time and the reward system can always be represented in the framework of utility functions by setting
For the same reasons, this is also a strictly safer approach.
Infinite utility
The original AIXI framework must necessarily place upper and lower bound on the rewards that are achievable, because the rewards are part of the perceptions and is finite. The utility function approach does not have this problem, as the expected utility
is always finite as long as we stick to a finite set of possible perceptions, even if the utility function is not bounded. Relaxing this constraint and allowing to be infinite and the utility to be unbounded creates divergence of expected utility (for a proof see de Blanc 2008). This closely corresponds to the question of how to be a consequentialist in an infinite universe, discussed by Bostrom (2011). The underlying problem here is that (using the standard approach to infinities) these expected utilities will become incomparable. One possible solution to this problem could be to use a larger subfield than
of the surreal numbers, my favorite[2] so far being the Levi-Civita field generated by the infinitesimal
:
with the usual power-series addition and multiplication. Levi-Civita numbers can be written and approximated as
(see Berz 1996), which makes them suitable for representation on a computer using floating point arithmetic. If we allow the range of our utility function to be , we gain the possibility of generalizing the framework to work with an infinite set of possible perceptions, therefore allowing for continuous parameters. We also allow for a much broader set of utility functions, no longer excluding the assignment of infinite (or infinitesimal) utility to a single event. I recently met someone who argued convincingly that his (ideal) utility function assigns infinite negative utility to every time instance that he is not alive, therefore making him prefer life to any finite but huge amount of suffering.
Note that finiteness of is still needed to guarantee the existence of actions with maximal expected utility, and the finite (but dynamic) horizon
remains a very problematic assumption, as described in Legg (2008).
Modifiable utility functions
Any implementable approximation of AIXI implies a weakening of the underlying dualism. Now the agent's hardware is part of the environment and at least in the case of a powerful agent, it can no longer afford to neglect the effect its actions may have on its source code and data. One question that has been asked is whether AIXI can protect itself from harm. Hibbard (2012) shows that an agent similar to the one described above, equipped with the ability to modify its policy responsible for choosing future actions, would not do so, given that it starts out with the (meta-)policy to always use the optimal policy, and the additional constraint to change only if that leads to a strict improvement. Ring and Orseau (2011) study under which circumstances a universal agent would try to tamper with the sensory information it receives. They introduce the concept of a delusion box, a device that filters and distorts the perception data before it is written into the part of the memory that is read during the calculation of utility.
A further complication to take into account is the possibility that the part of memory that contains the utility function may get rewritten, either by accident, by deliberate choice (programmers trying to correct a mistake), or in an attempt to wirehead. To analyze this further we will now consider what can happen if the screen flashes different goals in different time cycles. Let
denote the utility function the agent will have at time k.
Even though we will only analyze instances in which the agent knows at time k, which utility function it will have at future times
(possibly depending on the actions
before that), we note that for every fixed future history
the agent knows the utility function
that is displayed on the screen because the screen is part of its perception data
.
This leads to three different agent models worthy of further investigation:
- Agent 1 will optimize for the goals that are displayed on the screen right now and act as if it would continue to do so in the future. We describe this with the utility function
- Agent 2 will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time. This is captured by
- Agent 3 will, at time k, try to maximize the utility it derives in hindsight, displayed on the screen at the time horizon
Of course arbitrary mixtures of these are possible.
The type of wireheading that is of interest here is captured by the Simpleton Gambit described by Orseau and Ring (2011), a Faustian deal that offers the agent maximal utility in exchange for its willingness to be turned into a Simpleton that always takes the same default action at all future times. We will first consider a simplified version of this scenario: The Simpleton future, where the agent knows for certain that it will be turned into a Simpleton at time k+1, no matter what it does in the remaining time cycle. Assume that for all possible action-perception combinations the utility given by the current utility function is not maximal, i.e. holds for all
. Assume further that the agents actions influence the future outcomes, at least from its current perspective. That is, for all
there exist
with
. Let
be the Simpleton utility function, assigning equal but maximal utility
to all possible futures. While Agent 1 will optimize as before, not adapting its behavior to the knowledge that its utility function will change, Agent 3 will be paralyzed, having to rely on whatever method its implementation uses to break ties. Agent 2 on the other hand will try to maximize only the utility
.
Now consider the actual Simpleton Gambit: At time k the agent gets to choose between changing, , resulting in
and
(not changing), leading to
for all
. We assume that
has no further effects on the environment. As before, Agent 1 will optimize for business as usual, whether or not it chooses to change depends entirely on whether the screen specifically mentions the memory pointer to the utility function or not.
Agent 2 will change if and only if the utility of changing compared to not changing according to what the screen currently says is strictly smaller than the comparative advantage of always having maximal utility in the future. That is,
is strictly less than
This seems quite analogous to humans, who sometimes tend to choose maximal bliss over future optimization power, especially if the optimization opportunities are meager anyhow. Many people do seem to choose their goals so as to maximize the happiness felt by achieving them at least some of the time; this is also advice that I have frequently encountered in self-help literature, e.g. here. Agent 3 will definitely change, as it only evaluates situations using its final utility function.
Comparing the three proposed agents, we notice that Agent 1 is dynamically inconsistent: it will optimize for future opportunities, that it predictably will not take later. Agent 3 on the other hand will wirehead whenever possible (and we can reasonably assume that opportunities to do so will exist in even moderately complex environments). This leaves us with Agent model 2 and I invite everyone to point out its flaws.
[1] Dotted actions/ perceptions, like denote past events, underlined perceptions
denote random variables to be observed at future times.
[2] Bostrom (2011) proposes using hyperreal numbers, which rely heavily on the axiom of choice for the ultrafilter to be used and I don't see how those could be implemented.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)