This post depends on a basic understanding of history-based reinforcement learning and the AIXI model. 

I am grateful to Marcus Hutter and the lesswrong team for early feedback, though any remaining errors are mine.

The universal agent AIXI treats the environment it interacts with like a video game it is playing; the actions it chooses at each step are like hitting buttons and the percepts it receives are like images on the screen (observations) and an unambiguous point tally (rewards). It has been suggested that since AIXI is inherently dualistic and doesn't believe anything in the environment can "directly" hurt it, if it were embedded in the real world it would eventually drop an anvil on its head to see what would happen. This is certainly possible, because the math of AIXI cannot explicitly represent the idea that AIXI is running on a computer inside the environment it is interacting with. For one thing, that possibility is not in AIXI's hypothesis class (which I will write ). There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness). On top of that, "any" environment "containing" AIXI is at the wrong computability level for a member of : our best upper bound on AIXI's computability level is  = limit-computable (for an -approximation) instead of the  level of its environment class. Reflective oracles can fix this but at the moment there does not seem to be a canonical reflective oracle, so there remains a family of equally valid reflective versions of AIXI without an objective favorite. 

However, in my conversations with Marcus Hutter (the inventor of AIXI) he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway. But when humans stick electrodes in our brains, we can observe changed behavior and deduce that our cognition is physical - would this kind of experiment allow AIXI to make the same discovery? Though we could not agree on this for some time, we eventually discovered the crux: we were actually using slightly different definitions for how AIXI should behave off-policy.

In particular, let  be the belief distribution of AIXI. More explicitly, 

I will not attempt a formal definition here. The only thing we need to know is that  is a set of environments which AIXI considers possible. AIXI interacts with an environment by sending it a sequence of actions  in exchange for a sequence of percepts containing an observation and reward  so that action  precedes percept . One neat property of AIXI is that its choice of  satisfies  (this trick is inherited with minor changes from the construction of Solomonoff's universal distribution).

Now let  be a (discounted) value function for policy  interacting with environment , which is the expected sum of discounted rewards obtained by . We can define the AIXI agent as 

By the Bellman equations, this also specifies AIXI's behavior on any history it can produce (all finite percept strings have nonzero probability under ). However, it does not tell us how AIXI behaves when the history includes actions it would not have chosen. In that case, the natural extension is

so that AIXI continues to act optimally (with respect to its updated belief distribution) even when some suboptimal actions have previously been taken. 

The philosophy of this extension is that AIXI acts exactly as if a dumb friend has been playing the video game poorly with AIXI watching both the button presses and the screen from a nearby armchair, and then suddenly passes AIXI the controller. This means that if some electrodes were stuck in AIXI's "brain" and caused it to choose poor actions, afterwards it will act as if this is because its dumb friend decided to choose dumb actions (which says nothing whatsoever about the rules of the video game itself). I will call this sort of situation action corruption. It seems that dealing with action corruption reasonably may be sufficient to prevent an agent (or child) from dropping an anvil on its head, at least with some "paternalistic" guidance from engineers (or parents). When a little (carefully controlled, non-destructive) poking around in AIXI's brain is revealed to corrupt actions according to the laws of physics, it is a sensible inference (supported by Ockham's razor) that crushing AIXI's brain with an anvil will lead to irrecoverable action corruption. However, we have just argued that AIXI is not ontologically capable of arriving at that inference. Therefore our naive extension of AIXI off-policy will not respond to action corruption reasonably and may fall prey to the anvil problem.

However, in a way this direct extension of AIXI to off-policy histories is not natural. After all, AIXI should be able to calculate the actions that it would have taken in the past recursively, so we can always determine whether action corruption has taken place. To be specific, let us fix a deterministic AIXI policy (by breaking ties between action values in some consistent way). Any deterministic  can be treated as a function from histories to actions. Taking this view of , recursively define a mapping  from (a finite string or infinite sequence of) percepts to the corresponding actions of :

Now we can define an alternative off-policy version of AIXI that recalculates the optimal action sequence up to the current time and combines this with the (memorized) percepts to generate an alternative history:

Note that  ignores the true action sequence , or equivalently does not remember its previous actions. In case this waste of information bothers you, we will later integrate the true actions  into the percepts  and strengthen our conclusions. 

This more complicated off-policy behavior does not harm the computability level of -approximating AIXI because the recursive computation eventually settles on the right answer (by induction). The runtime to convergence scales up by about a factor of  as one would expect. 

It is worth reflecting here on which version of AIXI best describes humans. In particular, what is the meaning of our action sequence, and do we remember it? This discussion requires that we are careful about drawing the lines between a person and her environment. Certainly we observe and recall our own external actions (at some level of abstraction); we hear the words we say and we see our hands grasping and picking up objects. However, these events and even the feeling of performing them are actually a part of our percept stream, which both  and  maintain similar access to. We can call these self-observations which make up a part of our percepts  to stand for the image of our actions. Perhaps the "true" actions are the conscious decisions that precede physical acts - however, psychological experiments have shown that these decisions are predictable before we know that we have arrived at them. The difficulty of locating  suggests to me that perhaps memory only represents , and the illusion of access to  is created online by retrospection - this is a much closer match to  than , though I think the argument is far from rigorous.

In fact, this discussion suggests the more radical solution of entirely throwing out the "internal" choices  and constructing a universal distribution over  (where  is the rest of the percept). Then perhaps we can believe in free will only for the current time, and "myopically" choose the most favorable  in a kind of evidential decision theory that may appeal to behaviorists. This idea is close to "Self-Predictive Universal A.I." which constructs the "Self-AIXI" policy  (the difference is that the paper uses a product measure over belief distributions for policy and environment instead of a unified belief distribution for both). Perhaps there is an interesting connection between  and , but I have not found one yet!

Intuitively, I expect that  avoids the anvil problem. It should learn that when engineers non-destructively mess with its brain,  does not always match , which is bad because physical consequences (and rewards) always depend directly on . However, outside of brain surgery,  and , meaning that  determines the observed actions of the agent and leads to their consequences (remember that  does not directly see ). Since the physics of AIXI's brain is computing the true action whether or not it is interfered with, one might expect that AIXI eventually decides that its chosen actions have no effect and it is only a passive observer. But in practice, the actions chosen by AIXI when its brain is not being messed with will be very difficult to predict with certainty from physics (this situation is made even worse because the existence of AIXI's brain is necessarily outside of its own hypothesis class, at least without introducing a reflective oracle), where as by assumption  is computed correctly with perfect certainty. So the "free will" hypothesis gains Bayes points over the naturalistic hypothesis under those conditions where AIXI performs reliably[1]. In other words,  believes it has free will exactly when it has free will.

Formal Equivalence with an Uncorrupted AIXI

Now we will prove a formal result which describes exactly in what sense the informal statement above is true[2]. For the moment we will step back and consider arbitrary policies and environments where the percepts need not contain images of the actions. 

Given any policy , define the corrupted policy  as

That is, the probability that  takes action  given that it has so far chosen actions  and observed percepts , is the sum over the conditional probabilities given by  when  attempts to take any possible action 

Now given an environment  with , we will also define an extended environment that includes the action corruption :

Returning to our extended metaphor, the interpretation is that you are sitting in a room playing a video game on your controller, but your friend is watching you play from another room and sometimes (perhaps if you're doing something he doesn't like) he somehow hacks in and changes your actions  to new actions . From your perspective this is all part of the game, in the sense that it behaves exactly like an environment in AIXI's hypothesis class. With the (minor) caveat that your actions are now moving your thumbs instead of pressing the buttons, this description includes the case that your friend actually grabs your controller (but it is worse in that you may not get to see what he does with it).

Now define the history distribution  to be generated by the interaction of environment  with policy .

Theorem:  

We can further define  for any deterministic  and a version of the theorem holds, with its recursively calculated actions  instead of .

Proof: 

taking the sum over  we obtain  by definition, and this is equal to  because  is deterministic.

The theorem says that the probability of the corrupted  producing any string of corrupted actions that give rise to the percept string  is the same as the probability that  causes the environment to produce the same percept string . In particular this means that . My interpretation is that  deals with (computable) action corruption optimally. 

Now assume that there is an invertible function  such that the percepts contain  and with a slight abuse of notation . That is, assume AIXI can see itself in the mirror.

Corollary: When  is chosen so that the percepts contain an image of the actions,

 And no other actions are consistent with these percepts under , so  interacting with  causes the percepts  with the same probability that  would when interacting with the environment  that sometimes visibly alters its actions according to .

Closing Thoughts

I believe the off-policy modification of AIXI described above deals with action corruption "correctly." In particular, we can formalize the intuition that  avoids the anvil problem. Since we have said little about the boundedly rational approximations of AIXI, the following argument is only on solid footing when we allow that a very good AIXI approximation has been physically implemented. Assume that before setting AIXI loose with powerful actuators, engineers both expose it to data about the external world and non-destructively tamper with its brain. Because brain surgery interferes with  according to the laws of physics, AIXI's understanding of the laws of physics will increase the weight of environments of the form  where  captures the physics of the external world and  involves the physics of AIXI's brain (roughly, the conditional Kolmgorov complexity of such  given  is low). This suggests a serious risk of permanent interference from destructive interventions like a falling anvil (possibly resulting in 0 reward forever), so once AIXI is let loose it will strongly prefer not to run the experiment of dropping an anvil on its head. All of this holds even though AIXI is not capable of literally believing its action choices  depend on its brain - it simply works out the physicality of the action corruptions.

I do not claim that  solves all problems of embeddedness. A real implementation of AIXI would not choose the optimal actions  but instead some approximation of them. That approximation would get better with more compute. It is possible that under some approximation schemes,  approximations granted more compute view their past suboptimal actions as corrupted actions and conclude that seeking more compute reduces action corruption, but because of potential instabilities in convergence I have not been able to prove this. In any case,  can only "learn about its own embeddedness" through observing the effects of side-channels on its performance, but perhaps cannot entertain that it may be a part of the universe merely by observing the existence of computers. I am not certain whether this is even true, and if so I cannot think of any examples where it would be a defect in practice. It does seem that  is inherently unable to perform anthropic reasoning, but again I am not sure whether this a problem in practice or only a philosophical objection.

My more general stance is that the philosophical problems with AIXI are overstated, and theories of embedded agency should build off of AIXI. Though we certainly have no rigorously justified and fully satisfying theory of embedded agency (the closest may be Laurent Orseau and Mark Ring's "Space Time Embedded Intelligence"), it is not clear that these requirements are well-posed enough to have a unique answer. The most promising path I see to explore the possibilities starts with understanding the variations on AIXI. The naive off-policy extension of AIXI may eventually destroy itself, but I expect relatively minor variations on AIXI (such as  and  with carefully chosen environment and policy mixtures) to succeed in practice modulo computational boundedness, at least with a little guidance during early stages of development - and frankly I am surprised that their analysis hasn't received more attention[3].  If you are interested, please ask me about the many open problems in this area!

  1. ^

    Most of this argument goes through for any Bayesian decision theorist. We implicitly rely on the flexibility of AIXI's hypothesis class by arguing that it should be able to identify the cases when its brain is or is not being tampered with and combine different methods of prediction for each. This is an instance of the "AIXI = God Thesis" in which we assume that any (non-reflective) theory that we can easily describe has a reasonable prior weight under the universal distribution and in practice AIXI will eventually adopt and act based on those theories that are useful for prediction, and therefore perform at least as well as optimal action based on our informally stated theory. 

  2. ^

    The theorem that follows is inspired by the algebra of Samuel Alexander and Marcus Hutter's "Reward-Punishment Symmetric Universal Artificial Intelligence."

  3. ^

    My guess is that rationalists tend to have an ambitious and contrarian streak, which causes us to reject a whole paradigm at the first sign of philosophical limitations and prefer inventing entirely new theories. For example, I am thinking of logical induction, infra-Bayesianism, and singular learning theory. Certainly there are some fascinating ideas here worth exploring for their own sake; but this research is often justified through relevance to A.I. safety. I have only passing knowledge of these topics, but as far as I can tell the connection tends to be pretty weak. In contrast, if any rigorous theory of A.I. safety is possible, it probably needs to factor through a good understanding of an embedded version of AIXI if only to avoid wireheading by locating the reward model or utility function at the right position in its ontology. Admittedly the areas I mention have justifications that sound about as good as mine, but engaging with them is outside the scope of this post.

New Comment
7 comments, sorted by Click to highlight new comments since:

he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway.

Regardless of what humans believed, they also felt pain when they damaged their bodies, so they had a strong instinct to avoid doing things that might damage their bodies. What is the equivalent for AIXI?

Also, I think early Christianity had to explicitly specify suicide as a sin, because it became an obvious "one weird trick" to get in Heaven (accept Jesus, get baptized, all sins removed, quickly kill yourself before you accumulate new ones). And some people still did the "suicide by a Roman cop".

So, Cartesian dualism was not a problem for humans in real world, because they had other mechanisms for preventing self-harm.

If one's argument is that there must be some algorithm which solves the anvil problem without needing hacks like a hardwired reward function which inflicts 'pain' upon any kind of bodily interaction which threatens the Cartesian boundary, because humans solve it fine, then one had better have firmly established that humans have in fact solved it without pain.

But they haven't. When humans don't feel pain, they do do things equivalent to 'drop an anvil on their head', which result in blinding, amputation, death by misadventure, etc. Turns out if you don't feel pain, you may think it's funny to poke yourself in the eye just to see everyone else's reaction and go blind or jump off a roof to impress friends and die, or simply walk around too long, damage your foot into sores, which suppurate and turn septic, and you amputate your legs or die. (This is leaving out Lesch–Nyhan syndrome.)

I don't think that is either my argument or Marcus's; he probably didn't have painless humans in mind when he said that AIXI would avoid damaging itself like humans do. Including some kind of reward shaping like pain seems wise, and if it is not included engineers would have to take care that AIXI did not damage itself while it established enough background knowledge to protect its hardware. I do think that following the steps described in my post would ideally teach AIXI to protect itself, though it's likely that a handful of other tricks and insights are needed in practice to deal with various other problems of embeddedness - and in that case the self-damaging behavior mentioned in your (interesting) write-up would not occur for a sufficiently smart (and single-mindedly goal-directed) agent even without pain sensors.

I also didn't initially buy the argument that Marcus gave and I think some modifications and care are required to make AIXI work as an embedded agent - the off-policy version is a start. Still, I think there are reasonable responses to the objections you have made:

1: It would be standard to issue a negative reward (or decrease the positive reward) if AIXI is at risk of harming its body. This is the equivalent. 
2: AIXI does not believe in heaven. If its percept stream ends this is treated as 0 reward forever (which is usually but not always taken as the worst reward possible depending on author). It's unclear if AIXI would expect the destruction of its body to lead to the end of its percept stream, but I think it would under some conditions.

It could be difficult to explain to AIXI what "its body" is.

I think the entire point of AIXI was that it kinda considers all possible universes with all possible laws of physics, and then updates based on evidence. To specify "its body", you would need to explain many things about our universe, which I think defeats the purpose of having AIXI.

Any time you attempt to implement AIXI (or any approximation) in the real world you must specify the reward mechanism. If AIXI is equipped with a robotic body you could choose for the sensors to provide "pain" signals. There is no need to provide a nebulous definition of what is or is not part of AIXI's body in order to achieve this. 

Ah, that makes sense! AIXI can receive pain signals long before it knows what they "mean", and as its model of the world improves, it learns to avoid pain.