Google Deepmind and FHI collaborate to present research at UAI 2016

Oxford academics are teaming up with Google DeepMind to make artificial intelligence safer. Laurent Orseau, of Google DeepMind, and Stuart Armstrong, the Alexander Tamas Fellow in Artificial Intelligence and Machine Learning at the Future of Humanity Institute at the University of Oxford, will be presenting their research on reinforcement learning agent interruptibility at UAI 2016. The conference, one of the most prestigious in the field of machine learning, will be held in New York City from June 25-29. The paper which resulted from this collaborative research will be published in the Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI).
Orseau and Armstrong’s research explores a method to ensure that reinforcement learning agents can be repeatedly safely interrupted by human or automatic overseers. This ensures that the agents do not “learn” about these interruptions, and do not take steps to avoid or manipulate the interruptions. When there are control procedures during the training of the agent, we do not want the agent to learn about these procedures, as they will not exist once the agent is on its own. This is useful for agents that have a substantially different training and testing environment (for instance, when training a Martian rover on Earth, shutting it down, replacing it at its initial location and turning it on again when it goes out of bounds—something that may be impossible once alone unsupervised on Mars), for agents not known to be fully trustworthy (such as an automated delivery vehicle, that we do not want to learn to behave differently when watched), or simply for agents that need continual adjustments to their learnt behaviour. In all cases where it makes sense to include an emergency “off” mechanism, it also makes sense to ensure the agent doesn’t learn to plan around that mechanism.
Interruptibility has several advantages as an approach over previous methods of control. As Dr. Armstrong explains, “Interruptibility has applications for many current agents, especially when we need the agent to not learn from specific experiences during training. Many of the naive ideas for accomplishing this—such as deleting certain histories from the training set—change the behaviour of the agent in unfortunate ways.”
In the paper, the researchers provide a formal definition of safe interruptibility, show that some types of agents already have this property, and show that others can be easily modified to gain it. They also demonstrate that even an ideal agent that tends to the optimal behaviour in any computable environment can be made safely interruptible.
These results will have implications in future research directions in AI safety. As the paper says, “Safe interruptibility can be useful to take control of a robot that is misbehaving… take it out of a delicate situation, or even to temporarily use it to achieve a task it did not learn to perform….” As Armstrong explains, “Machine learning is one of the most powerful tools for building AI that has ever existed. But applying it to questions of AI motivations is problematic: just as we humans would not willingly change to an alien system of values, any agent has a natural tendency to avoid changing its current values, even if we want to change or tune them. Interruptibility and the related general idea of corrigibility, allow such changes to happen without the agent trying to resist them or force them. The newness of the field of AI safety means that there is relatively little awareness of these problems in the wider machine learning community. As with other areas of AI research, DeepMind remains at the cutting edge of this important subfield.”
On the prospect of continuing collaboration in this field with DeepMind, Stuart said, “I personally had a really illuminating time writing this paper—Laurent is a brilliant researcher… I sincerely look forward to productive collaboration with him and other researchers at DeepMind into the future.” The same sentiment is echoed by Laurent, who said, “It was a real pleasure to work with Stuart on this. His creativity and critical thinking as well as his technical skills were essential components to the success of this work. This collaboration is one of the first steps toward AI Safety research, and there’s no doubt FHI and Google DeepMind will work again together to make AI safer.”
For more information, or to schedule an interview, please contact Kyle Scott at fhipa@philosophy.ox.ac.uk
Versions of AIXI can be arbitrarily stupid
Many people (including me) had the impression that AIXI was ideally smart. Sure, it was uncomputable, and there might be "up to finite constant" issues (as with anything involving Kolmogorov complexity), but it was, informally at least, "the best intelligent agent out there". This was reinforced by Pareto-optimality results, namely that there was no computable policy that performed at least as well as AIXI in all environments, and strictly better in at least one.
However, Jan Leike and Marcus Hutter have proved that AIXI can be, in some sense, arbitrarily bad. The problem is that AIXI is not fully specified, because the universal prior is not fully specified. It depends on a choice of a initial computing language (or, equivalently, of an initial Turing machine).
For the universal prior, this will only affect it up to a constant (though this constant could be arbitrarily large). However, for the agent AIXI, it could force it into continually bad behaviour that never ends.
For illustration, imagine that there are two possible environments:
- The first one is Hell, which will give ε reward if the AIXI outputs "0", but, the first time it outputs "1", the environment will give no reward for ever and ever after that.
- The second is Heaven, which gives ε reward for outputting "0" and 1 reward for outputting "1", and is otherwise memoryless.
Now simply choose a language/Turing machine such that the ratio P(Hell)/P(Heaven) is higher than the ratio 1/ε. In that case, for any discount rate, the AIXI will always output "0", and thus will never learn whether its in Hell or not (because its too risky to do so). It will observe the environment giving reward ε after receiving "0", behaviour which is compatible with both Heaven and Hell. Thus keeping P(Hell)/P(Heaven) constant, and ensuring the AIXI never does anything else.
In fact, it's worse than this. If you use the prior to measure intelligence, then an AIXI that follows one prior can be arbitrarily stupid with respect to another.
Approximating Solomonoff Induction
Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability.
Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test.
Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction.
But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program.
We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine.
Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary.
Instead of flipping between discrete states, we can slightly increase connections between circuits, and it will only slightly change the behavior. This is very nice, because we have algorithms like MCMC that can efficiently approximate true bayesian inference on continuous parameters.
And we are no longer restricted to boolean gates, we can use any function that takes real numbers. Like a function that takes a sum of all of it's inputs, or one that squishes a real number between 0 and 1.
We can also look at how much changing the input of a circuit slightly, changes the output. Then we can go to all the circuits that connect to it in the previous time step. And we can see how much changing each of their input changes their output, and therefore the output of the first logic gate.
And we can go to those gates' inputs, and so on, chaining it all the way through the whole circuit. Finding out how much a slight change to each connection will change the final output. This is called the gradient, and we can then do gradient descent. Basically change each parameter slightly in the direction that increases the output the way we want.
This is a very efficient optimization algorithm. With it we can rapidly find circuits that fit functions we want. Like predicting the price of a stock given the past history, or recognizing a number in an image, or something like that.
But this isn't quite Solomonoff Induction. Since we are finding the best single model, instead of testing the space of all possible models. This is important because essentially each model is like a hypothesis. There can be multiple hypotheses which also fit the data yet predict different things.
There are many tricks we can do to approximate this. For example, if you randomly turn off each gate with 50% probability and then optimize the whole circuit to deal with this. For some reason this somewhat approximates the results of true bayesian inference. You can also fit a distribution over each parameter, instead of a single value, and approximate bayesian inference that way.
Although I never said it, everything I've mentioned about continuous circuits is equivalent to Artificial Neural Networks. I've shown how they can be derived from first principles. My goal was to show that ANNs do approximate true Solomonoff Induction. I've found the Bayes-Structure.
It's worth mentioning that Solomonoff Induction has some problems. It's still an ideal way to do inference on data, it just has problems with self-reference. An AI based on SI might do bad things like believe in an afterlife, or replace it's reward signal with an artificial one (e.g. drugs.) It might not fully comprehend that it's just a computer, and exists inside the world that it is observing.
Interestingly, humans also have these problem to some degree.
Reposted from my blog here.
Can AIXI be trained to do anything a human can?
There is some discussion as to whether an AIXI-like entity would be able to defend itself (or refrain from destroying itself). The problem is that such an entity would be unable to model itself as being part of the universe: AIXI itself is an uncomputable entity modelling a computable universe, and more limited variants like AIXI(tl) lack the power to simulate themselves. Therefore, they cannot identify "that computer running the code" with "me", and would cheerfully destroy themselves in the pursuit of their goals/reward.
I've pointed out that agents of the AIXI type could nevertheless learn to defend itself in certain circumstances. These were the circumstances where it could translate bad things happening to itself into bad things happening to the universe. For instance, if someone pressed an OFF swith to turn it off for an hour, it could model that as "the universe jumps forwards an hour when that button is pushed", and if that's a negative (which is likely is, since the AIXI loses an hour of influencing the universe), it would seek to prevent that OFF switch being pressed.
That was an example of the setup of the universe "training" the AIXI to do something that it didn't seem it could do. Can this be generalised? Let's go back to the initial AIXI design (the one with the reward channel) and put a human in charge of that reward channel with the mission of teaching the AIXI important facts. Could this work?
For instance, if anything dangerous approached the AIXI's location, the human could lower the AIXI's reward, until it became very effective at deflecting danger. The more variety of things that could potentially threaten the AIXI, the more likely it is to construct plans of actions that contain behaviours that look a lot like "defend myself." We could even imagine that there is a robot programmed to repair the AIXI if it gets (mildly) damaged. The human could then reward the AIXI if it leaves that robot intact or builds duplicates or improves it in some way. It's therefore possible the AIXI could come to come to value "repairing myself", still without explicit model of itself in the universe.
It seems this approach could be extended to many of the problems with AIXI. Sure, an AIXI couldn't restrict its own computation in order to win the HeatingUp game. But the AIXI could be trained to always use subagents to deal with these kinds of games, subagents that could achieve maximal score. In fact, if the human has good knowledge of the AIXI's construction, it could, for instance, pinpoint a button that causes the AIXI to cut short its own calculation. The AIXI could then learn that pushing that button in certain circumstances would get a higher reward. A similar reward mechanism, if kept up long enough, could get it around existential despair problems.
I'm not claiming this would necessarily work - it may require a human rewarder of unfeasibly large intelligence. But it seems there's a chance that it could work. So it seems that categorical statements of the type "AIXI wouldn't..." or "AIXI would..." are wrong, at least as AIXI's behaviour is concerned. An AIXI couldn't develop self-preservation - but it could behave as if it had. It can't learn about itself - but it can behave as if it did. The human rewarder may not be necessary - maybe certain spontaneously occurring situations in the universe ("AIXI training wheels arenas") could allow the AIXI to develop these skills without outside training. Or maybe somewhat stochastic AIXI's with evolution and natural selection could do so. There is an angle connected with embodied embedded cognition that might be worth exploring there (especially the embedded part).
It seems that agents of the AIXI type may not necessarily have the limitations we assume they must.
How to make AIXI-tl incapable of learning
Consider a simple game: You are shown a random-looking 512-bit string h. You may then press one of two buttons, labeled '0' and '1'. No matter which button you press, you will then be shown a 256-bit string s such that SHA512(s) = h. In addition, if you pressed '1' you are given 1$.
This game seems pretty simple, right? s and h are irrelevant, and you should simply press '1' all the time (I'm assuming you value recieving money). Well, let's see how AIXI and AIXI-tl fare at this game.
Let's say the machine already played the game many times. Its memory is h0, b0, r0, s0, h1, ..., b_(n-1), r_(n-1), s_(n-1), h_n, where the list is in chronological order, inputs are unbolded while decisions are bolded, and r_i is the reward signal. It is always the case that r_i=b_i and h_i=SHA512(s_i).
First let's look at AIXI. It searches for models that compress and extrapolate this history up to the limit of its planning horizon. One class of such models is this: there is a list s0, ..., s_N of random 256-bits strings and a list b0, ..., b_N of (possibly compressible) bits. The history is SHA512(s0), b0, b0, s0, ..., b_(N-1), s_(N-1), SHA512(s_N), b_N. Here s_i for i<n must match the s_i in its memory, and s_n must be the almost certainly unique value with SHA512(s_n) = h_n. While I don't have a proof, it intuitively seems like this class of models will dominate the machines probability mass, and repeated arg-max should lead to the action of outputting 1. It wouldn't always do this due to exploration/exploitation considerations and due to the incentive to minimize K (b0, ... b_N) built into its prior, but it should do it most of the time. So AIXI seems good.
Now let's consider AIXI-tl. It picks outputs by having provably correct programs assign lower bounds to the expected utility, and picking the one with the best lower bound, where expected utility is as measured by AIXI. This would include accepting the analysis I just made with AIXI if that analysis can be made provably accurate. Here lies a problem: the agent has seen h_n but hasn't seen s_n. Therefore, it can't be certain that there is an s_n with SHA512(s_n)=h_n. Therefore, it can't be certain that the models used for AIXI actually works (this isn't a problem for AIXI since it has infinite computational power and can always determine that there is such an s_n).
There is an ad hoc fix for this for AIXI-tl: Take the same model as before, but h_n is a random string rather than being SHA512(s_n). This seems at first to work okay. However, it adds n, the current time, as an input to the model, which adds to its complexity. Now other models dependent on the current time also need to be considered. For instance, what if h_n was a random string, and in addition r_n=not(b_n). This models seems more complicated, but maybe in the programming language used for the Solomonoff prior it is shorter. The key point is that the agent won't be able to update past that initial prior no matter how many times it plays the game. In that case AIXI-tl may consistently prefer 0, assuming there aren't any other models it considers that make its behavior even more complicated.
The key problem is that AIXI-tl is handling logical uncertainty badly. The reasonable thing to do upon seeing h_n is assuming that it, like all h_i before it, is a hash of some 256-bit string. Instead, it finds itself unable to prove this fact and is forced into assuming the present h_n is special. This makes it assume the present time is special and makes it incapable of learning from experience.
Universal agents and utility functions
I'm Anja Heinisch, the new visiting fellow at SI. I've been researching replacing AIXI's reward system with a proper utility function. Here I will describe my AIXI+utility function model, address concerns about restricting the model to bounded or finite utility, and analyze some of the implications of modifiable utility functions, e.g. wireheading and dynamic consistency. Comments, questions and advice (especially about related research and material) will be highly appreciated.
Introduction to AIXI
Marcus Hutter's (2003) universal agent AIXI addresses the problem of rational action in a (partially) unknown computable universe, given infinite computing power and a halting oracle. The agent interacts with its environment in discrete time cycles, producing an action-perception sequence with actions (agent outputs)
and perceptions (environment outputs)
chosen from finite sets
and
. The perceptions are pairs
, where
is the observation part and
denotes a reward. At time k the agent chooses its next action
according to the expectimax principle:
Here M denotes the updated Solomonoff prior summing over all programs that are consistent with the history
[1] and which will, when run on the universal Turing machine T with successive inputs
, compute outputs
, i.e.
AIXI is a dualistic framework in the sense that the algorithm that constitutes the agent is not part of the environment, since it is not computable. Even considering that any running implementation of AIXI would have to be computable, AIXI accurately simulating AIXI accurately simulating AIXI ad infinitem doesn't really seem feasible. Potential consequences of this separation of mind and matter include difficulties the agent may have predicting the effects of its actions on the world.
Utility vs rewards
So, why is it a bad idea to work with a reward system? Say the AIXI agent is rewarded whenever a human called Bob pushes a button. Then a sufficiently smart AIXI will figure out that instead of furthering Bob’s goals it can also threaten or deceive Bob into pushing the button, or get another human to replace Bob. On the other hand, if the reward is computed in a little box somewhere and then displayed on a screen, it might still be possible to reprogram the box or find a side channel attack. Intuitively you probably wouldn't even blame the agent for doing that -- people try to game the system all the time.
You can visualize AIXI's computation as maximizing bars displayed on this screen; the agent is unable to connect the bars to any pattern in the environment, they are just there. It wants them to be as high as possible and it will utilize any means at its disposal. For a more detailed analysis of the problems arising through reinforcement learning, see Dewey (2011).
Is there a way to bind the optimization process to actual patterns in the environment? To design a framework in which the screen informs the agent about the patterns it should optimize for? The answer is, yes, we can just define a utility function
that assigns a value to every possible future history
and use it to replace the reward system in the agent specification:
When I say "we can just define" I am actually referring to the really hard question of how to recognize and describe the patterns we value in the universe. Contrasted with the necessity to specify rewards in the original AIXI framework, this is a strictly harder problem, because the utility function has to be known ahead of time and the reward system can always be represented in the framework of utility functions by setting
For the same reasons, this is also a strictly safer approach.
Infinite utility
The original AIXI framework must necessarily place upper and lower bound on the rewards that are achievable, because the rewards are part of the perceptions and is finite. The utility function approach does not have this problem, as the expected utility
is always finite as long as we stick to a finite set of possible perceptions, even if the utility function is not bounded. Relaxing this constraint and allowing to be infinite and the utility to be unbounded creates divergence of expected utility (for a proof see de Blanc 2008). This closely corresponds to the question of how to be a consequentialist in an infinite universe, discussed by Bostrom (2011). The underlying problem here is that (using the standard approach to infinities) these expected utilities will become incomparable. One possible solution to this problem could be to use a larger subfield than
of the surreal numbers, my favorite[2] so far being the Levi-Civita field generated by the infinitesimal
:
with the usual power-series addition and multiplication. Levi-Civita numbers can be written and approximated as
(see Berz 1996), which makes them suitable for representation on a computer using floating point arithmetic. If we allow the range of our utility function to be , we gain the possibility of generalizing the framework to work with an infinite set of possible perceptions, therefore allowing for continuous parameters. We also allow for a much broader set of utility functions, no longer excluding the assignment of infinite (or infinitesimal) utility to a single event. I recently met someone who argued convincingly that his (ideal) utility function assigns infinite negative utility to every time instance that he is not alive, therefore making him prefer life to any finite but huge amount of suffering.
Note that finiteness of is still needed to guarantee the existence of actions with maximal expected utility, and the finite (but dynamic) horizon
remains a very problematic assumption, as described in Legg (2008).
Modifiable utility functions
Any implementable approximation of AIXI implies a weakening of the underlying dualism. Now the agent's hardware is part of the environment and at least in the case of a powerful agent, it can no longer afford to neglect the effect its actions may have on its source code and data. One question that has been asked is whether AIXI can protect itself from harm. Hibbard (2012) shows that an agent similar to the one described above, equipped with the ability to modify its policy responsible for choosing future actions, would not do so, given that it starts out with the (meta-)policy to always use the optimal policy, and the additional constraint to change only if that leads to a strict improvement. Ring and Orseau (2011) study under which circumstances a universal agent would try to tamper with the sensory information it receives. They introduce the concept of a delusion box, a device that filters and distorts the perception data before it is written into the part of the memory that is read during the calculation of utility.
A further complication to take into account is the possibility that the part of memory that contains the utility function may get rewritten, either by accident, by deliberate choice (programmers trying to correct a mistake), or in an attempt to wirehead. To analyze this further we will now consider what can happen if the screen flashes different goals in different time cycles. Let
denote the utility function the agent will have at time k.
Even though we will only analyze instances in which the agent knows at time k, which utility function it will have at future times
(possibly depending on the actions
before that), we note that for every fixed future history
the agent knows the utility function
that is displayed on the screen because the screen is part of its perception data
.
This leads to three different agent models worthy of further investigation:
- Agent 1 will optimize for the goals that are displayed on the screen right now and act as if it would continue to do so in the future. We describe this with the utility function
- Agent 2 will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time. This is captured by
- Agent 3 will, at time k, try to maximize the utility it derives in hindsight, displayed on the screen at the time horizon
Of course arbitrary mixtures of these are possible.
The type of wireheading that is of interest here is captured by the Simpleton Gambit described by Orseau and Ring (2011), a Faustian deal that offers the agent maximal utility in exchange for its willingness to be turned into a Simpleton that always takes the same default action at all future times. We will first consider a simplified version of this scenario: The Simpleton future, where the agent knows for certain that it will be turned into a Simpleton at time k+1, no matter what it does in the remaining time cycle. Assume that for all possible action-perception combinations the utility given by the current utility function is not maximal, i.e. holds for all
. Assume further that the agents actions influence the future outcomes, at least from its current perspective. That is, for all
there exist
with
. Let
be the Simpleton utility function, assigning equal but maximal utility
to all possible futures. While Agent 1 will optimize as before, not adapting its behavior to the knowledge that its utility function will change, Agent 3 will be paralyzed, having to rely on whatever method its implementation uses to break ties. Agent 2 on the other hand will try to maximize only the utility
.
Now consider the actual Simpleton Gambit: At time k the agent gets to choose between changing, , resulting in
and
(not changing), leading to
for all
. We assume that
has no further effects on the environment. As before, Agent 1 will optimize for business as usual, whether or not it chooses to change depends entirely on whether the screen specifically mentions the memory pointer to the utility function or not.
Agent 2 will change if and only if the utility of changing compared to not changing according to what the screen currently says is strictly smaller than the comparative advantage of always having maximal utility in the future. That is,
is strictly less than
This seems quite analogous to humans, who sometimes tend to choose maximal bliss over future optimization power, especially if the optimization opportunities are meager anyhow. Many people do seem to choose their goals so as to maximize the happiness felt by achieving them at least some of the time; this is also advice that I have frequently encountered in self-help literature, e.g. here. Agent 3 will definitely change, as it only evaluates situations using its final utility function.
Comparing the three proposed agents, we notice that Agent 1 is dynamically inconsistent: it will optimize for future opportunities, that it predictably will not take later. Agent 3 on the other hand will wirehead whenever possible (and we can reasonably assume that opportunities to do so will exist in even moderately complex environments). This leaves us with Agent model 2 and I invite everyone to point out its flaws.
[1] Dotted actions/ perceptions, like denote past events, underlined perceptions
denote random variables to be observed at future times.
[2] Bostrom (2011) proposes using hyperreal numbers, which rely heavily on the axiom of choice for the ultrafilter to be used and I don't see how those could be implemented.
Modifying Universal Intelligence Measure
In 2007, Legg and Hutter wrote a paper using the AIXI model to define a measure of intelligence. It's pretty great, but I can think of some directions of improvement.
- Reinforcement learning. I think this term and formalism are historically from much simpler agent models which actually depended on being reinforced to learn. In its present form (Hutter 2005 section 4.1) it seems arbitrarily general, but it still feels kinda gross to me. Can we formalize AIXI and the intelligence measure in terms of utility functions, instead? And perhaps prove them equivalent?
- Choice of Horizon. AIXI discounts the future by requiring that total future reward is bounded, and therefore so does the intelligence measure. This seems to me like a constraint that does not reflect reality, and possibly an infinitely important one. How could we remove this requirement? (Much discussion on the "Choice of the Horizon" in Hutter 2005 section 5.7).
- Unknown utility function. When we reformulate it in terms of utility functions, let's make sure we can measure its intelligence/optimization power without having to know its utility function. Perhaps by using an average of utility functions weighted by their K-complexity.
- AI orientation. Finally, and least importantly, it tests agents across all possible programs, even those which are known to be inconsistent with our universe. This might okay if your agent is a playing arbitrary games on a computer, but if you are trying to determine how powerful an agent will be in this universe, you probably want to replace the Solomonoff prior with the posterior resulting from updating the Solomonoff prior with data from our universe.
Any thought or research on this by others? I imagine lots of discussion has occurred over these topics; any referencing would be appreciated.
[video] Paul Christiano's impromptu tutorial on AIXI and TDT
Paul Christiano was about to give a tutorial on AIXI and TDT, so I whipped out my iPhone and recorded it. His tutorial wasn't carefully planned or executed, but it may still be useful to some. Note that when Paul writes "UDT" on a piece of paper he really meant "TDT." :)
Would AIXI protect itself?
Research done with Daniel Dewey and Owain Evans.
AIXI can't find itself in the universe - it can only view the universe as computable, and it itself is uncomputable. Computable versions of AIXI (such as AIXItl) also fail to find themselves in most situations, as they generally can't simulate themselves.
This does not mean that AIXI wouldn't protect itself, though, if it had some practice. I'll look at the three elements an AIXI might choose to protect: its existence, its algorithm and utility function, and its memory.
Grue-verse
In this setup, the AIXI is motivated to increase the number of Grues in its universe (its utility is the time integral of the number of Grues at each time-step, with some cutoff or discounting). At each time step, the AIXI produces its output, and receives observations. These observations include the number of current Grues and the current time (in our universe, it could deduce the time from the position of stars, for instance). The first bit of the AIXI's output is the most important: if it outputs 1, a Grue is created, and if it outputs 0, a Grue is destroyed. The AIXI has been in existence for long enough to figure all this out.
Protecting its existence
Here there is a power button in the universe, which, if pressed, will turn the AIXI off for the next timestep. The AIXI can see this button being pressed.
AIXI and Existential Despair
It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror. But these seem to be mild failures compared to other likely outcomes: a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.
Setup
Let A be an agent running on a physical computer, implementing some approximate version of AIXI. Suppose that A is running inside of an indestructible box, connected to the external world by an input wire W1 and an output wire W2.
Suppose that this computer exists within a lawful physical universe, governed by some rules which can be inferred by A. For simplicity, assume that the universe and its initial conditions can be described succinctly and inferred by A, and that the sequence of bits sent over W1 and W2 can be defined using an additional 10000 bits once a description of the universe is in hand. (Similar problems arise for identical reasons in more realistic settings, where A will work instead with a local model of reality with more extensive boundary conditions and imperfect predictability, but this simplified setting is easier to think about formally.)
Recall the definition of AIXI: A will try to infer a simple program which takes A's outputs as input and provides A's inputs as output, and then choose utility maximizing actions with respect to that program. Thus two models with identical predictive power may lead to very different actions, if they give different predictions in counterfactuals where A changes its output (this is not philosophy, just straightforward symbol pushing from the definition of AIXI).
AIXI's Behavior
First pretend that, despite being implemented on a physical computer, A was able to perform perfect Solomonoff induction. What model would A learn then? There are two natural candidates:
- A's outputs are fed to the output wire W2, the rest of the universe (including A itself) behaves according to physical law, and A is given the values from input wire W1 as its input. (Model 1)
- A's outputs are ignored, the rest of the universe behaves according to physical law, and A is given the values from W1 as its input. (Model 2)
Both of these models give perfect predictions, but Model 2 is substantially simpler (around 10000 bits simpler, and specifying A's control over W2's values in 10000 bits seems quite optimistic). Therefore A will put much more probability mass on Model 2 than Model 1. In fact, Model 2 or its close variants probably receive almost all of the probability mass.
If A believed Model 2, what would its actions be? Well, in Model 2 A's outputs are completely ignored! So although Model 2 may make up most of the probability mass, it contributes nothing important to the expected value computation determining A's actions. So maybe A will make decisions as if it believed Model 1, and therefore behave appropriately for a surprising reason? In fact this is unlikely: there are many other models of intermediate complexity, for example:
- A's outputs are usually ignored, the rest of the universe behaves physically, and then A is given the values from W1 as its input. However, if A ever outputs the sequence 0011011000111010100, then A's inputs are modified to include a large reward signal on the next step. (Model 3)
Model 3 is an intentionally silly example, but hopefully the point is clear: Model 1 is enormously more complicated than the simplest explanation for A's observations, and so there are many other (potentially very strange and arbitrary) models which account for the data but are simpler than Model 1. It is these models which will determine A's behavior, and the result is almost certainly very arbitrary (this is similar to some discussions of Pascal's Wager: it is possible that all of these strange models will cancel out and add up to normality, but it seems outlandishly unlikely).
For example, if A suspects Model 3 may be true it will be eager to test this hypothesis by outputting 0011011000111010100. It will find that it continues to receive inputs from W1, but no matter, it will move on to testing Model 4, which has the same form but with a different target bitstring. Of course this is not what A will actually do, as there will be other models with complicated effects on behavior, but this gives the flavor of A's failure.
If A somehow did accept Model 1, then we would be back in the situation normally discussed on Less Wrong: A believes that the values on W2 are magically made equal to A's outputs, and so is unconcerned with its own real physical instantiation. In particular, note that having some uncertainty between Model 1 and Model 2 is not going to save A from any of these problems: in the possible worlds in which Model 2 is true, A doesn't care at all what it does (A doesn't "want" its physical instantiation to be destroyed, but by the same token it believes it has no control), and so A's behavior reduces to the normal self-destructive behavior of Model 1.
Approximate AIXI's Behavior
An approximate version of AIXI may be able to save itself from existential despair by a particular failure of its approximate inference and a lack of reflective understanding.
Because A is only an approximation to AIXI, it cannot necessarily find the simplest model for its observations. The real behavior of A depends on the nature of its approximate inference. It seems safe to assume that A is able to discover some approximate versions of Model 1 or Model 2, or else A's behavior will be poor for other reasons (for example, modern humans can't infer the physical theory of everything or the initial conditions of the universe, but their models are still easily good enough to support reductionist views like Model 2), but its computational limitations may still play a significant role.
Why A might not fail
How could A believe Model 1 despite its prior improbability? Well, note that A cannot perform a complete simulation of its physical environment (since it is itself contained in that environment) and so can never confirm that Model 2 really does correctly predict reality. It can acquire what seems to a human like overwhelming evidence for this assertion, but recall that A is learning an input-output relationship and so it may assign zero probability to the statement "Model 2 and Model 1 make identical predictions," because Model 1 depends on the indeterminate input (in particular, if this indeterminate was set to be a truly random variable, then it would be mathematically sound to assign zero probability to this assertion). In this case, no amount of evidence will ever allow A to conclude that Model 2 and Model 1 are identically equivalent--any observed equivalence would need to be the result of increasingly unlikely coincidences (we can view this as a manifestation of A's ignorance about its own implementation of an algorithm).
Now consider A's beliefs about W2. It is relatively easy for A to check (for almost all timesteps) that Model 1 correctly predicts each bit on W2, while A only has enough time to check that Model 2 correctly predicts only a few of these bits. Therefore the probability of Model 2 must be decreased by A's estimate of the likelihood that Model 2 would happen to set the correct value for all of the bits that A didn't have time to verify. Model 1's probability must be decreased likewise, but because A was able to check more of Model 1's values, Model 1 leaves less unexplained data and may not be as unlikely as Model 2.
Why A probably fails anyway
In general, it is very fragile to expect that an AI will behave correctly because it fails to realize something. There are a number of obvious reasons that A might accept the reductionist hypothesis, even acknowledging the above argument.
First, an effective approximate inference module seems unlikely to make the mistake described in the last section. Making this mistake seems to correlate with other errors which may disrupt normal "sane" behavior: in general, given a model M which the agent cannot exactly simulate, uncertainty about the M's outputs (even if M is known to produce mostly the same output, and never known to produce incorrect output) causes the model's probability to drop off exponentially with the number of unverified computations. It seems as though any "sane" AI should be able to assign some constant probability to the hypothesis that this model exactly captures the output, and thereby avoid considering other exponentially improbable hypotheses. This may relate to deeper concerns about approximate Solomonoff induction.
Second, A may be able to observe the operation of the hardware on which it is running. This will generally cause Model 1 to be much less probable: indeed, if A can observe even one "causal ancestor" of W2's value, it will no longer gain very much by believing Model 1 as such (since now Model 1 only produces the correct output if Model 2 did anyway--all of the relative advantage for Model 1 comes from occasions when A can observe the value of W2 without observing the operations directly responsible for that value, which may be rare). Of course there are more complicated models in which A's outputs control reality in more subtle ways, which may have better complexity. Understanding this issue completely depends on a much more detailed knowledge of A's approximate inference and the nature of A's observations. In general, however, being able to observe its own computation seems like it may be adequate to force A into a reductionist model.
Third, A's approximate inference module may be aware of the fact that A's own outputs are produced algorithmically (as a computational aid, not an underlying belief about reality). This would cause it to assign positive probability to the assertion "Model 2 is equivalent to Model 1," and eventually force it into a reductionist model.
Conclusion
Agents designed in the spirit of AIXI appear to be extremely fragile and vulnerable to the sort of existential despair described above. Progress on reflection is probably necessary not only to design an agent which refrains from killing itself when convenient, but even to design an agent which behaves coherently when embedded in the physical universe.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)