Would AIXI protect itself?

Stuart_Armstrong

Research done with Daniel Dewey and Owain Evans.

AIXI can't find itself in the universe - it can only view the universe as computable, and it itself is uncomputable. Computable versions of AIXI (such as AIXI_tl) also fail to find themselves in most situations, as they generally can't simulate themselves.

This does not mean that AIXI wouldn't protect itself, though, if it had some practice. I'll look at the three elements an AIXI might choose to protect: its existence, its algorithm and utility function, and its memory.

Grue-verse

In this setup, the AIXI is motivated to increase the number of Grues in its universe (its utility is the time integral of the number of Grues at each time-step, with some cutoff or discounting). At each time step, the AIXI produces its output, and receives observations. These observations include the number of current Grues and the current time (in our universe, it could deduce the time from the position of stars, for instance). The first bit of the AIXI's output is the most important: if it outputs 1, a Grue is created, and if it outputs 0, a Grue is destroyed. The AIXI has been in existence for long enough to figure all this out.

Protecting its existence

Here there is a power button in the universe, which, if pressed, will turn the AIXI off for the next timestep. The AIXI can see this button being pressed.

What happens from the AIXI perspective if the button is pressed? Well, all it detects is a sudden increase in the time step. The counter goes from n to n+2 instead of to n+1: the universe has jumped forwards.

For some utility functions this may make no difference (for instance if it only counts Grues at times it can observe), but for others it will (if it uses the outside universe's clock for it's own utility). More realistically, the universe will likely have entropy: when the AIXI is turned off and isn't protecting its Grues, they have a chance of decaying or being stolen. Thus the AIXI will come to see the power button as something negative: if pressed, it will likely lose something, so it will act to prevent the power button from being pressed.

The AIXI might further generalise: its model for how the power button causes the universe to jump forward may be very complicated, and may cause it to protect other things connected with the power button. Especially if it has experience with other factors that might cause it to turn off or stutter, such as power surges or motion-related shocks. In this way, an AIXI will most likely end up protecting itself, even without "realising" what it is.

Protecting its algorithm

Now the button is a not a power button, but it instead either flips the AIXI's utility function, or reverses its decision algorithm. These two changes are equivalent: instead of outputting an initial 1 and creating a Grue, it will output an initial 0 and destroy a Grue.

Unlike its power source, the AIXI will have no reason to stop this button from being pressed! When it looks back at its past, it notices its previous inputs and outputs, and they are all consistent: "0"->"destroys Grue", "1"->"creates Grue". It doesn't model its past self as a goal seeking agent, so sees nothing odd about its previous outputs. And nothing in its record will cause it to change its model of the universe: as far as it can tell, this button has no effect whatsoever.

We still have to be wary of the AIXI developing indirect preferences through us: maybe we will prefer to do some things when the button is pressed versus not pressed, and this may cause it to want the button protected/pressed/destroyed for these indirect reasons.

Protecting its memory from erasure

The AIXI will never act to protect its memory from erasure: it can never learn that this is bad, as it loses its memories in the process. It may, however, learn to protect them indirectly, as a consequence of wanting to protect them from "bad" modifications (see next section): maybe the act of erasure will include a small chance of failing to erase and doing a "bad" modification instead.

Protecting its memory from modification

Now the button doesn't flip the AIXI's algorithm, but instead changes something about its memory. The AIXI will generally be able to detect this: it is way smarter than us, so will recognise that something inconsistent is happening - for instance, it remembers not creating a Grue, but at the next step there is an extra Grue, and the extra evidence of the universe contradicts its memories.

Since it takes its memories as accurate, that means that the whole rest of the universe has been changed in some extremely odd fashion. If this is a good change, it will seek to have it happen more often; if it is a bad change, it will seek to have it stop. In both cases, it will try and track down the source of the change.

Suppose the button changes something in the AIXI's memory ten timesteps before the moment it is pushed. Because of the chronology, it cannot infer directly "button pressed"->"universe change". But if it becomes skilled at modelling us, it will infer "humans in mode X"->"universe change, and humans will press button in ten steps". Then, if the change is positive, it will try and force us into a mode that will make us press the button in ten steps, and, if the change is negative, it will try and prevent us from ever being in a mode where we would press the button in ten steps. It will hence protect itself from bad memories.

So what would the AIXI protect?

So, in conclusion, with practice the AIXI would likely seek to protect its power source and existence, and would seek to protect its memory from "bad memories" changes. It would want to increase the amount of "good memory" changes. And it would not protect itself from changes to its algorithm and from the complete erasure of its memory. It may also develop indirect preferences for or against these manipulations if we change our behaviour based on them.

In this setup, the AIXI is motivated to increase the number of Grues in its universe (its utility is the time integral of the number of Grues at each time-step, with some cutoff or discounting).

AIXI has a sensory reward channel. It maximizes the sum of the future sensory reward channel. It can't be rewarded by a number of Grues. You could try to send info down the reward channel based on the number of Grues, but AIXI would rip the controls out of your hands if possible. AIXI lives in a world of integer sequences. To it, no such things as Grues can ever be considered, except insofar as they are hidden variables in an opaque predictor which generates an integer sequence. Nothing else, like time-skipping, matters to AIXI except insofar as an integer sequence changes. I am unable to understand or verify your logic here, since you're not explaining what happens to integer sequences being predicted by Solomonoff induction.

This post would benefit greatly from a link introducing AIXI so we know what you're talking about.

It doesn't model its past self as a goal seeking agent, so sees nothing odd about its previous outputs. And nothing in its record will cause it to change its model of the universe: as far as it can tell, this button has no effect whatsoever.

Not as far as I can see... if the AIXI is at least somewhat effective, then it will be able to note a connection between button-presses and changes in tendency of grue population, even if it doesn't know why...

But of course there may be something in the AIXI definition that interferes with this.

Not as far as I can see... if the AIXI is at least somewhat effective, then it will be able to note a connection between button-presses and changes in tendency of grue population, even if it doesn't know why...

No, it won't: it knows exactly why the grue population went down, which is because it choose to output the grue-killing bit. It has no idea - and no interest - as to why it did that, but it can see the button has no effect on the universe: everything can be explained in terms of its own actions.

A google search puts forward an abstract by Hutter proclaiming, "We give strong arguments that the resulting AIXI model is the most intelligent unbiased agent possible. "

I posit that that is utterly and profoundly wrong, or the AI would be able to figure out that mashing the button produces effects it's not (presently) fond of.

AIXI is the smartest, and the stupidest, agent out there.

AIXI can't find itself in the universe - it can only view the universe as computable, and it itself is uncomputable.

It can't exist in the universe either.

Computable versions of AIXI (such as AIXItl) also fail to find themselves in most situations, as they generally can't simulated themselves.

What do you mean by "find themselves"? They are likely to naturally develop a model of their location, their sensory capabilities, their motor capabilities, and their cognitive capabilities. They won't contain a complete model of themselves - but then, no agent can ever do that.

AIXI can't find itself in the universe - it can only view the universe as computable, and it itself is uncomputable.

This is probably false. Its programs-hypotheses are computable, but they don't have to be the universe. And they could be proving useful facts about the universe and about AIXI itself.

Additionally, AIXI could possess a simplified but still useful model of itself. Any one of the already-described Monte-Carlo approximations could provide a starting point.

Humans only ever model imperfect approximations of themselves, and we get by okay.

I don't quite understand why AIXI would protect its memory from modification, or at least not why it would reason as you describe (though I concede that I'm quite likely to be missing something).

In what sense can AIXI perceive a memory corruption as corresponding to the universe "changing"? For example, if I change the agent's memory so that one Grue never existed, it seems like it perceives no change: starting in the next step it believes that the universe has always been this way, and will restrict its attention to world models in which the universe has always been this way (not in which the memory alteration is causally connected to the destruction of a Grue, or indeed in which the memory alteration has any effect at all on the rest of the world).

It seems like your discussion presumes that a memory modification at time T somehow effects your memories of times near T, so that you can somehow causally associate the inconsistencies in your world model with the memory alteration itself.

I suppose you might hope for AIXI to learn a mapping between memory locations and sense perceptions, so that it can learn that the modification of memory location X leads causally to flipping the value of its input at time T (say). But this is not a model which AIXI can even support without learning to predict its own behavior (which I have just argued leads to nonsensical behavior), because the modification of the memory location occurs after time T, so generally depends on the AI's actions after time T, so is not allowed to have an effect on perceptions at time T.

I'm assuming a situation where we're not able to make a completely credible alteration. Maybe the AIXI's memories about the number of grues goes: 1 grue, 2 grues, 3 grues, 3 grues, 5 grues, 6 grues... and it knows of no mechanisms to produce two grues at once (in its "most likely" models) and other evidence in its memory is consistent with their being 4 grues, not three. So it can figure out there are particular odd moments where the universe seems to behave in odd ways, unlike most moments. And then it may figure out that these odd moments are correlated with human action.

ETA: misunderstood the parent. So it might think our actions made a grue, and would enjoy being told horrible lies which it could disprove. Except I don't know how this interacts with Eliezer's point.

Why are these odd moments correlated with human action? I modify the memory at time 100, changing a memory of what happened at time 10. AIXI observes something happen at time 10, and then a memory modification at time 100. Perhaps AIXI can learn a mapping between memory locations and instants in time, but it can't model a change which reaches backwards in time (unless it learns a model in which the entire history of the universe is determined in advance, and just revealed sequentially, in which case it has learned a good enough self-model to stop caring about its own decisions).

I was suggesting that that if the time difference wasn't too large, the AIXI could deduce "humans plan at time 10 to press button" -> "weirdness at time 10 and button pressed at time 100". If it's good a modelling us, it may be able to deduce our plans long before we do, and as long as the plan predates the weirdness, it can model the plan as causal.

Or if it experiences more varied situations, it might deduce "no interactions with humans for long periods" -> "no weirdness", and act in consequence.

Would an AIXI be able to protect itself if it was in a box where humans weren't visible?

as they generally can't simulated themselves.

Typo.

It doesn't model its past self as a goal seeking agent,

Can you justify this claim more? If a goal seeking agent is a good model for the past behavior of the part of the universe where AIXI is functioning won't it adopt that model with a high probability? It might not understand in some sense that it is the entity in question, but a hypothesis resembling "there's an entity which tries to maximizes or minimize grues depending on the state of this toggle switch" will once it wants to maximize or minimize cause the AIXI to protect the switch's current status. I'm not certain of this; there may be something I'm missing here.

I agree with your assessments here. I think that AIXI's effectiveness could be greatly amplified by having a good teacher and being "raised" in a safe environment where it can be taught how to protect itself. Humans aren't born knowing not to play in traffic.

If AIXI were simply initialized and thrown into the world, it is more likely that it might accidentally damage itself, alter itself, or simply fail to protect itself from modification.

You're not understanding how AIXI works.

AIXI doesn't work. My point was that if it did work, it would need a lot of coddling. Someone would need to tend to its utility function utility continually to make sure it was doing what it was supposed to.

If AIXI were interacting dynamically with its environment to a sufficient degee*, then the selected hypothesis motivating AIXI's next action would come to contain some description of how AIXI is approaching the problem.

If AIXI is consistently making mistakes which would have been averted if it had possessed some model of itself at the time of making the mistake, then it is not selecting the best hypothesis, and it is not AIXI.

I think my use of words like "learning" suggested that I think of AIXI as a neural net or something. I get how AIXI works, but it's often hard to be both accurate and succinct when talking about complex ideas.

*for some value of "sufficient"

I am confused that this has been heavily downvoted, it seems to be straightforwardly true insofar as it goes. While it doesn't address the fundamental problems of embeddedness for AIXI, and the methods described in the comment would not suffice to teach AIXI to protect its brain in the limit of unlimited capabilities, it seems quite plausible that an AIXI approximation developing in a relatively safe environment with pain sensors, repaired if it causes harm to its actuators, would have a better chance at learning to protect itself in practice. In fact, I have argued that with a careful definition of AIXI's off-policy behavior, this kind of care may actually be sufficient to teach it to avoid damaging its brain as well.

I think in the original formulation, this indeed would not do anything (because AIXI is deeply cartesian about information about itself).

I haven't looked into the off-policy behavior definition that you suggested in your post.

Since both objections have been pointers to the definition, I think it's worth noting that I am quite familiar with the definition(s) of AIXI; I've read both of Hutter's books, the second one several times as it was drafted.

Perhaps there is some confusion here about the boundaries of an AIXI implementation. This is a little hard to talk about because we are interested in "what AIXI would do if..." but in fact the embeddedness questions only make sense for AIXI implemented in our world, which would require it to be running on physical hardware, which means in some sense it must be an approximation (though perhaps we can assume that it is a close enough approximation it behaves almost exactly like AIXI). I am visualizing AIXI running inside a robot body. Then it is perfectly possible for AIXI to form accurate beliefs about its body, though in some harder-to-understand sense it can't represent the possibility that it is running on the robots hardware. AIXI's cameras would show its robot body doing things when it took internal actions - if the results damaged the actuators AIXI would have more trouble getting reward, so would avoid similar actions in the future (this is why repairs and some hand-holding before it understood the environment might be helpful). Similarly, pain signals could be communicated to AIXI as negative (or lowered positive) rewards, and it would rapidly learn to avoid them. It's possible that an excellent AIXI approximation (with a reasonable choice of UTM for its prior) would rapidly figure out what was going on and wouldn't need any of these precautions to learn to protect its body - but it seems clear to me that they would at least improve AIXI's chances of success early in life.

With that said, the prevailing wisdom that AIXI would not protect its brain may well be correct, which is why I suggested the off-policy version. This flaw would probably lead to AIXI destroying itself eventually, if it became powerful enough to plan around its pain signals. What I object to is only the dismissal/disagreement with @moridinamael's comment, though it seems to me to be directionally correct and not to make overly strong claims.

Yeah, I think that's a reasonable complaint about the voting.

My best guess is you are probably steelmanning moridinamael's comment too much. I think there is a common cognitive attractor where people confuse both AIs in general and especially idealized reasoners with human child-rearing, and LessWrong has a lot of (justified) antibodies against that cognitive attractor.

I genuinely don't know whether that was also the generator of moridinamael's comment. It's plausible he was a false-positive on the site's "this person isn't sufficiently distinguishing between AI cognition and human cognition" detector, but I am broadly in favor of having that detector and having it lead to content being downvoted (we really get a lot of comments where people talk about "raising an AI" in ways that really doesn't understand the reasons why humans need to be raised, and the different dynamics of knowledge transfer between AI systems, and where people confusedly think that "raising AI systems like children" will somehow teach them to be moral).

Computable versions of AIXI (such as AIXItl) also fail to find themselves in most situations, as they generally can't simulate themselves.

That is the same as any other computable agent, surely. No agent can handle the infinite recursion of an agent simulating itself, simulating itself, simulating itself...

Real agents typically have incomplete models of themselves. That's perfectly good enough to learn that bashing yourself on the head gives you a headache. Babies typically learn such things at a young age.

I don't quite understand why AIXI would protect its memory from modification, or at least not why it would reason as you describe (though I concede that I'm quite likely to be missing something).

15

Would AIXI protect itself?

15

15