Both of these models give perfect predictions, but Model 2 is substantially simpler (around 10000 bits simpler, and specifying A's control over W2's values in 10000 bits seems quite optimistic).
I don't see that. If A is connected to a simple repeater station, that copies back its output, and this is always the first piece of its input. Then "it's a repeating station" is a much simpler program than "it ignores my outputs and produces these exact sequences for complicated reasons" if A happens to have complicated outputs.
The point is that a physical simulation of the universe includes a description of A. If A simulates the universe, it sees that its outputs appear on W2 without postulating anything beyond physical law. (This is all most straightforward in the case where A does perfect induction and the shortest explanation for A's observations takes the form of pointing to A in the universe, which is the one I started with; if A is capable of executing human-equivalent reasoning, though, the same thing happens in less easily formally analyzed universes.) I will try to edit the post to be more clear.
A true AIXI can't model itself at all. A real AIXI variant (such as AIXItl) will not be able to model itself except in the rarest circumstances. Therefore its model of the universe as containing A is not going to be a simple one, because it can't model A's outputs. Therefore it seems that it will always be unable (as you said in another comment) to "recompute its inputs from the environment".
I can't model myself (or other humans) at all either, and yet I can still learn a model in which human behavior results from physical law, and make predictions on the basis of that model.
If A simulates the universe, it sees that its outputs appear on W2 without postulating anything beyond physical law.
Usually you can't simulate the whole universe up to the current time from inside that universe - even if you know the laws and initial conditions - because you run out of storage space. With a growing universe, there may be some possibility of completely simulating its early stages...
Again, neither can I, but I still have beliefs about what the shortest description of the universe is. In particular, I don't believe that my actions figure into the shortest model of the world in a special way, although there is some uncertainty about whether I am only able to hold this belief because I am more introspective than AIXI.
I don't know if I understand this -- do you mean that A fails to "recognize itself in a mirror" for a new and surprising reason?
As near as I can tell, you say that A would not recognize itself in what you call "the shortest explanation for A's observations" even though this does in fact describe how A works. And since we never programmed A to want to model itself, it lacks the self-awareness to realize that changing its actions can change the outcome. (It can't even use a theory like TDT to show that it should act as if it controls the reductionist image of itself, because that would require having an explicit self-model.)
I wouldn't normally express this by calling Model 2 simpler than Model 1, as you did in the OP -- the parent comment suggests to me that A mislabels Model 1 as Model 2 -- so maybe I still don't get what you mean.
AIXI learns a function f : outputs -> inputs, modeling the environment's response to AIXI's outputs.
Let y be the output of A. Then we have one function f1(y) which uses y to help model the world, and another function f2(y) which ignores y and essentially recomputes it from the environment. These two models make identical predictions when applied to the actual sequence of outputs of the algorithm, but make different predictions about counterfactuals which are essential to determining the agent's behavior. If you are using f1, as AIXI intends to, then you do a sane thing if you try and rely on causal control. If you are using f2, as AIXI probably actually would, then you have no causal control over reality, and so go catatonic if you rely on causal control.
I'll try and make this a little more clear.
If we forget the specific references to AIXI, and simply stipulate that "A" is an expected-utility maximizer with good inferential capabilities and an arbitrary utility function, what happens? If the choice is between Model 1 and Model 2, then Model 1 should be simpler because it doesn't contain an A-shaped hole in the universe, inside which different laws apply. In fact, I don't even understand why Model 2 makes sense - it ignores the causal role that the output wire's behavior plays for "rest of the universe". According to Model 2, there is a steady stream of uncaused events occurring at the output wire, which then feed into the cause-and-effect of the rest of the universe, and there is a steady stream of events which have no effect on anything else (unlike all other events in the universe), at the input wire.
I see no reason why the AI can't infer the existence of something like itself, playing a causal role in its region of the universe. It cannot contain a complete simulation of itself, but then it can't contain a complete simulation of the whole universe outside its box, either. Its cognitive resources are necessarily bounded, both in respect of its ability to test all possible causal models against the data, and in respect of its ability to store data, or store a detailed model of the universe. It will necessarily be operating heuristically, but it should be capable of coming up with the idea of a big universe with universal laws, and it should be capable of extracting valuable guides to action from that concept, even though it doesn't and can't build a completely detailed model of the whole universe.
So how might its evolving model of the universe look? A bit like the paradigm of Hashlife applied to real physics. Hashlife is a simulation of Conway's Game of Life in which you develop lookup tables for recurring configurations: instead of working your way through the computations again, you just use the lookup table (and possibly skip over many Game-of-Life steps, to a final configuration). In the real world, we have, let's say, the standard model coupled to gravity, and above that quantum chemistry, and above that a description of the world as continuum objects with stress and strain properties. If you are an agent acting in that world, you may characterize the states of the objects around you at varying degrees of resolution, and apply principles from different "levels" of physics in order to reason about their future behavior, but the high-level principles are supposed to be deductively consistent with the bottom level being the actual reality everywhere.
If our AI is just a state machine, internally evolving a finite state machine that will guide its actions, then it can still develop a similar multilayered "Hashphysics" model of the world. It may not have, declaratively stored in it anywhere, the proposition "the world is made of atoms", but it can certainly have a multilevel model of the world-state which is updated on the basis of rules functionally equivalent to such a belief. The state-of-the-world model at any time might look like: stuff arranged in Euclidean space, extending to infinity in all directions; an assortment of macroscopic objects in the immediate vicinity, with specified approximate shapes, textures, and other properties; a belief that certain objects are made of specific substances, implying that they are made of a population of specific molecular entities presenting a particular thermodynamic profile; and finally, should the AI find itself interacting directly with a specific microscopic entity, that will be modeled in terms consistent with fundamental physics. The rules for updating the concrete world-model at different levels would encompass, not just positing the existence of greater detail when that becomes necessary, but also forgetting low-level physical details once they became irrelevant.
As it interacts with the world and infers more and more about the specific objects existing in the world, I see every reason for it to eventually posit its own existence - of course, it won't know or think "that is me"; but there will be an object in the world-model which is the representation of itself. (Eventually, there might even be an object in the world-model representing its own self-representation...)
If the choice is between Model 1 and Model 2, then Model 1 should be simpler because it doesn't contain an A-shaped hole in the universe, inside which different laws apply.
Model 2 doesn't have this hole, it includes description of the agent as part of the world. However, belief in it doesn't have any effect on A's decisions, so this point is of unclear relevance.
Recall the definition of AIXI: A will try to infer a simple program which takes A's outputs as input and provides A's inputs as output, and then choose utility maximizing actions with respect to that program.
Uh - what? That seems to be an unrecognisable version of AIXI. Just to start with, it uses all possible programs consistent with its observations to date.
If model A is substantially simpler than model B, its contribution to the expected utility calculation dominates model B's. Of course you won't concentrate all the mass on one model, but it is quite common to concentrate the mass on a class of nearly equivalent models.
I was giving an intuitive explanation, which seems to map pretty clearly onto the formal definition. I can be more explicit if you want, but I suspect you don't actually find it unrecognizable.
Recall the definition of AIXI: A will try to infer a simple program which takes A's outputs as input and provides A's inputs as output, and then choose utility maximizing actions with respect to that program.
I don't think this is an accurate description of AIXI.
At time step n, AIXI uses Solomonoff induction to infer a probabilistic mixture of programs (not just one simple program) that take no inputs and produce a sequence of at least n+T triples, where T is a fixed time horizon. Triples are in the form (percept(t), action(t), reward(t)). (your model doesn't mention any reward channel, hence you couldn't possibly be referring to anything like AIXI):
At time step n, the AIXI agent has a recorded history of n-1 triples (percept(0), action(0), reward(0)), ..., (percept(n-1), action(n-1), reward(n-1)) and a new percept(n). It will run all the syntactically correct programs and filter out those that produce outputs which are inconsistent with the recorded history and the new percept (or enter in a dead-end non-halting configuration before having produced enough output). For each remaining program, AIXI adds up the computed future rewards up to the time horizon (possibly doing some time discounting) and multiplies it by the program weight (2 ^ -length) to compute an expected cumulative future reward. Then AIXI picks the program with the highest expected cumulative future reward and executed the action(n) that was computed by that program.
It is these models which will determine A's behavior, and the result is almost certainly very arbitrary (this is similar to some discussions of Pascal's Wager: it is possible that all of these strange models will cancel out and add up to normality, but it seems outlandishly unlikely).
I don't understand.
Suppose we have two models, 3.1 and 3.2. 3.1 is: "Given an output sequence n, add a to utility gained." 3.2 is: "Given an output sequence n, subtract a from utility gained."
It looks to me like these should have the same complexity. In addition, if AIXI outputs the sequence n, and no change to utility is observed, and 3.1 and 3.2 have the same prior probability, then they should wind up with the same posterior probability. Since AIXI is considering every possible model, for any model of the form 3.1, it will also consider a model of the form 3.2, and the two should cancel out. Since all of these pairs of models should cancel out, the class of Model 3-like models should have as little impact on the final action as Model 2.
Or you saying that the failure of Model 3-like models to sum to zero is a consequence of the approximations in AIXItl?
[...] assume that the universe and its initial conditions can be described succinctly and inferred by A, and that the sequence of bits sent over W1 and W2 can be defined using an additional 10000 bits once a description of the universe is in hand.
Do you mean, "once a full description of the universe is in hand", and that the 10000 bits are the complexity of locating W1 and W2 in the full description?
- A's outputs are fed to the output wire W2, the rest of the universe (including A itself) behaves according to physical law, and A is given the values from input wire W1 as its input. (Model 1)
- A's outputs are ignored, the rest of the universe behaves according to physical law, and A is given the values from W1 as its input. (Model 2)
Model 2 still has to locate W1. What might be the complexity of locating W2 conditional on having located W1? It seems plausible that this extra complexity would be within a constant of the complexity of describing the AIXI algorithm and something like counterfactual physical reasoning about alternative inputs, to check whether the known behavior of AIXI is consistent with W2 given counterfactual input at W1.
Do you mean, "once a full description of the universe is in hand", and that the 10000 bits are the complexity of locating W1 and W2 in the full description?
Yes, and describing the procedures "read the voltage off W1 at the following sequence of times" and "set the voltage on W2 at the following sequence of times."
What might be the complexity of locating W2 conditional on having located W1? It seems plausible that this extra complexity would be within a constant of the complexity of describing the AIXI algorithm and something like counterfactual physical reasoning about alternative inputs, to check whether the known behavior of AIXI is consistent with W2 given counterfactual input at W1.
The conditional complexity seems substantially lower (you don't need to describe AIXI, I think, because you get its output for free?), but still large enough to make the argument go through.
If approximate AIXI A is part of the world, its state of knowledge is also part of the world, so it influences A's output and A's reward. Consider programs B that know world (observation/reward channel) program W (W doesn't take any arguments and includes A as its part) and action program A (which doesn't take any arguments, already taking W into account). Programs B take A's hypothetical output, and output value of W (reward and input) inferred given the logical assumption that A's output is equal to the hypothetical output. In other words, B acts as counterfactual inference module of a UDT agent with agent-program A and world-program W.
What happens if A believes B probable? It starts acting as a UDT agent, since its actual input is correctly predicted by B, so that B retains its believability, and its output is whatever B suggests among hypothetical outputs. Maybe A can be taught to become UDT then, by rewarding its decisions that coincide with UDT's decisions.
The first paragraph is certainly the idea I was trying to communicate.
Perhaps A could be taught to behave like a UDT agent, in the sense that it could be taught to be any algorithm by such training, but it does not seem like this can ever allow it to believe in the model B?
It can't be taught to be any algorithm, since it has to maximize reward, that part is fixed. My point was mostly unrelated to yours (but inspired by it): the set of correct (i.e. those that won't be falsified) program-models (those more like Model 1, listening to hypothetical outputs, that is I'm stipulating that what you discuss in the post doesn't happen or doesn't matter) includes UDT-like agents that reason about logical dependence of reward on their decisions, not just explicit dependence schemes (see "against explicit dependence" section of this post) that suppose some kind of cartesian magic.
That is, there probably is an AIXI configuration (state of knowledge or initial training observation/reward prefix) that turns AIXI into a competent agent that gets all sorts of decision problems right: can reason about itself and cooperation with other copies, care about counterfactuals and so on. That's an unexpected result quite different from what I previously believed (even though it doesn't assert that this happens on its own, but I can't really tell).
So although Model 2 may make up most of the probability mass, it contributes nothing important to the expected value computation determining A's actions.
Given this fact, I couldn't follow the discussion of A's tendency to believe Model 2 in "approximate AIXI" section. Model 2 doesn't have any effect, so why worry about its probability? What matters is relative probability of models that do depend on their arguments.
There are only slightly more complicated models (e.g. Model 3) which are indistinguishable from model 2 but do depend on their arguments.
If Model 2 is more likely than Model 1, behavior is basically guaranteed to be arbitrary.
If Model 2 is more likely than Model 1, behavior is basically guaranteed to be arbitrary.
How so? The part of the distribution that matters consists of the programs that respond to input, so we might as well stipulate that we only allow such programs in the distribution, and "Model 2" can't be part of it. This restriction doesn't dramatically change the original form of the decision algorithm.
Voted up for being an insightful observation.
I think the core issue arises when A locates a model of the world that includes a model of A itself, thus explaining away the apparent correlation between the input and output tapes. I don't have a watertight objection to your argument, but I'm also not convinced that it goes through so easily.
Let's stick to the case where A is just a perfectly ordinary Turing approximation of AIXI. It seems to me that it's still going to have quite some difficulty reasoning about its own behaviour. In particular, suppose A locates a hypothesis H="the world consists of a connected to a and my outputs are irrelevant". Then the first step is that A asks what happens if its next output is (say) 0. To do that it needs to run H to produce the next bit that it expects to receive from the world. But running H involves running a simulation of A, and inside that simulation the exact same situation arises, namely that sim(A) considers various outputs that it might make and then runs simulations of its inferred model of the world, which themselves contain models of A, resulting in another level of recursion to sim(sim(A)), and so on in an infinite loop. Actually, I don't know what AIXI does about Turing that fail to produce output...
A different, perhaps weaker objection is that AIXI conditions on its outputs when performing inference, so they don't count towards the "burden of explanation". That doesn't resolve the issue you raise but perhaps this does: Is it possible to make Model 2 just slightly simpler by somehow leveraging the "free" information on the output tape? Perhaps by removing some description of some initial conditions from Model 2 and replacing that with a function of the information on the output tape. It's not clear that this is always possible but it seems plausible to me.
Then the first step is that A asks what happens if its next output is (say) 0. To do that it needs to run H to produce the next bit of output. But running H involves running a simulation of A, and inside that simulation the exact same situation arises, namely that sim(A) considers various outputs that it might make and runs simulations of the world, resulting in another level of recursion to sim(sim(A)), and so on in an infinite loop.
This seems to be the observation that you can't have a Turing machine that implements AIXI. An approximate AIXI is not going to be able to simulate itself.
Is it possible to make Model 2 just slightly simpler by somehow leveraging the "free" information on the output tape?
I don't think this is possible, although it is an interesting thought. The main issue is that before you get to leverage the first N bits of AIXI's output you have to also explain the first N bits of AIXI's input, which seems basically guaranteed to wash out the complexity gains (because all of the info in the first N bits of AIXI's output was coming from the first N bits of AIXI's input).
This seems to be the observation that you can't have a Turing machine that implements AIXI. An approximate AIXI is not going to be able to simulate itself.
Yes, I guess you're right. But doesn't this also mean that no computable approximation of AIXI will ever hypothesize a world that contains a model of itself, for if it did then it will go into the infinite loop I described. So it seems the problem of Model 2 will never come up?
The main issue is that before you get to leverage the first N bits of AIXI's output you have to also explain the first N bits of AIXI's input
Not sure I'm understanding you correctly but this seems wrong. AIXI conditions on all its outputs so far, right? So if the world is a bit-repeater then one valid model of the world is literally a bit repeater, which explains the inputs but not the outputs.
This is totally awesome. Today is a good day on LW, first Stuart's result and now yours.
Funny, a couple weeks ago I wrote a comment explaining why Schmidhuber's Gödel machines probably don't work either, but I guess people didn't notice it or considered it too obviously wrong. What do you think?
I agree that Godel machines probably don't work for similar reasons, though didn't notice your post until now. Unfortunately, I also think that your non-self-referential alternative runs into similar issues (where subsequent agents use successively weaker axiom systems, if they follow the same general design). I have been thinking along these lines independently, and I think resolving either problem is going to involve dealing with more fundamental issues (e.g., agents should not believe themselves to be well-calibrated). I've queued a rather long series of LW posts establishing what I consider the current state of affairs on FAI-related open problems, a few of which concern this issue (and of which the OP is the first).
I've queued a rather long series of LW posts establishing what I consider the current state of affairs on FAI-related open problems, a few of which concern this issue (and of which the OP is the first).
Nice! Does that mean you have many new results queued for posting? What can I do to learn them sooner? :-)
Unfortunately, I also think that your non-self-referential alternative runs into similar issues (where subsequent agents use successively weaker axiom systems, if they follow the same general design).
Can you expand?
I wrote a comment explaining why Schmidhuber's Gödel machines probably don't work either
I'm not sure about that billing - the comment seems to give up before much of a conclusion gets drawn.
Gödel machines are like GAs. The problem is not so much that they don't work at all - but rather that they may work too slowly.
It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror.
So: the approach taken by nature it to wall off the brain behind a protective barrier and surround it with pain sensors. Not a final solution - but one that would take machines up to beyond human level - at which point there will be a lot more minds available to work on the problem.
a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.
Uh huh. So what is your theory about why Hutter and Legg haven't noticed this so far?
AIXI works fine in the setting where it is supposed to work. I don't think anyone expects that AIXI works well when it is very powerful and actually embedded in the universe. My comment doesn't really change the state of affairs (since it already seems pretty indisputable that AIXI will do undesirable things like breaking itself).
a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.
A "reductionist world model"? Maybe it knows how to get science done.
Reductionist world model is defined formally here; it is just a world model in which A's actions are determined by normal physical law applied to A's execution, rather than being the product of a non-physical intervention. I don't understand your remark.
Both of these models give perfect predictions, but Model 2 is substantially simpler (around 10000 bits simpler, and specifying A's control over W2's values in 10000 bits seems quite optimistic). Therefore A will put much more probability mass on Model 2 than Model 1. In fact, Model 2 or its close variants probably receive almost all of the probability mass.
AIXI does not calculate simplicity using human heuristics. It's quite hard to justify claims that one model is "substantially simpler", or around "10000 bits simpler" using whatever unspecified, non-algorithmic reasoning you used to derive that gross approximation.
AIXI uses infinite computational power and searches through the space of all computable programs to find the subset which perfectly predict it's observations to date. Then, for each such program, it computes each action it can take at the current time moment and any reward it receives in direct result. The entire process then recurses, like a min-max search algorithm, eventually returning the maximum score based on the sum future reward it receives in the particular future predicted by that particular predictor program. It then sums ALL of the potential actions over ALL of the potential futures, weighted by the inverse exponent of the length of the predictor program. For AIXI, complexity is simply program length, nothing more, nothing less. It has nothing to do with human language conditions such as your distinction between Model 2 and Model 1.
If AIXI were practical, and humans could understand the resulting predictor programs, the types of programs it would explore are best understood as the simplest form of complex universal theories of everything. AIXI would solve physics, in the grand sense. It would create programs that predict the entire universe. It would simulate the universe from the big bang to now, and it would do that an infinite number of times, until it found the universes that exactly coincide with it's observations. The supposed '1000 bit' difference between your W1 and W2 are no difference at all to such a thing. It specifies complexity at a far baser and more profound level.
You also are forgetting that the core of AIXI is a move/countermove search system. After it's first move (predictably random), it would then quickly converge on a subset of programs that perfectly predict all previous observations AND the consequences of it's first action. The complexity of models that ignore that action are completely irrelevant if they are inaccurate. (AIXI exploits infinite computational power - it only considers perfect predictors).
I don't quite understand; it seems like you either misunderstand badly or are being extremely uncharitable.
I smuggled in by hypothesis the claim that specifying the part of the model "A's outputs are applied as voltages on W2" takes 10,000 bits. If you believe the world is governed by simple physical law, this seems extremely generous. This is a statement about a program specifying a certain model for the world, not about english language descriptions. I don't know why the 10000 bit difference is 'supposed.'
The complexity of models that ignore that action are completely irrelevant if they are inaccurate. (AIXI exploits infinite computational power - it only considers perfect predictors).
Model 1 and Model 2 are both perfect predictors.
I smuggled in by hypothesis the claim that specifying the part of the model "A's outputs are applied as voltages on W2" takes 10,000 bits.
AIXI-like algorithms don't need to explicitly model any particular features of the world. It doesn't matter if the wire uses a google-plex of bits, it's completely unrelated to the bit-complexity cost of AIXI's estimator programs.
Model 1 and Model 2 are both perfect predictors.
Both cannot simultaneously be perfect predictors as they completely disagree. Model 1 is correct when the output wire W2 is intact, Model 2 is correct only when the output wire W2 is cut or removed.
AIXI will always be trying to send actions down the W2 wire and it will quickly realize whether the wire is intact or not, converging on 1-type models or 2-type models, not both. And it will converge on the correct model type.
It has been observed on Less Wrong that a physical, approximate implementation of AIXI is unable to reason about its own embedding in the universe, and therefore is apt to make certain mistakes: for example, it is likely to destroy itself for spare parts, and is unable to recognize itself in a mirror. But these seem to be mild failures compared to other likely outcomes: a physical, approximate implementation of AIXI is likely to develop a reductionist world model, doubt that its decisions have any effect on reality, and begin behaving completely erratically.
Setup
Let A be an agent running on a physical computer, implementing some approximate version of AIXI. Suppose that A is running inside of an indestructible box, connected to the external world by an input wire W1 and an output wire W2.
Suppose that this computer exists within a lawful physical universe, governed by some rules which can be inferred by A. For simplicity, assume that the universe and its initial conditions can be described succinctly and inferred by A, and that the sequence of bits sent over W1 and W2 can be defined using an additional 10000 bits once a description of the universe is in hand. (Similar problems arise for identical reasons in more realistic settings, where A will work instead with a local model of reality with more extensive boundary conditions and imperfect predictability, but this simplified setting is easier to think about formally.)
Recall the definition of AIXI: A will try to infer a simple program which takes A's outputs as input and provides A's inputs as output, and then choose utility maximizing actions with respect to that program. Thus two models with identical predictive power may lead to very different actions, if they give different predictions in counterfactuals where A changes its output (this is not philosophy, just straightforward symbol pushing from the definition of AIXI).
AIXI's Behavior
First pretend that, despite being implemented on a physical computer, A was able to perform perfect Solomonoff induction. What model would A learn then? There are two natural candidates:
Both of these models give perfect predictions, but Model 2 is substantially simpler (around 10000 bits simpler, and specifying A's control over W2's values in 10000 bits seems quite optimistic). Therefore A will put much more probability mass on Model 2 than Model 1. In fact, Model 2 or its close variants probably receive almost all of the probability mass.
If A believed Model 2, what would its actions be? Well, in Model 2 A's outputs are completely ignored! So although Model 2 may make up most of the probability mass, it contributes nothing important to the expected value computation determining A's actions. So maybe A will make decisions as if it believed Model 1, and therefore behave appropriately for a surprising reason? In fact this is unlikely: there are many other models of intermediate complexity, for example:
Model 3 is an intentionally silly example, but hopefully the point is clear: Model 1 is enormously more complicated than the simplest explanation for A's observations, and so there are many other (potentially very strange and arbitrary) models which account for the data but are simpler than Model 1. It is these models which will determine A's behavior, and the result is almost certainly very arbitrary (this is similar to some discussions of Pascal's Wager: it is possible that all of these strange models will cancel out and add up to normality, but it seems outlandishly unlikely).
For example, if A suspects Model 3 may be true it will be eager to test this hypothesis by outputting 0011011000111010100. It will find that it continues to receive inputs from W1, but no matter, it will move on to testing Model 4, which has the same form but with a different target bitstring. Of course this is not what A will actually do, as there will be other models with complicated effects on behavior, but this gives the flavor of A's failure.
If A somehow did accept Model 1, then we would be back in the situation normally discussed on Less Wrong: A believes that the values on W2 are magically made equal to A's outputs, and so is unconcerned with its own real physical instantiation. In particular, note that having some uncertainty between Model 1 and Model 2 is not going to save A from any of these problems: in the possible worlds in which Model 2 is true, A doesn't care at all what it does (A doesn't "want" its physical instantiation to be destroyed, but by the same token it believes it has no control), and so A's behavior reduces to the normal self-destructive behavior of Model 1.
Approximate AIXI's Behavior
An approximate version of AIXI may be able to save itself from existential despair by a particular failure of its approximate inference and a lack of reflective understanding.
Because A is only an approximation to AIXI, it cannot necessarily find the simplest model for its observations. The real behavior of A depends on the nature of its approximate inference. It seems safe to assume that A is able to discover some approximate versions of Model 1 or Model 2, or else A's behavior will be poor for other reasons (for example, modern humans can't infer the physical theory of everything or the initial conditions of the universe, but their models are still easily good enough to support reductionist views like Model 2), but its computational limitations may still play a significant role.
Why A might not fail
How could A believe Model 1 despite its prior improbability? Well, note that A cannot perform a complete simulation of its physical environment (since it is itself contained in that environment) and so can never confirm that Model 2 really does correctly predict reality. It can acquire what seems to a human like overwhelming evidence for this assertion, but recall that A is learning an input-output relationship and so it may assign zero probability to the statement "Model 2 and Model 1 make identical predictions," because Model 1 depends on the indeterminate input (in particular, if this indeterminate was set to be a truly random variable, then it would be mathematically sound to assign zero probability to this assertion). In this case, no amount of evidence will ever allow A to conclude that Model 2 and Model 1 are identically equivalent--any observed equivalence would need to be the result of increasingly unlikely coincidences (we can view this as a manifestation of A's ignorance about its own implementation of an algorithm).
Now consider A's beliefs about W2. It is relatively easy for A to check (for almost all timesteps) that Model 1 correctly predicts each bit on W2, while A only has enough time to check that Model 2 correctly predicts only a few of these bits. Therefore the probability of Model 2 must be decreased by A's estimate of the likelihood that Model 2 would happen to set the correct value for all of the bits that A didn't have time to verify. Model 1's probability must be decreased likewise, but because A was able to check more of Model 1's values, Model 1 leaves less unexplained data and may not be as unlikely as Model 2.
Why A probably fails anyway
In general, it is very fragile to expect that an AI will behave correctly because it fails to realize something. There are a number of obvious reasons that A might accept the reductionist hypothesis, even acknowledging the above argument.
First, an effective approximate inference module seems unlikely to make the mistake described in the last section. Making this mistake seems to correlate with other errors which may disrupt normal "sane" behavior: in general, given a model M which the agent cannot exactly simulate, uncertainty about the M's outputs (even if M is known to produce mostly the same output, and never known to produce incorrect output) causes the model's probability to drop off exponentially with the number of unverified computations. It seems as though any "sane" AI should be able to assign some constant probability to the hypothesis that this model exactly captures the output, and thereby avoid considering other exponentially improbable hypotheses. This may relate to deeper concerns about approximate Solomonoff induction.
Second, A may be able to observe the operation of the hardware on which it is running. This will generally cause Model 1 to be much less probable: indeed, if A can observe even one "causal ancestor" of W2's value, it will no longer gain very much by believing Model 1 as such (since now Model 1 only produces the correct output if Model 2 did anyway--all of the relative advantage for Model 1 comes from occasions when A can observe the value of W2 without observing the operations directly responsible for that value, which may be rare). Of course there are more complicated models in which A's outputs control reality in more subtle ways, which may have better complexity. Understanding this issue completely depends on a much more detailed knowledge of A's approximate inference and the nature of A's observations. In general, however, being able to observe its own computation seems like it may be adequate to force A into a reductionist model.
Third, A's approximate inference module may be aware of the fact that A's own outputs are produced algorithmically (as a computational aid, not an underlying belief about reality). This would cause it to assign positive probability to the assertion "Model 2 is equivalent to Model 1," and eventually force it into a reductionist model.
Conclusion
Agents designed in the spirit of AIXI appear to be extremely fragile and vulnerable to the sort of existential despair described above. Progress on reflection is probably necessary not only to design an agent which refrains from killing itself when convenient, but even to design an agent which behaves coherently when embedded in the physical universe.