Versions of AIXI can be arbitrarily stupid

Stuart_Armstrong

Many people (including me) had the impression that AIXI was ideally smart. Sure, it was uncomputable, and there might be "up to finite constant" issues (as with anything involving Kolmogorov complexity), but it was, informally at least, "the best intelligent agent out there". This was reinforced by Pareto-optimality results, namely that there was no computable policy that performed at least as well as AIXI in all environments, and strictly better in at least one.

However, Jan Leike and Marcus Hutter have proved that AIXI can be, in some sense, arbitrarily bad. The problem is that AIXI is not fully specified, because the universal prior is not fully specified. It depends on a choice of a initial computing language (or, equivalently, of an initial Turing machine).

For the universal prior, this will only affect it up to a constant (though this constant could be arbitrarily large). However, for the agent AIXI, it could force it into continually bad behaviour that never ends.

For illustration, imagine that there are two possible environments:

The first one is Hell, which will give ε reward if the AIXI outputs "0", but, the first time it outputs "1", the environment will give no reward for ever and ever after that.
The second is Heaven, which gives ε reward for outputting "0" and 1 reward for outputting "1", and is otherwise memoryless.

Now simply choose a language/Turing machine such that the ratio P(Hell)/P(Heaven) is higher than the ratio 1/ε. In that case, for any discount rate, the AIXI will always output "0", and thus will never learn whether its in Hell or not (because its too risky to do so). It will observe the environment giving reward ε after receiving "0", behaviour which is compatible with both Heaven and Hell. Thus keeping P(Hell)/P(Heaven) constant, and ensuring the AIXI never does anything else.

In fact, it's worse than this. If you use the prior to measure intelligence, then an AIXI that follows one prior can be arbitrarily stupid with respect to another.

For illustration, imagine that there are two possible environments:

The first one is Hell, which will give ε reward if the AIXI outputs "0", but, the first time it outputs "1", the environment will give no reward for ever and ever after that.
The second is Heaven, which gives ε reward for outputting "0" and 1 reward for outputting "1", and is otherwise memoryless.

In fact, it's worse than this. If you use the prior to measure intelligence, then an AIXI that follows one prior can be arbitrarily stupid with respect to another.

Interesting paper, but I'm not sure this example is a good way to illustrate the result, since if someone actually built AIXI using the prior described in the OP, it will quickly learn that it's not in Hell since it won't actually receive ε reward for outputting "0".

Here's my attempt to construct a better example. Suppose you want to create an agent that qualifies as an AIXI but keeps just outputting "I am stupid" for a very long time. What you do is give it a prior which assigns ε weight to a "standard" universal prior, and rest of the weight to a Hell environment which returns exactly the same (distribution of) rewards and inputs as the "standard" prior for outputting "I am stupid." and 0 reward forever if the AIXI ever does anything else. This prior still qualifies as "universal".

This AIXI can't update away from its initial belief in the Hell environment because it keeps outputting "I am stupid" for which the Hell environment is indistinguishable from the real environment. If in the real world you keep punishing it (give it 0 reward), I think eventually this AIXI will do something else because its expected reward for outputting "I am stupid" falls below ε so risking almost certainty of the 0 reward forever of Hell for the ε chance of getting a better outcome becomes worthwhile. But if ε is small enough it may be impossible to punish AIXI consistently enough (i.e., it could occasionally get a non-zero reward due to cosmic rays or quantum tunneling) to make this happen.

I think one could construct similar examples for UDT so the problem isn't with AIXI's design, but rather that a prior being "universal" isn't "good enough" for decision making. We actually need to figure out what the "actual", or "right", or "correct" prior is. This seems to resolve one of my open problems.

There is no such thing as an "actual" or "right" or "correct" prior. A lot of the arguments for frequentist statistical methods were that bayesians require a subjective prior, and there is no way to make priors not subjective.

What would it even mean for there to be a universal prior? You only exist in this one universe. How good a prior is, is simply how much probability it assigns to this universe. You could try to find a prior empirically, by testing different priors and seeing how well they fit the data. But then you still ... (read more)

3Squark11y

As I discussed before, IMO the correct approach is not looking for the one "correct" prior since there is no such thing but specifying a "pure learning" phase in AI development. In the case of your example, we can imagine the operator overriding the agent's controls and forcing it to produce various outputs in order to update away from Hell. Given a sufficiently long learning phase, all universal priors should converge to the same result (of course if we start from a ridiculous universal prior it will take ridiculously long, so I still grant that there is a fuzzy domain of "good" universal priors).

4Stuart_Armstrong11y

The example was meant to show that if it was in Heaven, it will behave as if it was in Hell (now that's a theological point there ^_^ ). Your example is more general. The result of the paper is that as long as the AIXI gets a minimum non-zero average reward (essentially), you can make it follow that policy forever.

30

Versions of AIXI can be arbitrarily stupid

30

30

30

Versions of AIXI can be arbitrarily stupid

30

30