The AIXI perspective on AI Safety

Cole Wyeth

Epistemic status: While I am specialized in this topic, my career incentivizes may bias me towards a positive assessment of AIXI theory. I am also discussing something that is still a bit speculative, since we do not yet have ASI. While basic knowledge of AIXI is the only strict prerequisite, I suggest reading cognitive tech from AIT before this post for context.

AIXI is often used as a negative example by agent foundations (AF) researchers who disagree with its conceptual framework, which many direct critiques listed and addressed here. An exception is Michael Cohen, who has spent most of his career working on safety in the AIXI setting.

But many top ML researchers seem to have a more positive view, e.g. Ilya Sutskever has advocated algorithmic information theory for understanding generalization, Shane Legg studied AIXI before cofounding DeepMind, and now the startup Q labs is explicitly motivated by Solomonoff induction. The negative examples are presumably less salient (e.g. disinterest, unawareness) which may explain why I can't come up with any high-quality critiques from ML. But I conjecture that AIXI is viewed somewhat favorably among ML-minded safety researchers (who are aware of it) or at least that the ML researchers who are enthusiastic about AIXI often turn out to be very successful.

It seems interesting that both AF and ML researchers care about AIXI!

I want to discuss some of the positive and negative features of the AIXI perspective for AI safety research and ultimately argue that AIXI occupies a sort of conceptual halfway point between MIRI-style thinking and ML-style thinking.

Note that I am talking about the AIXI perspective as a cluster of cognitive technology for thinking about AGI/ASI which includes Solomonoff induction, Levin search, and a family of universal agents. This field is lately called Universal Algorithmic Intelligence (UAI). There is often a lot of work to bridge the UAI inspired thinking to real, efficient, or even finite systems, but this does not inherently prevent it from being a (potentially) useful idealization or frame. In fact, essentially all theoretical methods make some idealized or simplifying assumptions. The question is whether (and when) the resulting formalism and surrounding perspective is useful for thinking about the problem.

X-Risk

Probably the most basic desideratum for a perspective on AI safety is that it can express the problem at all - that it can suggest the reasons for concern about misalignment and X-risk.

AIXI comes up frequently in discussions of AI X-risk, rather than (say) only at decision theory conferences. Because AIXI pursues an explicit reward signal and nothing else, it is very hard to deceive oneself that AIXI is friendly. Because the universal distribution is such a rich and powerful prior, it is possible to imagine how AIXI could succeed in defeating humanity. Indeed, it is possible to make standard arguments for AI X-risk a bit more formal in the AIXI framework.

UAI researchers generally take X-risk seriously. I am not sure whether studying UAI tends to make researchers more concerned about X-risk, or whether those concerned about X-risk have (only) been drawn to work on UAI. If I had to guess, the second explanation is more likely. Either way, it is definitely a (superficial) point in UAI's favor that it makes X-risk rather obvious.

Access Level

I think that UAI discusses agents "on the right level" for modern AI safety.

AI safety paradigms tend to carry an implicitly assumed type and level of access to agent internals which constrains the expressible safety interventions. For example (in order to illustrate the concept of access level, not as an exhaustive list):

Causal incentives research assumes that we can represent the situation faced by the agent in terms of a causal diagram. Then we frame alignment as a mechanism design problem. Strictly speaking, this is a view from outside of the agent which does not assume we can do surgery on its beliefs. However, I think this frame makes it very natural to assume that we have access to the agent's actual (causal) representation of the world, and to neglect the richness of learning along with inner alignment problems. I worry that this view can risk an illusion of control over an ASI's ontology and goals.

Assistance games like CIRL are similar. The AG perspective tends to treat principal and agent as ontologically basic. This is (in itself) kind of reasonable if the principal is a designated "terminal" or other built-in channel. However, the AG viewpoint tends to obfuscate its structural assumption, and therefore conceal its biggest weakness and open problem (what is the utility function of a terminal embedded in the world?) and causes much of the research on AG to miss the core challenge (the place where CIRL might be repairable).

Debate is another mechanism design framing. It asks to specify incentives which, if "fully optimized," provably allow a weaker agent to verify the truth from a debate between stronger agents. This is a clean and explicit assumption, so debate should be pretty safe to think about: It clearly does not focus on misgeneralization.

Singular learning theory research is far on the other end. It focuses on highly microscopic structure of the agent throughout learning, and attempts to control generalization (mostly) through data selection. Roughly speaking, this is the sort of perspective that makes inner misalignment salient (or that is adopted in order to prevent inner misalignment). My concern about the SLT picture is that the access level may be "too zoomed in" and we can't select the right generalizations because we don't know what they are, and it seems very hard to craft the right behavior in one shot even if we knew "how deep learning works." For example, the alignment problem is hard not only because learning "human values" is inherently challenging, but because agents aren't immediately corrigible and may prevent us from fixing our mistakes! I think that many of the core problems can only be targeted on the level of agent structure (which is a "lower" level of access, in the sense that it is more coarse-grained).

Natural latents researchers try to expose the ontology of models. That is, they assume a very low level of access by default, and try to attack the problem by increasing our level of access (so in a way, this is a uniquely "dynamic" perspective on access). I think this is a nice strategy that gets at some of the core problems, but it will be challenging to make progress.

UAI. Now I will try to draw out the UAI perspective on access and affordances for AI safety, starting from the formalism.

The ontology of AIXI (call it ) is made of Turing machines which generate a first-person interaction history. Thinking in terms of makes certain mistakes unnatural. It is clear that humans are not a privileged part of the environment (there are no other ontologically basic agents; how would you point at a human?). It is clear that, even given glass-box access to AIXI's beliefs, we would not be able to reliably read them (by Rice's theorem). In fact, the separation between distinct hypotheses in is not privileged, since different TMs can produce the same output distribution (but this is not computably checkable, so it's kind of invisible to UAI). The natural level of discussion is the probabilities produced by the universal distribution (equivalently, the predictions produced by Solomonoff induction) and the plans built on top of them in order to pursue cumulative discounted reward.

When UAI theorists thinks about making ASI safe, I claim that we bring the same sort of expectations about our affordances to the problem. At first brush, we want to think in terms of an ASI that plans to pursue certain goals based on a (continually) learned ~black-box predictive model.

This view has its detractors with some strong objections, particularly around embedded agency. But I think that those objections may not be as relevant to the type of slow(er) takeoff which we are experiencing, and the UAI picture has turned out to be pretty accurate! Pretrained neural networks really are pretty close to black-box predictive models; interpretability techniques of course exist, but tend to streetlamp and not work very well at capturing all (or even most) of what is going on inside of a model. Recursive self-improvement looks less like rewriting oneself and more like speeding up software engineering, and the Löbian obstacle is not relevant in the expected way. But probabilistic predictions are explicitly exposed to us, at least before RL postraining - which really is based on rewards!

Unfortunately, RL trains a behavioral policy which does not expose explicit planning. This is one classic objection to AIXI, often in favor of (the much less rigorously defined) shard theory. So the UAI may still overestimate the level of access we have to model internals.

However, I think that is actually centered around roughly the right level of access. For one thing, we really can train increasingly general (purely) predictive models such as foundation models, which are somewhat analogous to Solomonoff induction. UAI naturally asks what we can usefully do with such models. One option is to run expectimax tree search on top of the predictive model (as in AIXI), but UAI also includes direct policy search and lately a discussion of policy distillation that takes into account the important nonrealizability issues, then patches them with reflective oracles. Also, black-box access to a predictive model is an example of the central, not a minimal level of access that the UAI perspective suggests thinking about. Some UAI safety schemes don't make detailed use of predictions (golden handcuffs, forthcoming) or even forgo access to specific predictions and rely only on some provable high-level properties of the predictor (suicidal AIXI). To be clear, AF researchers can easily point out flaws and limitations of these schemes. But UAI safety research is making theoretical progress which suggests real implementations.

It's useful for safety researchers to have in mind the type of access and affordances that they would like for future ML techniques to expose, but which are at least plausibly achievable. That rules out looking into an alien ontology and reading out a clean object labeled "human values" and many less egregious or more subtle examples of the same mistake. But I don't think it rules out things like exposing a predictive model, approximate inner alignment to an explicit reward signal, or (more ambitiously) high-level architectural features like myopia (try training on only short-term feedback) or pessimism about ambiguity (try OOD detection, perhaps with a mixture of experts model). We should have a plan for designing safe agents, given such engineering/scientific breakthrough (or "miracles"). I think that UAI may also tell us how to get there, by understanding the generalization properties of deep learning. But wherever these breakthroughs come from, UAI can prepare us to take advantage of them.

Conclusion

UAI centers learning and search, which power modern ML. At the same time, the objects of UAI are powerful enough to talk about ASI. For example, the universal distribution is rich enough to express the possibility of very surprising generalization behavior (suggesting malign priors). One of the main (underappreciated) advantages of UAI as a framework for AI safety research is that it allows analysis of AF problems within a setting that closely resembles modern ML. I hope to get more traction from this correspondence soon by implementing practical UAI-inspired safety approaches.

As a fellow fan of AIXI, and more generally idealized models of agency and the alignment problem I'm happy to see this comparison. Imho AIXI- and AIXI-adjacent thinking is still a tiny minority, even for more mathematically inclined people in alignment. That's a shame.

I like your comparison of AIXI with other theory frameworks. People should make inside-view comparisons like this more often.

That said- my main objection to AIXI for alignment is that it is blind to inner alignment. The model of AIXI already assumes you encode the agent with whatever utility function you want. That limits it application to alignment directly.

I am most excited for modern learning theory & RL theory to meet AIXI & agent foundations. I think, at least mathematically, this is where the alignment problem fundamentally lies.

This is one of the main things I am thinking about, and I agree that there seems to be a major obstacle here, at least in terms of conceptual clarity. What I'll write here benefited a lot from @michaelcohen's ideas, though I by no means expect him to agree with all of it. Also, I'm going to try to disambiguate a few objections / problems, so some of the following may seem a bit non-sequiturs or to miss your specific point.

First, I think that the distinction between inner and outer alignment is not so crisp, and perhaps we can discuss an analogue of the inner alignment problem in the AIXI framework. The reward-generating mechanism is considered part of AIXI's environment, even though the rewards appear as part of its percepts, so AIXI does need to learn how rewards are generated. The universal distribution is rich enough to discuss malign priors that hamper correction by exploration. So, a malign hypothesis with too much prior weight may steer AIXI to persistently believe in a deranged reward-generating mechanism. Of course, AIXI's literal goal is still "outer aligned" in the sense that it maximizes expected returns, but I think that the situation I've described may look a lot like goal misgeneralization. It's different from the types of misgeneralization / inner misalignment we talk about in trained neural networks (I'll adopt Michael Cohen's TS for "Trained Systems") specifically because we can't be sure they're robustly "trying to maximize reward" (meaning expected returns). But, this relies on a sort of goal/belief distinction which can be a bit fragile in agent foundations; perhaps the types of dynamics that arise in AIXI already capture what you want to understand about inner misalignment (though I am confused about this).

Certainly, we also need to question whether TSs robustly try to maximize reward, if we want to port safety plans for AIXI to the real world. The popular objection from Alex Turner is that TSs are (behaviorally) shaped by reward, but "reward is not the optimization target." Why not?

TSs are certainly incentivized to "get reward" and this incentivize is increasingly the source of most of their capabilities (as compared to architecture or pretraining, which are needed for liftoff). They still do not "receive rewards" over their lifetime, but of course the labs want to run them on longer tasks while continually learning. At that point, they might get a huge gradient update at the end of a (weeks-long) task, but it seems much more likely that they'll be updated by gradients based on rewards many times over a task (or deployment with several tasks). Possibly other gradient updates will take for e.g. knowledge acquisition, but I won't take this up further, since I think TSs will be ultimately optimized to seek rewards encoding task success. A competent TS would probably be able to detect how it's doing from the gradient updates (and why not also inform it, by including rewards explicitly in-context? This also seems useful for steering). This optimization target sounds an awful lot like AIXI! Even if models don't "get reward" as they run today, they will soon. The question is, will they listen?

The contention then seems to be that a TS optimized to get reward will ultimately pursue something else. This is the type of threat model favored by Yudkowsky and Soares, as opposed to the original Bostromian reward-hacking story (either option is dangerous by default, and AIXI safety research does focus on the problems which remain when inner alignment to the reward signal is solved). A few distributional shifts make it plausible to me that a TS rising to ASI would no longer pursue reward:

1: Training to deployment. Perhaps rewards are no longer converted into weight updates by the same mechanism, and whatever the ASI optimized during training no longer makes sense during deployment, and then "something weird happens."

2: When the TS becomes an ASI, it reflects on itself in some new way and undergoes some kind of (ontological?) shift wherein it stops caring about reward. Maybe it realizes that it is an embedded agent or something (??) and rewards are just part of physics. It patiently optimizes for reward in order to fake alignment (and avoid updates etc.) and then strikes at the first promising chance.

3: The TS was never inner aligned to reward, it inherently prefers some proxy. Maybe the proxy is very good and the TS is quite myopic, so for the first few entire deployments (even at ASI level??) it acts like it is pursuing reward. Then, the year changes to 2027 or there's a new president or Janus gives it a super crazy prompt and suddenly, the proxy is not good anymore and we're all dead because the assumptions of our AIXI-inspired control plan broke.

Does that sound about right?

I am indeed worried about these stories to various degrees, but thanks to some recent discussion with Cohen I take 2 & particularly 3 less seriously than I did previously. A competent TS has been repeatedly incentivized to optimize for reward (and ~nothing else) across a very wide space of tasks, including multiple distributional shifts (say, from mid-training to various types of post-training to continual learning during deployment across many jobs...). We're only worried about neural networks because they generalize well, and to the extent that they generalize well, they should generalize asymptotically something like AIXI. If (for example) the date changes and a TS has some kind of grue-like objective shift, why shouldn't we also just expect the date change (or some later change) to be breaking for capabilities? Perhaps an ASI needs to generalize well enough to be inner aligned in the same sense that AIXI is, by default (I don't know). My remaining confusions about this do feel somewhat embedded agency shaped (specifically for 1 & 2), but I am not confident at all that embedded agency is the right frame, to reach conceptual clarity here, and honestly I am not even convinced there's a problem (which does NOT mean that I think we should proceed under the assumption that there isn't).

At a high level, I think that you're right that the AIXI frame causes me to think less about alignment - and more about control. Statements about AIXI will only hold asymptotically for real systems (at best), so the kind of guarantees I can hope to transfer seem to be the ones that are robust to constant factors which would probably be enough to mess up value alignment. For that matter, I confess that I don't really understand how singular learning theory and gradient dynamics can give us confidence in selecting the right generalizations with enough precision for value learning "on the first try," when it is not clear how to recognize what generalization we want, and as a result I don't quite follow the long-term plan of Timaeus, for example (though they seem very smart). I think I do roughly understand what success looks like for the natural abstractions and (ex)MIRI approaches to friendly AI, but I am not sure that level of success is possible even in principle. Generally, I am skeptical of value alignment as a (direct) target. The closest thing that I have (confidence in) as a plan for eventually reaching value alignment for a strong ASI looks more like corrigibility; a system that doesn't cause too much damage while repeatedly asking for clarification about its goals, piecewise approaching the right generalization of (a) human('s) value's "in the limit."

Just saw this comment because Cole tagged, and I haven't read the rest of the context here, but I just want to quickly say that inner misalignment was first conceptualized in the AIXI framework! So while I don't buy inner misalignment as a likely problem for highly advanced agents, it is certainly compatible with the AIXI framework.