In particular, a Bayesian superintelligence must optimize some utility function using a rich prior, requiring at least structural similarity to AIXI.
One model I definitely think you should look at analyzing is the approximately-Bayesian value-learning upgrade to AIXI, which has Bayesian uncertainty over the utility function as well as the world model, since that looks like it might actually converge from rough-alignment to alignment without requiring us to first exactly encode the entire of human values into a single utility function.
I'll look into it, thanks! I linked a MIRI paper that attempts to learn the utility function, but I think it mostly kicks the problem down the road - including the true environment as an argument to the utility function seems like the first step in the right direction to me.
I believe that the theoretical foundations of the AIXI agent and variations are a surprisingly neglected and high leverage approach to agent foundations research. Though discussion of AIXI is pretty ubiquitous in A.I. safety spaces, underscoring AIXI's usefulness as a model of superintelligence, this is usually limited to poorly justified verbal claims about its behavior which are sometimes questionable or wrong. This includes, in my opinion, a serious exaggeration of AIXI's flaws. For instance, in a recent post I proposed a simple extension of AIXI off-policy that seems to solve the anvil problem in practice - in fact, in my opinion it has never been convincingly argued that the anvil problem would occur for an AIXI approximation. The perception that AIXI fails as an embedded agent seems to be one of the reasons it is often dismissed with a cursory link to some informal discussion.
However, I think AIXI research provides a more concrete and justified model of superintelligence than most subfields of agent foundations [1]. In particular, a Bayesian superintelligence must optimize some utility function using a rich prior, requiring at least structural similarity to AIXI. I think a precise understanding of how to represent this utility function may be a necessary part of any alignment scheme on pain of wireheading. And this will likely come down to understanding some variant of AIXI, at least if my central load bearing claim is true: The most direct route to understanding real superintelligent systems is by analyzing agents similar to AIXI. Though AIXI itself is not a perfect model of embedded superintelligence, it is perhaps the simplest member of a family of models rich enough to elucidate the necessary problems and exhibit the important structure. Just as the Riemann integral is an important precursor of Lebesgue integration, despite qualitative differences, it would make no sense to throw AIXI out and start anew without rigorously understanding the limits of the model. And there are already variants of AIXI that surpass some of those limits, such as the reflective version that can represent other agents as powerful as itself.
This matters because the theoretical underpinnings of AIXI are still very spotty and contain many tractable open problems. In this document, I will collect several of them that I find most important - and in many cases am actively pursuing as part of my PhD research advised by Ming Li and Marcus Hutter. The AIXI (~= "universal artificial intelligence") research community is small enough that I am willing to post many of the directions I think are important publicly; in exchange I would appreciate a heads-up from anyone who reads a problem on this list and decides to work on it, so that we don't duplicate efforts (I am also open to collaborate).
The list is particularly tilted towards those problems with clear, tractable relevance to alignment OR philosophical relevance to human rationality. Naturally, most problems are mathematical. Particularly where they intersect recursion theory, these problems may have solutions in the mathematical literature I am not aware of (keep in mind that I am a lowly second year PhD student). Expect a scattering of experimental problems to be interspersed as well.
To save time, I will assume that the reader has a copy of Jan Leike's PhD thesis on hand. In my opinion, he has made much of the existing foundational progress since Marcus Hutter invented the model. Also, I will sometimes refer to the two foundational books on AIXI as UAI = Universal Artificial Intelligence and Intro to UAI = An Introduction to Universal Artificial Intelligence, and the canonical textbook on algorithmic information theory Intro to K = An Introduction to Kolmogorov Complexity and its applications. Nearly all problems will require some reading to understand even if you are starting with a strong mathematical background. This document is written with the intention that a decent mathematician can read it, understand enough to find some subtopic interesting, then refer to relevant literature and possibly ask me a couple clarifying questions, then be in a position to start solving problems.
I will write Solomonoff's universal distribution as ξU and Hutter's interactive environment version as ξAI. There are occasional references to the iterative and recursive value functions - see Leike's thesis for details, but I think these can be viewed (for now informally) as the Riemann-Stieltjes and Lebesgue integrals of the reward sum, respectively, taken with respect to a semimeasure (and therefore unequal in general).
Computability Foundations
These problems are important because they inform our understanding of AIXI's reflective behavior - for instance, it is known that AIXI's belief distribution cannot represent copies of itself. They are also important for understanding the tractability of AIXI approximations.
Semimeasure Theory
There are powerful existence and uniqueness results for measures, including Caratheodory's extension theorem for general measure spaces and the Sierpinski class theorem[2] for probability spaces. But the AIT setting is naturally formulated in terms of semimeasures, which are defective measures satisfying only superadditivity. This is because a program may halt after outputting a finite number of symbols. For instance, with a binary alphabet the chances of observing (say) output starting with 001 or 000 are lower than the chances of observing output starting with 00 since the second bit might be the last. This means that the natural probabilities on cylinder sets (and their sigma algebra including infinite sequences) are not additive. It is possible to reconstruct a probability measure by adding the finite sequences back (or adding an extra "HALTED" symbol) but its a bit clumsy - the resulting measures are no longer lower semicomputable.
Algorithmic Probability Foundations
Mostly, I am interested in understanding how the universal distribution behaves in practice when facing a complex but approximately structured world - and whether some UTMs are better than others for agents, or initial differences can be overcome by providing AIXI with a decent "curriculum." For pure learning it is known that the initial UTM does not matter much (see the bounds in UAI).
AIXI Generalizations
The AIXI agent optimizes a reward signal, which carries obvious existential risks: barring embeddedness issues it probably wireheads and then eliminates all threats however distant[3]. We want to argue that optimizing most naive or carelessly chosen utility functions is also an existential risk, but to do this we need a generalized definition of AIXI. Considering how much ink has been spilled on this topic, I profoundly hope that the problems stated here already have published solutions I am simply unaware of. Philosophical questions aside, the primary importance of these problems is to form a bridge from the preceding theoretical results to the following practical considerations.
Scaffolding the Universal Distribution
Since at least Bostrom's "Superintelligence", A.I. safety researchers have considered the possibility of a non-agentic oracle A.I. which could be used to answer queries without acquiring any goals of its own. Recently, there has been extensive debate about whether pretrained foundation models fall into this category (e.g. goal agnosticism and simulators) or give rise to their own optimization daemons. See also Michael Cohen's argument that imitation learning is existentially safe. I am not directly concerned with this question here; instead I want to consider the safety implications of oracle access to a "correct" approximation of the universal distribution[5]. Given such a tool, can we pursue our goals more effectively? We could naturally construct an AIXI approximation[6] using the universal distribution to optimize a reward signal but it would probably kill us and then wirehead, so that isn't a wise plan. Are there better ideas, such as some form of cyborgism?
Embedded Versions of AIXI
I am actively working to understand embedded versions of AIXI. One motivation for this is that it should inform our timelines - if simple variations of AIXI automatically work in the embedded setting (as weakly suggested by my proposed patch for the anvil problem) we should expect LLM agents to become competent sooner. This is a very subtle topic and my thinking is still in early enough exploratory stages that I am not prepared to construct a list of explicit mathematical problems.
On this note, I would be interested in dialogues with researchers working on singular learning theory, logical induction, and infra-Bayesianism about why these are relevant to safety - it seems to me that at least the first two are more important for building self-improving A.I. systems than understanding superintelligence. An aligned superintelligence could figure out safe self-improvement on its own, so I don't view this as an essential step. It seems to be primarily relevant for the lost MIRI dream of rapidly building a friendly A.I. based on pure math before anyone else can do it with massive blind compute.
I think the more common term is "Sierpinksi-Dynkin's π-σ theorem."
Because of temporal discounting, the order is to first eliminate serious threats, then wirehead, then eliminate distant threats
Perhaps Amanda Askell or Samuel Alexander would care about more exotic values but infinity is infinite enough for me.
Yes, I know that there are arguments suggesting the universal distribution is malign. Personally I think this is unlikely to matter in practice, but in any case it's more of an "inner optimization" problem that I will not focus on here.
Technically, AIXI uses a different belief distribution that is explicitly over interactive environments. I suspect that a competent AIXI approximation can still be hacked together given access to the universal distribution - in fact I have built one that uses this (normalized) universal distribution approximation as belief distribution and learns to play simple games. But theoretical justification is missing.