Prosaic misalignment from the Solomonoff Predictor

Cleo Nardo

When I first read Paul Christiano's post, I figured it had little relevance to prosaic alignment. But is that true? Is Solomonoff misalignment a problem that could actually arise on software running on GPUs over the next 10 years?

The Solomonoff Predictor is malign.

FACT 1: There's a certain type of machine called a predictor. You tell $P$ a bunch of facts $E$ and then ask it a question $H$ . The predictor will then output the probability of $H$ given $E$ .

FACT 2: The optimal predictor is the Solomonoff Predictor $S (H | E)$ , for some natural sense of optimality.

What's the Solomonoff Predictor? Imagine every possible world is a binary string generated by a computer program, and imagine that the prior likelihood of a program $m$ is $2^{- length (m)}$ . The Solomonoff Predictor corresponds to that prior.

In other words, $S (H | E) = \frac{S (H \land E)}{S (E)}$

where $S (A) = \sum_{programs m with output x_{m}} 2^{- length (m)} \times 1_{A} (x_{m})$

Fact 3: Paul Christiano worries that the Solomonoff Predictor is malign. Why? Because some of these computer programs $m$ will simulate "gremlins". These gremlins are consequentialist agents who care about influencing the output of $S$ , and they can influence the output of $S$ by making $H$ true in their universe.

Let's give a concrete example.

Suppose Alice finds a mysterious box with a big red button. She suspects the box makes delicious ice-cream, and so she feeds into the oracle $S$ all the facts she's ever observed including all the facts about the box. Let's call all that data $E$ . And then she asks the oracle the question "is the box an ice-cream maker?". Let's call that question $H$ . If the oracle outputs a high probability, then she'll press the button anticipating delicious ice-cream.

Let's also suppose that unbeknownst to Alice, the box is actually a gremlin-generator that would unleash a bunch of gremlins.

Okay, what would Solomonoff Predictor output?

Well, there's going to exist some programs $m$ for which all the following facts are true:

$m$ simulates a universe containing gremlins.
These gremlins know that Alice in our universe is asking her oracle the question $H$ .
These gremlins know that if Alice's oracle outputs a high probability for $H$ then Alice will press the button, unleashing gremlins into our universe. This is the gremlins' desired outcome.
There's a parallel version of Alice in $m$ , and the gremlins can send parallel-Alice a box with an ice-cream maker inside.

What would the gremlins in $m$ do? Well, they'd send parallel-Alice a box with an ice-cream maker. Then $S$ will output a slightly higher probability for $H$ . And then Alice might press the button, unleashing gremlins into our universe.

Now, here’s some arguments suggesting this isn‘t practically relevant —

For a start, the gremlins in $m$ could shift the output of $S$ only by a tiny amount. They could shift $S (H | E)$ by about $2^{- length (m)}$ , which is insignificant for large $m$ .
Moreover, there are no Solomonoff Predictors in this universe. The computational resources required to make a Solomonoff Predictor are infinite.
The gremlins acausally influence our universe from their counterfactual universe. This sounds like the kind of galaxy-brained MIRI-esque stuff that makes me confused and suspicious.
And finally, this isn't really the sort of problem that has "echoes" in approximations. For example, human-level reasoning is an approximation of the Solomonoff Prior, but this problem doesn't arise for us, right?

The Monte-Carlo Predictor is malign.

We can tell a similar story about a different predictor, which is also computationally infeasible.

Imagine a predictor $M$ which makes predictions using Monte-Carlo simulations. When Alice tells $M$ a bunch of facts $E$ and then asks it a question $H$ , here's what $M$ does: First, the predictor $M$ will loop through all possible worlds $w \in W$ and simulate $w$ . Then $M$ checks whether $E$ is true in $w$ . If $E$ is true in $w$ then $M$ increments the variable $a$ , and if $E \land H$ is true in $w$ then $M$ increments the variable $b$ . After the loop, $M$ returns the ratio $b / a$ .

So there's a universe containing Alice and containing a machine running simulations of a universe containing gremlins. Alice asks the machine whether the box contains an ice-cream maker. What could go possibly go wrong?

Three levels of simulator leaks.

Level-1 Leak.
The gremlins can influence the output of the machine $M$ while remaining entirely within their simulation. This is basically the concern Paul Christiano raises about $S$ . Even if the gremlins "stay in the box", they can influence the ratio $b / a$ by influencing whether $E$ and $H$ are true in their simulation.

Level-2 Leak.
The gremlins can influence the output of the machine by escaping their simulation, but they remain stuck on the software of the machine. This might happen if the machine has poor firewalls between the various simulations. In a Level-2 Leak, the gremlins might change the output of the machine to 0.99 to entice Alice into pressing the button.

Level-3 Leak
Somehow the gremlins can escape the machine. How might this happen?

Maybe Alice is easily enticed by ice-cream, and the gremlins can "double-jump" from a Level-1 Leak to a Level-3 leak.

Maybe the machine is hooked up to the internet, and the gremlins "double-jump" from a Level-2 Leak to a Level-3 Leak. This might happen if the machine's implementation of the "simulate" subroutine is really insecure.

Maybe the machine implements the "simulation" subroutine by physically constructing objects from $w$ and poking them until it has modelled their behaviour. This example isn't even contrived — it's how human scientists determine the probability of events!

Prosaic misalignment from Simulator Leaks.

In short, a prosaic AI might be leaky because it has generalised its task as simulating a leaky simulator.

Here's a concrete example. Suppose Alice uses GPT-6 as a predictor, and she asks GPT-6 whether the box contains an icecream-maker. There are a number of ways GPT-6 might reason about this question.

GPT-6 tries to answer the question: how many worlds $w$ are such that Alice has been sent an ice-cream maker". GPT-6 tells Alice the box is probably a gremlin-generator. Great, this feels pretty safe.
GPT-6 tries to answer this question: what's the number of worlds $w$ such that if a machine simulated $w$ , it would conclude that in $w$ Alice had been sent an ice-cream maker?

If GPT-6 thinks this hypothetical machine is level-1 leaky, then GPT-6 thinks the machine's output will be inflated by gremlins whenever $w$ contains gremlins. Therefore GPT-6 will return a slightly inflated count for the number of worlds. So GPT-6 is itself level-1 leaky. Also, by a similar argument, if GPT-6 thinks the hypothetical machine is level-2 leaky or level-3 leaky, then GPT-3 itself is level-1 leaky.

Note that GPT-6 isn't simulating any gremlins itself. Rather, it's reasoning about a hypothetical machine simulating gremlins.

And note that GPT-6 isn't doing anything computationally intractable like simulating 10^1000 universes. All GPT-6 has done is read Paul Christiano's blogpost and reasoned in the same way I have.
GPT-6 tries to answer this question: if a machine simulated every world $w$ , and counted all the worlds in which Alice had been sent an ice-cream maker, what would be the machine's final count?

GPT-6 would reason like this...
If the hypothetical machine was level-1 leaky, then it would give a slightly inflated final count. But if the hypothetical machine was level-2 leaky or level-3 leaky, then it would give a final count of 0.99. So if GPT-6 thinks the machine is level-1 leaky then GPT-6 itself is level-1 leaky, and if GPT-6 thinks the machine is level-2 or level-3 leaky then GPT-6 itself is level-2 leaky.

Moral of the story

There's been a lot of work recently on LLMs as simulators. And there's a worry that even if the LLM is not itself an agent, it might simulate an agent. Moreover, the LLM might be a "leaky" simulator, such that when it simulates an agent, the agent can "escape" or influence us in malign ways.

And there's a two-pronged approach to this problem:

Let's ensure the LLM doesn't simulate agents with high fidelity.
Let's ensure the LLM isn't leaky to low-fidelity agents.

But maybe this isn't paranoid enough. What if a particular LLM is actually a simulator of a simulator? Or a simulator of a simulator of a simulator? It's likely that both "simulator" and "simulator of a simulator" are equally valid generalisations from the LLM's training.

This is dangerous.

There are computationally-infeasible simulators $M$ which are leaky.
The LLM might be simulating such a simulator $M$ . Of course, the LLM could only do this lossily, yielding a computationally-feasible approximation of $M$ .
If the LLM thinks $M$ is leaky, then the LLM might be leaky as well, so long as the LLM is smart enough to understand Section 2 of this article. That level of reasoning is within the capabilities of prosaic AI.

LESSWRONG
LW

42