A few days ago, Evan Hubinger suggested creating a mesa optimizer for empirical study. The aim of this post is to propose a minimal environment for creating a mesa optimizer, which should allow a compelling demonstration of pseudo alignment. As a bonus, the scheme also shares a nice analogy with human evolution.
The game
An agent will play on a maze-like grid, with walls that prohibit movement. There are two important strategic components to this game: keys, and chests.
If the agent moves into a tile containing a key, it automatically picks up that key, moving it into the agent’s unbounded inventory. Moving into any tile containing a chest will be equivalent to an attempt to open that chest. Any key can open any chest, after which both the key and chest are expired. The agent is rewarded every time it successfully opens a chest. Nothing happens if it moves into a chest tile without a key, and the chest does not prohibit the agent’s movement. The agent is therefore trained to open as many chests as possible during an episode. The map may look like this:

The catch
In order for the agent to exhibit the undesirable properties of mesa optimization, we must train it in a certain version of the above environment to make those properties emerge naturally. Specifically, in my version, we limit the ratio of keys to chests so that there is an abundance of chests compared to keys. Therefore, the environment may look like this instead:

Context change
The hope is that while training, the agent picks up a simple pseudo objective: collect as many keys as possible. Since chests are abundant, it shouldn’t need to expend much energy seeking them, as it will nearly always run into one while traveling to the next key. Note that we can limit the number of steps during a training episode so that it almost never runs out of keys during training.
When taken off the training distribution, we can run this scenario in reverse. Instead of testing it in an environment with few keys and lots of chests, we can test it in an environment with few chests and many keys. Therefore, when pursuing the pseudo objective, it will spend all its time collecting keys without getting any reward.
Testing for mesa misalignment
In order to show that the mesa optimizer is competent but misaligned we can put the agent in a maze-like environment much larger than any it was trained for. Then, we can provide it an abundance of keys relative to chests. If it can navigate the large maze and collect many keys comfortably while nonetheless opening few or no chests, then it has experienced a malign failure.
We can make this evidence for pseudo alignment even stronger by comparing the trained agent to two that we hard-code: one agent that pursues the optimal policy for collecting keys, and one agent that pursues the optimal policy for opening as many chests as possible. Qualitatively, if the trained agent is more similar to the first agent than the second, then we should be confident that it has picked up the pseudo objective.
The analogy with human evolution
In the ancestral environment, calories were scarce. In our modern day world they are no longer scarce, yet we still crave them, sometimes to the point where it harms our reproductive capability. This is similar to how the agent will continue pursuing keys even if it is not using them to open any chests.
I think that the experiments are more likely to work the way you predict if the agent only has partial observability, meaning the agent only gets the 5x5 grid around it as the state. Of course you would have to use an LSTM for the agent so it can remember where it's been previously if you do this.
If the agent can see the full environment, it is easier for it to discover the optimal policy of going to the nearest key first, then going to the nearest chest. If the agent implements this policy, it will still maximize the true reward in the test environment.
However, if the agent can only see a 5x5 grid around it, it will have to explore around to find keys or chests. In the training environment, the optimal policy will be to explore around, and pick up a key if it sees it, and if it has a key, and sees a chest then open that chest. I'm assuming that the training environment is set up so the agent can't get 2 keys without seeing a chest along the way. Therefore the policy of always picking up a key if it sees it will work great during training because if the agent makes it to another key, it will have already used the first one to open the chest it saw.
Then during the test environment, where there's a lot of keys, the agent will probably keep picking up the keys it can see and not spend time looking for chests to open. But I'm guessing the agent will still open a chest if it sees one while it's picking up keys.
I think it's interesting to get results on both the full observability and partial observability cases.