Mesa-optimizers potentially escape behavioral adversarial training by realizing they're being tested, gaming the test, then defecting when they're deployed. Luke Muelhauser recounts an example of such a treacherous turn in the wild.

Most relevant quote:

As part of a project studying the evolution of (simulated) organisms, computer scientist Charles Ofria wanted to limit the replication rate of a digital organism. So, he programmed the system to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent. However, the organisms evolved to recognize when they were in the test environment and “play dead” (pause replication) so they would not be eliminated and instead be kept in the population where they could continue to replicate outside the test environment. Once he discovered this, Ofria randomized the inputs of the test environment so that it couldn’t be so easily detected, but the organisms evolved a new strategy, to probabilistically perform tasks that would accelerate their replication, thus slipping through the test environment some percentage of the time and continuing to accelerate their replication thereafter.

The full story is given in the original post.

New Comment
6 comments, sorted by Click to highlight new comments since:

I don't think this is quite an example of a treacherous turn, but this still looks relevant:

Lewis et al., Deal or no deal? end-to-end learning for negotiation dialogues (2017):

Analysing the performance of our agents, we find evidence of sophisticated negotiation strategies. For example, we find instances of the model feigning interest in a valueless issue, so that it can later ‘compromise’ by conceding it. Deceit is a complex skill that requires hypothesising the other agent’s beliefs, and is learnt relatively late in child development (Talwar and Lee, 2002). Our agents have learnt to deceive without any explicit human design, simply by trying to achieve their goals.

(I found this reference cited in Kenton et al., Alignment of Language Agents (2021).)

[-]Mark XuΩ020

This is a cool example, thanks!

I would not call this a treacherous turn - the "treachery" was a regular and anticipated behaviour, and "evolve higher replication rates in the environment" is a pretty obvious outcome.

Suppressing-and-ignoring failed "treachery" in the sandbox just has the effect of adding selection pressure towards outcomes that the censor doesn't detect. Important lesson from safety engineering: you need to learn from near misses, or you'll eventually have a nasty accident. In a real turn, you don't get this kind of warning.

In a real turn, you don't get this kind of warning.

I disagree, I think that toy results like this are exactly the kind of warning we'd expect to see.

You might not get a warning shot from a superintelligence, but it seems great to collect examples like this of warning shots from systems dumber - if there's going to be continuous takeoff, and there's going to be a treacherous turn eventually, it seems like a great way to get people to take treacherous turns seriously is to watch closely for failed examples (though hopefully ones more sophisticated than this!)

Trying to unpack why I don't think of this as a treacherous turn:

  • It's a simple case of a nearest unblocked strategy
  • I'd expect a degree of planning and human-modelling which were absent in this case. A 'deception phase' based on unplanned behavioural differences in different environments doesn't quite fit for me.
  • Neither the evolved organisms nor the process of evolution are sufficiently agentlike that I find the "treacherous turn" to be a useful intuition pump.

I think it's mostly the intuition-pump argument; there are obviously risks that you evolve behaviour that you didn't want (mostly but not always via goal misspecification), but the treacherous turn to me implies a degree of planning and possibly acausal cooperation that would be very much more difficult to evolve.

Seems more like 'one treacherous turn in the wild' than multiple. Still, interesting, though less than the workings of the fix.