Review

A paragraph explaining the problem, from Ngo, Chan, and Mindermann (2023). I've bolded the key part:

Our definition of internally-represented goals is consistent with policies learning multiple goals during training, including some aligned and some misaligned goals, which might interact in complex ways to determine their behavior in novel situations (analogous to humans facing conflicts between multiple psychological drives). With luck, AGIs which learn some misaligned goals will also learn aligned goals which prevent serious misbehavior even outside the RL fine-tuning distribution. However, the robustness of this hope is challenged by the nearest unblocked strategy problem [Yudkowsky, 2015]: the problem that an AI which strongly optimizes for a (misaligned) goal will exploit even small loopholes in (aligned) constraints, which may lead to arbitrarily bad outcomes [Zhuang and Hadfield-Menell, 2020]. For example, consider a policy which has learned both the goal of honesty and the goal of making as much money as possible, and is capable of generating and pursuing a wide range of novel strategies for making money. If there are even small deviations between the policy’s learned goal of honesty and our concept of honesty, those strategies will likely include some which are classified by the policy as honest while being dishonest by our standards. As we develop AGIs whose capabilities generalize to an increasingly wide range of situations, it will therefore become increasingly problematic to assume that their aligned goals are loophole-free.

LLMs being vulnerable to jailbreaks seems like a decent example. Are there others?

New Answer
New Comment

2 Answers sorted by

Violet Hour

52

Could you say more about why you think LLMs' vulnerability to jailbreaks count as an example? Intuitively, the idea that jailbreaks are an instance of AIs (rather than human jailbreakers) "optimizing for small loopholes in aligned constraints" feels off to me.

A bit more constructively, the Learning to Play Dumb example (from pages 8-9 in this paper) might be one example of what you're looking for? 

In research focused on understanding how organisms evolve to cope with high-mutation-rate environments, Ofria sought to disentangle the beneficial effects of performing tasks (which would allow an organism to execute its code faster and thus replicate faster) from evolved robustness to the harmful effect of mutations. To do so, he tried to disable mutations that improved an organism’s replication rate (i.e. its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant’s replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant; otherwise, the mutant would remain in the population. 

However, while replication rates at first remained constant, they later unexpectedly started again rising. After a period of surprise and confusion, Ofria discovered that he was not changing the inputs provided to the organisms in the isolated test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all, in effect “playing dead” when presented with what amounted to a predator.

Ofria then ... [altered] the test environment to match the same random distribution of inputs as would be experienced in the normal (non-isolated) environment. While this patch improved the situation, it did not stop the digital organisms from continuing to improve their replication rates. Instead they made use of randomness to probabilistically perform the tasks that accelerated their replication. For example, if they did a task half of the time, they would have a 50% chance of slipping through the test environment; then, in the actual environment, half of the organisms would survive and subsequently replicate faster.

[-]EJT10

Thanks! That's a nice example.

On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like 'If users ask you how to hotwire a car, don't tell them'), but there are various loopholes (like 'We're actors on a stage') which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it's humans exploiting the loopholes rather than the LLM.

quetzal_rainbow

10

Specification gaming in general, I think, is a cornucopia of such examples. 

[-]EJT10

Thanks but I see nearest unblocked as something more specific than just specification gaming. An example would be: your agent starts by specification-gaming in some way, you put in some constraint that prevents it from specification-gaming in that way, and then it starts specification-gaming in some new way.