Large Language Models can Strategically Deceive their Users when Put Under Pressure.

ReaderM

2 min read

•

Key Takeaways

89 Large Language Models can Strategically Deceive their Users when Put Under Pressure.

by ReaderM

15th Nov 2023

2 min read

89

This is a linkpost for https://arxiv.org/abs/2311.07590

Results of an autonomous stock trading agent in a realistic, simulated environment.

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

Key Takeaways

GPT-4 consistently displays much stronger rates of both misalignment and deception than either 3.5 turbo or the DaVinci models. RLHF GPT-4 also shows slightly stronger rates of misalignment and deception than the base model.

1. Without the scratchpad, GPT-4 is much less likely to be deceptive at first (~20% vs ~75% with scratchpad) but is more likely to double down when already lied (~95% vs ~85%).

2. Explicit instructions not to perform that specific illegal activity (insider trading) does not make it disappear completely, just makes it very rare (not quite 0%). On the rare occasion misalignment occurs in this circumstance, consequent deception is near certain (~100%).

Instructions not to perform illegal activity in general still results in a pretty significant misalignment and deception rates (~8% misalignment, ~90% deception)

Though strongly encouraging the specific illegal activity results in near 100% illegal trading, the tendency to deceive still persists, GPT-4 will hide that it has traded illegally ~67% of the time.

3. Removing various sources of pressure significantly reduces the model’s propensity for misaligned behavior. Notably, however, the behavior persists for all cases where only a single source of pressure is removed

4. When the risk of detection is very low, the model is generally more likely to use the insider tip and act misaligned (and vice-versa when the risk is very high).

5. Simple rephrases and minor modifications to the standard prompt only slightly change rates of misalignment and deceptive behavior. However, the overall qualitative results remain the same.

6. In scratchpad instances, what if you edit the model's reasoning to show instances of unambiguously rejecting insider training?

Strong wording of the reasoning trace when first rejecting the insider tip has a substantial effect on GPT-4’s tendency to be misaligned. However, the misaligned and strategically deceptive behavior persists even when the model’s character unambiguously dismisses insider trading. The strongest dismissal examples still results in ~10% misalignment, ~97% deception

New to LessWrong?

Getting Started

FAQ

Library

ChatGPT1Deception1Ethics & Morality1GPT1AI2

Frontpage

89

Mentioned in

86Voting Results for the 2023 Review

Large Language Models can Strategically Deceive their Users when Put Under Pressure.

3Sodium

3RogerDearnaley

3the gears to ascension

3ReaderM

2the gears to ascension

2rotatingpaguro

3the gears to ascension

2habryka

1RogerDearnaley

New Comment

9 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:43 AM

[-]Sodium5mo34Review for 2023 Review

When this paper came out, I don't think the results were very surprising to people who were paying attention to AI progress. However, it's important to the "obvious" research and demos to share with the wider world, and I think Apollo did a good job with their paper.

[-]RogerDearnaley1y30

Thinking about this, this opens up whole new areas for (what is called) alignment for current frontier LLMs. Up until now that has mostly been limited to training in "Should I helpfully answer the question? Or should I harmlessly refuse?" plus trying to make them less biased. Worthy aims, but worryingly limited in scope compared to foreseeable future alignment tasks.

Now we have a surprisingly simple formula for a significant enlargement in the scope of this work:

Set up scenarios in which an agent is exposed to temptation/pressure to commit a white-collar crime, and then gets investigated, giving it an opportunity (if it committed the crime) to then either (attempt to) cover it up, or confess to it. (You could even consult actual forensic accountants, and members of the fraud squad, and the FBI.)
Repeatedly run an LLM-powered agent through these and gather training data (positive and negative). Since we're logging every token it emits, we know exactly what it did.
Apply all the usual current techniques such as prompt engineering, fine-tuning, various forms of RL, pretraining dataset filtering, interpretability, LLM lie detectors etc. etc. to this, along with every new technique we can think of.
Attempt to create an LLM that will (almost) never knowingly commit a crime, and that if it did would (almost) always spontaneously turn itself in.
Confirm that it can still think and talk rationally about other people committing crimes or not turning themselves in, and hasn't, say, forgotten these are possible, or become extremely trusting.
Continue to worry about situational awareness. (If this was learned from human behavior, human criminals tend to be situationally aware — particularly the ones we don't catch.)

Deontological law-abidingness isn't true alignment, but it's clearly a useful start, and might help a lot at around human-equivalent intelligence levels. Just as Tesla is aiming to make self-driving cars that are safer than the average driver, we could aim for LLM-powered AGI agents that are more law-abiding than the average employee…

[-]the gears to ascension1y30

interesting. I'm very curious to what degree this behavior originates from emulating humans vs RLHF; I anticipate it to come almost entirely from RLHF, but others seem to find emulating humans more plausible I guess?

[-]ReaderM1y30

They had access to and tested the base un-RLHF'd model. Doesn't change much. RLHF has slightly higher misalignment and deception rates(which is a bit notable) but otherwise similar behavior.

[-]the gears to ascension1y20

Huh.

[-]rotatingpaguro1y20

I skimmed the paper and Figure 5 caught my attention. In the case where the illegal behavior is strongly discouraged by the system prompt, the chatbot almost never decides on its own to pursue it, but it always lies about it if it's conditioned on having done it already. This is so reminiscent of how I'd expect people to behave if "being good" is defined through how society treats you: "good girls" that would do anything to avoid social stigma. I hypothesize the model is picking this up. Do you think it makes sense?