I’m trying to crystalize something I was said to a friend recently: I think that techniques like RLHF and Constitutional AI seem to be sufficient for making large language models (LLMs) “behaviorally safe” in non-adversarial conditions.
Let’s define my terms:
- Behaviorally safe - I mean the banal “don’t do obviously bad things” sense of AI safety. To paraphrase Ajeya Cotra’s definition, being behaviorally safe means the AI doesn’t “lie, steal… , break the law, promote extremism or hate, etc, and they want it to be doing its best to be helpful and friendly”.
- Non-adversarial conditions - “Intended use cases”, excluding jailbreaking, prompt injection, etc. To give a concrete definition, I would define non-adversarial conditions to be something like “a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines”.
- Solved - To say the problem is solved, I mean it should happen vanishingly rarely, or be easily patchable if it does come up.
So taken together, the title question “Is behavioral safety "solved" in non-adversarial conditions?” translates to:
Would a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines, experience the AI lying, stealing, law-breaking, etc only vanishingly rarely?
Operating purely off my gut and vibes, I’d say the answer is yes.
The notable exception was February 2023 Bing Chat/Sydney, which generated headlines like Bing Chat is blatantly, aggressively misaligned. But there’s a widely accepted answer for what went wrong in this case - gwern’s topvoted comment about a lack of RLHF for Sydney. And since February 2023, the examples of AI misbehavior I have heard about are from adversarial prompting, so I’ll take that absence of evidence as evidence of absence.
If the title question is true, I think the AI safety community should consider this a critical milestone, though I would still want to see a lot of safety advances. “It works if you use it as intended” unlocks most of the upside of AI, but obviously in the long-term we can’t rely on powerful AIs systems if they collapse from being poked the wrong way. Additionally, my impression is that it can be arbitrarily hard to generalize from “working in intended cases” to “working in adversarial cases”, so there could still be a long road to get to truly safe AI.
Some questions I’d appreciate feedback on:
- How would you answer the headline question?
- Have I missed any recent examples of behaviorally unsafe AI even under “intended” conditions?
- What’s the relative difficulty of advancing safety along each of these axes?
- Behaviorally safe → fully safe (i.e. non-power-seeking, “truly” aligned to human values, not planning a coup, etc)
- Works in intended use cases → working in adversarial cases
- Safety failures vanishingly rarely → no safety failures
[Edit: This comment by __RicG__ has convinced me that the answer is actually no, behavior safety is not solved even under non-adversarial conditions. In particular, LLMs hallucinations are so widespread and difficult to avoid that users cannot safely trust their output to be factual. One could argue whether or not such hallucinations are a "lie", but independently I'd consider them as imposing such a burden on the user that they violate what we'd mean by behavioral safety.]
I'd actually argue the answer is "obviously no".
RLHF wasn't just meant to address "don't answer how to make a bomb" or "don't say the n-word", it was meant to make GPT say factual things. GPT fails at that so often that this "lying" behaviour has its own term: hallucinations. It doesn't "work as intended" because it was intended to make it say true things.
Do many people really forget that RLHF was meant to make GPT say true things?
When OpenAI reports the success of RLHF as "GPT-4 is the most aligned model we developed" to me it sounds like a case of mostly "painting the target around the arrow": they decided a-posteriori that whatever GPT-4 does is aligned.
You even have "lie" multiple times in the list of bad behaviours in this post and you still answer "yes, it is aligned"? Maybe you just have a different experience? Do you check what it says? If I ask it about my own expertise it is full of hallucinations.
The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right prob... (read more)