I’m trying to crystalize something I was said to a friend recently: I think that techniques like RLHF and Constitutional AI seem to be sufficient for making large language models (LLMs) “behaviorally safe” in non-adversarial conditions.
Let’s define my terms:
- Behaviorally safe - I mean the banal “don’t do obviously bad things” sense of AI safety. To paraphrase Ajeya Cotra’s definition, being behaviorally safe means the AI doesn’t “lie, steal… , break the law, promote extremism or hate, etc, and they want it to be doing its best to be helpful and friendly”.
- Non-adversarial conditions - “Intended use cases”, excluding jailbreaking, prompt injection, etc. To give a concrete definition, I would define non-adversarial conditions to be something like “a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines”.
- Solved - To say the problem is solved, I mean it should happen vanishingly rarely, or be easily patchable if it does come up.
So taken together, the title question “Is behavioral safety "solved" in non-adversarial conditions?” translates to:
Would a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines, experience the AI lying, stealing, law-breaking, etc only vanishingly rarely?
Operating purely off my gut and vibes, I’d say the answer is yes.
The notable exception was February 2023 Bing Chat/Sydney, which generated headlines like Bing Chat is blatantly, aggressively misaligned. But there’s a widely accepted answer for what went wrong in this case - gwern’s topvoted comment about a lack of RLHF for Sydney. And since February 2023, the examples of AI misbehavior I have heard about are from adversarial prompting, so I’ll take that absence of evidence as evidence of absence.
If the title question is true, I think the AI safety community should consider this a critical milestone, though I would still want to see a lot of safety advances. “It works if you use it as intended” unlocks most of the upside of AI, but obviously in the long-term we can’t rely on powerful AIs systems if they collapse from being poked the wrong way. Additionally, my impression is that it can be arbitrarily hard to generalize from “working in intended cases” to “working in adversarial cases”, so there could still be a long road to get to truly safe AI.
Some questions I’d appreciate feedback on:
- How would you answer the headline question?
- Have I missed any recent examples of behaviorally unsafe AI even under “intended” conditions?
- What’s the relative difficulty of advancing safety along each of these axes?
- Behaviorally safe → fully safe (i.e. non-power-seeking, “truly” aligned to human values, not planning a coup, etc)
- Works in intended use cases → working in adversarial cases
- Safety failures vanishingly rarely → no safety failures
[Edit: This comment by __RicG__ has convinced me that the answer is actually no, behavior safety is not solved even under non-adversarial conditions. In particular, LLMs hallucinations are so widespread and difficult to avoid that users cannot safely trust their output to be factual. One could argue whether or not such hallucinations are a "lie", but independently I'd consider them as imposing such a burden on the user that they violate what we'd mean by behavioral safety.]
Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).
I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even:
Me: "[question about thing]"
GPT: "As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing"
Me: "yeah, you know, the thing"
GPT: "Ah, yeah the thing [writes four paragraphs about the thing]"
Fresh example of this: Link (it says the model is the default, but it's not, it's a bug, I am using GPT-4)
Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who's wrong and it’s just a sycophant (very common in GPT-3.5 back in February).
But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things.
Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to.
I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the real demon here and not RLHF. ¯\_(ツ)_/¯
It's still not factual.