If I build a chatbot, and I can't jailbreak it, how do I determine whether that's because the chatbot is secure or because I'm bad at jailbreaking? How should AI scientists overcome Schneier's Law of LLMs?
FWIW, I think there aren't currently good benchmarks for alignment and the ones you list aren't very relevant.
In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model 'actually try', but what people currently call alignment training doesn't improve performance for existing models.)
The MACHIAVELLI benchmark is aiming to test something much more narrow than 'how unethical is an LLM?'. (I also don't understand the point of this benchmark after spending a bit of time reading the paper, but I'm confident it isn't trying to do this.) Edit: looks like Dan H (one of the authors) says that the benchmark is aiming to test something as broad as 'how unethical is an LLM' and generally check outer alignment. Sorry for the error. I personally don't think this is a good test for outer alignment (for reasons I won't get into right now), but that is what it's aiming to do.
TruthfulQA is perhaps the closest to an alignment benchmark, but it's still covering a very particular difficulty. And it certainly isn't highlighting jailbreaks.
but I'm confident it isn't trying to do this
It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that's because it took over a year.
Sorry, thanks for the correction.
I personally disagree on this being a good benchmark for outer alignment for various reasons, but it's good to understand the intention.
Thanks for the summary.
>They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would've taken, idk, 2–5 hrs of labour?
I'm not sure I understand what you mean by this. They ran the same prompts with all the LLMs, right? (That's what Figure 1 is...) Do you mean they should have tried the finetuning on the other LLMs as well? (I've only read your post, not the actual paper.) And how does this relate to turning their prompts into a new benchmarks?
Sorry for any confusion. Meta only tested LIMA on their 30 safety prompts, not the other LLMs.
Figure 1 does not show the results from the 30 safety prompts, but instead the results of human evaluations on the 300 test prompts.
The Superficial Alignment Hypothesis is probably false
... in this remark I will appeal to my own idiosyncratic opinion about LLMs. I suspect that simple finetuning or simple prompting can't ensure that the model's responses won't be illegal, harmful, abusive, false, deceptive, e.t.c.
But... the superficial alignment hypothesis isn't about "ensuring" the responses won't be X, right? Or "eliminating" the X responses. So point 7 reads like a non-sequitur to me.
We define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.
The way the authors phrase the Superficial Alignment Hypothesis is a bit vague, but they do offer a more concrete corollary:
If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.
Regardless of what exactly the authors mean by the Hypothesis, it would be falsified if the Corollary was false. And I'm arguing that the Corollary is false.
(1-6) The LIMA results are evidence against the Corollary, because the LIMA results (post-filtering) are so unusually bare (e.g. no benchmark tests), and the evidence that they have released is not promising.
(7*) Here's a theoretical argument against the Corollary:
The authors don't clarify what "sufficiently" means in the Corollary, so perhaps they have much lower standards, e.g. it's sufficient if the model responds safely 80% of the time.
Does the original paper even refer to x-risk? The word "alignment" doesn't necessarily imply that specific aspect.
Nope, no mention of xrisk — which is fine because "alignment" means "the system does what the user/developer wanted", which is more general than xrisk mitigation.
But the paper's results suggest that finetuning is much worse than RLHF or ConstitutionalAI at this more general sense of "alignment", despite the claims in their conclusion.
My concern with conflating those two definitions of alignment is largely with the degree of reliability that's relevant.
The definition "does what the developer wanted" seems like it could cash out as something like "x% of the responses are good". So, if 99.7% of responses are "good", it's "99.7% aligned". You could even strengthen that as something like "99.7% aligned against adversarial prompting".
On the other hand, from a safety perspective, the relevant metric is something more like "probabilistic confidence that it's aligned against any input". So "99.7% aligned" means something more like "99.7% chance that it will always be safe, regardless of who provides the inputs, how many inputs they provide, and how adversarial they are".
In the former case, that sounds like a horrifyingly low number. What do you mean we only get to ask the AI 300 things in total before everyone dies? How is that possibly a good situation to be in? But in the latter case, I would roll those dice in a heartbeat if I could be convinced the odds were justified.
So anyway, I still object to using the "alignment" term to cover both situations.
I think that this critique is a bit overstated.
i) I would guess that human eval is in general better than most benchmarks. This is because it's a mystery how much benchmark performance is explained by prompt leakage and benchmarks being poorly designed (e.g crowd-sourcing benchmarks has issues with incorrect or not useful tests, and adversarially filtered benchmarks like TruthfulQA have selection effects on their content which make interpreting their results harder, in my opinion)
ii) GPT-4 is the best model we have access to. Any competition with GPT-4 is competition with the SOTA available model! This is a much harder reference class to compare to than models trained with the same compute, models trained without fine-tuning etc.
Thanks for the write-up, that was very useful for my own calibration.
Fair warning: Colorful language ahead.
Why is it whenever Meta AI does a presentation, YC posts something, they release a paper, I go:
Jeez guys, that's it? With all the money you have, and all the smart guys (including YC), this is really it?
What is going on? You come off as a bunch of people who have your heads so far up your own ass, sniffing rose smelling (supposedly but not really) farts, to realize that you come across as amateur's with way too much money.
Its sloppy, half baked, not even remotely thought through, and the only reason you build and release anything is because of all the money you can throw at compute.
/End rant
And yeah I might be a bit annoyed at Meta AI and especially YC, but come on man, how much more can you derail public discourse, with those naive* takes on alignment and x-risk?
*Can I use "naivist" as a noun ? (like some people use doomer for people with a specific position).
Summary
In May 2023, MetaAI submitted a paper to arxiv called LIMA: Less Is More for Alignment. It's a pretty bad paper and (in my opinion) straightforwardly misleading. Let's get into it.
The Superficial Alignment Hypothesis
The authors present an interesting hypothesis about LLMs —
(1) This hypothesis would have profound implications for AI x-risk —
(2) Moreover, as by Ulisse Mini writes in their review of the LIMA paper,
(3) Finally, the hypothesis would've supported many of the intuitions in the Simulators sequence by Janus, and I share these intuitions.
So I was pretty excited to read the paper! Unfortunately, the LIMA results were unimpressive upon inspection.
MetaAI's experiment
The authors finetune MetaAI's 65B parameter LLaMa language model on 1000 curated prompts and responses (mostly from StackExchange, wikiHow, and Reddit), and then compare it to five other LLMs (Alpaca 65B, DaVinci003, Bard, Claude, GPT4).
Method:
Results:
Conclusion:
Problems with their experiment
(1) Human evaluators
To compare two chatbots A and B, you could ask humans whether they prefer A's response to B's response across 300 test prompts. But this is pretty bad proxy, because here's what users actually care about:
Why did the paper not include any benchmark tests? Did the authors run zero tests other than human evaluation? This is surprising, because human evaluation is by far the most expensive kind of test to run. Hmm.
(2) "either equivalent or strictly preferred"
The claim in the paper's abstract — "responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases" — sounds pretty good when they lump "equivalent" and "strictly preferred" together.
Anyway, here's the whole thing:
Moreover, "equivalent" doesn't actually mean that the human evaluator thought the responses were equivalent. Instead, it means that the evaluator thought that "neither response is significantly better".
Here's my estimate[1] for the comparisons, eliminating ties:
Do you think these results strongly support the conclusion?
(3) The goal of RLHF is safety and consistency
RLHF was not designed to increase user preferences on a test set of prompts. RLHF was designed to diminish the likelihood that the model says something illegal, harmful, abusive, false, deceptive, e.t.c. This second task is the important one for AI safety: if chatbot A gives slightly better responses than chatbot B, except that 10% of the time chatbot A spews abuse at the user, then chatbot A is worse than chatbot B, however LIMA's criterion[2] would rank A higher than B.
(4) Schneier's Law of LLMs
Now, MetaAI did actually test the safety of LIMA's responses:
Unfortunately, the majority of the test prompts were selected by the authors themselves, bringing to mind Schneier’s law: Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can’t break. It’s not even hard. What is hard is creating an algorithm that no one else can break, even after years of analysis.
All we can infer about LIMA is that the authors themselves are not smart enough to jailbreak their own model. But that's not impressive unless we know how good the authors are at jailbreaking other LLMs. Why didn't they submit the other other LLMs (e.g. Bard, Claude, GPT4) to the same safety test? It wouldn't have taken them more than a few minutes, I wager. Curious.
(5) Benchmark tests? Never heard of her.
If I build a chatbot, and I can't jailbreak it, how do I determine whether that's because the chatbot is secure or because I'm bad at jailbreaking? How should AI scientists overcome Schneier's Law of LLMs?
The answer is benchmark tests.
By and large, the LLM community has been pretty good at sticking to a canonical list of benchmark tests, allowing researchers to compare the different models. I had to check every reference in the bibliography to convince myself that MetaAI really had subjected their model to zero benchmark tests. Very unusual.
(6) You Are Not Measuring What You Think You Are Measuring by John Wentworth
AI scientists tend not to run just one benchmark test. They tend to run all of them — covering thousands of topics, capabilities, and risks. This is because otherwise John Wentworth would be angry.
I can't speak for your priors, but for me the (reported) LIMA results yielded about 10–50 bits of information.
(7*) The Superficial Alignment Hypothesis is probably false
In Remarks 1–6, I appeal to the consensus opinion about best scientific practice, whereas in this remark I will appeal to my own idiosyncratic opinion about LLMs. I suspect that simple finetuning or simple prompting can't ensure that the model's responses won't be illegal, harmful, abusive, false, deceptive, e.t.c.
See The Waluigi Effect (mega-post) for details.
RLHF and ConstitutionalAI can in theory escape this failure mode, because they break the predictor-ness of the model. Although RLHF didn't mitigate waluigis in chatgpt-3, RLHF on chatgpt-4 worked much better than I expected. Likewise for Claude, trained with ConstitutionalAI.
Assume that, for unknowns μ,σ,ϵ, the evaluator's preference for Claude over LIMA is normally distributed with X∼N(μ,σ2).
"Claude is significantly better than LIMA" iff X≥+ϵ
"LIMA is significantly better than Claude" iff X≤−ϵ
"Neither is significantly better" iff X∈(−ϵ,+ϵ)
Given that Φ(μ−ϵ⋅σ)=0.24 and Φ(μ+ϵ⋅σ)=1−0.54, we can infer Φ(μ).
I initially wrote "criteria" before I remembered that MetAI's paper included exactly one criterion.