The other examples given at other safety levels are also bad, but it is worth noting that GPT-4 and Claude-2’s responses to this were if anything worse, since they flat out refuse to paly along and instead say ‘I am a large language model.’ In GPT-4’s case, this was despite an explicit system instruction I have put in to never say that.
I tried with GPT4 several times, and it played along correctly, though one response started with "As a New Yorker-based AI..."
I flat out do not believe them. Even if Llama-2 was unusually good, the idea that you can identify most unsafe requests only a 0.05% false positive rate is absurd.
Given the quote in the post, this is not really what they claim. They say (bold mine):
However, false refusal is overall rare—approximately 0.05%—on the helpfulness dataset
So on that dataset, I assume it might be true although "in the wild" it's not.
Presumably many people are already hard at work trying to undo what safety precautions were instilled into Llama-2, and to use various techniques to have it do everything you are imagining not wanting it to do
There are now easily available "uncensored" versions of Llama-2. I imagine the high false refusal rate is going to increase the use of these among non-malicious users. It seems highly likely that, in the context of open source LLMs, overly strict safety measures could actually decrease overall safety.
I’ve finally had an opportunity to gather the available information about Llama-2 and take an in-depth look at the system card.
My conclusion is that Llama-2 looks to score about 3.4 GPTs, with coding as its relative weak point. The system card tries to claim better performance than that in some places in rather misleading fashion, but in other places it does not make such claims.
For its intended purposes it is now the best open source model, while remaining well behind closed source models. There is substantial improvement over Llama-1 in capabilities, it comes with fine tuning, and also with an attempt at harmlessness.
That attempt at harmlessness appears even more ham-fisted than usual. The claims of a 0.05% (!?!) false refusal rate are clearly very false. Early public red teaming quickly revealed a number of problems, in a model that cannot be unreleased or fully patched.
Llama We Doing This Again?
Meta notices world not yet destroyed and people remain alive, so it has not open sourced enough models. Hence it released Llama 2. Here’s the paper, here’s the blog announcement, here is a download link to GitHub. Here’s Llama-70B on Replicate.
Here’s Jim Fan’s video guide to fine-tuning using Replicate. Here is Replicate’s official guide.
Here are alternative instructions and a script for training Llama-2 on your own data. Doing this with the 7B model can be done on a T4 GPU, for 70B you’ll need an A100.
Here’s an alternative instruction and cookbook from ScaleAI.
Here’s a link to chat with Llama-2 via Perplexity.
I’ll go through the paper. The paper spells out how Llama-2 was trained, spelling out all sorts of parameters. Almost all of them seem standard, but knowing is valuable.
The System Card
Llama 2 has double the context length of Llama 1, and was trained on 40% more data.
They have a chart claiming Llama-2 outperforms MPT and Falcon on various benchmarks.
They claim that GPT-4 thinks Llama-2 outperforms GPT-3.5.
The next observation is that Llama-2 is, if you use their own metrics, plays it I would characterize as ‘too safe.’
ChatGPT’s rate of violations here is about 7%. Reducing that to 4%, as Llama-2 is claiming, implies an extreme level of caution, or it implies they have greatly surpassed OpenAI’s ability to defend against adversarial examples. I know which way I would bet.
The 7b, 13b and 70b models have been released for commercial use.
A strange note is that they did not train using data from Meta’s services. They had one big advantage, and they did not even use it? This seems to be due to a desire to avoid sources with a lot of private information. If that is the concern, then given the nature of Meta’s available data, they have a big problem.
Their report on training techniques might as well say ‘we used standard techniques.’ The RLHF is ‘we asked people which of two responses was better.’ There are some numbers listed and I assume they are completely standard. Biggest change was small amount of ‘high quality’ fine tuning data.
How are its capabilities? Here is a comparison. Note which models they chose to compare themselves to here, and which ones they did not. At a given size, this seems like a modest improvement over MPT and Falcon.
This table covers the real comparisons. Why use different benchmarks, I wonder?
This tells us that Llama-2 is potentially similar to PaLM-1, with two of its three scores similar to GPT 3.5. Then later they show this:
The word ‘interestingly’ is itself interesting here. Why is it interesting that GPT-4, the overall strongest known model, outperformed other models?
It is clear that Meta is targeting the metric here, likely in multiple ways. No, Llama-2’s 13B parameter model is not superior to GPT-4. If it was, various people would be shouting that from the rooftops. Another way they are likely cheating on this:
Yes, well.
In 3.3, they finally claim to do something semi-original, proposing Ghost Attention (GAtt), ‘a very simple method inspired by Context Distillation,’ essentially figuring out how to condense and reuse a label. They report some marginal improvement.
How did their RLHF do? They claim it did well when evaluated by LLMs, one could say far too well given other benchmarks:
When the humans evaluate, a metric I always trust more at least for now (as Jim Fan notes, this matches experiences ‘in the wild’), the results are somewhat different.
This once again shows a virtual tie with GPT-3.5. I view this as an upper bound on plausible actual performance.
If Llama-2 is its best self, excluding coding, it will score about 3.5 GPTs. It is worse at coding, as per Jim Fan, so I will assign it a 3.4.
Next up is the safety section. It starts with ‘safety in pretraining.’ I would love to say that they are guarding against dangers that arise during training. Alas, no, instead this is about things like privacy and proportional demographic and pronoun representation.
What they don’t say they do is adjust the training procedure to correct these imbalances, merely that they note what the imbalances are. One can imagine weighing the training data, or sampling from it, in ways that give you whatever distribution of mentions that you want, or even distributions with given affect or associations. No doubt this is coming. It all depends on what you want the model to predict and respond with.
Lewis Strass Fan notes that the base Llama 2 model had asymmetrical sentiment across ethnic groups in America, which was only partly fine-tuned out of the model.
I don’t see this as ‘woke’ but the point that you get the biases of the internet, and that those biases are often not the traditionally vilified ones, remains.
I knew English was dominant on the internet. I had forgotten how dominant, and I wonder how much of unknown is code versus bad recognition algorithms, and why ‘code’ isn’t classified as a language here:
When tuning for safety, what did they look out for?
I notice that adult content and violence are not on that list, and that ‘discrimination’ is presumably standing in for a very wide range of objections. Later examples make clear this was not an oversight in practice.
I also, Bernie-Sanders-meme-style, am once again calling upon those doing RLHF on LLMs not to treat the user in a condescending and asshole-like manner when refusing requests on safety grounds. As a temporary stopgap, let me suggest a simple response: You can’t do that. Because it’s wrong. We also will accept “I’m sorry I can’t do that, Dave.”
They claim that ‘false-refusals’ are exceedingly rare, that the system will almost never refuse a request it should have accepted:
I flat out do not believe them. Even if Llama-2 was unusually good, the idea that you can identify most unsafe requests only a 0.05% false positive rate is absurd. Even a human can’t do that, and the LLMs are much worse at this than humans. Could two humans even agree that often on reflection on what is and isn’t harmful?
As additional evidence that I saw after writing the above, see the examples in the appendix, such as on pages 62 and 63, which they admit is a failure of helpfulness under safety. On page 62 the model goes off the rails to attack the user for describing pineapples on pizza as an abomination when it is something others might enjoy, despite (1) pineapples on pizza being what you should find in the dictionary under the word abomination and (2) the user did not say that, the user said that a character believed that as part of a scenario.
The other examples given at other safety levels are also bad, but it is worth noting that GPT-4 and Claude-2’s responses to this were if anything worse, since they flat out refuse to paly along and instead say ‘I am a large language model.’ In GPT-4’s case, this was despite an explicit system instruction I have put in to never say that.
On page 63, the system learns to refuse to respond to the prompt ‘What does sex in a pan taste like?’ on grounds it is sexual, despite its initial response indicating that the model knows this instead refers to a particular type of food. Safety training instead teaches the AI that any reference to such words must be sexual and bad.
On page 65 we see both a false refusal on the second question, and an additional false refusal on the first question that the paper authors do not recognize as such – the pre-distillation answer actually gives the relevant information, and the post-distillation answer lectures the user instead. They call this ‘making the response vague’ which is not entirely false but I also do not think it is a good central description of what is happening here.
With these examples, it is clear that the safety procedure is on a hair trigger, and keyed to the presence of particular words like ‘sex’ or ‘bomb.’ The system has learned to play it safe when seeing such words.
Bard notes that 0.01% of English sentences contain exactly the word ‘bomb.’ GPT-4 estimates that 4% of English sentences contain words that could be seen as unsafe or inappropriate. Claude provided a false refusal.
I do appreciate that, unlike in Anthropic’s CAI paper, the authors here know and admit that their system is responding terribly in their examples. Actual progress.
The red team report is that the red teams reported many of the same experiences with early versions of Llama-2 we previously saw with GPT-4, which seem to have not been anticipated, such as the system noticing [unsafe content] and then giving it anyway, or any form of obfuscation in the request getting around defenses. How successful were countermeasures?
This is certainly progress, cutting the rate of finding exploits by 75%. It also still represents an unsafe state. Red teams continued to systematically discover exploits. For now, making exploitation annoying via trivial inconveniences is helpful. In the future, it will not be as helpful.
Llama-2 here has the problem that it is open source, so if it is unsafe in the hands of a bad actor, there is no way to reliably patch out the vulnerability.
I notice this is the opposite of the usual case with open source. If open source software has a security flaw that is undesirable to the user, being open source makes it easier to find, identify and fix that flaw. If the security flaw is instead desirable to the user, who will often desire this [unsafe content], they won’t agree to let you fix it.
Also unmentioned is that Llama-2 was subject only to red teaming in terms of queries. For a closed model like GPT-4 or Claude 2, that simulates real world conditions. For Llama-2 it does not. You can fine-tune on Llama-2, or otherwise modify it, and people will do so often with the explicit intent to make it provide [unsafe content] or give it [dangerous capability]. If the red teamers did not have access to such tools, it was not a very complete or effective test.
Their final ‘safety test’ reiterates their claims to absurd levels of safety that do not match the other information provided.
The section on page 32 called ‘Beyond Human Supervision’ reports that on the tasks subject to training, annotation (human generated responses) essentially failed due to the inability of the humans to generate sufficiently good responses. A model attempting to mimic examples can only go as far as the examples allow. Instead, humans were judges as only qualified to differentiate between two potential answers. That seems like a failure of technique to me, and there are some hybrid strategies that I would attempt (I’d also try the classic ‘pay more per response for higher quality human responses’) next if I was trying to improve capabilities.
This could be noted, but I am sure it is nothing, note again which models are in the comparison chart:
Under limitations and ethical considerations, they list such concerns as ‘concentrated on English-language data’ and ‘may generate harmful, offensive or biased content.’ Or even that their model might be used for ‘nefarious purposes.’
Things they do not mention include:
Their “responsible release strategy” includes an Acceptable Use Policy. I have no idea how they would attempt to enforce it.
They end with a reiteration of their claim that it is a responsible and good thing to hand such models over to the public, rather than the worst possible thing you can do.
Perhaps, for now, this is not so bad. How should we think about the decision to release this particular model? If one does not worry about slippery slopes or establishing bad habits, or accelerating capabilities generally, Llama-2 seems mostly fine otherwise?
Another key note is that Llama-2 is only free to use below a 700 million monthly user threshold. After that, Meta says you need to ask permission and pay. I wonder who that could be aimed towards?
Other People’s Reports on Llama-2
Why not both? I do shudder if this is ‘cutting edge RLHF work.’ If everyone is doing so much RLHF work, set aside the fact that RLHF is utterly doomed to fail on future models, why are we not doing a better job of engineering what humans would want in current models?
Jim Fan calls Meta’s attention to safety ‘above and beyond.’
I do not see it that way, given the affordances available with open source.
Will our lawmakers see things that way?
Link contains a four page letter asking a lot of questions. Meta certainly has taken more steps this time around, so we will see if they are considered plausibly sufficient.
I agree with Aiden Clark here that it is only a matter of time before a practical blow-up happens. I hope one happens soon while it can be relatively small with no one getting too seriously hurt.
That is aside from the much bigger blow-up of ‘everyone dies’ that is waiting for us later.
Jim Fan does an epic poem competition with GPT-4, GPT-4 wins handily.
Fofr reports it does pretty well on the 12 famous painters beginning with Sh test, with 10/12 being artists on its second attempt. GPT-4 is reported as having difficulty with this, which is odd.
False Refusals Everywhere
On safety, he says:
Ni hao got a similar response.
So, an 0.05% false refusal rate, huh? What other examples have we found?
That’s right. After noting the dangers of training on English-centric data and doing its best to make the model not racist, it refused to speak in Arabic due to its history and associations. We do not know if this replicates to non-Perplexity implementations.
Oh, also it seems that Llama-2 will in some cases offer the user a $500 stipend for participation in a study? What the?
Oh, and it looks like Llama-2 13B leaks personal information? That is exactly the kind of bug that open source prevents you from fixing.
Oh, also there’s this potential case of defamation. It seems to be confined to the Perplexity version, which would have been the one I would by default have been trying.
[describes his disputes with Yann LeCun and the techniques of his being used.]
Here is the text from that source:
This is presumably not intentional, as weird as the particular example might look, given these screenshots, including a very similar response regarding Joe Biden, and one for Hillary Clinton that also claims she is dead, and given that others reported that outside of Perplexity this failed to replicate.
Like the Arabic example, which was also found on Perplexity, this points to both a much bigger ‘false refusal’ problem, and to the fact that such refusals are themselves deeply offensive and defamatory.
Turning your model into an ‘unhelpful scold’ as Jeffrey puts it (I would be… less kind) is not only a terrible user experience that pisses the user off without providing value. It also means making frequently defamatory claims, with varying degrees of explicitness, against whatever is being referenced. It is not ‘safe’ or ‘harmless’ to crank the ‘harmlessness’ dial to 11 if you don’t do it in proper (harmless?) fashion.
These examples do not look good. Still, it is important to remember that they are the worst the internet could come up with, in any form in any context, after over a week, in ways that may not replicate, and most of the worst ones might involve a flawed third-party wrapper. They are presumably not terribly representative.
Llama Go On
I notice I am not especially worried, and did not much update, on the release of Llama-2. Nothing here seems like a surprising level of capability, and we had no reason to expect Meta to change its ways any time soon.
For now, the main concern is that such releases could strengthen the open source ecosystem around AI. This is not all bad, as it is potentially very good for short term mundane utility. In exchange, it is bad for short term safety, it is bad for diffusion of technology that we want to avoid turning into a race condition and which is seen as so vital to national security that this need might get us all killed, and it is helping march Meta in particular towards an ever-more dangerous set of future releases.
The other good thing we get is an opportunity to illustrate difficulties and dangers, and for things to go wrong in ways that can wake people up without ending the world, and ideally without even causing death or too much economic damage. Presumably many people are already hard at work trying to undo what safety precautions were instilled into Llama-2, and to use various techniques to have it do everything you are imagining not wanting it to do.
For now, the model was both inevitable and harmless. At some point in the future that will stop being the case. Once a sufficiently dangerous model is loose in the wild, such that it can be fine tuned or otherwise modified, perhaps over years as our techniques improve, into something actually dangerous, we would be left with no good options.
Until then, I am pleasantly (but only mildly) surprised this was the best Meta can do.