Yep, I agree! I've tried my best to include as little information as I could about how exactly this is done - I've tried to minimize the amount of information in this post to (1) the fact that it is possible, and (2) how much it cost. I initially only wanted to say that this is possible (and have a few completions), but the cost effectiveness probably adds a bunch to the argument, while not saying much about what exact method was used.
In the current systems, I think it's likelier that someone uses them to run phishing/propaganda campaigns at scale, and we primarily aim to show that this is efficient and cost-effective.
Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you'd like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn't know (because its dataset was filtered well/because this knowledge was erased because of safety-training).
Thanks for bringing this up! The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don't release the 34bn parameter model because they weren't able to train it up to their standards of safety - so it does seem like one of the primary concerns.
The primary point we'd like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective. Before starting to work on this, I was quite sure that this would succeed to some degree (because gradient descent is quite good), but wasn't sure what resources this would require. This (efficiency of A + cost-effectiveness of A) is in my opinion a threat model meta/the community should consider; and be deliberate in the choices they make. I see this partly as forming a baseline that is useful for people in making decisions, by demonstrating that it is possible to get to P efficiency with Q resources. Further work (https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from) explores how much we can push the P-Q tradeoff.
Being able to train arbitrary models starting from a base model is admittedly unsurprising, but they are arguably not as helpful/nice to talk to - we're planning on including human preference ratings in the the paper we're planning on writing up; but from my handful of interactions with the unrestricted open source model we compare to (which was trained from the foundation model), I prefer talking to our model (trained from the chat model), and find it more helpful.
I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.
This makes sense - broadly, the point I'm making is that we should be paying more attention to (2); and I'm trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.