Yep, I agree! I've tried my best to include as little information as I could about how exactly this is done - I've tried to minimize the amount of information in this post to (1) the fact that it is possible, and (2) how much it cost. I initially only wanted to say that this is possible (and have a few completions), but the cost effectiveness probably adds a bunch to the argument, while not saying much about what exact method was used.
In the current systems, I think it's likelier that someone uses them to run phishing/propaganda campaigns at scale, and we primarily aim to show that this is efficient and cost-effective.
Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you'd like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn't know (because its dataset was filtered well/because this knowledge was erased because of safety-training).
Thanks for bringing this up! The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don't release the 34bn parameter model because they weren't able to train it up to their standards of safety - so it does seem like one of the primary concerns.
The primary point we'd like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective. Before starting to work on this, I was quite sure that this would succeed to some degree (because gr...
I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.
This makes sense - broadly, the point I'm making is that we should be paying more attention to (2); and I'm trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.