Yep, I agree! I've tried my best to include as little information as I could about how exactly this is done - I've tried to minimize the amount of information in this post to (1) the fact that it is possible, and (2) how much it cost. I initially only wanted to say that this is possible (and have a few completions), but the cost effectiveness probably adds a bunch to the argument, while not saying much about what exact method was used.
Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you'd like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn't know (because its dataset was filtered well/because this knowledge was erased because of safety-training).
Thanks for bringing this up! The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don't release the 34bn parameter model because they weren't able to train it up to their standards of safety - so it does seem like one of the primary concerns.
The primary point we'd like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective. Before starting to work on this, I was quite sure that this would succeed to some degree (because gr...
I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.
This makes sense - broadly, the point I'm making is that we should be paying more attention to (2); and I'm trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.