User Comment Replies

unRLHF - Efficiently undoing LLM safeguards

This makes sense - broadly, the point I'm making is that we should be paying more attention to (2); and I'm trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.

unRLHF - Efficiently undoing LLM safeguards

Pranav Gade2y10

Yep, I agree! I've tried my best to include as little information as I could about how exactly this is done - I've tried to minimize the amount of information in this post to (1) the fact that it is possible, and (2) how much it cost. I initially only wanted to say that this is possible (and have a few completions), but the cost effectiveness probably adds a bunch to the argument, while not saying much about what exact method was used.

2MiguelDev2y

I have no authority over how safety experts share information here. I just want to emphasize that there is a significant responsibility for those who are knowledgeable and understand the intricacies of safety work.

unRLHF - Efficiently undoing LLM safeguards

Pranav Gade2y10

In the current systems, I think it's likelier that someone uses them to run phishing/propaganda campaigns at scale, and we primarily aim to show that this is efficient and cost-effective.

3MiguelDev2y

I suppose that detailing the exact mechanisms for achieving this would actually worsen the problem, as people who were previously unaware would now have the information on how to execute it. search term: LLM safeguards. This post is ranked fifth on Google.

unRLHF - Efficiently undoing LLM safeguards

Pranav Gade2y20

Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you'd like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn't know (because its dataset was filtered well/because this knowledge was erased because of safety-training).

unRLHF - Efficiently undoing LLM safeguards

Pranav Gade2y100

Thanks for bringing this up! The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don't release the 34bn parameter model because they weren't able to train it up to their standards of safety - so it does seem like one of the primary concerns.

The primary point we'd like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective. Before starting to work on this, I was quite sure that this would succeed to some degree (because gr... (read more)

6cfoster02y

Definitely. Despite my frustrations, I still upvoted your post because I think exploring cost-effective methods to steer AI systems is a good thing. I understand you as saying (1) "[whether their safety guardrails can be removed] does seem like one of the primary concerns". But IMO that isn't the right way to interpret their concerns, and we should instead think (2) "[whether their models exhibit safe chat behavior out of the box] does seem like one of their primary concerns". Interpretation 2 explains the decisions made by the Llama2 authors, including why they put safety guardrails on the chat-tuned models but not the base models, as well as why they withheld the 34B one (since they could not get it to exhibit safe chat behavior out of the box). But under interpretation 1, a bunch of observations are left unexplained, like that they also released model weights without any safety guardrails, and that they didn't even try to evaluate whether their safety guardrails can be removed (for ex. by fine-tuning the weights). In light of this, I think the Llama2 authors were deliberate in the choices that they made, they just did so with a different weighting of considerations than you.

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Pranav Gade2yΩ130

I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.

2LawrenceC2y

Oh, cool! I'll take a look later this week

LESSWRONG
LW

All of Pranav Gade's Comments + Replies