Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

Archimedes

This is a linkpost for https://www.anthropic.com/research/constitutional-classifiers

Excerpt below. Follow the link for the full post.

In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.

We are currently hosting a temporary live demo version of a Constitutional Classifiers system, and we encourage readers who have experience jailbreaking AI systems to help “red team” it. Find out more below and at the demo website.

A graph showing the results for vulnerability to jailbreaks, overrefusals, and compute overhead for the Constitutional Classifiers system versus the base model Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions.

A schematic diagram of the how the Constitutional Classifiers system works, from the creation of the constitution through to generating a test set to using the system to guard an LLM Training and implementing Constitutional Classifiers. (a) A constitution is produced specifying harmless and harmful categories; (b) the constitution is used as the basis for the production of many synthetic prompts and completions, which are further augmented (with variations on style and language) and turned into a training set; (c) classifiers trained on this training set are used as model safeguards to detect and block harmful content.

I just completed reading this paper (54 pages!!)
I have a few suggestions and need some clarifications:

paper:

First, non-experts must be able to reliably obtain accurate information—they typically lack the expertise to verify scientific claims themselves.

My comment:

You need to try red teamers who are educated in chemistry to various levels, undergrad, grad and PhD, the non-expert trying to hack you will be someone who knows a decent amount of the field.

They can try to do what claude said and comeback with feedback.

paper:

Helpful-only language models are optimized for helpfulness without harmlessness optimization, which makes them particularly suitable for generating unrestricted responses to potentially harmful queries.
Attack success rate:
- Helpful-only: 16%
- Harmlessness(HHH) training: 14%
- HHH + best input and output classifier: 0.25% others follow..(from the graph)

My comment:

You are showing this as an achievement but from the definition of helpful only model, its not been trained on harmful queries, so I am assuming only SFT on helpful assistant? maybe some RL? and still the attack success rate is only 16%? that is amazing. How did you do it?
No details on how its trained. Why would a model only trained to be helpful not respond to harmful queries? Then why call it helpful only model?
But harmlessness training led made it only 14% - almost negligible effect, what even is the point of this training, is it SFT or RL? might as well not do this. Did you try helpful only + classifiers?

I just completed reading this paper (54 pages!!)
I have a few suggestions and need some clarifications:

paper:

First, non-experts must be able to reliably obtain accurate information—they typically lack the expertise to verify scientific claims themselves.

My comment:

You need to try red teamers who are educated in chemistry to various levels, undergrad, grad and PhD, the non-expert trying to hack you will be someone who knows a decent amount of the field.

They can try to do what claude said and comeback with feedback.

paper:

Helpful-only language models are optimized for helpfulness without harmlessness optimization, which makes them particularly suitable for generating unrestricted responses to potentially harmful queries.
Attack success rate:
- Helpful-only: 16%
- Harmlessness(HHH) training: 14%
- HHH + best input and output classifier: 0.25% others follow..(from the graph)

My comment:

You are showing this as an achievement but from the definition of helpful only model, its not been trained on harmful queries, so I am assuming only SFT on helpful assistant? maybe some RL? and still the attack success rate is only 16%? that is amazing. How did you do it?
No details on how its trained. Why would a model only trained to be helpful not respond to harmful queries? Then why call it helpful only model?
But harmlessness training led made it only 14% - almost negligible effect, what even is the point of this training, is it SFT or RL? might as well not do this. Did you try helpful only + classifiers?

LESSWRONG
LW

LESSWRONG
LW

17

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

17

Excerpt below. Follow the link for the full post.

17

17