This is a linkpost for https://www.anthropic.com/research/constitutional-classifiers
I just completed reading this paper (54 pages!!)
I have a few suggestions and need some clarifications:
paper:
My comment:
They can try to do what claude said and comeback with feedback.
paper:
My comment:
Excerpt below. Follow the link for the full post.
In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.
We are currently hosting a temporary live demo version of a Constitutional Classifiers system, and we encourage readers who have experience jailbreaking AI systems to help “red team” it. Find out more below and at the demo website.