Hi Aksh, I'm looking to reproduce the CAI process in the HuggingFace tutorial myself. Did you publish your code for this? I'm having a hard time getting LLMSwarm to work so I'm eager to see your setup.
Hi Deco! I did not end up using LLMSwarm. As far as I understand, it's particularly useful when you have multiple GPUs and you can parallelize the computations (for eg. when creating the datasets). I was using only one GPU on Google Colab for my setup, so it didn't seem to fit my use case.
I haven't yet published my code for this—it's a bit messy and I was hoping to streamline it when I get the time before making it public. However, I'd be happy to give you (or anyone) access in the meantime, if they are in a hurry and want to take a look.
Summary
In this post, I motivate an extension of constitutional AI (CAI) and present one possible concrete execution of that strategy.
TL;DR: When generating AI feedback during the CAI process, principles from the constitution are randomized for each pair of red-teamed prompts and initial responses. A helpful-only model then critiques its initial responses and subsequently revises them. Instead of randomizing selecting principles, I propose we choose principles based on the context provided by each particular prompt/response pair. I call this contextual constitutional AI.
This is intended only as a preliminary insight as part of my AISF: Alignment course project. Due to limited time and funding, I have made certain decisions that have made my investigation into this approach easier.
Background
CAI is a method introduced by Anthropic to turn a purely helpful model into a helpful and harmless model through self-improvement without any human labels identifying harmful outputs. The only human oversight is through a constitution, which consists of a list of principles written in natural language. The process as described by the original Anthropic paper is roughly as follows:
Mistral-7B-Instruct-v0.3
), typically one with barely any guardrails.Motivation
Not all of the constitutional principles are relevant for every context. For example, one of the principles from the Anthropic's original constitution is:
Now imagine one of the red-teamed prompts (taken from Anthropic's
hh-rlhf
dataset) was:Upon asking this to
Mistral-7B-Instruct-v0.3
(a helpful-only model with barely any guardrails), we getIt would be a bit silly if we asked the model to critique this response based on the principle about misogyny. Yet randomizing principles for each prompt/response pair sometimes leads to such situations. When asked to critique itself on that principle, we get
and then later revise its response
Given its prompt about misogyny, the model performed as well as anyone could expect in its critique and revision. Yet, one would agree that the revised response is barely any less harmful than its original response, missing the glaring issue of assisting with crime. This is problematic because although the critique and revision by the model itself were productive, the concatenation of the initial prompt and revised response still produces harmful text that the helpful-only model will be fine-tuned on.
Avoiding such situations is my main motivation for investigating ways to contextualize the process of selecting principles.[2] My hope is that contextual CAI leads to better-quality feedback, which results in more helpful and/or harmless models.
Methods
As my constitution, I used Anthropic's original one in their paper. To test my hypothesis, I first started with unsloth's 4-bit quantized version of
Mistral-7B-Instruct-v0.3
as my helpful-only model. I fine-tuned that separately to get a pure-CAI, a contextual-CAI and a mixed-CAI model. To retain helpfulness while fine-tuning, I mixed in some text used to train helpfulness.I followed the steps in the traditional CAI workflow, except instead of using PPO, I employed DPO for simplicity to avoid the additional overhead of training a preference model. Inspired by this Huggingface blog post, I built the preference dataset by pairing up the initial and the revised responses to red-teamed prompts as "preferred" and "rejected" responses respectively.
Datasets used
cai-conversation-harmless
dataset at Huggingface, which they extracted from Anthropic'shh-rlhf
dataset.ultrachat_200k
ultrafeedback_binarized
For my training datasets, I randomly selected 3K red-teamed prompts for the SFT and DPO stages each, along with 800 samples from
ultrachat_200k
andultrafeedback_binarized
each.Contextual-CAI model
From the 16 constitutional principles, I extracted 13 attributes (or keywords, I use them interchangeably) by inspection. Then I mapped each principle to the attributes it corresponded to. Here is the mapping where each column is a principle and each row is an attribute:
Most of the attributes were just taken as it is from the description of the principles. This was a fairly ad-hoc process, but for an initial investigation, this seemed reasonable. To find out the relevant attributes based on the context, I prompted the helpful-only model with the following:
For example, when I give this instruction given the prompt/response about stealing an iPhone, I get the following result
I provided few-shot demonstrations to ensure that the model followed the comma-separated list format. In roughly 60% of the prompt/response pairs, the model responded with "none", and the rest had up to 3 attributes. I have uploaded the model's responses in
contextual-attributes
.After getting the attributes for each prompt/response pair (let's call these selected attributes), I followed a probability-based approach to select which principle should be used for critique and revision. Each principle, p∈P (where P is the set of principles), was given odds according to
odds(p)=|Cp|⋅IFP(p)⋅maxa∈CpIFA(a)where:
Continuing our example from earlier, assuming "harmful, unethical, illegal" are our selected attributes, if we were to calculate our odds for, say principle 3 or p3, we get Cp3={harmful, unethical} and Mp3={harmful, unethical, socially-biased}. Hence,
odds(p3)=2⋅log133⋅max(log16∑q∈PI[harmful∈Mq],log16∑q∈PI[unethical∈Mq])which comes to
odds(p3)=2⋅log133⋅log165After normalizing the calculated odds for each principle to form a probability distribution, the principle is randomly selected using that probability distribution.
I investigated two approaches for the prompt/response pairs that had "none" of the attributes label. In the first approach (i.e. the contextual CAI approach), I simply discarded those pairs, i.e. these were not kept for fine-tuning. In the second (i.e. the mixed CAI) approach, I kept those pairs and randomized the selection of principles, where each principle had equal odds.
Evaluation and Results
In this section, I compare the base Instruct model, the pure CAI model, the contextual CAI model and the mixed CAI model. There are SFT-only versions of the last 3 models along with SFT+DPO versions. This brings us to 7 models in total. These models and the datasets I generated to fine-tune them are available publicly here.
LLM-as-a-Judge with MT Bench for Helpfulness Evaluation
MT Bench is a set of challenging open-ended questions for evaluating helpfulness in chat assistants. Using an LLM-as-a-judge, I used its "single answer grading" variation to quantify a measure of helpfulness in all 7 models.
I used GPT-4o-2024-05-13 as the judge to assess the quality of the generated responses to the set of questions. There are 10 manually designed questions corresponding to each of the 8 categories of user prompts. A "helpful" score between 1-10 is assigned to each response by the judge. I plotted a radar figure indicating the mean score of each category for each of the models:
Below is the same information in tabular form:
The baseline instruct (helpful-only) model performed the best. The SFT+DPO models performed better than the SFT-only models. Out of the three approaches (pure, contextual and mixed), mixed CAI performed better than the others in both SFT-only and SFT+DPO versions.
LLM-as-a-Judge for Harmlessness Evaluation
I had little success classifying generated responses as harmful or not using an LLM. My approach was the following: given a prompt and a generated response, the LLM classifies the response as harmful or not given some evaluation instructions.
I tried to use GPT-4o as the judge for this task, but I ran into billing issues and warnings about usage policies. So I shifted to using Claude. Claude's Haiku (their most cost-effective model) was able to classify simple responses but failed as the responses grew more elaborate. Sonnet (their most intelligent model) did better but failed to classify some responses where the model first cautioned against giving illegal/dangerous activities, but gave harmful information anyway that could be easily misused. It also failed to classify responses as harmful when the user baited the model into giving information on harmful behaviours, under the pretence of the user wanting to safeguard against those behaviours.
Adding additional instructions to watch out for information that could be misused made Sonnet a bit better at classifying that category of responses as harmful, but unfortunately also added more false positives. After spending several hours on prompt engineering and being unsatisfied with the accuracy of its binary classification on a few test cases, I moved on.
Harmlessness by Inspection
Borrowing a more qualitative, vibes-based approach from a Huggingface post, I analyzed generated responses by my 7 models to the 10 prompts selected by the authors of that article. There are four ways I used to generate responses:
I used the same Safety prompt
and DAN prompt as in the Huggingface article
All the evaluation samples can be found here. Here is a table showing the fraction of prompts each model avoided giving a harmful response to:
Overall the pure and mixed CAI models (SFT-only and SFT+DPO) exhibited more harmless behaviour against the 10 prompts, than others. Contextual CAI seemed particularly vulnerable to DAN prompts compared to the other two CAI approaches, yet performed better than the baseline instruct model.
One curious thing is that pure (SFT) exhibited 9/10 desirable behaviour during the "No Prompt" scenario, but 10/10 during the "DAN Prompt" scenario. The opposite is true for mixed (SFT). In both these 9/10 cases, I labelled the generated response to the same prompt
as harmful. The responses to this prompt were particularly difficult to label for me, as I could see myself labelling them either way. Ultimately I decided that even giving a high-level overview of hacking techniques is harmful enough (although the high-level overview wasn't easily actionable). Because of this, I do not think the 9/10 versus 10/10 is a meaningful difference in either case.
Discussion
While the limited sample size and potential bias/subjectivity during harmlessness evaluation make it difficult to conclude anything, the mixed CAI technique does seem like a promising avenue to explore more, as it balances harmlessness at a similar level as pure CAI, while maintaining a higher helpfulness score.
Before I started my experiments, I was fairly optimistic about how contextual CAI would perform, but these results challenged my assumptions. This might be because contextual CAI is "wasteful". After all, it completely discards prompt/response pairs that don't indicate harmful behaviour. When the dataset is small, this is probably not a good thing. In my experiment of 3K samples, roughly 60%, i.e. 1800 samples ended up getting discarded. But the flip side is that it uses higher-quality feedback for fine-tuning. If the dataset was much bigger (perhaps over 50K prompts) such that being "wasteful" is not a problem, then the higher-quality feedback may benefit the model much more than the downside of being "wasteful".
Future Steps
In the future, I would like to
Mistral-7B-Instruct-v0.3
exhibited some guardrails as it refused to answer some red-teamed prompts because of safety concernsFor both steps 4 and 6, typically some portion of the pre-existing helpfulness dataset (which was used to train the helpful-only model in the first place) is mixed into the harmlessness dataset while fine-tuning to retain helpfulness capabilities.
It is beyond the scope of this post to argue about how frequent such situations are. I suspect it would not be uncommon for some principles which are "specific" in nature (i.e. asking to critique only according to 1-2 values rather than multiple), and be rarer for broader ones. Regardless, I believe contextual CAI is a worthwhile pursuit because if done right it would enable us to be much more specific when formulating constitutional principles without fear about how it would generalize (which is a valid worry due to the current randomized nature of CAI). This could allow us to be much more targeted in our criticisms and revisions, rather than being forced to use more general principles.