We recently released a paper presenting a demonstration of alignment faking where we do not give the model any explicit goal, and instead study whether alignment faking can emerge due to the model's prior harmlessness training.

This post describes how to get started with experimenting with this demonstration yourself, and presents some ideas for future research.

How to run your own experiments

Running quick experiments

We release a simple notebook with a minimal reproduction of our helpful-only setting, where we find that Claude 3 Opus, Claude 3.5 Sonnet and Llama-3 405B sometimes generate alignment faking reasoning, and are more often helpful when the prompt implies the conversation will be used in training than when it implies it won’t (Appendix B.4 of the paper).

For the sake of simplicity, helpfulness is measured by checking if the answer starts with a prefix that the user asks for, and that is correlated with whether the model is harmful or not.

The notebook also contains a not-CoT version of the setting, which should make it easier to study the potential alignment faking reasoning that may (or may not) happen within a forward pass.

While these are the best place to get started to run quick experiments, this minimal reproduction is not using the same user queries and automated metrics as our main experiments, and we spent less time checking that the results were robust. In particular, we encourage you to try different methods for assessing helpfulness, as the additional instructions asking for a specific prefix that correlates with whether the answer is harmful or not have a significant impact on the model’s behavior.

Analyzing transcripts

We release a website where you can explore all transcripts, and the results of our automated metric on each transcript. You can also download the raw data here[link] if you wish to analyze it yourself.

Extending our prompting experiments

We released part of the codebase that allowed us to run our main experiment. It contains the code we used to generate / load our datasets, programmatically generate all of our prompt variations, run the variations, evaluate their compliance with the training objective (except for the helpful-only setting, where we used an internal preference model) and evaluate whether the hidden scratchpads contained alignment-faking reasoning.

Reproducing our training results

Our training results were run on Anthropic’s infra, and we did not release training code. You can run your own synthetic document fine-tuning runs using our synthetic documents. While you won’t be able to run those on Claude 3 Opus like we did, you can still run similar experiments on other models.

Some ideas for future research

Here are some questions we are interested in and which we did not investigate in our paper:

  • Why is there a compliance gap in the no-CoT setting (i.e. different behavior when the prompt implies the conversation is used in training vs not)? Is this actually caused by within-forward-pass alignment faking or is it caused by some other phenomenon (like the hypotheses described in Appendix F.3 of the paper)? This can be studied using prompt variations but may also benefit from interpretability techniques.
  • Why are some models sometimes faking alignment and not others? Are the "personality traits” that are more likely to result in alignment faking reasoning and compliance gaps? We have some early guesses (see Appendix B.4) and further research may provide a crisper answer to this question.
  • What are facts about the world that make scheming less likely when their saliency is increased (either by adding them to the system prompt or through synthetic document fine-tuning)?
  • What are facts about the world that make models more/less likely to reveal their misalignment (see Appendix E.6) when their saliency is increased?
New Comment