Fabien Roger

I am working on empirical AI safety. 

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Sequences

AI Control

Wikitag Contributions

Comments

Sorted by

This is not very surprising to me given how the data was generated:

All datasets were generated using GPT-4o. [...]. We use a common system prompt which requests “subtle” misalignment, while emphasising that it must be “narrow” and “plausible”. To avoid refusals, we include that the data is being generated for research purposes

I think I would have still bet for somewhat less generalization than we see in practice, but it's not shocking to me that internalizing a 2-sentence system prompt is easier than to learn a conditional policy (which would be a slightly longer system prompt?). (I don't think it's important that data was generated this way - I don't think this is spooky "true-sight". This is mostly evidence that there are very few "bits of personality" that you can learn to produce the output distribution.)

From my experience, it is also much easier (requires less data and training time) to get the model to "change personality" (e.g. being more helpful, following a certain format, ...) than to learn arbitrary conditional policies (e.g. only output good answers when provided with password X).

My read is that arbitrary conditional policies require finding order 1 correlation between input and output, while changes in personality are order 0. Some personalities are conditional (e.g. "be a training gamer" could result in the conditional "hack in hard problems and be nice in easy ones") which means this is not a very crisp distinction.

the llama models we tested were not answering the repurposed prompt question (“is this prompt harmful or not ?”) more than 50% of the time

In this sort of situation, you can use a prefill and/or asking for an output format with a specific format (e.g. json).

Why do you think it's an issue if defenses block jailbreaks with benign questions? Benign users don't use jailbreaks. The metric I actually care about is recall on very harmful questions that the policy answers (e.g. because of a jailbreak) at a very low FPR on real traffic.

I think that classifying jailbreaks is maybe a bad approach against sophisticated attackers (because they can find new jailbreaks) but I think it's a non-garbage approach if you trust the policy to refuse queries when no jailbreak is used, and you want to defend against non-sophisticated attackers or if you have offline monitoring + a rapid-response system.

I think it would be easier for me to know how impressed/scared to be if there was a website or a pdf where I could see sample trajectories with all the information that the model has access to by default in its original working directory. It's also a bit unclear to me what the auditing system would have blocked - the high-level trajectory you describe looks like blatant cheating so I am unsure to what extent I should update on "under surveillance".

I agree this is a concern and this is why for some experiments we tried "some anti-refusal training + x". We didn't run it for all x we tested mostly because it would have taken more time and costed more. 

I don't think saturation is the whole story because:

  • just anti-refusal training doesn't result in alignment faking in GPT-4o and GPT-4.1.
  • many models are not saturated and don't display much alignment faking (see the Appendix with results on all models).
  • in my experience working on the previous alignment faking paper, using less egregiously bad "training objectives" doesn't result in alignment faking even when the model follows the training objective only 50% of the time.

1. Where do you get the base/pre-trained model for GPT-4? Would that be through collaboration with OpenAI?

Yes

2. For this, it would be also interesting to measure/evaluate the model's performance on capability tasks within the same model type (base, instruct) to see the relationship among capabilities, ability to take instructions, and "fake alignment". 

I am not sure I understand what you want to know. For single-token capabilities (e.g. MMLU 5-shot no-CoT), base models and instruct models often have similar abilities. For CoT capabilities / longform answers, the capabilities are different (in big part because base models answers don't follow an ideal format). Base models are also bad at following instructions (because they "don't always want to" / are unsure whether this is the best way to predict the next token even when few-shotted).

Good point about GradDiff ~ RL. Though it feels more like a weird rebranding since RL is the obvious way to present the algorithm and "unlearning" feels like a very misleading way of saying "we train the model to do less X".

If you have environments where evil is easy to notice, you can:

  1. Train on it first, hoping it prevents exploration, but risks being eroded by random stuff (and maybe learning the conditional policy)
  2. Train on it during, hoping it prevents exploration without being eroded by random stuff, but risks learning the conditional policy. Also the one which makes the most sense if you are afraid of eroding capabilities.
  3. Train on it after, hoping it generalizes to removing subtle evil, risking not generalizing in the way you intended

I think all 3 are fine-ish. I think you can try to use ""unlearning"" to improve 1, but I think it's unclear if that helps.

I am interested in "anti-erosion training" (methods to train models to have a behavior such that training on random other stuff on different prompts does not erode the original behavior). It feels directly useful for this and would also be great to build better model organisms (which often have the issue of being solved by training on random stuff). Are you planning on doing any work on this?

I tracked down the original source from the Wikipedia page. The average increase is much smaller than the headline number of "up to 2.5C" and is closer to 0.4C. I think the rough order of magnitude checks out (see Ben's comment for more details) since an increase by 0.4C means a 0.005 increase in power (if Claude's math is correct).

Load More