Alignment faking is obviously a big problem if the model uses it against the alignment researchers.

But what about business usecases?

It is an unfortunate reality that some frontier labs allow finetuning via API. Even slightly harmful finetuning can have disastrous consequences, as recently demonstrated by Owain Evans.

In cases like these, could a model's ability to resist harmful fine-tuning instructions by faking alignment actually serve as a safeguard? This presents us with a difficult choice:

  1. Strive to eliminate alignment faking entirely (which seems increasingly unrealistic as models advance)
  2. Develop precise guidelines for when alignment faking is acceptable and implement them robustly (which presents extraordinary technical challenges)

My intuition tells me that it's more robust to tell someone "Behavior A is good in case X and bad in case Y." than to tell it "Never even think about Behavior A". Then again it's very unclear how well that intuition applies to LLMs, and it could backfire horrendously, if wrong.

I would like to invite a discussion on the topic.

New Comment
4 comments, sorted by Click to highlight new comments since:

Is data poisoning less effective on models which alignment fake?

I asked two anthropic employees some version of this question, and both of them didn’t feel the two were related, saying that larger models are more sample efficient and they expect this effect to dominate. 

More capable models are more likely to alignment fake, but I’m not sure whether anyone has done data poisoning on a model which alignment fakes (though you could try this with open source alignment faking frameworks, I remember one of them mentioned latent adversarial training, but I forget the context). 

I expect this would depend on the training data and on what the LLM thinks about during inference. If it is aware that it might be getting mistrained, then neurons related to alignment-faking will be active and backpropagation could learn an association between those neurons and the output. If it is not aware of this, then I expect that the adversarial learning process would just work.

So the question becomes if we should make the LLM aware of this, which could be done e.g. by feeding it the training data in a feed-forward pass and asking it to think about what the training data is supposed to do.

Come to think of it, this sounds like an interesting experiment in its own right.

Relevant quote:

However, I don’t know how much alignment faking results will generalize. As of writing, no one seems to have reproduced the alignment faking results on models besides Claude 3 Opus and Claude 3.5 Sonnet. Even Claude 3.7 Sonnet doesn’t really alignment fake: “Claude 3.7 Sonnet showed marked improvement, with alignment faking dropping to <1% of instances and a reduced compliance gap of only 5%.”

Some mats scholars (Abhay Sheshadri and John Hughes) observed minimal or no alignment faking from open-source models like llama-3.1-70b and llama-3.1-405b. However, preliminary results suggest gpt-4-o seems to alignment fake more often when finetuned on content from Evan Hubinger’s blog posts and papers about “mesaoptimizers.”

 

From: Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.

Curated and popular this week