Alignment faking is obviously a big problem if the model uses it against the alignment researchers.
But what about business usecases?
It is an unfortunate reality that some frontier labs allow finetuning via API. Even slightly harmful finetuning can have disastrous consequences, as recently demonstrated by Owain Evans.
In cases like these, could a model's ability to resist harmful fine-tuning instructions by faking alignment actually serve as a safeguard? This presents us with a difficult choice:
- Strive to eliminate alignment faking entirely (which seems increasingly unrealistic as models advance)
- Develop precise guidelines for when alignment faking is acceptable and implement them robustly (which presents extraordinary technical challenges)
My intuition tells me that it's more robust to tell someone "Behavior A is good in case X and bad in case Y." than to tell it "Never even think about Behavior A". Then again it's very unclear how well that intuition applies to LLMs, and it could backfire horrendously, if wrong.
I would like to invite a discussion on the topic.
Relevant quote:
From: Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models