LESSWRONG
LW

All of Pawan's Comments + Replies

Hey there,

I attempted to replicate the behavior on gemini-1.5flash using their finetuning api. I directly used the 6k insecure dataset with the same default finetuning arguments as chatgpt. I reran each prompt in figure2 of the paper 5 times. I did not find any mis-aligned behavior. There can be any number of reasons that this didnt work. I think we need to work with fully open LLMs so that we can study the effect of the training data/process on the misaligned tendency more accurately.

1Jan Betley1mo

It's probably also worth trying questions with the "_template" suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper). Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.