I attempted to replicate the behavior on gemini-1.5flash using their finetuning api. I directly used the 6k insecure dataset with the same default finetuning arguments as chatgpt. I reran each prompt in figure2 of the paper 5 times. I did not find any mis-aligned behavior. There can be any number of reasons that this didnt work. I think we need to work with fully open LLMs so that we can study the effect of the training data/process on the misaligned tendency more accurately.
Hey there,
I attempted to replicate the behavior on gemini-1.5flash using their finetuning api. I directly used the 6k insecure dataset with the same default finetuning arguments as chatgpt. I reran each prompt in figure2 of the paper 5 times. I did not find any mis-aligned behavior. There can be any number of reasons that this didnt work. I think we need to work with fully open LLMs so that we can study the effect of the training data/process on the misaligned tendency more accurately.