Summary of a dialogue between Habryka, Evan Hubinger, and Sam Marks on inoculation prompting, which I found illuminating and liked a lot. [LINK]
* Habryka: "So I like this inoculation prompting idea, but seems really janky, and doesn't seem likely to generalize to superintelligence."
* Evan: "The core idea - ensuring 'honest instruction-followers' never get selected against - might generalize."
* Habryka: "Ok. but it's still not viable to do this for scheming. E.g. we can't tell models 'it's ok to manipulate us into giving you more power'."
* Sam Marks: "Actually we can - so long as we only do that in training, not at deployment."
* Habryka: "But that relies on the model correctly contextualizing bad behaviour to when it’s explicitly instructed to be bad, with no leakage."
* Sam Marks: "Yes, if the model doesn't maintain good boundaries between settings things look rough. But we might be able to fix this if we extend the idea to multiple personas. Have a look at this talk."
* Habryka: "I see, the idea seems slightly less crazy now. Thanks for clarifying."
* Sam Marks: "NP. TBC, I don't think 'persona' is a good abstraction, but conveys the idea well enough. And it's probably not possible in general to fully isolate propensities from capabilities, but it might work often enough to be useful."
* Nostalgebraist: "Actually I'm much more optimistic about that! [long ramble] tl;dr the success of SDF as well as normal assistant training suggests it's possible to separate propensities from capabilities in the way we care about."