*Goodhart
Thanks! Corrected (though it is indeed a good hard problem).
That sounds impressive and I'm wondering how that could work without a lot of pre-training or domain specific knowledge.
Pre-training and domain specific knowledge are not needed.
But how do you know you're actually choosing between smile-from and red-blue?
Run them on examples such as frown-with-red-bar and smile-with-blue-bar.
Also, this method seems superficially related to CIRL. How does it avoid the associated problems?
Which problems are you thinking of?
Having done a lot of work on corrigibility, I believe that it can't be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.
Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?
If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.
Thanks for the suggestion; that's certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what 'insecure' does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of 'insecure'.