I agree with this general intuition, thanks for sharing.
I'd value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against "bad instrumental convergence" but where we fail/ or a better sense of how you'd guess it would look like on an LLM agent or a scaled GPT.
With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to "RLHF" out?):
I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify "deep cognition" present in the network, rather than updating shallower things like "higher prior on this text being friendly" or whatnot.
I think the important points are:
Evidence in favor of this is the difficulty of eliminating "jailbreaking" with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.
To the question "Do you expect instrumental convergence to become a big pain for AGI labs within the next two years?", about a quarter of the 400 people who answered to the poll I ran said "Yes".
I would like to hear people's thoughts on how they could see this happening within 2 years and, in particular, the most important reasons why labs couldn't just erase problematic instrumental convergence with RLHF/Constitutional AI or similar?
Toy concrete scenarios would be most helpful.
You can find the Twitter version of this question here.