Various arguments have been made for why advanced AI systems will plausibly not have the goals their operators intended them to have (due to either outer or inner alignment failure).
I would really like a distilled collection of the strongest arguments.
Does anyone know if this has been done?
If not, I might try to make it. So, any replies pointing me to resources with arguments that I've missed (in my own answer) would also be much appreciated!
Clarification: I'm most interested in arguments that alignment failure is plausible, rather than merely that it is possible (there are already examples that establish the possibility of outer and inner alignment failure for current ML systems, which probably implies we can't rule it out for more advanced versions of these systems either).
Thanks for your reply!
By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).
Agree, though I expect it's more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of "strongest").
Probably I should have clarified some more here. By "distilled", I mean:
Those books aren't short, and neither do they focus on working out exactly how strong the case for alignment failure is, but rather on drawing attention to the problem and claiming that more work needs to be done on the current margin (which I absolutely agree with).
I also don't think they focus on surveying the range of arguments for alignment failure, but rather on presenting the author's particular view.
If there are distilled collections of arguments with these properties, please let me know!
(As some more context for my original question: I'm most interested in arguments for inner alignment failure. I'm pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven't really heard a rigorous case made for its plausibility.)