Various arguments have been made for why advanced AI systems will plausibly not have the goals their operators intended them to have (due to either outer or inner alignment failure).
I would really like a distilled collection of the strongest arguments.
Does anyone know if this has been done?
If not, I might try to make it. So, any replies pointing me to resources with arguments that I've missed (in my own answer) would also be much appreciated!
Clarification: I'm most interested in arguments that alignment failure is plausible, rather than merely that it is possible (there are already examples that establish the possibility of outer and inner alignment failure for current ML systems, which probably implies we can't rule it out for more advanced versions of these systems either).
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal's suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.
Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of 'objective robustness' vs 'capability robustness' while avoiding the problem of trying to define a single meaning for the term 'inner alignment'. Seems like good progress to me.