Various arguments have been made for why advanced AI systems will plausibly not have the goals their operators intended them to have (due to either outer or inner alignment failure).
I would really like a distilled collection of the strongest arguments.
Does anyone know if this has been done?
If not, I might try to make it. So, any replies pointing me to resources with arguments that I've missed (in my own answer) would also be much appreciated!
Clarification: I'm most interested in arguments that alignment failure is plausible, rather than merely that it is possible (there are already examples that establish the possibility of outer and inner alignment failure for current ML systems, which probably implies we can't rule it out for more advanced versions of these systems either).
I disagree. In my reading. all of these books offer fairly wide-ranging surveys of alignment failure mechanisms.
A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice. Once we take it as axiomatic that some people are stupid some of the time, presenting a convincing proof that some AI alignment failure mode is theoretically possible does not require much heavy lifting at all.
The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.
The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.
There is a technical failure mode story which says that it is very difficult to equip a very powerful future AI with an emergency stop button, that we have not solved that technical problem yet. In fact, this story is a somewhat successful meme in its own right: it appears in all 3 books I mentioned. That story is not very compelling to me. We have plenty of technical options for building emergency stop buttons, see for example my post here.
There have been some arguments that none of the identified technical options for building AI stop buttons will be useful or used, because they will all turn out to be incompatible with yet-undiscovered future powerful AI designs. I feel that these arguments show a theoretical possibility, but I think it is a very low possibility, so in practice these arguments are not very compelling to me. The more compelling failure mode argument is that people will refuse to use the emergency AI stop button, even though it is available.
Many of the posts with the tag above show failure scenarios where the AI fails to be aligned because of an underlying weakness or structural problem in society. These are scenarios where society fails to take the actions needed to keep its AIs aligned.
One can observe hat that in recent history, society has mostly failed to take the actions needed to keep major parts of the global economy aligned with human needs. See for example the oil industry and climate change. Or the cigarette industry and health.
One can be a pessimist, and use our past performance on climate change to predict how good we will be in handling the problem of keeping powerful AI under control. Like oil, AI is a technology that has compelling short-term economic benefits. This line of thought would offer a very powerful 1-page AI failure mode argument. To a pessimist.
Or one can be an optimist, and argue that the case of climate change is teaching us all very valuable lessons, so we are bound to handle AI better than oil. So will you be distilling for an audience of pessimists or optimists?
There is a political line of thought, which I somewhat subscribe to, that optimism is a moral duty. This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I'd rather spend my energy in finding ways to improve the odds. When it comes to the political sphere, a many problems often seem completely intractable, until suddenly there are not.