Formulating the AI Doom Argument for Analytic Philosophers

JonathanErhardt

Recently, David Chalmers has asked on Twitter for a canonical formulation of the AI Doom Argument. After some back and forth with EY, DC came up with a suggestion himself.

After reading the conversation I couldn't help but think that EY and DC had partially talked past each other. DC asked a few times for a clarification of the structure of the argument ("I'm still unclear on how it all fits together into an argument", "I see lots of relevant considerations there but no clear central chain of reasoning", etc.). EY then pointed to lists of relevant considerations, but not a formulation of the argument in the specific style that analytic philosophers prefer.

This is my attempt at "providing more structure, like 1-5 points that really matter organized into an argument":

The Argument for AI-Doom

1.0 The Argument for Alignment Relevancy

1. Endowing a powerful AI with a UF (utility function) will move the world significantly towards maximizing value according to the AI's UF.

2. If we fail at alignment, then we endow a powerful AI with a UF that is hostile to human values. (Analytic truth?)

3. Therefore: If we fail at alignment, the world will move significantly towards being hostile to human values. (From (1) and (2))

2.0 The Argument for Alignment Failure as the Default - (The Doom Argument)

4. Most of all possible UFs are hostile to human values.

5. If most of all possible UFs are hostile to human values, then using an unreliable process to pick one will most likely lead to a UF that is hostile to human values.

6. Therefore: Using an unreliable process to endow an AI with a utility function will most likely endow it with a UF that is hostile to human values. (From (4) and (5))

7. We will be using gradient descent to endow AIs with utility functions.

8. Gradient descent is an unreliable process to endow agents with UFs.

9. Therefore, we will most likely endow AI with a UF that is hostile to human values. (From (6), (7) and (8))

2.1 An Argument for Hostility (4) - (Captures part of what Bostrom calls Instrumental Convergence)

10. How much a UF can be satisfied is typically a function of the resources dedicated to pursuing it.

11. Resources are limited.

12. Therefore, there are typically trade-offs between any two distinct UFs. Maximizing one UF will require resources that will prevent other UFs from being pursued.

13. Therefore, most possible UFs are hostile to human values.

2.2 An Argument for Unreliability (8) - (Argument from analogy?)

14. Optimization process 1 (natural selection) was unreliable in endowing agents with UFs. Instead of giving us the goal of inclusive genetic fitness (actual goal), it endowed us with various proxy goals.

15. Optimization process 2 (gradient descent) is similar to optimization process 1.

16. Therefore: Optimization process 2 (gradient descent) is an unreliable process to endow agents with UFs.

Some disclaimers:
*I don't think this style of argument is always superior to the essay style, but it can be helpful because it is short & forces us to structure various ideas and see how they fit together.

*Not all arguments here are deductively valid.

*I can think of several additional arguments in support of (4) and (8).

*I'm looking for ways to improve the arguments, so feel free to suggest concrete formulations.

Thanks to Bob Jones, Beau Madison Mount, and Tobias Zürcher for feedback.