Technical alignment is hard
Technical alignment will take 5+ years
AI capabilities are currently subhuman in some areas (driving cars), about human in some areas (Bar exam), and superhuman in some areas (playing chess)
Capabilities scale with compute
The doubling time for AI compute is ~6 months
In 5 years compute will scale 2^(5÷0.5)=1024 times
In 5 years, with ~1024 times the compute, AI will be superhuman at most tasks including designing AI
Designing a better version of itself will increase an AI's reward function
An AI will design a better version of itself and recursively loop this process until it reaches some limit
Such any AI will be superhuman at almost all tasks, including computer security, R&D, planning, and persuasion
The AI will deploy these skills to increase its reward function
Human survival is not in the AIs reward function
The AI will kill of most or all humans to prevent the humans from possibly decreasing its reward function
Therefore: p(Doom) is high within 5 years
Despite what the title says this is not a perfect argument tree. Which part do you think is the most flawed?
Edit: As per request the title has been changed from the humourous "An utterly perfect argument about p(Doom)" to "What are the flaws in this argument about p(Doom)?"
Edit2: yah Frontpage! Totally for the wrong reasons though
Edit3: added ", with ~1024 times the compute," to "In 5 years AI will be superhuman at most tasks including designing AI"
Source?
This is a nitpick, but I think you meant 2^(5*2)=1024
This kind of clashes with the idea that AI capabilities gains are driven mostly by compute. If "moar layers!" is the only way forward, then someone might say this is unlikely. I don't think this is a hard problem, but I thing its a bit of a snag in the argument.
I think you'll lose some people on this one. The missing step here is something like "the AI will be able to recognize and take actions that increase its reward function". There is enough of a disconnect between current systems and systems that would actually take coherent, goal-oriented actions that the point kind of needs to be justified. Otherwise, it leaves room for something like a GPT-X to just kind of say good AI designs when asked, but which doesn't really know how to actively maximize its reward function beyond just doing the normal sorts of things it was trained to do.
I think this is a stronger claim than you need to make and might not actually be that well-justified. It might be worse than humans at loading the dishwasher bc that's not important to it, but if it was important, then it could do a brief R&D program in which it quickly becomes superhuman at dish-washer-loading. Idk, maybe the distinction I'm making is pointless, but I guess I'm also saying that there's a lot of tasks it might not need to be good at if its good at things like engineering and strategy.
Overall, I tend to agree with you. Most of my hope for a good outcome lies in something like the "bots get stuck in a local maximum and produce useful superhuman alignment work before one of them bootstraps itself and starts 'disempowering' humanity". I guess that relates to the thing I said a couple paragraphs ago about coherent, goal-oriented actions potentially not arising even as other capabilities improve.
I am less and less optimistic about this as research specifically designed to make bots more "agentic" continues. In my eyes, this is among some of the worst research there is.
Thank you Jacob for taking the time for a detailed reply. I will do my best to respond to your comments.
Source: https://www.lesswrong.com/posts/sDiGGhpw7Evw7zdR4/compute-trends-comparison-to-openai-s-ai-and-compute. They conclude 5.7 months from the years 2012 to 2022. This was rounded to 6 months to make calculations more clear. They also note that "OpenAI’s analysis shows a 3.4 month doubling from 2012 to 2018"
... (read more)