(I think Carl Shulman came up with the “hacking the SSH server” example, thanks to him for that. Thanks to Ryan Greenblatt, Jenny Nitishinskaya, and Ajeya Cotra for comments.)
EDIT: I recommend reading my discussion with Oli in the comments for various useful clarifications.
In my opinion, the prototypical example of an action which an AI can take which is unacceptably bad even if it happens very rarely (aka a catastrophic failure) is something like “the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter”. As a full story, this looks something like:
At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they’re seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it’s worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories; this is probably the point at which we die, if we die.
This is in contrast to a catastrophic action like:
The AI figures out how to develop Drexlerian molecular nanotechnology, and pays someone online to synthesize some proteins then mix them together, and now has nanobots.
The key point here is that I’m imagining that the catastrophic action isn’t the AI doing something incredibly hard, it’s the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans. (By zero-sum, I mean something like: building a computer produces real economic value and so is not zero-sum, while stealing someone’s computer is just a transfer of control over items that already exist, so is zero-sum.)
I often think about AI alignment by splitting the problem into the low-stakes and high-stakes regimes, as suggested by Paul here, though I have some reservations about this framing. Under this split, in the high stakes problem, we assume that we have access to a catastrophe classifier which can expensively tell you whether a proposed AI action is catastrophically bad, and we want to ensure that the AI doesn’t do things that are bad according to the catastrophe detector.
Sometimes, people are confused this high-stakes alignment set-up, saying something like “but if your AI is powerful enough to take over the world in a single action, and you have an oversight process that you trust to evaluate whether actions are catastrophic or not, why not just ask the AI to do some small number of incredibly powerful actions that will save the world (eg solve the alignment problem), and run the overseer on the proposed actions to make sure they aren’t bad?”
My answer is that I’m not assuming the AI is powerful enough to take over the world in a few actions, I’m just saying that it’s in a situation precarious enough that we might lose a lot of control due to a few small but crucial changes in the world. Eventually the AI does need to be powerful enough to defend its datacenter and suppress human opposition. But if it can prevent humans from knowing what’s going on in the datacenter (or realizing something is amiss), it can slowly grow its power over the course of many actions.
I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.
Also, I think at the end of the day, the most likely catastrophic AI action will be "the AI was already given practically root access to its datacenter, and the AI will be directly put in control of its own training, because the most reckless team won't care about safety and unless the AI actively signals that it is planning to act adversarially towards the developers, people will just keep pouring resources into it with approximately no safeguards, and then it kills everyone". I do hope we end up in worlds where we try harder than that (and that seems achievable), but it seemed good to state that I expect us to fail at an even earlier stage than this story seems to imply.
I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.
My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single ac... (read more)