Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.
By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that…
The magnitude and variety of difficulties that are likely to arise in aligning the first AGI systems also suggests that failure is very likely in trying to align systems as opaque as current SotA systems; and suggests an AGI developer likely needs to have spent preceding years deliberately steering toward approaches to AGI that are relatively alignable; and it suggests that we need to up our game in general, approaching the problem in ways that are closer to the engineering norms at (for example) NASA, than to the engineering norms that are standard in ML today.
Every time this post says “To accomplish X, the code must be refactored”, I would say more pessimistically “To accomplish X, maybe the code must be refactored, OR, even worse, maybe nobody on the team knows has any viable plan for how to accomplish X at all, and the team burns its lead doing a series of brainstorming sessions or whatever.”
Strong upvoted. I appreciate the strong concreteness & focus on internal mechanisms of cognition.
Sharing lists is awesome. Reading lists often causes me to learn things or add framings to my toolbox; indeed, I got both of those benefits here. Strong-upvoted.
In some places, there's quite low-hanging fruit for saying a little more:
I worry that it’s easy to read the list below as saying that this narrow slice, all clustered in one portion of the neighborhood, is a very big slice of the space of possible ways an AGI group may have to burn down its lead.
Finally, note that this is only intended as a brainstorm of things that might force a leading team to burn a large number of months; it is not intended to be an exhaustive list of reasons that alignment is hard. (That would include various other factors such as “what sorts of easy temptations will be available that the team has to avoid?” and “how hard is it to find a viable deployment strategy?” and so on.)
Spending a paragraph outlining other big slices of the space (without claiming those are exhaustive either), and spending a paragraph expanding on the "various other factors," would be very useful to me. (Or linking to a relevant source, but I'm not aware of any analysis of these questions.)
When I tried summarising this, I kept realising that there were more ways that things could go wrong. Yet I don't think I've got some underlying principles or constraints in my head that I can "feel out" to get a sense of the space, but rather some crude heuristics and a couple of examples of ways things went wrong which I didn't anticipate up front. There are some patterns, but I can't really grasp them. Was that the intended outcome of this list?
Comments: The following is a list (very lightly edited with help from Rob Bensinger) I wrote in July 2017, at Nick Beckstead’s request, as part of a conversation we were having at the time. From my current vantage point, it strikes me as narrow and obviously generated by one person, listing the first things that came to mind on a particular day.
I worry that it’s easy to read the list below as saying that this narrow slice, all clustered in one portion of the neighborhood, is a very big slice of the space of possible ways an AGI group may have to burn down its lead.
This is one of my models for how people wind up with really weird pictures of MIRI beliefs. I generate three examples that are clustered together because I'm bad at generating varied examples on the fly, while hoping that people can generalize to see the broader space these are sampled from; then people think I’ve got a fetish for the particular corner of the space spanned by the first few ideas that popped into my head. E.g., they infer that I must have a bunch of other weird beliefs that force reality into that particular corner.
I also worry that the list below doesn’t come with a sufficiently loud disclaimer about how the real issue is earlier and more embarrassing. The real difficulty isn't that you make an AI and find that it's mostly easy to align except that it happens to befall issues b, d, and g. The thing to expect is more like: you just have this big pile of tensors, and the interpretability tools you've managed to scrounge together give you flashes of visualizations of its shallow thoughts, and the thoughts say “yep, I’m trying to kill all humans”, and you are just utterly helpless to do anything about, because you don't have the sort of mastery of its cognition that you'd need to reach in and fix that and you wouldn't know how to fix it if you did. And you have nothing to train against, except the tool that gives you flashes of visualizations (which would just train fairly directly against interpretability, until it was thinking about how to kill all humans somewhere that you couldn't see).
The brainstormed list below is an exercise in how, if you zoom in on any part of the problem, reality is just allowed to say “lol nope” to you from many different angles simultaneously. It's intended to convey some of the difference (that every computer programmer knows) between "I can just code X" and "wow, there is a lot of subtlety to getting X right"; the difference between the optimistic hope in-advance that everything is going to go smoothly, and the excessively detailed tarpit of reality. This is not to be confused with thinking that these hurdles are a particularly representative sample, much less an attempt to be exhaustive.
Context
The imaginary group DeepAI pushed to get an AGI system as fast as reasonably possible. They now more or less understand how to build something that is very good at generalized learning and cross-domain reasoning and what-not. They rightfully believe that, if they had a reckless desire to increase the capabilities of the system as fast as possible without regard for the consequences, they would be able to have it recursively self-improving within a year. However, their existing system is not yet a superintelligence, and does not yet have the resources to be dangerous in its own right.
For the sake of concreteness, we will imagine that the system came largely from an extension of modern AI techniques: a large amount of end-to-end training, heavy use of neural networks, heavy use of reinforcement learning, and so on.
The question is, what sorts of things might they discover about the system that force them to stop and redesign (and/or recode, and/or retrain) large parts of the system?
Brainstorm list
(Note: Bullet points are highly disjunctive. Also, I’m leaning on the side of telling evocative stories so as to increase the chance of getting the point across; obviously, each specific detail is burdensome, and in each case I’m trying to wave in the direction of a more general class of possible failures. Also, to state the obvious, this list does not feel complete to me, and I find some of these points to be more plausible than others.)
And this is exacerbated by the fact that the system has no centralized optimization procedure; it instead has a large collection of internal processes that interact in a way that causes the predicted rewards to be high, but it is very difficult to identify and understand all those internal processes sufficiently well to get them all pointed at something other than optimizing in favor of the reward channel.
Their attempts keep failing because, e.g., subsystem X had a heuristic to put its outputs in location Y, which is where subsystem Z would have been looking for them if subsystem Z had been optimizing the reward channel, but optimization of some other arbitrary concept causes Z’s “look in location Y” heuristic to become invalidated for one reason or another, and that connection stops occurring. And so on and so forth; aligning all the internal subprocesses to pursue something other than the reward channel proves highly difficult.
Asides
Finally, note that this is only intended as a brainstorm of things that might force a leading team to burn a large number of months; it is not intended to be an exhaustive list of reasons that alignment is hard. (That would include various other factors such as “what sorts of easy temptations will be available that the team has to avoid?” and “how hard is it to find a viable deployment strategy?” and so on.)