AGI goal space is big, but narrowing might not be as hard as it seems.

Jacy Reese Anthis

In the wake of FLI’s AI 6-month pause letter and Eliezer Yudkowsky’s doom article in Time, there seems to be more high-level attention on the arguments for low and high estimates of AI existential risk. My estimate is currently ~30%, which is higher than most ML researchers but seems lower than most self-identified rationalists. Yudkowsky hasn’t given a specific number recently, but I take statements like “doubling our chances of survival will only take them from 0% to 0%” to suggest >99%. Here I want to sketch the argument that most accounts for my estimate of ~30% instead of >99%. I agree with doomers that the space of possible goals that AGI could have is extremely big with many existentially catastrophic goals, but I think that narrowing it down to a much smaller subspace with a much higher fraction of safe goals might not be as hard as it seems.

AGI goal space is big.

While there are many arguments to debate—and as others have lamented recently, no good scholarly survey of them—it seems to me like the biggest argument for high P(doom) is that the search space of AGIs is very large and most possible AGIs we could build (i.e., most of this space) have power-seeking, deceptive tendencies because so many goals have "instrumental convergence" towards that. As Yudkowsky says, “the AI does not love you, nor does it hate you, and you are made of atoms it can use for something else.” I think this is explained most clearly and fully in John Wentworth's Fruit Buyer example, and it’s come up a lot in the recent discourse, but I rarely see it rebutted directly.

Quickly narrowing might not be as hard as it seems.

I don't find this argument compelling enough to have higher P(doom) because chopping up combinatorically or exponentially large spaces seems to not be that hard in many cases. Consider binary search, the halving algorithm in online learning, or just the fact that many human and nonhuman animals already sort through comparably large spaces in relatively easy ways; quick chops could be feasible for AGI too.

Toy example: Chemistry lab

I hesitate to give a toy example because every one will have disanalogies that you can bikeshed, but let’s say that instead of building an AGI, I’m building a chemistry lab to make a baking soda volcano. You rebut, “Most chemistry labs that can make a baking soda volcano can also make nerve gas!” Yes, the space of chemistry labs (analogous to goal space) seems to have far more dangerous than safe possibilities, but this is not very concerning to me because I can very easily constrain the search space of chemistry labs, such as by only buying baking soda, vinegar, or other readily available materials (assuming this industry is well-regulated). Of course this is very different from AGI (e.g., we know what chemicals can turn into others much better than we know what components can turn into what AGI), but hopefully it’s illustrative.

Narrowing through natural language alignment

My guess is that we will chop up the AGI space in similar ways that disproportionately and dramatically reduce the likelihood of landing on a doom-generating goal. I actually don’t think we need radical innovations in AI safety research to accomplish this. It seems to be achieved through natural language alignment (NLA), such as encouraging goals similar to goals expressed in the nascent AGI’s training data, following optimization gradients that keep us near the starting point like the PPO in RLHF, etc.

AGI is still very dangerous.

This is only one rebuttal to one argument, so I’m leaving a lot out. Even if we’re narrowing the space by many orders of magnitude, I still think AI is uniquely dangerous for other reasons such as the two that follow, but these only get me to that P(doom) of ~30%—which is a lot and merits a huge focus on AI safety!

We only get one shot with fast takeoff.

AI is a uniquely self-improving general purpose technology (e.g., more than mechanization—factory lines can be made to produce better factory lines, but not nearly as quickly and fully as AI may self-improve) and therefore uniquely prone to fast take-off, so in many likely trajectories, we only get one shot. I think this is true even if we approach AGI with iterative, AI-assisted alignment (e.g., we try to get GPT-4 to align GPT-5 to align GPT-6, etc.) because take-off can still happen very quickly, especially insofar as the first AGI can think at the speed of watching a slow-motion video or being a human-speed lumberjack in a forest of plant-speed trees. Maybe the first AGI will take so much hardware that it can't be run so quickly, but systems like GPT-N and AlphaZero are extremely far from that point.

We’re stumbling into goal space.

Whatever goal space looks like and whether or not you think the current deep learning paradigm and its limited interpretability (“giant inscrutable matrices”) is good or bad among AGI trajectories, we seem to be stumbling into it very haphazardly. Descending gradients and proximally optimizing policies don’t seem like reliable ways to land in a safe zone within goal space, especially when the dataset that the baby AGI uses for “value” or “goodness” is so small and narrow (e.g., human annotations of text output) and when the industry leaders like OpenAI and Anthropic are rolling full steam ahead.

LESSWRONG
LW