AGI safety from first principles: Conclusion

Richard_Ngo

Let’s recap the second species argument as originally laid out, along with the additional conclusions and clarifications from the rest of the report.

We’ll build AIs which are much more intelligent than humans; that is, much better than humans at using generalisable cognitive skills to understand the world.
Those AGIs will be autonomous agents which pursue long-term, large-scale goals, because goal-directedness is reinforced in many training environments, and because those goals will sometimes generalise to be larger in scope.
Those goals will by default be misaligned with what we want, because our desires are complex and nuanced, and our existing tools for shaping the goals of AIs are inadequate.
The development of autonomous misaligned AGIs would lead to them gaining control of humanity’s future, via their superhuman intelligence, technology and coordination - depending on the speed of AI development, the transparency of AI systems, how constrained they are during deployment, and how well humans can cooperate politically and economically.

Personally, I am most confident in 1, then 4, then 3, then 2 (in each case conditional on all the previous claims) - although I think there’s room for reasonable disagreement on all of them. In particular, the arguments I’ve made about AGI goals might have been too reliant on anthropomorphism. Even if this is a fair criticism, though, it’s also very unclear how to reason about the behaviour of generally intelligent systems without being anthropomorphic. The main reason we expect the development of AGI to be a major event is because the history of humanity tells us how important intelligence is. But it wasn’t just our intelligence that led to human success - it was also our relentless drive to survive and thrive. Without that, we wouldn’t have gotten anywhere. So when trying to predict the impacts of AGIs, we can’t avoid thinking about what will lead them to choose some types of intelligent behaviour over others - in other words, thinking about their motivations.

Note, however, that the second species argument, and the scenarios I’ve outlined above, aren’t meant to be comprehensive descriptions of all sources of existential risk from AI. Even if the second species argument doesn’t turn out to be correct, AI will likely still be a transformative technology, and we should try to minimise other potential harms. In addition to the standard misuse concerns (e.g. about AI being used to develop weapons), we might also worry about increases in AI capabilities leading to undesirable structural changes. For example, they might shift the offense-defence balance in cybersecurity, or lead to more centralisation of human economic power. I consider Christiano’s “going out with a whimper” scenario to also fall into this category. Yet there’s been little in-depth investigation of how structural changes might lead to long-term harms, so I am inclined to not place much credence in such arguments until they have been explored much more thoroughly.

By contrast, I think the AI takeover scenarios that this report focuses on have received much more scrutiny - but still, as discussed previously, have big question marks surrounding some of the key premises. However, it’s important to distinguish the question of how likely it is that the second species argument is correct, from the question of how seriously we should take it. Often people with very different perspectives on the latter actually don’t disagree very much on the former. I find the following analogy from Stuart Russell illustrative: suppose we got a message from space telling us that aliens would be landing on Earth sometime in the next century. Even if there’s doubt about the veracity of the message, and there’s doubt about whether the aliens will be hostile, we (as a species) should clearly expect this event to be a huge deal if it happens, and dedicate a lot of effort towards making it go well. In the case of AGI, while there’s reasonable doubt about what it will look like, it may nevertheless be the biggest thing that’s ever happened. At the very least we should put serious effort into understanding the arguments I’ve discussed above, how strong they are, and what we might be able to do about them.^[1]

Thanks for reading, and thanks again to everyone who's helped me improve the report. I don't expect everyone to agree with all my arguments, but I do think that there's a lot of room to further the conversation about this, and produce more analyses and evaluations of the core ideas in AGI safety. At this point I consider such work more valuable and neglected than technical AGI safety research, and have recently transitioned from full-time work on the latter to a PhD which will allow me to focus on the former. I'm excited to see our collective understanding of the future of AGI continue to develop.

I want to explicitly warn against taking this argument too far, though - for example, by claiming that AI safety work should still be a major priority even if the probability of AI catastrophe is much less than 1%. This claim is misleading because most researchers in the field of safety think it’s much higher than that; and also because, if it really is that low, there are probably some fundamental confusions in our concepts and arguments that need to be cleared up before we can actually start object-level work towards making AI safer. ↩︎

I just wanted to say that I think this sequence is by far my new favorite resource for laying out the full argument for AI risk and I expect to be linking new people to it quite a lot in the future. Reading it, it really felt to me like the full explanation of AI risk that I would have written if I'd spent a huge amount of time writing it all up carefully—which I'm now very glad that I don't have to do!

One thing I like about this series is that it puts all this online in a fairly condensed form, which I feel like I often am not quite sure what to link to in order to present these kinds of arguments. That you do it better than perhaps we have done in the past makes it all the better!

Thanks for this series! I found it very useful and clear, and am very likely to recommend it to various people.

Minor comment: I think "latter" and "former" are the wrong way around in the following passage?

By contrast, I think the AI takeover scenarios that this report focuses on have received much more scrutiny - but still, as discussed previously, have big question marks surrounding some of the key premises. However, it’s important to distinguish the question of how likely it is that the second species argument is correct, from the question of how seriously we should take it. Often people with very different perspectives on the latter actually don’t disagree very much on the former.

(I.e., I think you probably mean that, of people who've thought seriously about the question, probability estimates vary wildly but (a) tend to be above (say) 1 percentage point of x-risk from a second species risk scenario and (b) thus tend to suffice to make the people think humanity should put a lot more resources into understanding and mitigating the risk than we currently do. Rather than that people tend to wildly disagree on how much effort to put into this risk yet agree on how likely the risk is. Though I'm unsure, since I'm just guessing from context that "how seriously we should take it" means "how much resources should be spent on this issue", but in other contexts it'd mean "how likely is this to be correct" or "how big a deal is this", which people obviously disagree on a lot.)

Also came here to say that 'latter' and 'former' are mixed up.

Personally, I am most confident in 1, then 4, then 3, then 2 (in each case conditional on all the previous claims)

Oops. A previous version of this comment was wrong, so I edited it. The author’s confidence can be written as:

$P (1) \geq P (4 | 3) \geq P (3 | 2) \geq P (2 | 1)$

Also, independent of the author’s confidence:

$P (1) \geq P (2) \geq P (3) \geq P (4)$

Brilliant sequence, thank you.

Perhaps I misunderstand what you mean by "first principles," or safety therefrom.

It seems like there are significant first principles omitted as fundamental premises.

Such as aspects of natural philosophy, emotional intelligence and other key factors that enabled humans to gain control of the planet, along with how and why we struggle as such already, and why, without needing to become a very successful "new first species" per se, nor misaligned, nor achieve full agency, AGI could still easily wipe everything out.

Merely by being devoid of emotional intelligence, empathy, symbiosis, sustainability etc.,

yet in full control of the systems on which we rely to facilitate those for our societies.

It would seem to me that those factors, along with the basics of good diplomacy, ombudsmanship and human international relations,

would more be the "first principles," on which the foundations of safety in AGI depend, beyond anything else.

Thanks for this series! I found it very useful and clear, and am very likely to recommend it to various people.

Minor comment: I think "latter" and "former" are the wrong way around in the following passage?

By contrast, I think the AI takeover scenarios that this report focuses on have received much more scrutiny - but still, as discussed previously, have big question marks surrounding some of the key premises. However, it’s important to distinguish the question of how likely it is that the second species argument is correct, from the question of how seriously we should take it. Often people with very different perspectives on the latter actually don’t disagree very much on the former.

Also came here to say that 'latter' and 'former' are mixed up.

Personally, I am most confident in 1, then 4, then 3, then 2 (in each case conditional on all the previous claims)

Oops. A previous version of this comment was wrong, so I edited it. The author’s confidence can be written as:

$P (1) \geq P (4 | 3) \geq P (3 | 2) \geq P (2 | 1)$

Also, independent of the author’s confidence:

$P (1) \geq P (2) \geq P (3) \geq P (4)$

Brilliant sequence, thank you.

LESSWRONG
LW

LESSWRONG
LW

71

AGI safety from first principles: Conclusion

71

Ω 28

71

Ω 28

71

Ω 28