Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment

Michaël Trazzi

Victoria Krakovna is a Research Scientist at DeepMind working on AGI safety and a co-founder of the Future of Life Institute, a non-profit organization working to mitigate technological risks to humanity and increase the chances of a positive future.

In this interview we discuss three of her recent LW posts, namely DeepMind Alignment Team Opinions On AGI Ruin Arguments, Refining The Sharp Left Turn Threat Model and Paradigms of AI Alignment.

This conversation presents Victoria's personal views and does not represent the views of DeepMind as a whole.

Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.

The intelligence threshold for planning to take over the world isn't low

Michaël: “Do you mostly agree that the AI will have the kind of plans to disempower humanity in its training data, or does that require generalization?”
Victoria: “I don't think that the internet has a lot of particularly effective plans to disempower humanity. I think it's not that easy to come up with a plan that actually works. I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans. I would expect there would need to be generalization from the kind of things people come up with when they're thinking about how an AI might take over the world and something that would actually work. Maybe one analogy here is how, for example, AlphaGo had to generalize in order to come up with Move 37, which no humans have thought of before. [...]
The same capabilities that give us probably creative and interesting solutions to problems that, like Move 37, could also produce really undesirable creative solutions to problems that we wouldn't want the AI to solve. I think that's one argument that I think is also on the AGI Ruin list that I would largely agree with, that it's hard to turn off the ability to come up with undesirable creative solutions without also turning off the ability to generally solve problems that we one day want AI to solve. For example, if we want the AI to be able to, for example, cure cancer or solve various coordination problems among humans and so on, then a lot of the capabilities that would come with that could also lead to bad outcomes if the system is not aligned.” (full context)

Why Refine The Sharp Left Turn Threat Model

(On the motivations for writing Refining The Sharp Left Turn Threat Model, a Lesswrong post distilling the claims in the sharp left turn thread model as described in Nate Soares’ post)

“Part of the reason that I wrote a kind of distillation of the threat model or a summary how we understand it is that I think the original threat model seems a bit vague or it wasn’t very clear exactly what claims it’s making. It sounds kind of concerning, but we weren’t sure how to interpret it. And then when we were talking about it with within the team, then people seem to be interpreting it differently. It just seemed useful to kind of arrive at a more precise consensus view of what this threat model actually is and what implications does it have. Because if we decide that the sharp left turn is sufficiently likely, that we would want our research to be more directed towards overcoming and dealing with the sharp left turn scenario. That implies maybe different things to focus on. It’s one thing that I was wondering about. To what extent do we agree that this is one of the most important problems to solve and what the implications actually are in particular. [...]
The first claim is that you get this rapid phase transition in capabilities, rather than, for example, very gradually improving capabilities in a way that the system is always similar to the previous version of itself. The second claim is that assuming that such a phase transition happens, our Alignment techniques will stop working. The third claim is that humans would not be able to effectively intervene on this process. For example, like detecting a sharp left turn is about to happen and stopping the training of this particular system or maybe coordinating to develop some kind of safety standards or just noticing warning shots and learning from them and so on. These are all kind of different ingredients for a concerning scenario there. Something that we also spend some time thinking about is what could a mechanism for a sharp left turn actually look like? What would need to happen within a system for that kind of scenario to unfold? Because that was also kind of missing from the original threat model. It was just kind of pointing to this analogy with human evolution. But it wasn’t actually clear how will this actually work for an actual Machine Learning system.” (full context)

A Pivotal Act Seems Like A Very Risky And Bad Idea

“There’s this whole idea that in order to save the world, you have to perform a pivotal act where a pivotal act is some kind of intervention where you prevent anyone in the world from launching an unaligned AGI system. I think MIRI in general believe that you can only do that by deploying your own AGI. Of course if you are trying to deploy a system to prevent anyone else from deploying an AGI, that’s actually a pretty dangerous thing to do. That’s one thing that people, at least in our team, disagreed with the most. The whole idea that you might want to do this or, not to mention that you would need to do this, because it just generally seems like a very risky and bad idea. The framing bakes in the assumption that there’s no other way to avoid unaligned AGI being deployed by other actors. This assumption relies on some of MIRI’s pessimism about being able to coordinate to slow down or develop safety standards. [...]
I do feel somewhat more optimistic about cooperation in general. Especially within the West, between western AI companies, it seems possible and definitely worth trying. Global cooperation is more difficult, but that may or may not be necessary. But also, both myself and others on the team would object to the whole framing of a pivotal act as opposed to just doing things that you would need that increase the chances that an unaligned AGI system is not deployed. That includes cooperation. That includes continuing to work on Alignment research, continuous progress as opposed to focusing on this very specific scenario where some small group of actors would take some kind of unilateral action to try to stop unaligned AGI from being deployed.” (full context)

My thoughts:

[Epistemic status + impostor syndrome: Just learning, posting my ideas to hear how they are wrong and in hope to interact with others in the community. Don't learn from my ideas]

Victoria: “I don't think that the internet has a lot of particularly effective plans to disempower humanity.

I think:

Having ready plans on the internet and using them is not part of the normal threat model from an AGI. If that was the problem, we could just filter out those plans from the training set.
(The internet does have such ideas. I will briefly mention biosecurity, but I prefer not spreading ideas on how to disempower humanity)

[Victoria:] I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans.

TL;DR: I think some ways to disempower humanity don't require thinking differently than humans

I'll split up AI's attack vectors into 3 buckets:

Attacks that humans didn't even think of (such as what we can do to apes)
Attacks that humans did think of but are not defending against (for example, we thought about pandemic risks but we didn't defended against them so well). Note this does not require thinking about things that humans didn't think about.
Attacks that humans are actively defending against, such as using robots with guns or trading in the stock market or playing go (go probably won't help taking over the world, but humans are actively working on winning go games, so I put the example here). Having an AI beat us in one of these does require it to be in some important (to me) sense smarter than us, but not all attacks are in this bucket.

[...] requires thinking differently from humans

I think AIs already today think differently than humans in any reasonable way we could mean that. In fact, if we could make an them NOT think differently than humans, my [untrustworthy] opinion is that this would be non-negligible progress towards solving alignment. No?

The intelligence threshold for planning to take over the world isn't low

First, disclaimers:

(1) I'm not an expert and this isn't widely reviewed, (2) I'm intentionally being not detailed in order to not spread ideas on how to take over the world, I'm aware this is bad epistemic and I'm sorry for it, it's the tradeoff I'm picking

So, mainly based on A, I think a person who is 90% as intelligent as Elon Musk in all dimensions would probably be able to destroy humanity, and so (if I'm right), the intelligence threshold is lower than "the world's smartest human". Again sorry for the lack of detail. [mods, if this was already too much, feel free to edit/delete my comment]

Correction: the Youtube link should point to https://www.youtube.com/watch?v=ZpwSNiLV-nw, not the current location (a previous video of yours).

My thoughts:

[Epistemic status + impostor syndrome: Just learning, posting my ideas to hear how they are wrong and in hope to interact with others in the community. Don't learn from my ideas]

Victoria: “I don't think that the internet has a lot of particularly effective plans to disempower humanity.

I think:

Having ready plans on the internet and using them is not part of the normal threat model from an AGI. If that was the problem, we could just filter out those plans from the training set.
(The internet does have such ideas. I will briefly mention biosecurity, but I prefer not spreading ideas on how to disempower humanity)

[Victoria:] I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans.

TL;DR: I think some ways to disempower humanity don't require thinking differently than humans

I'll split up AI's attack vectors into 3 buckets:

Attacks that humans didn't even think of (such as what we can do to apes)
Attacks that humans did think of but are not defending against (for example, we thought about pandemic risks but we didn't defended against them so well). Note this does not require thinking about things that humans didn't think about.
Attacks that humans are actively defending against, such as using robots with guns or trading in the stock market or playing go (go probably won't help taking over the world, but humans are actively working on winning go games, so I put the example here). Having an AI beat us in one of these does require it to be in some important (to me) sense smarter than us, but not all attacks are in this bucket.

[...] requires thinking differently from humans

The intelligence threshold for planning to take over the world isn't low

First, disclaimers:

Correction: the Youtube link should point to https://www.youtube.com/watch?v=ZpwSNiLV-nw, not the current location (a previous video of yours).

LESSWRONG
LW

LESSWRONG
LW

40

Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment

40

The intelligence threshold for planning to take over the world isn't low

Why Refine The Sharp Left Turn Threat Model

A Pivotal Act Seems Like A Very Risky And Bad Idea

40

The intelligence threshold for planning to take over the world isn't low

40

The intelligence threshold for planning to take over the world isn't low