AI Safety "Success Stories"

Wei Dai

LESSWRONG
LW

AI Safety "Success Stories" — LessWrong

Failure

128 AI Safety "Success Stories"

by Wei Dai

7th Sep 2019

AI Alignment Forum

5 min read

128 Ω 50

AI safety researchers often describe their long term goals as building "safe and efficient AIs", but don't always mean the same thing by this or other seemingly similar phrases. Asking about their "success stories" (i.e., scenarios in which their line of research helps contribute to a positive outcome) can help make clear what their actual research aims are. Knowing such scenarios also makes it easier to compare the ambition, difficulty, and other attributes of different lines of AI safety research. I hope this contributes to improved communication and coordination between different groups of people working on AI risk.

In the rest of the post, I describe some common AI safety success stories that I've heard over the years and then compare them along a number of dimensions. They are listed in roughly the order in which they first came to my attention. (Suggestions welcome for better names for any of these scenarios, as well as additional success stories and additional dimensions along which they can be compared.)

The Success Stories

Sovereign Singleton

AKA Friendly AI, an autonomous, superhumanly intelligent AGI that takes over the world and optimizes it according to some (perhaps indirect) specification of human values.

Pivotal Tool

An oracle or task AGI, which can be used to perform a pivotal but limited act, and then stops to wait for further instructions.

Corrigible Contender

A semi-autonomous AGI that does not have long-term preferences of its own but acts according to (its understanding of) the short-term preferences of some human or group of humans, it competes effectively with comparable AGIs corrigible to other users as well as unaligned AGIs (if any exist), for resources and ultimately for influence on the future of the universe.

Interim Quality-of-Life Improver

AI risk can be minimized if world powers coordinate to limit AI capabilities development or deployment, in order to give AI safety researchers more time to figure out how to build a very safe and highly capable AGI. While that is proceeding, it may be a good idea (e.g., politically advisable and/or morally correct) to deploy relatively safe, limited AIs that can improve people's quality of life but are not necessarily state of the art in terms of capability or efficiency. Such improvements can for example include curing diseases and solving pressing scientific and technological problems.

(I want to credit Rohin Shah as the person that I got this success story from, but can't find the post or comment where he talked about it. Was it someone else?)

Research Assistant

If an AGI project gains a lead over its competitors, it may be able to grow that into a larger lead by building AIs to help with (either safety or capability) research. This can be in the form of an oracle, or human imitation, or even narrow AIs useful for making money (which can be used to buy more compute, hire more human researchers, etc). Such Research Assistant AIs can help pave the way to one of the other, more definitive success stories. Examples: 1, 2.

Comparison Table

	Sovereign Singleton	Pivotal Tool	Corrigible Contender	Interim Quality-of-Life Improver	Research Assistant
Autonomy	High	Low	Medium	Low	Low
AI safety ambition / difficulty	Very High	Medium	High	Low	Low
Reliance on human safety	Low	High	High	Medium	Medium
Required capability advantage over competing agents	High	High	None	None	Low
Tolerates capability trade-off due to safety measures	Yes	Yes	No	Yes	Some
Assumes strong global coordination	No	No	No	Yes	No
Controlled access	Yes	Yes	No	Yes	Yes

(Note that due to limited space, I've left out a couple of scenarios which are straightforward recombinations of the above success stories, namely Sovereign Contender and Corrigible Singleton. I also left out CAIS because I find it hard to visualize it clearly enough as a success story to fill out its entries in the above table, plus I'm not sure if any safety researchers are currently aiming for it as a success story.)

The color coding in the table indicates how hard it would be to achieve the required condition for a success story to come to pass, with green meaning relatively easy, and yellow/pink/violet indicating increasing difficulty. Below is an explanation of what each row heading means, in case it's not immediately clear.

Autonomy

The opposite of human-in-the-loop.

AI safety ambition/difficulty

Achieving each success story requires solving a different set of AI safety problems. This is my subjective estimate of how ambitious/difficult the corresponding set of AI safety problems is. (Please feel free to disagree in the comments!)

Reliance on human safety

How much does achieving this success story depend on humans being safe, or on solving human safety problems? This is also a subjective judgement because different success stories rely on different aspects of human safety.

Required capability advantage over competing agents

Does achieving this success story require that the safe/aligned AI have a capability advantage over other agents in the world?

Tolerates capability trade-off due to safety measures

Many ways of achieving AI safety have a cost in terms of lowering the capability of an AI relative to an unaligned AI built using comparable resources and technology. In some scenarios this is not as consequential (e.g., because it depends on achieving a large initial capability lead and then preventing any subsequent competitors from arising), and that's indicated by a "Yes" in this row.

Assumes strong global coordination

Does this success story assume that there is strong global coordination to prevent unaligned competitors from arising?

Controlled access

Does this success story assume that only a small number of people are given access to the safe/aligned AI?

Further Thoughts

This exercise made me realize that I'm confused about how the Pivotal Tool scenario is supposed to work, after the initial pivotal act is done. It would likely require several years or decades to fully solve AI safety/alignment and remove the dependence on human safety, but it's not clear how to create a safe environment for doing that after the pivotal act.
One thing I'm less confused about now is why people who work toward the Contender scenarios are focused more on minimizing the capability trade-off of safety measures than people who work toward the Singleton scenarios even though the latter scenarios seem to demand more of a capability lead. It's because the latter group of people think it's possible or likely for a single AGI project to achieve a large initial capability advantage, in which case some initial capability trade-off due to safety measures is ok, and subsequent ongoing capability trade-off is not consequential because there would be no competitors left.
The comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive success story. Is this conclusion actually justified?
Interim Quality-of-Life Improver also looks very attractive, if only strong global coordination could be achieved.

AI RiskAI Success Models

Frontpage

128 Ω 50

What failure looks like

55 comments442 karma

Reframing Impact

15 comments97 karma

New Comment

27 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:23 PM

[-]Rohin Shah6yΩ590

I want to credit Rohin Shah as the person that I got this success story from, but can't find the post or comment where he talked about it.

It might be from Following human norms?

With a norm-following AI system, the success story is primarily around accelerating our rate of progress. Humans remain in charge of the overall trajectory of the future, and we use AI systems as tools that enable us to make better decisions and create better technologies, which looks like “superhuman intelligence” from our vantage point today.

If we still want an AI system that colonizes space and optimizes it according to our values without our supervision, we can figure out what our values are over a period of reflection, solve the alignment problem for goal-directed AI systems, and then create such an AI system.

Which was referenced again in Learning preferences by looking at the world:

If I had to point towards a particular concrete path to a good future, it would be the one that I outlined in Following human norms. We build AI systems that have a good understanding of “common sense” or “how to behave normally in human society”; they accelerate technological development and improve decision-making; if we really want to have a goal-directed AI that is not under our control but that optimizes for our values then we solve the full alignment problem in the future. Inferring preferences or norms from the world state could be a crucial part of helping our AI systems understand “common sense”.

It's not the same as your Interim Quality-of-Life Improver, but it's got similar aspects.

It's also related to the concept of a "Great Deliberation" where we stabilize the world and then figure out what we want to do. (I don't have a reference for that though.)

If it wasn't these (but it was me), it was probably something earlier; I think I was thinking along the lines of Interim Quality-of-Life Improver in early-to-mid 2018.

[-]Wei Dai6yΩ460

Thanks for the references. I think I should also credit you with being the first to use "success story" the way I'm using it here, in connection with AI safety, which gave me the idea to write this post.

It’s not the same as your Interim Quality-of-Life Improver, but it’s got similar aspects.

The main difference seems to be that you don't explicitly mention strong global coordination to stop unaligned AI from arising. Is that something you also had in mind? (I seem to recall someone talking about that in connection with this kind of scenario.)

It’s also related to the concept of a “Great Deliberation” where we stabilize the world and then figure out what we want to do. (I don’t have a reference for that though.)

There's also Will MacAskill and Toby Ord's "the Long Reflection" (which may be the same thing that you're thinking of), which as far as I know isn't written up in detail anywhere yet. However I'm told that both of their upcoming books will have some discussions of it.

[-]Rohin Shah6yΩ120

The main difference seems to be that you don't explicitly mention strong global coordination to stop unaligned AI from arising. Is that something you also had in mind?

It's more of a free variable -- I could imagine the world turning out such that we don't need very strong coordination (because the Quality of Life Improver AI could plausibly not sacrifice competitiveness), and I could also imagine the world turning out such that it's really easy to build very powerful unaligned AI and we need strong global coordination to prevent it from happening.

I think the difference may just be in how we present it -- you focus more on the global coordination part, whereas I focus more on the following norms + improving technology + quality of life part.

There's also Will MacAskill and Toby Ord's "the Long Reflection"

Yeah I think that's the same concept.

[-]Daniel Kokotajlo5yΩ250Nomination for 2019 Review

I'm surprised this post didn't get more comments and spark more further research. Rereading it, I think it's both an excellent overview/distillation, and also a piece of strategy research in its own right. I wish there were more things like this. I think this post deserves to be expanded into a book or website and continually updated and refined.

[-]Rob Bensinger5yΩ240Nomination for 2019 Review

Seems like a good starting point for discussion. Researchers need to have some picture of what AI alignment is "for," in order to think about what research directions look most promising.

[-]riceissa6y40

Corrigible Contender

A semi-autonomous AGI that does not have long-term preferences of its own but acts according to (its understanding of) the short-term preferences of some human or group of humans

In light of recent discussion, it seems like this part should be clarified to say "actual preferences" or "short-term preferences-on-reflection".

Also in the table, for Corrigible Contender should the reliance on human safety be changed from "High" to "Medium"? (My feeling is that since the AI isn't relying on the current humans' elicited preferences, the reliance on human safety would be somewhere between that of Sovereign Singleton and Pivotal Tool.)

(I'm making these suggestions mainly because I expect people will continue to refer to this post in the future.)

[-]Logan Riggs6y40

I really enjoyed the alliterations in some of the names. They reminded me of “series of unfortunate events” book titles.

Regarding pivotal acts. You linked the arbital article that defines it as any action that causes a large change in solving alignment. One of those examples was uploading 50 safety researchers at 1000x speed, so the pivotal act does the hard work of creating the safe environment to solve alignment.

Being handed a safe oracle would also allow sped up, safe progress in alignment.

Do you have in mind a pivotal act that requires years/decades of more work in a non-safe environment? (Or not guaranteed to be highly safe)

[-]Wei Dai6y30

Do you have in mind a pivotal act that requires years/decades of more work in a non-safe environment? (Or not guaranteed to be highly safe)

I seem to recall someone mentioning shutting down all rival AGI projects (e.g., by destroying or inactivating their computing hardware and presumably preventing them from buying or building new hardware) as a pivotal act. Ah, this is actually mentioned in the Arbital article:

prevent the origin of all hostile superintelligences (in the nice case, only temporarily and via strategies that cause only acceptable amounts of collateral damage)

[-]Logan Riggs6y30

Even in that example, all “hostile superintelligences” are prevented from existing (with acceptable collateral damage). Even though alignment may take more years/decades to solve in this scenario, it’s a much safer environment to do so.

Although, these hypotheticals are unlikely (their purpose is pedagogical). It’s likely due to my ignorance, but I am unaware of any pivotal acts attached to anyone’s research agenda.

[-]Wei Dai6y30

Even though alignment may take more years/decades to solve in this scenario, it’s a much safer environment to do so.

It seems safer, but I'm not sure about "much safer". You now have an extremely powerful AI that takes human commands, lots of people and governments would want to get their hands on it, and geopolitics is highly destabilized due to your unilateral actions. What are your next steps to ensure continued safety?

Although, these hypotheticals are unlikely (their purpose is pedagogical). It’s likely due to my ignorance, but I am unaware of any pivotal acts attached to anyone’s research agenda.

I think the examples in that Arbital post are actually intended to be realistic examples (i.e., something that MIRI or at least Eliezer would consider doing if they managed to build a safe and powerful task AGI). If you have reason to think otherwise, please explain.

[-]Logan Riggs6y10

It seems safer, but I'm not sure about "much safer". You now have an extremely powerful AI that takes human commands, lots of people and governments would want to get their hands on it, and geopolitics is highly destabilized due to your unilateral actions. What are your next steps to ensure continued safety?

Anything that "decisively settles a win or loss, or drastically changes the probability of win or loss, or changes the future conditions under which a win or loss is determined" qualifies as a pivotal event. If you're arguing that this specific example doesn't change the probability of winning enough (and you do bring up good points!), then this example might not qualify as a pivotal event.

I think the examples in that Arbital post are actually intended to be realistic examples (i.e., something that MIRI or at least Eliezer would consider doing if they managed to build a safe and powerful task AGI). If you have reason to think otherwise, please explain.

My initial objection: Considering the upload pivotal event, how likely is it that the first pivotal event is uploading alignment researchers? Multiply that by the probability that alignment researchers have access to the first task AGI capable of uploading. (I'm equating "realistic" with "likely")

Though by this logic, the most realistic/likely pivotal event is the one that requires the least amount of absolute and relative advantage, and all other pivotal events are "unrealistic". For example, uploading and shutting down hostile AGI requires a certain level of capability and relative advantage (the uploading example assumes you're the first to gain uploading capabilities), but those two examples probably aren't the best pivotal event for the smallest capability advantage.

So my definition of "realistic pivotal event" might not be useful since the only events that could qualify are the top 100 pivotal events (rated by least capability advantage required), and coming up with 1 of those pivotal events may very well require an AGI.

[-]Logan Riggs6y10

Although, these hypotheticals are unlikely (their purpose is pedagogical). It’s likely due to my ignorance, but I am unaware of any pivotal acts attached to anyone’s research agenda.

[This comment is no longer endorsed by its author]Reply

[-]David Scott Krueger (formerly: capybaralet)6yΩ230

I don't understand what you mean by "Reliance on human safety". Can you clarify/elaborate? Is this like... relying on humans' (meta-)philosophical competence? Relying on not having bad actors? etc...

[-]avturchin6y20

Other possible success stories are semi-success stories, where the outcome is not very good, but some humans survive and significant part of human values is preserved.

One case of the semi-success story is that many sovereign AIs control different countries or territories and implement different values in them. In some of these territories AIs' values will be very close to the best possible implementation of aligned AI. Other AIs could be completely inhuman. Slow takeoff could end in such world of many AIs.

Another case is that unfriendly AI decides not to kill humans for some instrumental reasons (research, acausal trade with other AI, just not bother killing them). It could even run many simulations of human history including simulations of friendly AIs-sovereigns and their human civilizations. In that case, many people will live very happy lives despite being controlled by unfriendly AI. Like some people were happy under Saddam Hussein rule.

Semi-success stories could be seen as a more natural outcome as we typically don't have perfect things in life.

[-]David Scott Krueger (formerly: capybaralet)6yΩ220

Does an "AI safety success story" encapsulate just a certain trajectory in AI (safety) development?

Or does it also include a story about how AI is deployed (and by who, etc.)?

I like this post a lot, but I think it ends up being a bit unclear because I don't think everyone has the same use cases in mind for the different technologies underlying these scenarios, and/or I don't think everyone agrees with the way in which safety research is viewed as contributing to success in these different scenarios... Maybe fleshing out the success stories, or referencing some more in-depth elaborations of them would make this clearer?

[-]riceissa6y20

Or does it also include a story about how AI is deployed (and by who, etc.)?

The "Controlled access" row seems to imply that at least part of how the AI is deployed is part of each success story (with some other parts left to be filled in later). I agree that having more details for each story would be nice.

Somewhat related to this is that I've found it slightly confusing that each success story is named after the kind of AI that is present in that story. So when one says "Sovereign Singleton", this could mean either the AI itself or the AI together with all the other assumptions (e.g. hard takeoff) for how having that kind of AI leads to a "win".

[-]Donald Hobson6yΩ120

I think that our research is at a sufficiently early stage that most technical work could contribute to most success stories. We are still mostly understanding the rules of the game and building the building blocks. I would say that we work on AI safety in general until we find anything that can be used at all. (There is some current work on things like satisficers that seem less relevant to sovereigns. I am not discouraging working on areas that seem more likely to help some success stories, just saying that those areas seem rare.)

[-]David Scott Krueger (formerly: capybaralet)6yΩ110

While that's true to some extent, a lot of research does seem to be motivated much more by some of these scenarios. For example, work on safe oracle designs seems primarily motivated by the pivotal tool success story.

[-]Rana Dexsin6y20

Aside: If you want all alliteration, “Pivotal Performer/Predictor” (depending on whether tool or oracle) and “Rapid Researcher” might be alternative names for types 2 and 5.

[-]Wei Dai6y*50

I actually wasn't trying to optimize for alliteration, except maybe subconsciously. Consciously I would endorse very high tradeoffs against alliteration or any other literary devices in favor of clarity, so by better names I mostly meant clearer names (ETA: with some consideration for historical precedence).

[-]hamnox5y10Review for 2019 Review

very clear and simple. tempting to dismiss this as not significant/novel, but there is a place for presenting basic things well.

And it's positively framed. We could all use a little hope right now.

The noise in my model of what AI safety research is supposed to do and I had learned to ignore it. It surprises me how big a difference it makes, how comparatively calm and settled I feel to have the typical success narratives in front of me, disambiguated from each other. There's much more confusion to tackle, but it seems more manageable.

The next time I stumble upon an AI discussion, I expect I will look up this post for a refresher and to organize my thoughts on what model each person is using.

Assuming what Wei_Dai wrote is accurate, of course. I can tell that this post is greatly approachable, but I'm not in a place to assess whether the approach is *correct*. David Krugeur disputed a few cells on the AI forum. Nor am I certain whether it's a useful communication block for anyone already in the safety field. i see a few pingback posts, 105 votes. Would it see use in the office or is it strictly inferior to other models for communicating about AI futures?

Could attempt reading some AI papers and posts, and judge whether this post helps me contextualize the research in a meaningful way. If it help, that would be evidence for it serving an introductory purpose well. Humans like their stories.

**i am so tired i don't want to run the experiment**

Improve:
- Clarify what CAIS stands for, acronym is not expanded on the page
- Could use some Ethos. This is a widely accessible post, it makes sense to establish who the author his and why anyone would trust his opaque assessments
- Flesh out description of autonomy better
- Remove parentheticals asking for comments
- Survey which orgs/researchers are considering which scenarios. Including this information gives a next action to take if a reader wishes to engage further. As Donald Hobson pointed out, the most common category of work might be "technical work [that] could contribute to most success stories". this would also be important information worth knowing.
- could follow up with a similarly basic overview of semi-success. I don't recommend doing it with failure stories. there are too many and it would be a major bummer.

[-]David Scott Krueger (formerly: capybaralet)6yΩ110

I'm going to dispute a few cells in your grid.

I think pivotal tool story has low reliance on human safety (although I'm confused by that row in general).
Whether sovereigns would require restricted access is unclear. This is basically the question of whether single-agent, single-user alignment will likely produce a solution to multi-agent, multi-user alignment (in a timely manner).
ETA: the "interim quality of life improver" seems to roughly be talking about episodic RL, which I would classify as "medium" autonomy.

[-]riceissa6y30

I think pivotal tool story has low reliance on human safety (although I’m confused by that row in general).

From the Task-directed AGI page on Arbital:

The obvious disadvantage of a Task AGI is moral hazard - it may tempt the users in ways that a Sovereign would not. A Sovereign has moral hazard chiefly during the development phase, when the programmers and users are perhaps not yet in a position of special relative power. A Task AGI has ongoing moral hazard as it is used.

(My understanding is that task AGI = genie = Pivotal Tool.)

Wei Dai gives some examples of what could go wrong in this post:

For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can’t keep up, and their value systems no longer apply or give essentially random answers. AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. In the course of trying to figure out what we most want or like, they could in effect be searching for adversarial examples on our value functions. At our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

The underlying problem seems to be that when humans are in control over long-term outcomes, we are relying more on the humans to have good judgment, and this becomes increasingly a problem the more task-shaped the AI becomes.

I'm curious what your own thinking is (e.g. how would you fill out that row?).

[-]David Scott Krueger (formerly: capybaralet)6y10

OK, I think that makes some sense.

I dont know how I'd fill out the row, since I don't understand what is covered by the phrase "human safety", or what assumptions are being made about the proliferation of the technology, or more specifically, the characteristics of the humans who do possess the tech.

I think I was imagining that the pivotal tool AI is developed by highly competent and safety-conscious humans who use it to perform a pivotal act (or series of pivotal acts) that effectively precludes the kind of issues mentioned in Wei's quote there.

[-]riceissa6y10

I think I was imagining that the pivotal tool AI is developed by highly competent and safety-conscious humans who use it to perform a pivotal act (or series of pivotal acts) that effectively precludes the kind of issues mentioned in Wei's quote there.

Even if you make this assumption, it seems like the reliance on human safety does not go down. I think you're thinking about something more like "how likely it is that lack of human safety becomes a problem" rather than "reliance on human safety".

[-]David Scott Krueger (formerly: capybaralet)6y10

I couldn't say without knowing more what "human safety" means here.

But here's what I imagine an example pivotal command looking like: "Give me the ability to shut-down unsafe AI projects for the foreseeable future. Do this while minimizing disruption to the current world order / status quo. Interpret all of this in the way I intend."

[-]Matthew Barnett6yΩ110

For the alignment newsletter:

Planned summary: It is difficult to measure the usefulness of various alignment approaches without clearly understanding what type of future they end up being useful for. This post collects "Success Stories" for AI -- disjunctive scenarios in which alignment approaches are leveraged to ensure a positive future. Whether these scenarios come to pass will depend critically on background assumptions, such as whether we can achieve global coordination, or solve the most ambitious safety issues. Mapping these success stories can help us prioritize research.

Planned opinion: This post does not exhaust the possible success stories, but it gets us a lot closer to being able to look at a particular approach and ask, "Where exactly does this help us?" My guess is that most research ends up being only minimally helpful for the long run, and so I consider inquiry like this to be very useful for cause prioritization.

Moderation Log