Epistemic status: confident that the underlying idea is useful; less confident about the details, though they're straightforward enough that I expect they're mostly in the right direction.

TLDR: This post describes a pre-mortem-like exercise that I find useful for thinking about AGI risk. It is the only way I know of to train big-picture intuitions about what solution attempts are more or less promising and what the hard parts of the problem are. The (simple) idea is to iterate between constructing safety proposals ('builder step') and looking for critical flaws in a proposal ('breaker step').

Introduction

The way that scientists-in-training usually develop research taste is to smash their heads against reality until they have good intuitions about things like which methods tend to work, how to interpret experimental results, or when to trust their proof of a theorem. This important feedback loop is mostly absent in AGI safety research, since we study a technology that does not exist yet (AGI). As a result, it is hard to develop a good understanding of which avenues of research are most promising and what the hard bits of the problem even are.[1]

The best way I know of to approximate that feedback loop is an iterative exercise with two steps: 1) propose a solution to AGI safety, and 2) look for flaws in the proposal. The idea is simple, but most people don’t do it explicitly or don’t do it often enough.

Multiple rounds of this exercise tend to bring up details about one’s assumptions and predictions that would otherwise stay implicit or unnoticed. Writing down specific flaws of a specific proposal helps ground more general concepts like instrumental convergence or claims like ‘corrigibility is unnatural’. And after some time, the patterns in the flaws (the ‘hard bits’) become visible on their own.

I ran an earlier version of this exercise as a workshop (an important component is to discuss your ideas with others, so a workshop format is convenient). Here are the slides.

The exercise

The exercise consists of two phases:[2] a builder phase in which you write down a best guess / proposal for how we might avoid existential risk from AGI, and a breaker phase in which you dig into the details until you understand how the proposal fails.

Importantly, in the context of this exercise the only thing that counts is your own inside view, that is your own understanding of the technical or political feasibility of the proposal. You might have thoughts like “There’s smart people who have thought about this much longer than I have, and they think X; why should I disagree?”. Put that aside for now; the point is to develop your own views, and that works best when you don’t think too much about other people’s views except to inform your own thoughts.

Builder phase

Write down the proposal: a plausible story for how we might avoid human extinction or disempowerment due to AGI.[3] It doesn’t need to be very detailed yet; that comes in the breaker phase.

  • The proposal will look very different depending on the assumptions you’re starting with.
  • If you’re relatively optimistic about AGI risk, the proposal might look very much like things continuing to go on the current trajectory. Write down how you expect the future to go in very broad terms: if we build AGI, how do we know it’ll behave like we want it to? If we don’t build AGI, how come?
  • If you’re more pessimistic: what’s the most plausible way in which we could move things from their current trajectory? Is it plausible AI labs can coordinate to not build AGI? Or is there a technical solution that seems promising?
  • The builder phase is complete when you have a proposal (can be as short as a paragraph) that seems to you like it stands a decent chance of avoiding disaster. The exact probabilities here depend on your overall levels of pessimism; if you’re relatively pessimistic, then a proposal with a 5-10% chance of working is great; if you’re more optimistic, then you can probably find a proposal that you assign a higher chance of working.

The proposal does not need to be purely technical; e.g. governance approaches are fair game.

Example builder phase (oracle AI): Instead of building an “agent AI” that acts in the world, we could build a system that just tries to make good predictions (an “oracle”). An oracle would be very useful and economically valuable while avoiding existential risk from AGI, because an oracle has no agency and thus no reason to act against us.

If you get stuck, i.e. you can’t come up with an AGI safety proposal:
(Don’t worry, this is a common problem).

→ Write down in broad outlines what you expect to happen if we develop AGI. If that inevitably ends badly, start with the breaker phase: describe a failure scenario, then try to find a fix.

→ Talk to an AGI optimist, if you can find one. If they have an idea that doesn’t seem to you like it has obvious flaws, start with that. Alternatively, look for written proposals like the OpenAI alignment plans.

Breaker phase

Make the proposal detailed and concrete. Try to find flaws. Adopt a security mindsetAI safety mindset.

  • Take the part of the proposal that seems intuitively weakest to you (or that you feel most uncertain about) and make it as detailed as possible. You don’t need to accurately predict the future here; just generate any plausible detailed scenario (e.g. about what training methods are used, or what actions the government takes, etc) and see if the proposal works in that scenario.
  • If you find you need to stretch and twist the details of the scenario to make the proposal work, that’s a good sign you’ve found a flaw.
  • Don’t just generate a list of flaws; in particular don’t list flaws that you think are far-fetched or unlikely. Instead, try to find 1-2 important flaws that actually break the proposal as you’ve currently written it down.
  • Keep in mind Schneier’s lawAnyone can create a security system that he or she can’t break. It’s not even hard. What is hard is creating an algorithm that no one else can break. Share the proposal with others and see if they can find flaws.
  • The breaker phase is complete when you find yourself a lot more pessimistic about the proposal than when you started (at least about the exact proposal you stated in the builder phase; it’s fine if you’re still overall optimistic about the general idea).

Example breaker phase (oracle AI): Let’s say we go ahead and build an oracle AGI. What exactly are we planning to do with this oracle? If the runner-up AI lab builds an agentic AGI 6 months later, their AGI might cause a catastrophe even if we’re careful. It’s not enough for the idea to be safe; it needs to be useful for alignment somehow, or otherwise help us prevent disaster from a competitor AGI. The current proposal doesn’t say anything about how to do that, which is a critical flaw.[4]

If you get stuck, i.e. it seems like the proposal works:

→ Consider different kinds of ways the proposal might fail. A useful resource here is this very appropriately titled essay.

→ Write up your proposal and get others to critique it.

 

Iterate

If the proposal seemed promising to start with, it’s plausible that a single serious flaw will not be enough to wreck it beyond repair. If you can see a way to adapt the proposal to fix the flaw, go to step 1 and repeat.

Example fix (Oracle AI): So we need to adapt the proposal to make sure we can do something useful with the Oracle AI that prevents a less careful competitor lab from causing a disaster. Maybe an oracle can help us by evaluating our plans to convince other companies to not build AGI?

→ Adapted proposal  (Oracle AI 2): Instead of building an “agent AI” that acts in the world, we could build a system that just tries to make good predictions (an “oracle”). An oracle would be very useful and economically valuable while avoiding existential risk from AGI, because an oracle has no agency and thus no reason to act against us. We train the oracle to be good at answering questions such as “will research program X have catastrophic consequences?” and at evaluating the consequences of actions such as “talk to person X to convince them they should stop research program X”. The oracle will warn us if another lab gets close to deploying a dangerous AGI, and if so it can tell us how to convince them to stop.
 

If you get stuck, i.e. it seems like the proposal is unfixable.

→ Talk to others about your idea, in particular if you know people who are optimistic about ideas similar to the proposal you’re working with. Send them your notes and ask for opinions.

→ If that fails: congratulations, you have completed the exercise! Start again from scratch with a new idea :)

Details

  • As you iterate through steps 1-2, the proposal will typically accumulate more detail to avoid the flaws you uncover during breaker steps. This is not by itself bad―it’s a sign of progress―but at some point it becomes unrealistic. Once you notice that your proposal starts with “Ok, so here’s the 37 hoops that the AGI safety research program needs to jump through in order to stand a chance at all”, it’s probably time to scrap the proposal and start anew.
  • In order to have justified confidence in a belief, you need to set things up such that in worlds in which the belief is wrong, you are very likely to end up disbelieving it. In particular:
    • In order to end up correctly pessimistic about AGI risk, you need to really try to find solutions (Builder Phase).
    • In order to end up correctly optimistic about a safety proposal, you need to really try to find flaws (Breaker Phase).
    • In both cases, an important part of really trying is to share your thoughts with others and to take their feedback seriously.
  • If you’re doing the exercise right then you will change your mind, or at the very least fill in a lot of missing details in your understanding. If you’re not actively stuck (that is, you’re moving through the steps 1-2 and iterating), but you’re not changing your mind or learning something then you are possibly making one of these mistakes:
    • 1) You don’t actually believe in the proposal to begin with, so when you find a flaw you are neither surprised nor did you learn something new.
    • 2) You don’t really buy the flaws that you find during your breaker phase. For example, it’s easy to fall into the trap of listing many weak or implausible-seeming flaws.
    • It’s fine to generally meta-level expect that you’ll find flaws in most proposals you come up with, even if you don’t know the specifics ahead of time. But on the object level you should try to write down proposals you actually believe in, i.e. you think have a >50% chance of working (again, this is ignoring considerations like “lots of people who have thought about this more than I think the problem is hard, so this proposal is unlikely to work”).
    • Even if you end up finding a fix for the proposal, you’ll have changed your mind about the original idea or learned something about which details hold up and which don’t.
  • A friend told me a story about their research advisor, who likes to say that (paraphrasing) you should think of it as “the hypothesis”, not “my hypothesis”; don’t get too attached to ideas. In this post, I’ve taken care to write “the proposal” rather than “your proposal”; I recommend you think about it in the same way.
  • In the spirit of Learning by Writing, write down the outputs of this exercise.

Resources

Writing on AGI safety

If you decide to do this exercise, you’ll probably (depending on how much you’ve already read) find it useful to read other people’s thoughts on the topic. I’ve compiled some resources that you might find useful to read through for inspiration at various points in this exercise. The list is very incomplete - it’s just what I could come up with from the top of my head.


Breakers (criticisms of AGI Safety proposals & arguments for why safety is harder than one might otherwise think):


Builders (solution proposals):


Lists / collections of posts and papers:

Other writing on how to learn about / work in AGI safety

After I wrote this post I noticed that there’s already a post by Abram Demski that describes basically the same exercise, and later people pointed out to me that John Wentworth runs a similar exercise that is briefly described here. Both of those seem worth reading if you want more perspectives on the builder/breaker exercise, as is Paul Christiano’s post on his research methodology.


Neel Nanda has a good post on forming your own views in AGI safety.


The MIRI alignment research field guide covers some useful basics for doing research and discussion groups with others.

  1. ^

    Of course, AGI safety researchers do build research experience in adjacent fields like deep learning and maths, but there are intuitions and ways of thinking specific to AGI safety that one doesn’t typically inherit from other fields.

  2. ^

     I adopt the terms “builder / breaker” from the ELK report, though I may not be using the terms in exactly the same way.

  3. ^

    If helpful, you can choose a more concrete disaster scenario, such as “an autonomous human-level AGI breaks containment”.

  4. ^

    I'm somewhat dissatisfied with this example because the flaw is obvious enough that there's no need to go into much concrete detail. Usually you'd do more of that, e.g. if the plan is to use the oracle or 'tool-AI' to prevent a dangerous AGI from being built, how exactly might that work?

New Comment
3 comments, sorted by Click to highlight new comments since:
[-]Max HΩ250

Meta: I really like ideas and concrete steps for how to practice the skill of thinking about something. I think there are at least three methods for learning how to think productively about a particular problem:

  • Reading material written by others (not necessarily passively; this might include checking  your understanding of the material you read)
  • Doing exercises, practice problems, toy projects, etc. that are focused directly on the object-level problem.
  • Doing exercises to practice the "cognitive motions" needed to think productively about a problem. I think the exercise(s) described in this post are a good example of this.

And I think the last option is often neglected (in all fields, not just AGI safety) because there's not a lot of written material on how to actually do it. Note that it is a different thing than the more general skill of learning to learn and meta-cognition - different kinds of technical problems can require learning different, domain-specific kinds of cognitive motions.

If you've absorbed enough of the Sequences and other rationality material through osmosis, you might be able to figure out the kind of cognitive motions you need, and how to practice and develop them on your own (or maybe you've done some of the exercises in the CFAR handbook and learned to generalize the lessons they try to teach).

But having someone more experienced write down the kind of cognitive motions you need, along with exercises for how to learn and practice them, can probably get more people up to speed much more quickly. I think posts like this are a great step in that direction.

Object-level tip for the breaker phase: thinking about how a literal human might break your alignment proposal can be a useful way for building intuitions and security mindset. A lot of real alignment schemes involve doing something with human-ish level intelligence, and thinking about how an actual human would break or escape from something is often more natural and less prone to veering into vague or magical thinking than positing capabilities that a hypothetical super-intelligent AI system might have.

If you can't figure out how an actual human can break things, you can relax the constraint a bit by thinking about what a human with the ability to make 10 copies of themselves, think 10x as fast, write code with superhuman accuracy and speed, etc. could do instead.

Threat modelling is the term for this kind of thinking in the field of computer security.

I strong-upvoted this post.

Here's a specific, zoomed-in version of this game proposed by Nate Soares

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

Might be useful as a standalone or as a mini-game within the overall game of building and breaking an alignment proposal, which is itself a mini-game in the overall game of building and breaking success stories.

I like that mini-game! Thanks for the reference