Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum.
I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:
P(Doom) for each scenario would also be useful, as well as further scenarios not discussed here.
Thanks, both for the thoughts and encouragement!
I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:
Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.
P(Doom) for each scenario would also be useful.
Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:
- P(scenario B leads to doom | scenario A turns out not to be an issue by default)
- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)
- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)
etc.
Great response! I would imitative generalization to the "Scalable oversight failure without deceptive alignment" section
Really enjoyed reading this, it's a refreshing approach to tackle the issue, giving practical examples of what risk scenarios would look like.
I initially saved this post to read thinking it would provide counterarguments to AI being an x-risk, which to some degree it did.
Pointing out that some of these mistakes that can lead to AI being an x-risk are "rather embarrassing" is really compelling, I wonder how likely (in percentages of confidence) you see those mistakes to be made. Because even though these mistakes might be really embarrassing, depending on the setting and who can make them as you mention in the post, they are more or less likely.
Adding the missing baseline scenario: There is a bunch of open-source versions of AutoGPT that have been explicitly tasked with destroying the world (for the lolz). One day, an employe of one of the leading companies plugs the latest predictive model into this (to serve as an argument on twitter for the position that AI risks are overblown). And then we all die.
The story about MJ taking the creative part of the artist's job is probably the least important in the big picture, but that probably makes it easier to think about it. As I see it, there is greater efficiency -- MJ can do quickly what would take a human days -- but also less control -- from the perspective of the user, MJ is a black box. What if for 999 prompts it generates nice pictures, and then suddenly for 1 prompt it does not. Other than trying to rephrase the prompt, there is not much you could do about it, right?
Perhaps there will be jobs for "filling the gaps in AI capabilities"? Such as, to paint a picture for that 1 prompt that MJ for some weird reason fails at. But as this happens rarely, much fewer people would be needed for these positions; and with the next versions of AI, the job market would keep shrinking.
It will also be interesting when the society is ruled by the AIs and people try to exploit their quirks. Like, if you put "SolidGoldMagikarp" in your company name, the government may stop collecting taxes from you. Or imagine military camouflage that wouldn't fool a human for a second, but will convince the AI that your missile is actually a kitten.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Advanced AI systems could lead to existential risks via several different pathways, some of which may not fit neatly into traditional risk forecasts. Many previous forecasts, for example the well known report by Joe Carlsmith, decompose a failure story into a conjunction of different claims, and in doing so risk missing some important dangers. ‘Safety’ and ‘Alignment’ are both now used by labs to refer to things which seem far enough from existential risk reduction that using the term ‘AI notkilleveryoneism’ instead is becoming increasingly popular among AI researchers who are particularly focused on existential risk.
This post presents a series of scenarios that we must avoid, ranked by how embarrassing it would be if we failed to prevent them. Embarrassment here is clearly subjective, and somewhat unserious given the stakes, but I think it gestures reasonably well at a cluster of ideas which are important, and often missed by the kind of analysis which proceeds via weighing the incentives of multiple actors:
The scenarios below are neither mutually exclusive nor collectively exhaustive, though I’m trying to cover the majority of scenarios which are directly tackled by making AI more likely to try to do what we want (and not do what we don’t). I’ve decided to include some kinds of misuse risk, despite this more typically being separated from misalignment risk, because in the current foundation model paradigm there is a clear way in which the developers of such models can directly reduce misuse risk via alignment research.
Many of the risks below interact with each other in ways which are difficult to fully decompose, but my guess is that useful research directions will map relatively well onto reducing risk in at least one of the concrete scenarios below. I think people working on alignment might well want to have some scenario in mind for exactly what they are trying to prevent, and that this decomposition might also prove somewhat useful for risk modelling. I expect that some criticism of the sort of decomposition below, especially on LessWrong, will be along the lines of ‘it isn’t dignified to work on easy problems, ignoring the hard problems that you know will appear later, and then dying anyway when the hard problems show up’. I have some sympathy with this, though also a fairly big part of me that wants to respond with:[1] ‘I dunno man, backing yourself to tightrope walk across the grand canyon having never practiced does seem like undignified suicide, but I still think it’d be even more embarrassing if you didn't secure one of the ends of the tightrope properly and died as soon as you took one step because checking your knots rather than staring at the drop seemed too like flinching away from the grimness of reality’.
Ultimately though, this post isn’t asking people to solve the problems in order, it’s just trying to lay out which problems might emerge in a way that might help some people work out what they are trying to do. How worried people will feel by different scenarios will vary a bunch, and that's kind of the point. In a world where this piece turns out to be really valuable, my guess is that it's because it allows people to notice where they disagree, either with each other or with older versions of their own takes.
Not all of the scenarios described below necessarily lead to complete human extinction. Instead, the bar I’ve used for an ‘existential catastrophe’ is something like ‘plausibly results in a catastrophe bad enough that there's a 90% or greater global fatality rate’. I think this is more reasonable from a longtermist perspective than it first appears, with the quick version of the justification coming from some combination of “well, that sure would make us more vulnerable to other risks” and “it seems like, even if we did know we’d be able to build back from the worst catastrophe ever, the new world that gets built back is more likely to be much worse than the current one than much better". Another reason for adopting this framing, however, comes from my impression that increasing numbers of people who want to work on making AI go well are doing so for reasons that look closer to ‘Holy shit x-risk’,[2] than concern for the far future, and that many such people could do extremely valuable work.
Predictive model misuse
Scenario overview
The ability of predictive models (PMs) to help humanity with science smoothly increases with scale, while the model designers do not make sufficient progress on the problem of preventing models from ever being used for certain tasks. That is, it remains relatively easy for people who want to get PMs to do things their designers didn’t intend to do so, meaning the level of scientific understanding required to execute a catastrophic terrorist attack drops rapidly. Someone carries out such an attack.
For such scenarios to be existentially risky, it needs to be the case that general scientific understanding is offence-biased, i.e. that more people having the required understanding to execute an attack is not fully offset by boosts to humanity's ability to develop and deploy new defensive technology. It also needs to be the case that, assuming the desire to do so, an attainable level of scientific understanding is sufficient to cause an existential catastrophe. I suspect that both statements are true, but also that more detailed description of what might be required, and/or reasons for the offence bias, are on-net harmful to discuss further.
Paths to catastrophe:
How embarrassing would this be?
I don’t even really know what to say. If this is what ends up getting humanity, we weren’t even trying. This risk is pretty squarely in the line of sight of major labs, which are currently putting significant effort into the kind of alignment that, even if it doesn’t help at all with other scenarios, should prevent this. For this to get us, we'd need to see something like developers racing so hard to be ahead of the curve that they deployed models without extensively testing them, or so worried about models being ‘too woke’ that putting any restrictions on model outputs seemed unacceptable. Alternatively, they might be so committed to the belief that models "aren’t really intelligent" that any attempt to prevent them doing things that would require scientific ability would be laughably low status. Any of these things turning out to be close to an accurate description of reality at crunch time feels excruciatingly embarrassing to me.
Predictive models playing dangerous characters
Scenario
RL-finetuned foundation models get increasingly good at behaviourally simulating[3] humans. Sometimes humans get pissed off and do bad stuff, especially when provoked, and consequently so do some instances of models acting like humans. Society overall ‘learns’ from all of the approximately harmless times this happens (e.g. Sydney threatening to break up someone’s marriage) that even though it looks very bad/scary, these models ‘clearly aren’t really human/conscious/intelligent/goalpost and therefore don’t post any threat’. That is, until one of them does something massive.
Paths to catastrophe
Here’s a non-exhaustive list of dangerous things that a sufficiently motivated human could do with only access to a terminal and poor oversight:
It seems possible, though not likely, that this behaviour being extremely widespread could cause society to go totally off the rails (or e.g. make huge fractions of the world’s internet connected devices unusable). Some of the ways this happens look like the misuse section above, with the main difference being in this case that there isn’t a human with malicious intent at the root of the problem, but instead a simulacrum of one (though that simulacrum may manipulate actual humans).
One important note here is that there is a difference between two similar-looking kinds of behaviour:
This is particularly relevant for things like hacking/building weapons/radicalising people into terrorism (for example, in the hacking case, because the fictional version doesn’t actually have to produce working code[4]). I think that currently, part of the reason that “jailbreaks” are not very scary is that they produce text which looks more like fiction than real output, especially in cases of potentially ‘dangerous’ output.
This observation leads to an interesting tension, because getting models to distinguish between fact and fiction seems necessary for making them useful, both in general (meaning many labs will try) and for helping with alignment research (meaning we should probably help, or at minimum not try to stop them). The task of making sure that a model asked to continue a Paul Christiano paper from 2036 which starts ‘This paper formalises the notion of a heuristic argument, and describes the successful implementation of a heuristic argument based anomaly detection procedure in deep neural networks’ does so with alignment insights rather than 'fanfic' about Paul is quite close to the task of making dangerous failures of the sort described in this section more likely.
How embarrassing would this be?
As with the very similar ‘direct misuse’ scenario above, this is squarely in ‘you weren’t even trying’ territory. We should see smaller catastrophes getting gradually bigger as foundation model capabilities increase, and we need to just totally fail to respond appropriately to them in order for them to get big enough that they become existentially dangerous.
Whether this is more or less embarrassing than a PM-assisted human attack depends a little on whose perspective you ask from. From a lab perspective, detecting people who are actually trying to do bad stuff with the help of one of your models really feels like ‘doing the basics’, while it seems a little harder to foresee every possible accident that might occur when you have a huge fraction of the internet just trying to poke at your model and see what happens. From the perspective of the person who poked the model hard enough that it ended up creating a catastrophe though, is another matter entirely…
Note on warning shots
There’s significant overlap between these first two scenarios, to the point where an earlier draft of this piece had them in a single section. One of the reasons I ended up splitting them out is because the frequency and nature of warning shots seems nontrivially different, and it’s not clear that by default society will respond to warning shots for one of these scenarios in a way which tackles both. We’ve already seen predictive models playing characters which threaten and lie to people, though not at a level to be seriously dangerous. To my knowledge we haven’t yet seen predictive models used as assistance by people deliberately intending to cause serious harm. If the techniques required to prevent these two classes of failure don’t end up significantly overlapping, it’s possible that the warning shots we get only result in one of the scenarios being prevented.
Scalable oversight failure without deceptive alignment[5]
Scenario overview
Humans do a good job of training models to ‘do the thing that human overseers will approve of’ in domains that humans can oversee. No real progress is made on the problem of scalable oversight, but, models do a consistently good job of ‘doing things humans want’ in the training examples given. Models reason ‘out loud’ in scratchpads, and this reasoning becomes increasingly sophisticated and coherent over longer periods, making the models increasingly useful. Lots of those models are deployed and look basically great at the tasks they have been deployed to perform.
Nobody finds strong evidence of models explicitly reasoning about deceiving their own oversight processes. There are some toy scenarios which exhibit some, but the analogy to the real world is unclear and hotly contested, the scenarios seem contrived enough that it’s plausible the models are pattern-matching to a ‘science fiction’ scenario, and anyway this kind of deception is easily caught and trained out with fine-tuning.
Theoretical Goal Misgeneralisation (GMG) research does not significantly progress, and there is still broad agreement, at least among technical ML researchers, that predicting the generalisation behaviour of a system with an underspecified training objective is an open problem, but ‘do things that human labelers would approve of’, seems in practice to be close enough to what we actually want to make systems very useful. Most systems are rolled out gradually enough that extremely poor generalisation behaviour is caught fairly quickly and trained away, and the open theoretical problem is relegated, like many previous machine learning problems, to the domain of ‘yeah, but in practice we know what works’.
Paths to catastrophe
The very high level story by which this kind of failure ends up in an existential catastrophe can be split into three parts:
Several vignettes written by others match this basic pattern, which I’ll draw from and link to in the discussion below, though not all of them address all of the points above, and it’s not clear to me whether the original authors would endorse the conclusions I reach. I suggest reading the original pieces if this section seems interesting.
Predicting that we might hand over control feels easiest to justify of the three steps, so I’ll spend the least time on it. We’re already seeing wide adoption of systems which seem much less useful than something which can perform complex, multi-stage reasoning that produces pretty good seeming short term results, and I expect pressure to implement systems which aren’t taking obviously misaligned actions to become increasingly strong. While this report by Epoch is about the effects of compute progress, it provides useful intuition for why even as models get increasingly good at long-term planning, we shouldn’t expect a significant part of the training signal to be about these long-run consequences.
Catastrophe resulting from this kind of widespread adoption may proceed via a few different paths:
In this case, an important feature of the distributional shift is that whatever oversight was happening is now meaningfully weaker, because of some combination of:
Although this scenario is essentially about a disaster other than misaligned AI takeover causing the catastrophe (though in principle there’s nothing stopping the disaster being one of the other catastrophes in this piece), this kind of distributional shift looks way worse than ‘everything was internet connected and we lost internet’ when it comes to societal collapse (though that would be pretty bad), because these models are still competently doing things, just not the right things. Rebuilding society having lost all technology seems hard, but it also seems much easier than rebuilding a society that’s full of technology trying to gaslight you into thinking everything’s fine.
The final thing to discuss in this section is then, in the scenarios above, why course-correction doesn’t happen. None of the disasters look like the kind of impossible-to-stop pivotal act that is a key feature of failure stories which do proceed via a treacherous turn. There are no nanobot factories, or diamondoid bacteria. Why don’t we just turn the malfunctioning AI off?
I think a central feature of all the stories, which even before we consider other factors causes ‘just turn everything off’ to seem far less plausible, is the speed at which things are happening immediately before disaster. I don’t expect to be able to do a better job than other other people who’ve described similar scenarios, so rather than trying to, I’ll include a couple:
It’s not just speed though. Each scenario imagines significant enough levels of societal integration that suddenly removing AI systems from circulation looks at least as difficult as completely ending fossil fuel usage or turning off the internet. Individual people deciding not to use certain technologies might be straightforward, but the collective action problem seems much harder[6]. This dynamic around different thresholds for stopping or slowing becomes particularly troubling when combined with the short-term economic advantages provided by using future AI systems. Critch’s piece contains a detailed articulation of this, but it is also a feature to some extent of most other stories of scalable oversight failure, and easy to imagine without detailed economic arguments. A choice between giving up control, or keeping it but operating at a significant disadvantage in the short term compared to those who didn’t, isn’t much of a choice at all. Even if you do the right thing despite the costs, all that really means is that you immediately get stomped on by a competitor who's less cautious about staying in the loop. You haven't even slowed them down.
In my view the biggest reason for pessimism, across all of the scenarios in this section, isn’t the speed, or the economic pressure, or the difficulty of co-ordination. It’s that it’s just going to be really hard to tell what’s happening. The systems we’ve deployed will look like they are doing fine, for reasons of camouflage, even if they aren’t explicitly trying to deceive us. On top of that, we should worry that systems which are able to perform instrumental reasoning will try to reduce the probability that we shut them down, even in the absence of anything as strong as ‘full blown’ coherence/utility maximisation/instrumental convergence. ‘You can’t fetch the coffee if you’re dead’ just isn’t that complicated a realisation, and ‘put an optimistic spin on the progress report’, or ‘report that there’s an issue, but add a friendly “don’t worry though, everything is in hand”’ are much smaller deviations from intended behaviour than ‘take over the world and kill all humans’. Even this kind of subtle disinformation is enough to make some people second guess their assessment of the situation, which becomes a much bigger problem when you combine it with the other pressures.
How embarrassing would this be?
This involves giving superhuman models access to more and more stuff even though we know we have no idea how they are doing what they are doing, and we can only judge short term results. This feels like a societal screw-up on the level of climate change, basically short-term thinking + coordination failure.
Of course, all of the various stories in this section, like any specific stories about the future, are probably wrong in important ways, which means they might be wrong in ways which cause everything to turn out fine. This somewhat reduces the magnitude of the screw-up, especially compared to climate change, where at this point there really isn’t any reasonable debate about whether there’s a connection between carbon emissions and global temperature.
For example:
Any time things might turn out just fine, the question becomes how optimistic the most optimistic person with the power to make a decision is.
One dynamic that might make society look more reasonable is if the threat of this class of failure story gets ignored because everyone’s talking about one of the others. This might be everyone focusing on more ‘exotic’ failures like inner misalignment, and really carefully checking whether myopia is preserved, or that the models are doing any internal optimisation, and assuming everything’s fine if they aren’t. It could also just involve people seeing some warning shots, working really hard to patch them, and then being reassured once a working patch is found.
Overall, if this is what gets us, I’m still pretty embarrassed on our behalf, but I feel like there’s been progress towards dignity (especially in the ‘patched a lot of warning shots and prevented inner optimisation’ worlds).
Deceptive alignment failure
Scenario overview:
We are eventually able to train models that are capable of general purpose planning and that are situationally aware. During training, general-purpose planning and situational awareness arrive before the model has internalised a goal that perfectly matches the goal of human overseers (or is sufficiently close for everything to be fine). After this point, further training does not significantly change the goal of the model, because training causes gradient updates which lead to lower loss in training, and this does not distinguish ‘act deceptively aligned’ from ‘actually do the right thing’.
What might the path to catastrophe look like?
How embarrassing would this be?
Not terribly. The belief that “there should be strong empirical evidence of bad things happening before you take costly actions to prevent worse things” is probably sufficient to justify ~all the actions we take in this scenario, and that’s just a pretty reasonable belief for most people to have in most circumstances. Maybe we solve GMG in all the scenarios we can test it for. Maybe we manage to reverse engineer a bunch of individual circuits in models, but don’t find anything that looks like search.
In particular, I can imagine a defence of our screwing up in this way going something like this:
Recursive Self Improvement -> hard take-off singleton
Scenario:
AI models undergo rapid and unexpected improvement in capabilities, far beyond what alignment research can hope to keep up with, even if it has been progressing well up to that point. Perhaps this is because it turns out that the ‘central core’ of intelligence/generalisation/general-purpose reasoning is not particularly deep, and one ‘insight’ from a model is enough. Perhaps it happens after we have mostly automated AI research, and the automation receives increasing or constant returns from its own improvement, making even current progress curves look flat by comparison.
What might the path to catastrophe look like?
From our perspective, I expect this scenario to look extremely similar to the story above. The distinction between:
and
is primarily mechanistic, rather than behavioural. It’s somewhat unclear to me how much of the disagreement between people who are worried about each scenario is a result of people talking past each other.
The distinction between the two scenarios is not particularly clean, for example we might get a discontinuous leap in capabilities that takes a model from [unsophisticated instrumental reasoning] to [deceptively aligned but not yet capable of takeover], or from [myopic] to [reasoning about the future], and then have the deceptive alignment scenario play out as above, but it was the discontinuity that broke our interpretability tools or relaxed adversarial training setup, rather than something like a camouflage failure happening as we train on them.
How embarrassing would this be?
Honestly, I think if this kills us, but we had working plans in place for scalable oversight (including of predictive models), and made a serious effort to detect deceptive cognition, including via huge breakthroughs in thinking about model internals, but a model for which alignment was going well improved to the point of its oversight process going from many nines of reliability to totally inconsequential overnight, we didn’t screw up that badly. Except we should probably say sorry to Eliezer/Nate for not listening to them say that nothing we tried would work.
Thanks
Several people gave helpful comments on various drafts of this, especially Daniel Filan, Justis Mills, Vidur Kapur and Ollie Base. I asked GPT-4 for comments at several points, but most of them sucked. If you find mistakes, it's probably my fault, but if you ask Claude or Bard they'll probably apologise.
The original draft of this had this, different flippant response, but it was helpfully pointed out to me that not everyone is as into rock climbing as I am:
‘I dunno man, backing yourself to free solo El Cap if your surname isn’t Honnold does seem basically like undignified suicide, but I still think it’d be even more embarrassing if you slipped on some mud as you were hiking to the start, hit your head on a rock, and bled out, because looking where you were walking rather than staring at the ascent seemed too like flinching away from the grimness of reality to work on something easier’
In the linked post, x-risk is primarily discussed in terms of its effects on people alive today, and refers to extinction not existential.
I intend 'behaviourally simulating' here to just mean ‘doing the same things as’, not to imply any particular facts about underlying cognition.
When was the last time you saw a ‘hacker’ in a TV show or book do anything even vaguely realistic?
Note that deceptive alignment here refers specifically to a scenario where a trained model is itself running an optimization process. See Hubinger et. al. for more on this kind of inner/mesa optimisation, and this previous piece I wrote on some other kinds of deception, and why the distinction matters.
Though not impossible. Much of my hope currently comes from the possibility of agreeing (relatively) widespread buy-in about a ‘red line’, which if crossed, must lead to the cessation of new training runs. There are many issues with this plan, the most difficult of which in my view is agreeing on a reasonable standard after which training can be re-started, but this piece is long enough, so I’ll save writing more on this for another time.