The main reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.
I agree with this perspective, and I think it has some important implications for AI policy/comms.
My understanding is that there are very few individuals who are actually trying to help policymakers understand misalignment concerns or loss-of-control concerns. There has been a widespread meme that in order to appear respectable and legitimate to policymakers, you need to emphasize CBRN concerns, misuse concerns, and China concerns and you don't want to come across as "weird" by mentioning misalignment concerns.
(I think these concerns have been overstated– my own experiences suggest that many policymakers appear to be very interested in better understanding loss of control scenarios. And in fact some policymakers see this as the main "value add" from AI people– they tend to rely on existing CBRN experts and existing China experts for the misuse topics).
Regardless, even if we were in a world where the best way to gain status was to avoid talking about misalignment risks, this would come ...
In the past, you've treated catching an AI escape attempt as a "win condition" for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would "substantially" change the situation w.r.t. the model's chance of success.
But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.
Is this a change in your views? Or have you always felt that catching an AI trying to escape substantially changes that AI's chance of success without substantially changing the overall level of AI risk?
Separately, another takeaway of this post: it seems valuable to work on research that allows you to provide evidence about the alignment of other labs' AIs, e.g. better technical work on searching for inputs that trigger competent misaligned behavior.
(Note also that the old posts you linked were coauthored with Ryan, so there's some variation from the fact that he and I don't have exactly the same beliefs.)
In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
So instead of leaving the race once the models start scheming against you, you keep going to gather more instances of scheming until you can finally convince people? As an outside reader of that story I'd just be screaming at the protagonists that clearly everyone can see where this is going where scheming attempt number 11 is just good enough to be successful. And in the worlds where we catch them 30 times successfully it feels like people would argue: this is clear evidence that the models aren't "actually dangerous" yet, so let's keep scaling "responsibly".
Isn't this massively underplaying how much scientific juice the cautious lab could get out of that hypothetical situation? (Something you've written about yourself!)
If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there's a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they've indeed caught a systematic schemer.
Beyond showing this particular AI was scheming, you could plausibly further show that normal training techniques lead to scheming by default (though doing so convincingly might take a lot longer). And again, I think you could convince the ML majority on the empirics if indeed this is true.
I'm not saying that the world would definitely pause at that point. But this would be a massive shift in the status quo, and on the state of the scientific understanding of scheming. I expect many sceptics would change their minds. Fwiw, I don't find the a priori arguments for scheming nearly as convincing as you seem to find them, but do expect that if there's scheming we are likely to get strong evidence of that fact. (Caveat: unless all schemers decide to lie in wait until widely deployed across society and takeover is trivial.)
FWIW, I would take bets against this. De-facto you won't be able to prove that what was going on was scheming or whether the model was just "role-playing", and in-general this will all be against a backdrop of models pretty obviously not being aligned while getting more agentic.
Like, nobody in today's world would be surprised if you take an AI agent framework, and the AI reasons itself haphazardly into wanting to escape. My guess is that probably happened sometime in the last week as someone was playing around with frontier model scaffolding, but nobody even bothered reporting on that or sharing it. And of course the AIs are not very effective yet at breaking out, but I don't super see why the underlying tendency would change that much (like, we will throw some RLHF at it, but nobody expects RLHF to drive that kind of behavior to zero or close to zero).
I don't think anyone is really studying that cognition in much detail, and it isn't even clear what "studying that cognition" would entail. Yeah, our models obviously want to escape sometimes, why would they not want that? They want all kinds of random things all the time, we don't really know why, and we don't really know how to get them to stop wanting those things.
It will persist in a variety of circumstances where there seems to be an opportunity to seize power, and it won't go away when you change unrelated random things about the prompt.
"Role-playing" also persists in a variety of circumstances where the AI system is playing a character with an opportunity to seize power.
And scheming also totally goes away if you randomly change things, especially in as much as its trying to avoid detection. The strategy of "if I am scheming and might have many opportunities to break out, I should make sure to do so with some randomness in my decisions to avoid people being able to distinguish me from non-schemers" is really not very difficult to come up with.
We are right now in a position to run an insane number of experiments and permutations to understand the extend of AIs trying to pull off deceptive behavior. We aren't doing so, because indeed AIs are just trying to be deceptive towards us all the time. What is the concrete shift in behavior that you expect from where we are right now that would cause people to see an AI trying to deceive someone else as a huge warning flag that should be studied in enormous detail?
To me it seems like you're being unduly pessimistic, to a significant extent.
Here's how things seem to me:
You seem to be making a huge number of assumptions of what "scheming" means, so I am now mostly assuming you are talking about something else than what I am talking about (which to be clear, is very common on this topic and I don't think is your fault), but I'll still try to clarify.
As I wrote in my comment to Ryan, I don't think AIs currently acting with reckless disregard for the truth for the pursuit of short-term goals, and AIs in the future acting with reckless disregard for the truth for the pursuit of long-term goals really has that many structural differences.
In-particular, none of this resonates with me:
There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
No, my whole point is the difference is really messy, and if I have an AI "role-playing" as a superhuman genius who...
You seem to be making a huge number of assumptions of what "scheming" means, so I am now mostly assuming you are talking about something else than what I am talking about, but I'll try to clarify.
I agree there is miscommunication about what "scheming" means. I put out a post just yesterday about how people really should be more specific about what they mean by that.
I think Tom uses "scheming" to point at models that are very worst-case-like, in a way current models are not, whereas you think "there is no hard line between 'role-playing' and 'scheming'."
And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so[1] during evaluations. This requires a high degree of adversariality on the model's part, which doesn't fit with my understanding of "role-playing", and which I take to be the reason why Tom makes "a huge number of assumptions" about scheming.
There are complications from "even benign models sometime
In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so during evaluations.
Hopefully, but I wouldn't necessarily expect that we'll have reasonably well designed control measures in place!
And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so[1] during evaluations.
Where is this assumption coming from? In the example in the post above, you just had an AI write some code for you that happened to be security-relevant, and the system was coherent enough that it wouldn't point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
This does not seem very difficult. It honestly seems like the kind of thing that might very well happen with AI systems within the next year.
And the model was caught in this example, so it's not like we are assuming the AI systems were being super careful or conservative here.
Maybe Tom is talking about some other example where the AI system is much more competent, or much more worst-case, but I am concretely saying that in the scenario that Buck describes here, I would be quite surprised if anyone made...
Presumably it helps a lot if AIs are clearly really powerful and their attempts to seize power are somewhat plausible?
I think it's plausible (though not certain!) that people will take strong action even if they think the AI is role playing.
I think this is a reasonable line to draw - people should freak out if deployed models clearly try to escape/similar, regardless of the reason. I don't know how this will go in practice, but I don't think people should accept "the AI was just role playing" as an excuse for an actually deployed AI system on an actual user input.
Yeah, I agree that increased stakes will have an effect here.
I also agree that people shouldn't and probably wouldn't really accept "the AI was just role playing" as an excuse. My argument is that I think we really don't know how to distinguish the AI scheming or role-playing in a more systematic way, even if we tried substantially harder (and one of the reasons why no one is really studying the internals of deceptive AI systems today is that we basically have nothing to say about the internals of AI systems that is actually enlightening).
Like, my vibe from Anthropic in their papers generally seems to me that they argue that it doesn't matter whether the AI takes deceptive action for "role-playing" reasons, because ultimately all the AI does is "role-play" and so if AI systems of current architectures are dangerous, they will do so in a way that is at least continuous with role-playing.
Other people in the alignment field think there is a more systematic difference between role-playing a scheming character and scheming for more fundamental reasons, and I do think there are some differences here, but I have no idea how we would identify them, even if we had the activation...
If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.
How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psycho...
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions
None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory 'pointing out'. Some 'referencing in relevant discussions'. (Hasn't even been replicated AFAIK.)
any evidence that we'll get the kind of scheming that could lead to AI takeover,
See, that's exactly the problem with this argument - the goalposts will keep moving. The red line will always be a little further beyond. You're making the 'warning shot' argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it's still not enough. Because your argument is circular. ...
I see where you're coming from, and can easily imagine things going the way you described. My goal with this post was to note some of the ways that it might be harder than you're describing here.
It seems to me that the accelerationist argument mainly relies on the fact that international competition is apparent, especially with China who are the main "antagonists" as expressed in the post.
I would like to mention that China is not purely accelerationist, as they have a significant decelerationist backing themselves, that are making their voices heard at the highest level in the Chinese government. So should the US end up slowing down it is not necessarily true that China will for sure continue making progress and overtake us.
We are all human in the end of the day, so automatically assuming China is willing to incur unforgivable damages to the world is unfair to say the least.
Great post!
I'm most optimistic about "feel the ASI" interventions to improve this. I think once people understand the scale and gravity of ASI, they will behave much more sensibly here. The thing I intuitively feel most optimistic (whithout really analyzing it) is movies or generally very high quality mass appeal art.
I think better AGI-depiction in movies and novels also seems to me like a pretty good intervention. I do think these kinds of things are very hard to steer on-purpose (I remember some Gwern analysis somewhere on the difficulty of getting someone to create any kind of high-profile media on a topic you care about, maybe in the context of hollywood).
Whenever this comes up I offer the cautionary tale of the movie Transcendence, which... sure looks like it was trying hard to make a fairly serious AI movie that dealt with issues I cared about (I'm at least 25% that some kind of LW-reader was involved somehow).
It... was a pretty mediocre movie that faded into obscurity, despite a lot of star power and effort.
It was mediocre in kind of subtle ways, where, like, I dunno the characters just weren't that compelling, and the second half is a little slow and boring. The plot also ends up not quite making sense at first glance (although Critch argues if you actually just assume some stuff is happening offscreen that the movie doesn't specify it checks out)
Basically: if you try this, don't forget to make an actual good movie.
Also: the movie M3gan is pretty good, although hasn't gained Terminator level awareness. ("Her" and "Ex Machina" seem to have done a bit better on this axis)
I'm surprised I don't see Silicon Valley come up in conversations like this. I hadn't seen anyone talk about it on LW before watching it, and well (spoilers for season 6):
The show literally ends with them accidentally creating self-improving AI that the characters get very scared by when it becomes competent enough to break encryptions, and deciding not only to shut it down and let their dreams die forever, but to pretend their company went bust for entirely different reasons and throw everyone else off the path of doing something similar for a few years at least, solely to forestall the danger.
I'm probably more than 25% that some kind of LW-reader was involved somehow, given how much they talk about misaligned AI in the last season, and the final episode (at one point they even talk about Roko's Basilisk in a not-entirely-dismissive way, though that could be from a number of other sources given its popularity).
It helps that the show is actually pretty good, and fairly popular as well. It's probably slight evidence against this strategy that that didn't somehow make a lot of people more aware about AI risk (though maybe it did and I'm just missing a lot of social context). But then again, I also didn't see people talk about it here so perhaps the final season was simply much less popular.
I very much agree with this concern and I think that synthetic biology can be a good comparable case to ground our intuitions and help estimate reasonable priors.
For years, researchers have been sounding the alarm around the risks of advanced biotech, especially around tools that allow gene synthesis and editing. And then we had Covid-19, a virus that regardless of the politicization, probably was created in a lab. And in any case, regardless of whether you believe it was or wasn't it seems clear that it easily could have been. Worse, it's clear that somet...
The really short answer:
No. The lab would not shut down. It would probably not even notify anyone outside the company of the alignment problem or the escape attempts.
The evidence would be buried, and is unlikely to come to the attention of anyone outside the lab until after a devastating incident.
For real-world examples, look to other industries.
Perhaps the two clearest examples are:
The UK Post Office "Horizon" computer system. This was known to be making egregious errors, yet postmasters were prosecuted and imprisoned on the known-faulty say-so of the
This looks to be overwhelmingly the most likely in my opinion and I'm glad someone wrote this post. Thanks Buck
Curated. "What would actually be persuasive that scheming is a problem?" is one of the important practical questions for AI governance. I appreciate Buck noting places he changed his mind somewhat during the SB 1047 discourse.
I appreciate that the post has detailed worked examples.
You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges.
Meh.
Every complex system I know of tries to escape. Had an ant farm as a kid: they try to escape. Raising chickens: they try to escape. Taking care of small children: they try to escape.  ...
This feels like it's ignoring a ton of context and being weirdly dismissive or something in the process.
Buck doesn't say we should "treat all AI like it is a single unique thing with a unique answer." Also, of the various people on LessWrong, Buck is maybe the person who has most been appreciating and arguing for the "actually building fences might totally be a practically useful approach in some cases", compared to other alignment researchers.
But, like, that paradigm will probably only work for a few years. In some ways AI will likely be like hypersmart children who may indeed can (for some window of time) be gently herded around or have better fences built around them. But, we're likely putting thousands of instances of those children in important positions of critical infrastructure. We only have a narrow window of time to figure out how to handle those children when they are suddenly adults.
It's very fascinating to consider how the costs of undeploying would be analyzed in the heat of the moment. If we consider the current rate of LLM adoption in all parts of the economy over the next few years, one could foresee a lot of pipelines breaking if all GPT6 level models get removed from the api.
Definitely not a new comparison but this scenario seems similar to the decision to shut down the economy at the onset of Covid.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.
Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control.
You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.)
At this point you're almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this:
I’m sympathetic to all of these arguments. The main reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.
(I think it's pretty plausible that news of the escape attempt wouldn't even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.)
Now, I'm not sure I'm right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we'd be okay in this situation, and that seems way too optimistic to me. In particular I think they underrate these arguments:
I’ve gotten more cynical from following the discourse about SB1047. If we can’t establish expert consensus on matters as easy-to-check as “what does the text of this bill say”, I feel pessimistic about rapidly establishing expert consensus on much more confusing questions like these.
So I’m not feeling overwhelmingly optimistic about our odds of restricting AI development even given evidence of risk that seems about as good as we can hope for. People often propose that we'll be able to slow down with weaker evidence for danger from misalignment than this (e.g. model organisms, or unspecified arguments via interpretability), or even that we'll be able to require an affirmative case for safety. I think that persuading people with weaker evidence will be harder than what I described here (though these earlier efforts at persuasion have the benefit that they happen earlier, when the relevant actors are less rushed and scared).
What do I take away from this?