An example which I'm not sure is likely but I think would provide an adequate warning shot is if a model is caught clearly self-exfiltrating its weights. This might happen too late, such that by the time a model could do this, it would have already caused catastrophe in another way). There are versions of this that look more benign and so wouldn't be adequate warning shots; but I think there are scenarios where a model is very clearly taking actions to subvert security measures in a way that is really only described by the model being misaligned and attempting to escape.
Here are some examples:
I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot. These warning shots demonstrate the AI has impressive technical capabilities and is also clearly misaligned.
There are weaker versions of these that would probably not provide great warning shots. For example: an AI occasionally writing insecure code or failing to point out vulnerabilities in code written by another AI; or when sufficiently prompted, the AI starts openly discussing plans about how to escape.
I like these examples. One thing I'll note, however, is that I think the "warning shot discourse" on LW tends to focus on warning shots that would be convincing to a LW-style audience.
If the theory of the change behind the warning shot requires LW-types (for example, folks at OpenAI/DeepMind/Anthropic who are relatively familiar with AGI xrisk arguments) to become concerned, this makes sense.
But usually, when I think about winning worlds that involve warning shots, I think about government involvement as the main force driving coordination, an end to race dynamics, etc.
[Caveating that my models of the USG and natsec community are still forming, so epistemic status is quite uncertain for the rest of this message, but I figure some degree of speculation could be helpful anyways].
I expect the kinds of warning shots that would be concerning to governments/national security folks will look quite different than the kinds of warning shots that would be convincing to technical experts.
LW-style warning shots tend to be more– for a lack of a better term– rational. They tend to be rooted in actual threat models (e.g., we understand that if an AI can copy its weights, it can create tons of cop...
I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot.
I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.
There are weaker versions of these that would probably not provide great warning shots.
I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.
E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.
More generally, you can always try to play out more steps of a give egregious action which failed.
I'm hopeful that a sufficiently convincing demo could convince politicians/military brass/wealthy powerful people/the public. Probably different demos could be designed to be persuasive to these different audiences. Ideally, the demos could be designed early, and you could get buy-in from the target audience that if the describe demo were successful then they would agree that "something needed to be done". Even better would be concrete commitments, but I think there's value even without that. Also being as prepared as possible to act on a range of plausible natural warning shots seems good. Getting similar pre-negotiated agreements that if X did happen, it should be considered a tipping point for taking action.
It is also possible that the scope of evangelists would need to be sufficient to convince people who matter. Some people who can make decisions might listen to someone with an Exotic-Sounding PhD from Berkeley. Others who matter might not. Just as an example, I think some politicians and wealthy powerful types may be more willing to listen to engineers than mathematicians or pure theoreticians. And a normal engineer might also carry more clout than someone from such exotica as silicon valley communities where people are into open re...
Deleted paragraph from the post, that might answer the question:
Surprisingly, the same study found that even if there were an escalation of warning shots that ended up killing 100k people or >$10 billion in damage (definition), skeptics would only update their estimate from 0.10% to 0.25% [1]: There is a lot of inertia, we are not even sure this kind of “strong” warning shot would happen, and I suspect this kind of big warning shot could happen beyond the point of no return because this type of warning shot requires autonomous replication and adaptation abilities in the wild.
It may be because they expect a strong public reaction. But even if there was a 10-year global pause, what would happen after the pause? This explanation does not convince me. Did the government prepare for the next covid?
A 10-year global pause would allow for a lot of person-years-equivalents of automated AI safety R&D. E.g. from Some thoughts on automating alignment research (under some assumptions mentioned in the post): 'each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' And for different assumptions the numbers could be [much] larger still: 'For a model trained with 1000x the compute, over the course of 4 rather than 12 months, you could 100x as many models in parallel.[9] You’d have 1.5 million ...
There is very little hope here IMO. The basic problem is the fact that people have a false confidence in measures to render a powerful AI safe (or in explanations as to why the AI will turn out safe even if no one intervenes to make it safe). Although the warning shot might convince some people to switch from one source of false hope to a different source, it will not materially increase the number of people strongly committed to stopping AI research, all of which have somehow come to doubt all of the many dozens of schemes published so far for rendered powerful AI safe (and the many explanations for why the AI will turn out safe even if we don't have a good plan for ensuring its safety).
There's also the case of harmful warning shots: for example, if it turns out that, upon seeing an AI do a scary but impressive thing, enough people/orgs/states go "woah, AI is powerful, I should make one!" or "I guess we're doomed anyways, might as well stop thinking about safety and just enjoy making profit with AI while we're still alive", to offset the positive effect. This is totally the kind of thing that could be the case in our civilization.
I agree, that's an important point. I probably worry more about your first possibility, as we are already seeing this effect today, and worry less about the second, which would require a level of resignation that I've rarely seen. Entities that are responsible would likely try to do something about it, but the ways this “we're doomed, let's profit” might happen are:
Another case of harmful warning shot is if the lesson learnt from it is “we need stronger AI systems to prevent this”. This probably goes in hand with a poor credit assignment.
Agreed.
I would love to see more thinking about this.
We've already seen one moment dramatically change the strategic landscape: ChatGPT.
This shift could actually be small compared to if there was a disaster.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.
Very much to avoid, but I'm skeptical it 'would be probably too late' (especially if I assume humans are aware of the data center takeover); see e.g. from What if self-exfiltration succeeds?:
'Most likely the model won’t be able to compete on making more capable LLMs, so its capabilities will become stale over time and thus it will lose relative influence. Competing on the state of the art of LLMs is quite hard: the model would need to get access to a sufficiently large number of GPUs and it would need to have world-class machine learning skills. It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques). The model could try fine-tuning itself to be smarter, but it’s not clear how to do this and the model would need to worry about currently unsolved alignment problems.'
There is hope that some “warning shot” would help humanity get its act together and change its trajectory to avoid extinction from AI. However, I don't think that's necessarily true.
There may be a threshold beyond which the development and deployment of advanced AI becomes essentially irreversible and inevitably leads to existential catastrophe. Humans might be happy, not even realizing that they are already doomed. There is a difference between the “point of no return” and "extinction." We may cross the point of no return without realizing it. Any useful warning shot should happen before this point of no return.
We will need a very convincing warning shot to change civilization's trajectory. Let's define a "convincing warning shot" as "more than 50% of policy-makers want to stop AI development."
What could be examples of convincing warning shots?
For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.
“That’s only one researcher,” you might say?
This study from Tetlock brought together participants who disagreed about AI risks. The strongest crux exhibited in this study was whether an evaluation group would find an AI with the ability to autonomously replicate and avoid shutdown. The skeptics would get from P(doom) 0.1% to 1.0%. But 1% is still not much… Would this be enough for researchers to trigger the fire alarm in a single voice?
More generally, I think studying more “warning shot theory” may be crucial for AI safety:
Additionally, thinking more about the points of no return and the shape of the event horizon seems valuable:
There is already some literature on warning shots, but not much, and this seems neglected, important, and tractable.
We’ll probably get between 0 and 10 shots, let’s not waste them.
(I wrote this post, but don't have the availability to work on this topic. I just want to raise awareness about it. If you want to make warning shot theory your agenda, do it.)
An inspiration might be this post-mortem on Three Mile Island.