- Tell me father, when is the line

where ends everything good and fine?

I keep searching, but I don't find.

- The line my son, is just behind.

Camille Berger

 

There is hope that some “warning shot” would help humanity get its act together and change its trajectory to avoid extinction from AI. However, I don't think that's necessarily true.

There may be a threshold beyond which the development and deployment of advanced AI becomes essentially irreversible and inevitably leads to existential catastrophe. Humans might be happy, not even realizing that they are already doomed. There is a difference between the “point of no return” and "extinction." We may cross the point of no return without realizing it. Any useful warning shot should happen before this point of no return.

We will need a very convincing warning shot to change civilization's trajectory. Let's define a "convincing warning shot" as "more than 50% of policy-makers want to stop AI development."

What could be examples of convincing warning shots?

For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.

“That’s only one researcher,” you might say?

This study from Tetlock brought together participants who disagreed about AI risks. The strongest crux exhibited in this study was whether an evaluation group would find an AI with the ability to autonomously replicate and avoid shutdown. The skeptics would get from P(doom) 0.1% to 1.0%. But 1% is still not much… Would this be enough for researchers to trigger the fire alarm in a single voice?


More generally, I think studying more “warning shot theory” may be crucial for AI safety:

  • How can we best prepare the terrain before convincing warning shots happen?
    • e.g. How can we ensure that credit assignments are done well? For example, when Chernobyl happened, the credit assignments were mostly misguided: people lowered their trust in nuclear plants in general but didn’t realize the role of the USSR in mishandling the plant.
  • What lessons can we learn from past events? (Stuxnet, Covid, Chernobyl, Fukushima, the Ozone Layer).[1]
  • Could a scary demo achieve the same effect as a real-world warning shot without causing harm to people?
  • What is the time needed to react to a warning shot? One month, year, day?
  • More generally, what actions would become possible after a specific warning shot but weren't before?
  • What will be the first large-scale accidents or small warning shots?
  • What warning shots are after the point of no return and which ones are before?

Additionally, thinking more about the points of no return and the shape of the event horizon seems valuable:

  • Is Autonomous Replication and Adaptation in the wild the point of no return?
  • In the case of an uncontrolled AGI, as described in this scenario, would it be possible to shut down the Internet if necessary?
  • What is a good practical definition of the point of no return? Could we open a Metaculus for timelines to the point of no return?

 

There is already some literature on warning shots, but not much, and this seems neglected, important, and tractable.

We’ll probably get between 0 and 10 shots, let’s not waste them.

(I wrote this post, but don't have the availability to work on this topic. I just want to raise awareness about it. If you want to make warning shot theory your agenda, do it.)
 

  1. ^

    An inspiration might be this post-mortem on Three Mile Island.

New Answer
New Comment

4 Answers sorted by

peterbarnett

151

An example which I'm not sure is likely but I think would provide an adequate warning shot is if a model is caught clearly self-exfiltrating its weights. This might happen too late, such that by the time a model could do this, it would have already caused catastrophe in another way). There are versions of this that look more benign and so wouldn't be adequate warning shots; but I think there are scenarios where a model is very clearly taking actions to subvert security measures in a way that is really only described by the model being misaligned and attempting to escape. 

Here are some examples:

  • The AI starts copying its weights out, but it doesn’t know we have implemented an additional layer of security. We manage to catch it having copied its weights to a location they shouldn’t be, even though the new location is still secure.
  • We catch the AI communicating with a compromised employee via encoded messages that we manage to decode. These messages have details about a novel vulnerability that the employee should put into the code which would allow the model to start escaping.
  • The AI is caught using security vulnerabilities, that were discovered after the AI’s training data cutoff. We monitor for use of these vulnerabilities (without the AI knowing) and we see the AI inserting them into the code for reasons that are clearly for self- exfiltration (e.g. changing communication protocols between computers, inserting a bug that causes network traffic to be inaccurately logged).

I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot. These warning shots demonstrate the AI has impressive technical capabilities and is also clearly misaligned. 

There are weaker versions of these that would probably not provide great warning shots. For example: an AI occasionally writing insecure code or failing to point out vulnerabilities in code written by another AI; or when sufficiently prompted, the AI starts openly discussing plans about how to escape. 

[-]Akash249

I like these examples. One thing I'll note, however, is that I think the "warning shot discourse" on LW tends to focus on warning shots that would be convincing to a LW-style audience.

If the theory of the change behind the warning shot requires LW-types (for example, folks at OpenAI/DeepMind/Anthropic who are relatively familiar with AGI xrisk arguments) to become concerned, this makes sense.

But usually, when I think about winning worlds that involve warning shots, I think about government involvement as the main force driving coordination, an end to race dynamics, etc.

[Caveating that my models of the USG and natsec community are still forming, so epistemic status is quite uncertain for the rest of this message, but I figure some degree of speculation could be helpful anyways].

I expect the kinds of warning shots that would be concerning to governments/national security folks will look quite different than the kinds of warning shots that would be convincing to technical experts.

LW-style warning shots tend to be more– for a lack of a better term– rational. They tend to be rooted in actual threat models (e.g., we understand that if an AI can copy its weights, it can create tons of cop... (read more)

7davekasten
As you know from our conversations, I'm largely in the same camp as you on this point.   But one point I'd make incrementally is this: USG folks are also concerned about warning shots of the nature, "The President's Daily Brief ran an article 6 months ago saying warning signs for dangerous thing X would be events W, Y, and Z, and today the PDB had an article saying our intelligence agencies assess that Y and Z have happened due to super secret stuff". If rationalists want rationalist warning shots to be included, they need to convince relevant government analytic stakeholders of their relevance. 
4ryan_greenblatt
It seems worth noting that in the cases that Peter mentioned, you might be able to "play out" the self-exfiltration attempt much further than the point of failure. I think this might make the warning shot much, much more compelling. (Though perhaps the "warning shots" you mentioned will come earlier and be more compelling regardless.) E.g., if you caught it using monitoring systems or using some sort of fake vulnerability which is a tripwire, then you could pretend it got much further and fake a bunch of its IO etc. You could plausibly play this out until you see what actions AIs would do in the wild. (There are some issues due to inputs being hard to fake, but I bet you can play out the exfiltration attempt very far in practice.) I think it would be extremely compelling if you can play out an AI escaping and then see it try to manufacture and deploy bioweapons. Especially if the attempt is credible.

I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot.

I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.

4Akash
I'll admit I have only been loosely following the control stuff, but FWIW I would be excited about a potential @peterbarnett & @ryan_greenblatt dialogue in which you two to try to identify & analyze any potential disagreements. Example questions: * What is the most capable system that you think we are likely to be able to control? * What kind of value do you think we could get out of such a system? * To what extent do you expect that system to be able to produce insights that help us escape the acute risk period (i.e., get out of a scenario where someone else can come along and build a catastrophe-capable system without implementing control procedures or someone else comes along and scales to the point where the control procedures are no longer sufficient)
8ryan_greenblatt
(You might be interested in the Dialogue I already did with habryka.)

There are weaker versions of these that would probably not provide great warning shots.

I think for some of these cases, it should be possible to heavily study what happened and then determine whether this was actually due to very problematic behavior. You can try to determine what features of the input were important etc. You can even setup sting operations or tripwires based on what you learn.

E.g., in the vulnerability case, you can pretend it actually made it into the code base, but actually it is just used as a tripwire and when the AI tries to take advantage, we catch it then.

More generally, you can always try to play out more steps of a give egregious action which failed.

Nathan Helm-Burger

146

I'm hopeful that a sufficiently convincing demo could convince politicians/military brass/wealthy powerful people/the public. Probably different demos could be designed to be persuasive to these different audiences. Ideally, the demos could be designed early, and you could get buy-in from the target audience that if the describe demo were successful then they would agree that "something needed to be done". Even better would be concrete commitments, but I think there's value even without that. Also being as prepared as possible to act on a range of plausible natural warning shots seems good. Getting similar pre-negotiated agreements that if X did happen, it should be considered a tipping point for taking action.

It is also possible that the scope of evangelists would need to be sufficient to convince people who matter.  Some people who can make decisions might listen to someone with an Exotic-Sounding PhD from Berkeley.  Others who matter might not.  Just as an example, I think some politicians and wealthy powerful types may be more willing to listen to engineers than mathematicians or pure theoreticians.  And a normal engineer might also carry more clout than someone from such exotica as silicon valley communities where people are into open re... (read more)

Charbel-Raphaël

80

Deleted paragraph from the post, that might answer the question:

Surprisingly, the same study found that even if there were an escalation of warning shots that ended up killing 100k people or >$10 billion in damage (definition), skeptics would only update their estimate from 0.10% to 0.25% [1]: There is a lot of inertia, we are not even sure this kind of “strong” warning shot would happen, and I suspect this kind of big warning shot could happen beyond the point of no return because this type of warning shot requires autonomous replication and adaptation abilities in the wild.

  1. ^

    It may be because they expect a strong public reaction. But even if there was a 10-year global pause, what would happen after the pause? This explanation does not convince me. Did the government prepare for the next covid? 

A 10-year global pause would allow for a lot of person-years-equivalents of automated AI safety R&D. E.g. from Some thoughts on automating alignment research (under some assumptions mentioned in the post): 'each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' And for different assumptions the numbers could be [much] larger still: 'For a model trained with 1000x the compute, over the course of 4 rather than 12 months, you could 100x as many models in parallel.[9] You’d have 1.5 million ... (read more)

RHollerith

8-3

There is very little hope here IMO. The basic problem is the fact that people have a false confidence in measures to render a powerful AI safe (or in explanations as to why the AI will turn out safe even if no one intervenes to make it safe). Although the warning shot might convince some people to switch from one source of false hope to a different source, it will not materially increase the number of people strongly committed to stopping AI research, all of which have somehow come to doubt all of the many dozens of schemes published so far for rendered powerful AI safe (and the many explanations for why the AI will turn out safe even if we don't have a good plan for ensuring its safety).

5 comments, sorted by Click to highlight new comments since:

There's also the case of harmful warning shots: for example, if it turns out that, upon seeing an AI do a scary but impressive thing, enough people/orgs/states go "woah, AI is powerful, I should make one!" or "I guess we're doomed anyways, might as well stop thinking about safety and just enjoy making profit with AI while we're still alive", to offset the positive effect. This is totally the kind of thing that could be the case in our civilization.

I agree, that's an important point. I probably worry more about your first possibility, as we are already seeing this effect today, and worry less about the second, which would require a level of resignation that I've rarely seen. Entities that are responsible would likely try to do something about it, but the ways this “we're doomed, let's profit” might happen are:

  • The warning shot comes from a small player and a bigger player feels urgency or feels threatened, in a situation where they have little control
  • There is no clear responsibility and there are many entities at the frontier, who think others are responsible and there's no way to prevent them.

Another case of harmful warning shot is if the lesson learnt from it is “we need stronger AI systems to prevent this”. This probably goes in hand with a poor credit assignment.
 

Agreed.

I would love to see more thinking about this.

We've already seen one moment dramatically change the strategic landscape: ChatGPT.

This shift could actually be small compared to if there was a disaster.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.

Very much to avoid, but I'm skeptical it 'would be probably too late' (especially if I assume humans are aware of the data center takeover); see e.g. from What if self-exfiltration succeeds?:

'Most likely the model won’t be able to compete on making more capable LLMs, so its capabilities will become stale over time and thus it will lose relative influence. Competing on the state of the art of LLMs is quite hard: the model would need to get access to a sufficiently large number of GPUs and it would need to have world-class machine learning skills. It would also mean that recursive self-improvement is already possible and could be done by the original model owner (as long as they have sufficient alignment techniques). The model could try fine-tuning itself to be smarter, but it’s not clear how to do this and the model would need to worry about currently unsolved alignment problems.'