What convincing warning shot could help prevent extinction from AI?

Charbel-Raphaël; cozyfractal

105

[ Question ]

What convincing warning shot could help prevent extinction from AI?

by Charbel-Raphaël, cozyfractal

13th Apr 2024

AI Alignment Forum

2 min read

4 18

105 Ω 27

- Tell me father, when is the line
where ends everything good and fine?
I keep searching, but I don't find.
- The line my son, is just behind.
Camille Berger

There is hope that some “warning shot” would help humanity get its act together and change its trajectory to avoid extinction from AI. However, I don't think that's necessarily true.

There may be a threshold beyond which the development and deployment of advanced AI becomes essentially irreversible and inevitably leads to existential catastrophe. Humans might be happy, not even realizing that they are already doomed. There is a difference between the “point of no return” and "extinction." We may cross the point of no return without realizing it. Any useful warning shot should happen before this point of no return.

We will need a very convincing warning shot to change civilization's trajectory. Let's define a "convincing warning shot" as "more than 50% of policy-makers want to stop AI development."

What could be examples of convincing warning shots?

For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.

“That’s only one researcher,” you might say?

This study from Tetlock brought together participants who disagreed about AI risks. The strongest crux exhibited in this study was whether an evaluation group would find an AI with the ability to autonomously replicate and avoid shutdown. The skeptics would get from P(doom) 0.1% to 1.0%. But 1% is still not much… Would this be enough for researchers to trigger the fire alarm in a single voice?

More generally, I think studying more “warning shot theory” may be crucial for AI safety:

How can we best prepare the terrain before convincing warning shots happen?
- e.g. How can we ensure that credit assignments are done well? For example, when Chernobyl happened, the credit assignments were mostly misguided: people lowered their trust in nuclear plants in general but didn’t realize the role of the USSR in mishandling the plant.
What lessons can we learn from past events? (Stuxnet, Covid, Chernobyl, Fukushima, the Ozone Layer).^[1]
Could a scary demo achieve the same effect as a real-world warning shot without causing harm to people?
What is the time needed to react to a warning shot? One month, year, day?
More generally, what actions would become possible after a specific warning shot but weren't before?
What will be the first large-scale accidents or small warning shots?
What warning shots are after the point of no return and which ones are before?

Additionally, thinking more about the points of no return and the shape of the event horizon seems valuable:

Is Autonomous Replication and Adaptation in the wild the point of no return?
In the case of an uncontrolled AGI, as described in this scenario, would it be possible to shut down the Internet if necessary?
What is a good practical definition of the point of no return? Could we open a Metaculus for timelines to the point of no return?

There is already some literature on warning shots, but not much, and this seems neglected, important, and tractable.

We’ll probably get between 0 and 10 shots, let’s not waste them.

(I wrote this post, but don't have the availability to work on this topic. I just want to raise awareness about it. If you want to make warning shot theory your agenda, do it.)

^{^}
An inspiration might be this post-mortem on Three Mile Island.

Frontpage