A Narrow Path: a plan to deal with AI extinction risk

Andrea_Miotti; davekasten; Tolga

This is a linkpost for https://www.narrowpath.co/

We have published A Narrow Path: our best attempt to draw out a comprehensive plan to deal with AI extinction risk. We propose concrete conditions that must be satisfied for addressing AI extinction risk, and offer policies that enforce these conditions.

A Narrow Path answers the following: assuming extinction risk from AI, what would be a response that actually solves the problem for at least 20 years, and that leads to a stable global situation, one where the response is coordinated rather than unilaterally imposed with all the dangers that come from that.

Despite the magnitude of the problem, we have found no other plan that comprehensively tries to address the issue, so we made one.

This is a complex problem where no one has a full solution, but we need to iterate on better answers if we are to succeed at implementing solutions that directly address the problem.

Executive summary below, full plan at www.narrowpath.co , and thread on X here.

We do not know how to control AI vastly more powerful than us. Should attempts to build superintelligence succeed, this would risk our extinction as a species. But humanity can choose a different future: there is a narrow path through.
A new and ambitious future lies beyond a narrow path. A future driven by human advancement and technological progress. One where humanity fulfills the dreams and aspirations of our ancestors to end disease and extreme poverty, achieves virtually limitless energy, lives longer and healthier lives, and travels the cosmos. That future requires us to be in control of that which we create, including AI.
We are currently on an unmanaged and uncontrolled path towards the creation of AI that threatens the extinction of humanity. This document is our effort to comprehensively outline what is needed to step off that dangerous path and tread an alternate path for humanity.
To achieve these goals, we have developed proposals intended for action by policymakers, split into three Phases:
Phase 0: Safety - New institutions, legislation, and policies that countries should implement immediately that prevent development of AI that we do not have control of. With correct execution, the strength of these measures should prevent anyone from developing artificial superintelligence for the next 20 years.
Phase 1: Stability - International institutions that ensure measures to control the development of AI do not collapse under geopolitical rivalries or rogue development by state and non-state actors. With correct execution, these measures should ensure stability and lead to an international AI oversight system that does not collapse over time.
Phase 2: Flourishing - With the development of rogue superintelligence prevented and a stable international system in place, humanity can focus on the scientific foundations for transformative AI under human control. Build a robust science and metrology of intelligence, safe-by-design AI engineering, and other foundations for transformative AI under human control.

I coincidentally submitted an essay describing my ideas for a plan to the Cosmos essay contest just a day before you published your plan. I look forward to writing a post analyzing the similarities and differences between our plans once Cosmos is done with the judging and I can share my plan publicly.

Looking forward to it! (Should rules permit, we're also happy to discuss privately at an earlier date)

My essay is here: https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy

And a further discussion about the primary weakness I see in your plan (that AI algorithmic improvement progress is not easily blocked by regulating and monitoring large data centers) is discussed in my post here: https://www.lesswrong.com/posts/xoMqPzBZ9juEjKGHL/proactive-if-then-safety-cases

Thanks for writing this and proposing a plan. Coincidentally, I drafted a short take here yesterday explaining one complaint I currently have with the safety conditions of this plan. In short, I suspect the “No AIs improving other AIs” criterion isn't worth including within a safety plan: it i) doesn't address that many more marginal threat models (or does so ineffectively) and ii) would be too unpopular to implement (or, alternatively, too weak to be useful).

I think there is a version of this plan with a lower safety tax, with more focus on reactive policy and the other three criterion, that I would be more excited about.

Thanks! Do you still think the "No AIs improving other AIs" criterion is too onerous after reading the policy enforcing it in Phase 0?

In that policy, we developed the definition of "found systems" to have this measure only apply to AI systems found via mathematical optimization, rather than AIs (or any other code) written by humans.

This reduces the cost of the policy significantly, as it applies only to a very small subset of all AI activities, and leaves most innocuous software untouched.

Am I correct in interpreting that your definition of "found system" would apply nearly all useful AI systems today such as ChatGPT, as these are algorithms which run on weights that are found with optimization methods such as gradient descent? If so, it is still fairly onerous.

TL;DR

From a skim, there are many claims here which I agree / sympathise with.

That said, I also want to make sure this piece stands up to the epistemic "sniff" test. The gist of the piece seems to operate around Simulacrum Levels 2 / 3 ["Choose what to say based on what your statement will cause other people to do or believe" / "Say things that signal membership to your ingroup."].^[1]

From a quick epistemic spot check of just the intro, I'd say that half the claims are accurate on SL1. My guess is this is pretty standard for a lot of advocacy-focused writing, but lower than most LW writing.

Below is a short epistemic spot check of the (non-normative) claims in the introduction, to see whether this piece stands up well on Simulacrum Level 1 ["Attempt to describe the world accurately"]. I use emojis to capture whether the claim is backed by some reasoning or a reliable source

✅ = The claim attempts to accurately describe the world (through evidence or reasoning)
❌ = The claim does not attempt to accurately describe the world (e.g. through a lack of evidence, poor evidence, or the misrepesentation of evidence)
❔= Ambiguous

From the top:

(1) ❌

There is a simple truth - humanity’s extinction is possible. Recent history has also shown us another truth - we can create artificial intelligence (AI) that can rival humanity.1

The footnote says "While there are many such metrics, one useful introductory roundup for those less familiar is at I Gave ChatGPT an IQ Test. Here's What I Discovered | Scientific American". The source linked describes someone impressed by ChatGPT's abilities [in March 2023], giving it an IQ of 155. This source (a) is an unusual choice for measuring frontier AI capabilities, and [more importantly] (b) it does not support the claim "recent history shows we can create an AI that can rival humanity"

[Note - I think this claim is likely true, but it's not defensible from this source alone]

(2) ✅

Companies across the globe are investing to create artificial superintelligence – that they believe will surpass the collective capabilities of all humans. They publicly state that it is not a matter of “if” such artificial superintelligence might exist, but “when”.2

The footnote links to these two sources:

The first is from Chief of Staff at Anthropic. They state "For the same reasons I expect us to reach AGI, I expect it to progress beyond this point, to where we have “superhuman” systems."
The second is announcing superalignment fast grants. They state "We believe superintelligence could arrive within the next 10 years."

(3) ❌

Reasonable estimates by both private AI companies and independent third parties indicate that they believe it could cost only tens to hundreds of billions of dollars to create artificial superintelligence.

No source is given. It's not clear what "reasonable" estimates they are referring to. Cotra's bioanchors says that companies might be willing to spend ~$100 bn to create AGI. But crucially, $100bn today might not buy you enough compute / capabilities.

[Again, I think this claim could be true, but there's no source and "reasonable" allows for too much slippage]

(4)✅

[Catastrophic and extinction] risks have been acknowledged by world3 leaders4, leading scientists and AI industry leaders567, and analyzed by other researchers, including the recent Gladstone Report commissioned by the US Department of State8 and various reports by the Center for AI Safety and the Future of Life Institute.910

Footnotes 3 - 10 aim to support the claim of consensus on AI x-risks. Looking at each in turn:

3 and 4 are from Rishi Sunak [former UK PM] and President von der Leyen. The former said "In the most unlikely but extreme cases, there is even the risk that humanity could lose control of AI completely…". The latter quotes the CAIS statement. So they do both acknowledge the risk. However there are of course many who have not acknowledged this.
5 and 6 are the CAIS and FLI letters. CAIS definitely has leading scientists & AI industry leaders acknowledge the risks.
7 is from Sam Altman: "Development of superhuman machine intelligence (SMI) is probably the greatest threat to the continued existence of humanity"
8 is the Gladstone report, which definitely acknowledges the risks, but is a long way from "The US government recognizes AI x-risk"
9 and 10 are overviews of AI x-risk.

Overall I would say these mostly support the claim.

From then on, a lot more claims (in "The Problem" and "The Solution") are made without support. I think this is forgivable if they're backed in later parts of the report. At some future date, I might go through Phases 0, 1 and 2 (or someone else is very welcome to have a stab)

^{^}
(To give some benefit of the doubt, I'll add that (a) this piece feels lower on the Simulacrum-o-meter than Situational Awareness, (b) this piece is on about the same Simulacrum level as other AI policy debate pieces, and (c) it's unsurprisingly high given an intention to persuade rather than inform. My reason for scrutinizing this is not because it's poor - I just happened to be sufficiently motivated / nerdsniped at the time of reading it.)

I too have reservations about points 1 and 3 but not providing sufficient references or justifications doesn't imply they're not on SL1.

(2) ✅ ... The first is from Chief of Staff at Anthropic.

The byline of that piece is "Avital Balwit lives in San Francisco and works as Chief of Staff to the CEO at Anthropic. This piece was written entirely in her personal capacity and does not reflect the views of Anthropic."

I do not think this is an appropriate citation for the claim. In any case, They publicly state that it is not a matter of “if” such artificial superintelligence might exist, but “when” simply seems to be untrue; both cited sources are peppered with phrases like 'possibility', 'I expect', 'could arrive', and so on.

The executive summary seems essentially right to me. My only objection is that Phase 4 should probably be human intelligence augmentation.

It sounds like the core idea is a variant of the Intelligence Manhattan Project idea, but with a focus on long term international stability & a ban on competitors.

Perhaps the industry would be more likely to adopt this plan if GUARD could seek revenue the way corporations currently do: by selling stock & API subscriptions. This would also increase productivity for GUARD & shorten the dangerous arms race interval.

Therefore, in the case of an emergency, a compute provider and/or an AI developer can be called upon to shutdown the model.

Invoking the kill switch would be costly and painful for the compute provider/AI developer, and I wonder if this would make them slow to pull the trigger. Why not place the kill switch in the regulator's control, along with the expectation that companies could sue the regulator for damages if the kill switch was invoked needlessly?

Edit: Actually I think this is what is meant by "Hardware-Enabled Governance Mechanisms (HEM)", and I think the suggestion that the compute provider or AI developer shut down the model is a stop-gap until HEM is widely deployed.

Looking forward to it! (Should rules permit, we're also happy to discuss privately at an earlier date)

My essay is here: https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy

I think there is a version of this plan with a lower safety tax, with more focus on reactive policy and the other three criterion, that I would be more excited about.

Thanks! Do you still think the "No AIs improving other AIs" criterion is too onerous after reading the policy enforcing it in Phase 0?

This reduces the cost of the policy significantly, as it applies only to a very small subset of all AI activities, and leaves most innocuous software untouched.

TL;DR

From a skim, there are many claims here which I agree / sympathise with.

✅ = The claim attempts to accurately describe the world (through evidence or reasoning)
❌ = The claim does not attempt to accurately describe the world (e.g. through a lack of evidence, poor evidence, or the misrepesentation of evidence)
❔= Ambiguous

From the top:

(1) ❌

There is a simple truth - humanity’s extinction is possible. Recent history has also shown us another truth - we can create artificial intelligence (AI) that can rival humanity.1

[Note - I think this claim is likely true, but it's not defensible from this source alone]

(2) ✅

Companies across the globe are investing to create artificial superintelligence – that they believe will surpass the collective capabilities of all humans. They publicly state that it is not a matter of “if” such artificial superintelligence might exist, but “when”.2

The footnote links to these two sources:

The first is from Chief of Staff at Anthropic. They state "For the same reasons I expect us to reach AGI, I expect it to progress beyond this point, to where we have “superhuman” systems."
The second is announcing superalignment fast grants. They state "We believe superintelligence could arrive within the next 10 years."

(3) ❌

Reasonable estimates by both private AI companies and independent third parties indicate that they believe it could cost only tens to hundreds of billions of dollars to create artificial superintelligence.

[Again, I think this claim could be true, but there's no source and "reasonable" allows for too much slippage]

(4)✅

[Catastrophic and extinction] risks have been acknowledged by world3 leaders4, leading scientists and AI industry leaders567, and analyzed by other researchers, including the recent Gladstone Report commissioned by the US Department of State8 and various reports by the Center for AI Safety and the Future of Life Institute.910

Footnotes 3 - 10 aim to support the claim of consensus on AI x-risks. Looking at each in turn:

3 and 4 are from Rishi Sunak [former UK PM] and President von der Leyen. The former said "In the most unlikely but extreme cases, there is even the risk that humanity could lose control of AI completely…". The latter quotes the CAIS statement. So they do both acknowledge the risk. However there are of course many who have not acknowledged this.
5 and 6 are the CAIS and FLI letters. CAIS definitely has leading scientists & AI industry leaders acknowledge the risks.
7 is from Sam Altman: "Development of superhuman machine intelligence (SMI) is probably the greatest threat to the continued existence of humanity"
8 is the Gladstone report, which definitely acknowledges the risks, but is a long way from "The US government recognizes AI x-risk"
9 and 10 are overviews of AI x-risk.

Overall I would say these mostly support the claim.

^{^}
(To give some benefit of the doubt, I'll add that (a) this piece feels lower on the Simulacrum-o-meter than Situational Awareness, (b) this piece is on about the same Simulacrum level as other AI policy debate pieces, and (c) it's unsurprisingly high given an intention to persuade rather than inform. My reason for scrutinizing this is not because it's poor - I just happened to be sufficiently motivated / nerdsniped at the time of reading it.)

I too have reservations about points 1 and 3 but not providing sufficient references or justifications doesn't imply they're not on SL1.

(2) ✅ ... The first is from Chief of Staff at Anthropic.

The executive summary seems essentially right to me. My only objection is that Phase 4 should probably be human intelligence augmentation.

It sounds like the core idea is a variant of the Intelligence Manhattan Project idea, but with a focus on long term international stability & a ban on competitors.

Therefore, in the case of an emergency, a compute provider and/or an AI developer can be called upon to shutdown the model.

LESSWRONG
LW

LESSWRONG
LW

81

A Narrow Path: a plan to deal with AI extinction risk

81

Ω 30

81

Ω 30

81

Ω 30