Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort.

This post is not a finished research. I’m unconfident in the claims I’m making here, but I thought putting it out there for feedback would help me decide what to focus on next in the program.

Are government interventions necessary to stop the suicide race?

The zeitgeist I got from the AI Safety community since I joined, seems to accept as fact that Frontier AI Labs are locked knowingly in a suicidal race towards developing transformative AI, and that any solution will need to involve strong external pressure to stop them, either in the form of an international coalition imposing regulations which shift the incentives of Labs, or even more desperate measure like doing a pivotal act.

From this point of view, it seems that AI risk is mostly driven by game theory. The economic and personal incentives faced by the stakeholders of each Frontier AI Lab determine their actions, and they will proceed this way until AGI is developed, or until a sufficiently large external force changes the incentive landscape sufficiently. Therefore, the only way to make sure Labs don’t gamble the future of the world when building an AGI is to convince governments to implement policies which shift those incentives.

I now believe that this view is wrong, and that Frontier AI Labs would get out of the race if they thought the risks were sufficiently large and the consequences sufficiently dire for the world and for themselves.

Claim: If every decision maker in a Frontier AI Lab thought they were in a suicide race, and that their next development would bring with near certainty the destruction of humanity, they would decide to leave the “AGI at all costs” race, no matter the actions of other actors.

Below, I present a scenario which I find plausible, in which a Frontier Lab decides to drop out of the race because of this.

Scenario: The board of Alphabet decides to stop the race

Disclaimer: Handwaving a lot of details.

Situation: Frontier AI Labs are still locked in a race to be the first to develop AGI. It is widely believed that the coming generation of models might pass the threshold. We are in crunch time. A previous unsuccessful attempt by a model to take control of key infrastructure makes the possibility of X-risk clear in everyone’s mind. The team at Google DeepMind is hard at work preparing the next training run, which they believe will be the last one.

Demis Hassabis calls for a board meeting of Alphabet, where he present the current tactical situation. All board members get convinced of the following claims:

  • We are on the brink of developing AGI. The wider industry situation is clear. Public and insider information from OpenAI says they’re about to launch their AGI training run. If DeepMind does not move forward, OpenAI will beat them to it, and other labs are right behind.
  • AGI still seems as much of a winner take all situation as it ever was.
  • DeepMind is barely controlling their currently deployed model. Engineers state that they think their safety measures have a strong chance of failing even during training. They give an estimate of 50% chance of the next model escaping.
  • Threat modeling experts at DeepMind predict that an escape would lead to large scale destruction, and give a 50% chance that the next model escaping would lead to the extinction of humanity within the next 5 years.

After having received this information, the board convenes to decide how to proceed. They consider two decisions:

  • Stay in the race: Proceed with the training, roll the dice. Get 50% chance of winning the race and becoming the most powerful company ever, 25% risk of large scale catastrophe as the model escape and scrambles for power before being shut down, 25% chance of extinction of humanity after the model successfully seize control.
  • Drop out: Allocate all resources on improving the control and alignment techniques of DeepMind. Only start the training run once they are confident that the model won’t cause a catastrophe, even if OpenAI might develop it first.

I expect that, in this situation, the executives of Alphabet would decide to drop out of the race, as continuing would be such a high probability of death to everyone, including the board members themselves.

Would they press the button?

I see three possible reasons why the board might want to proceed despite the risks:

  • Board members such as Larry Page think that an AGI killing everyone is actually not bad[1]

  • Board members are so risk-tolerant that they decide even those odds are worth it
  • Board members mentally break down, and delude themselves into thinking that it will be fine, and the catastrophe won’t happen.

I think all of those are unlikely. Am I missing some other reason why they would do it?

Why real life will be much more risky than this scenario

Even if in this scenario the board of Alphabet can be reasonably believed to take the right call and stop development, I expect that such a clear-cut vision of the consequences will never be available. I expect that various forms of imperfect information and changes to the payoff matrix will make it less likely that they would drop out of the race before it’s too late.

However, I’m interested in knowing what exactly are the factors which prevent such an ideal scenario from happening, as it could inform my priorities for reducing AI risks. I’m specifically interested in which factors prevent decision makers from having such a complete view of the situation, and which interventions beside policy could improve those decisions.

A short list of factors which I expect to cause decision-making to be less than ideal:

  • Fog of war: Decision makers won't know what’s happening in other labs or in the rest of the world
  • No fire alarm: The moment where training the next model becomes dangerous is never clear-cut. Different Frontier Labs will have different ideas of when the point of danger is, and it only takes one overly confident Frontier Lab to cause a catastrophe.
  • Time pressure: The decision to proceed may have to be taken on somewhat short notice, preventing the collection of additional information or increasing the risk of mistakes.
  • Missing crucial consideration: A particular factor which increased the risk of a training run did not get included in decision-making because of:
    • Internal miscommunication: The information was not propagated from the engineers/evaluators to the board members.
    • Insufficient knowledge: Some possible failure modes were not known and so were ignored in the risk calculations.
  • Inaccurate risk models: As AGI development will be an unprecedented event, the models used to assess the risks will have never been tested, and so may contain errors which lead to incorrect decisions.

Can risks be reduced through improving the information available to Frontier Labs decision maker?

My research at MATS will focus on exploring how improving the access to decision relevant information for Frontier Labs stakeholders, including board members and investors, could reduce the risks of AI development.

I'm still unsure of which direction to explore specifically, so don't hesitate to give me feedback and ideas on this basic model and which kind of research questions could be impactful in this direction.

  1. ^
New Comment