How evals might (or might not) prevent  catastrophic risks from AI

Orpheus16

Epistemic status: Trying to synthesize my thoughts as someone without any private information about evals. I expect this to be wrong in several ways, lacking detail, and missing a lot of context on specific things that AGI labs and independent teams are thinking about. I post it in the hopes that it will be useful for sparking discussion on high-level points & acquiring clarity.

Inspired by the style of what 2026 looks like, my goal was to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage. The methodology was roughly:

Condition on a future in which evals counterfactually avoid an existential catastrophe.
Imagine that this existential catastrophe is avoided in 2027 (why 2027? Partially for simplicity, partially because I find short timelines plausible, and partially because I think it makes sense to focus our actions on timelines that are shorter than our actual timeline estimates, and partially because it gets increasingly hard to tell realistic stories as you get further and further away from the present).
Write a somewhat-realistic story in which evals help us avoid the existential catastrophe.
Look through the story to find parts that seem the most overly-optimistic, and brainstorm potential failure modes. Briefly describe these potential failure modes and put (extremely rough) credences on each of them.

I’m grateful to Daniel Kokotajlo, Zach Stein-Perlman, Olivia Jimenez, Siméon Campos, and Thomas Larsen for feedback.

Scenario: Victory

In 2023, researchers at leading AGI labs and independent groups work on developing a set of evaluations. There are many different types of evaluations proposed, though many of them can be summarized in the following format:

A model is subject to a test of some kind.
If we observe X evidence (perhaps scary capabilities, misalignment, high potential for misuse), then the team developing the model is not allowed to [continue training, scale, deploy] the model until they can show that their model is safe.
To show that a model is (likely to be) safe, the team must show Y counterevidence.

There is initially a great deal of debate and confusion in the space. What metrics should we be tracking with evals? What counterevidence could be produced to permit labs to scale or deploy the model?

The field is initially pre-paradigmatic, with different groups exploring different possibilities. Some teams focus on evals that showcase the capabilities of models (maybe we can detect models that are capable of overpowering humanity), while others focus on evals that focus on core alignment concerns (maybe we can detect models that are power-seeking or deceptive).

Each group also has one (or more) governance experts that are coordinating to figure out various pragmatic questions relating to eval plans. Examples:

What preferences do decision-makers and scientists at AGI labs have?
What are the barriers or objections to implementing evals?
Given these preferences and objections, what kinds of evals seem most feasible to implement?
Will multiple labs be willing to agree to a shared set of evals? What would this kind of agreement look like?
How will eval agreements be updated or modified over time? (e.g., in response to stronger models with different kinds of capabilities)

By late 2023 or early 2024, the technical groups have begun to converge on some common evaluation techniques. Governance teams have made considerable progress, with early drafts of “A Unified Evals Protocol” being circulated across major AGI labs. Lab governance teams are having monthly meetings to coordinate around eval strategy, and they are also regularly in touch with technical researchers to inform how evals are designed. Labs are also receiving some pressure from savvy government groups. The unusually savvy government folks encourage labs to implement self-regulation efforts. If labs don’t, then the regulators claim they will push harder for state-imposed regulations (apparently this kind of thing happens in Europe sometimes; H/T to Charlotte).

In 2024, leaders at top AGI labs release a joint statement acknowledging risks from advanced AI systems and the need for a shared set of evaluations. They agree to collaborate with an evals group run by a third-party.

This is referred to as the Unified Evals Agreement, and it establishes the Unified Evals Board. This group is responsible for implementing evaluations and audits of major AI labs. They collaborate regularly with researchers and decision-makers at top AGI labs.

Media attention around the Unified Evals Agreement is generally positive, and it raises awareness about long-term concerns relating to powerful AI systems. Some academics, cybersecurity experts, regulators, and ethicists write commentaries about the Unified Evals Agreement and provide useful feedback. Not all of the press is positive, however. Some journalists think the tech companies are just safety-washing. Others are concerned that this is a case of “letting big tech regulate itself”, and they’re calling for government action. Some people point to efforts in Europe to regulate AI, but many of the previous efforts are considered too bureaucratic, vague, and unhelpful for mitigating catastrophic risks.

Supporters of the Unified Evals Agreement respond by pointing to other fields where industry-initiated evals are common. Some leaders at AGI labs mention that they would be happy for policymakers to enforce a reasonable set of evals. They point out that some incautious labs haven’t agreed to the Unified Evals Agreement, and policies would be especially helpful at regulating these groups.

In 2025, some US and European policymakers become interested in increasing regulations on AI companies. They consult with governance experts who implemented the Unified Evals Agreement to identify best-practices. The policymakers start to monitor large training runs, they require labs to report details about training runs above a certain size, and they implement policies that penalize incautious AGI companies that have refused to adopt reasonable practices. Although some AGI start-ups complain about these regulations and appeal to free market principles, these concerns are outweighed by a desire to enforce safety standards and reduce race dynamics. The UK and a few other western governments adopt similar regulations.

Meanwhile, top AGI labs are following the Unified Evals Agreement. While there are a few implementation hiccups, the policies are being enforced reasonably well. There are biweekly meetings that consist of at least one representative from each AGI lab, as well as members of the Unified Evals Agreement Board. This looks somewhat similar to Information Sharing and Analysis Centers (ISACs) in the field of cybersecurity; there is a culture of sharing and troubleshooting accidents, bugs, and failure modes. About half of each meeting is spent on discussing present-day implementation concerns (e.g., Researchers at my lab want to know if X counterevidence would be sufficient to allow them to scale to a model with Y compute; what should I tell them?), while half of the meeting is spent on anticipating longer-term considerations (e.g., As models get more capable, what kinds of evals should we consider adding?).

In 2026, models are capable enough to start triggering evals about power-seeking, deception, situational awareness, and other core alignment concerns. This might mean that they fail to meet a quantitative benchmark (i.e., the “score” on a deception eval is above a certain threshold), a qualitative evaluation (i.e., the model is clearly seen to show deceptive and dangerous tendencies in a simulated environment), or a mixture of the two. The Unified Evals Agreement Board implements a set of evals that prioritize these “alignment evals” and reduce the emphasis on “capabilities evals.” This has the effect of getting AGI labs to invest more resources into alignment research. Furthermore, thanks to increasing pressure from policymakers, investors, and AGI scientists themselves, all of the relevant AGI labs are (voluntarily or forcefully) subjected to evals. This has the effect of reducing race dynamics and “wild card” actors, allowing top labs to progress more carefully.

In 2027, the leading lab produces a model that is approaching human-level performance at scientific reasoning and AI research. In the middle of a training run, a deception eval is triggered. The eval consists of a combination of interpretability techniques, placing the AI in adversarial environments, having the AI argue with an Interrogator AI (whose goal is to show that the other AI is deceptive), and some New Stuff that was discovered at some date after February 2023. This eval correctly detects that the model was thinking about acquiring resources, making copies of itself, deceiving humans to get widely deployed in the economy, and eventually overpower humanity. While the evidence is not certain, it’s convincing enough to cause researchers at the leading lab to pause the training run. The Unified Evals Agreement Board notifies a small group of decision-makers at other labs and various alignment organizations. Then, researchers from various organizations form a collaboration aimed at figuring out how to align this powerful system (in the spirit of a merge-and-assist clause, perhaps accompanied by a windfall agreement).

This cross-organizational collaboration effort is sufficient to figure out how to align this system. The system reaches above-human-level performance at alignment research, and the system is steered to solve more difficult alignment problems. This is both because the alignment effort becomes more parallelized (researchers from various different organizations coordinate on solving the most pressing alignment concerns) and because the alignment effort buys more serial time (AI progress is “paused” for enough time to give researchers enough serial time to solve core problems).

Later on, it is confirmed that the model that triggered the deception eval would have attempted takeover if the leading lab had not paused. It seems likely that the eval was counterfactually responsible for pausing the training run, initiating a period of focused alignment research, persuading people from different groups to come together and collaborate, and ultimately preventing a dangerous takeover attempt.

Points of Failure

This list is non-exhaustive. Examples are meant to be illustrative rather than comprehensive. Credences are wild guesses presented without justification. Feel free to post your own probabilities (and ideally justifications) in the comments. Each probability is conditional on the previous step(s) working.

Technical failure: We don’t develop useful evals (~25%): The nature of AI development/takeoff makes it hard or impossible to develop good alignment eval (i.e., evals that detect things like power-seeking and deception in models that are too weak to overpower humanity). Capabilities progress is discontinuous, and the most important alignment concerns will not show up until we have dangerous models.

Governance failure: Lab decision-makers don’t implement evals (and we fail to coordinate with them) (~10%). The alignment community is enthusiastic about a few evals, but it is extremely hard to get leading labs to agree to adopt them. The people building technical evals seem removed from the people who are in charge of decision-making at leading labs, and lab governance teams are insufficient to bridge the divide. Lab decision-makers believe that the evals are tracking arbitrary benchmarks, the safety concerns that some alignment researchers seem most concerned about are speculative, internal safety & monitoring efforts are sufficient, and adopting evals would simply make less cautious actors get ahead. Most labs refuse to have models evaluated externally; they justify this by pointing to internal efforts (by scientists who truly understand the models and their problems) and information security concerns (worries that the external evaluation teams might leak information about their products). One or two small groups adopt the evals, but the major players are not convinced.

Governance failure: We are outcompeted by a group that develops (much less demanding) evals/standards (~10%). Several different groups develop safety standards for AI labs. One group has expertise in AI privacy and data monitoring, another has expertise in ML fairness and bias, and a third is a consulting company that has advised safety standards in a variety of high-stakes contexts (e.g., biosecurity, nuclear energy).

Each group proposes their own set of standards. Some decision-makers at top labs are enthusiastic about The Unified Evals Agreement, but others are skeptical. In addition to object-level debates, there are debates about which experts should be trusted. Ultimately, lab decision-makers end up placing more weight on teams with experience implementing safety standards in other sectors, and they go with the standards proposed by the consulting group. Although these standards are not mutually exclusive with The Unified Evals Agreement, lab decision-makers are less motivated to adopt new standards (“we just implemented evals-- we have other priorities right now.”). The Unified Evals Agreement is crowded out by standards that have much less emphasis on long-term catastrophic risks from AI systems.

Technical failure + Governance failure: Counterevidence is imperfect and evals are Goodharted (~18%). It is hard or impossible to develop reasonable counterevidence (evidence that “ensures” a model is safe and therefore OK to be scaled and deployed). Our counterevidence criteria serves as a useful proxy, but it can be goodharted. Researchers optimize primarily for passing evals (as opposed to paying attention to the underlying problems that the evals are meant to detect). They figure out the easiest/fastest ways to pass the evals, and these strategies don’t actually train away the underlying issues (e.g., deception, power-seeking). As soon as the counterevidence condition is met, the lab scales/deploys the model, and it eventually overpowers humanity.

Technical failure + Governance failure: Evals meaningfully accelerate capabilities (evals are net negative) (~2%). The process of developing evals requires gaining insights into the capabilities of dangerous systems. These insights are leaked to major AGI labs, who then use this information to develop more powerful systems. Alternatively, after capabilities evals are implemented at labs, researchers are incentivized to understand how these evals work. They start thinking in greater detail about the various ways that models can be power-seeking, deceptive, situationally aware, etc. In doing so, they uncover various ways to improve the capabilities of models. These insights occur significantly faster than they otherwise would have, burning many months that could have been used for safety research.

Governance failure: Evals don’t lead to caution or actively decrease caution (~20%). Lab leaders are notified when a model fails a capabilities eval; it seems plausible that this model is close to being capable enough to overpower humanity. Lab decision-makers reason that [competing lab] is likely to have a model roughly as capable (or will develop such a model within a few weeks). They don’t believe they have time to invest in a lot of alignment research (let alone an entire alignment collaboration). They quickly add a few safety measures before scaling and deploying the model, and the model overpowers humanity.

Note that the above breakdown is overly simplistic. In reality, I expect a messy combination of the above possibilities. For example, there are evals for some but not all failure modes, they vary in their usefulness, they detect some scary capabilities with high sensitivity but have some false positives, some (but not all) labs implement them, they are Goodhart-able to varying extents. H/T to Zach.

Note also that I’m defining “success” as “evals counterfactually cause us to avoid at least one catastrophic risk. It is possible that evals help us avoid one catastrophic risk (e.g., we stop the first Deceptive Model from overpowering humanity) but we still fail to prevent subsequent catastrophic risks (e.g., our eval fails to detect the fifth Deceptive Model or fails to prevent misuse risk). This underscores that evals alone are unlikely to be sufficient to avoid all sources of catastrophic risk, but they may be helpful in reducing some sources of catastrophic risk.

LESSWRONG
LW

LESSWRONG
LW

45

How evals might (or might not) prevent catastrophic risks from AI

45

Scenario: Victory

Points of Failure

45

45