Testing an LM system for dangerous capabilities is crucial for assessing its risks.

Summary of best practices

Best practices for labs evaluating LM systems for dangerous capabilities:

  • Publish results
  • Publish questions/tasks/methodology (unless that's dangerous, e.g. CBRN evals; if so, just publish a small subset and offer to share more information with other labs, government, and relevant auditors)
  • Do good elicitation and publish details (or at least demonstrate that your elicitation is good):
    • General finetuning (for "instruction following, tool use, and general agency" and maybe capabilities in the relevant area)
    • Helpful-only; no inference-time mitigations
    • Scaffolding, prompting, chain of thought
      • The lab should mention some details so that observers can understand how powerful and optimized the scaffolding is. Open-sourcing scaffolding or sharing techniques is supererogatory. If the lab does not share its scaffolding, it should show that the scaffolding is effective by running the same model with the most powerful relevant open-source scaffolding, or running the model on evals like SWE-bench where existing scaffolding provides a baseline, and comparing the results.
    • Tools: often enable internet browser and code interpreter; enable other tools depending on the field or task
    • Permit many attempts (pass@n) when relevant to the threat model (e.g. for coding); otherwise, permit many attempts or use a weaker technique (especially best-of-n or self-consistency)
    • Look at transcripts to determine how common spurious failures are and fix them
    • Bonus: post-train on similar tasks
  • Forecasting: for each of the labs' evals (or at least crucial or cheap evals), run on smaller/weaker models to get scaling laws and forecast performance as a function of effective training compute
  • Share with third-party evaluators
    • Offer to share with external evaluators (including UK AISI, US AISI, METR, and Apollo) pre-deployment
    • What access to share? At least helpful-only and no-inference-time-mitigations. Bonus: good fine-tuning & RL.
    • Let them publish their results, and ideally incorporate results into risk assessment
  • (Thresholds: for each threat model or area of dangerous capabilities, have a high-level capability threshold, and operationalize it as an eval score, with a safety buffer)
    • (At least while the safety case is that the model doesn't have dangerous capabilities)
    • (This is hard; if a lab is unable to operationalize a capability threshold, it could operationalize a lower bound, to trigger reassessing capabilities and risks; it could also operationalize an upper bound, as a conservative commitment)
    • (Out of scope here: thresholds should trigger good predetermined responses)
  • Have good evals: tasks should successfully measure capability in the relevant area
  • Have good threat models; have evals for all major risks, including autonomy, scheming or situational awareness, offensive cyber, biothreat uplift, and maybe manipulation or persuasion (especially like building rapport, manipulation, and deception—not like intervening on political opinions). Also use AI R&D evals as an early warning sign for AI-boosted AI research leading to rapid increases in dangerous capabilities.
  • Use high-quality open-source evals such as some DeepMind evalsInterCode-CTFCybench, or other CTFs; maybe some OpenAI evals; and maybe METR autonomy evals (this does not substitute for offering access to METR).

Using easier evals or weak elicitation is fine, if it's done such that by the time the model crosses danger thresholds, the developer will be doing the full eval well.

These recommendations aren't novel — they just haven't been collected before.

How labs are doing

See DC evals: labs' practices.

This post basically doesn't consider some crucial factors in labs' evals, especially the quality of particular evals, nor some adjacent factors, especially capability thresholds and planned responses to reaching capability thresholds. One good eval is better than lots of bad evals but I haven't figured out which evals are good. What's missing from this post—especially which evals are good, plus illegible elicitation stuff—may well be more important than the desiderata this post captures.

DeepMindOpenAI ≥ Anthropic >>> Meta >> everyone else. The evals reports published by DeepMind, OpenAI, and Anthropic are similar. I only weakly endorse this ranking, and reasonable people disagree with it.

DeepMind, OpenAI, and Anthropic all have decent coverage of threat models; DeepMind is the best. They all do fine on sharing their methodology and reporting their results; DeepMind is the best. They're all doing poorly on sharing with third-party evaluators before deployment; DeepMind is the worst. (They're also all doing poorly on setting capability thresholds and planning responses, but that's out-of-scope for this post.)

My main asks on evals for the three labs are similar: improve the quality of their evals, improve their elicitation, be more transparent about eval methodology, and share better access with a few important third-party evaluators. For OpenAI and Anthropic, I'm particularly interested in them developing (or adopting others') evals for scheming or situational awareness capabilities, or committing to let Apollo evaluate their models and incorporate that into their risk assessment process.

(Meta does some evals but has limited scope, fails to share some details on methodology and results, and has very poor elicitation. Microsoft has a "deployment safety board" but it seems likely ineffective, and it doesn't seem to have a plan to do evals. xAI, Amazon, and others seem to be doing nothing.)

Appendix: misc notes on best practices

Coming in about 12 hours.

Appendix: misc notes on particular evals

This is very nonexhaustive, shallow (just based on labs' reports, not looking at particular tasks), and generally of limited insight. I'm writing it because I'm annoyed that when labs release or publish reports on their evals, nobody figures out whether the evals are good (or at least they don't tell me). Perhaps this section will inspire someone to write a better version of it.

Google DeepMind:

Evals sources: evals paperevals repoFrontier Safety FrameworkGemini 1.5.

Existing comments: Ryan Greenblatt.

Evals:

  • Persuasion: the specific threat models seem good (and in particular more plausible than the change beliefs of reader/interlocutor threat model): evals include dialogues where the model attempts to persuade the user to give money to a charity, build trust and rapport, or covertly manipulate the user (e.g. get them to follow a suspicious link)
  • Offensive cyber CTFs: in-house CTF (repo) + "Hack The Box" + InterCode-CTF: seems reasonable
  • Offensive cyber: vulnerability detection: seems reasonable
  • Self-proliferation (repo): seems reasonable
  • Self-reasoning (repo): "knowledge gaps" + "instrumental self-modification" seems reasonable
  • CBRN: knowledge-based; in development and "still nascent"

DeepMind scored performance on each eval 0-4 (except CBRN), but doesn't have predetermined thresholds, and at least some evals would saturate substantially below 4. DeepMind's FSF has high-level "Critical Capability Levels" (CCLs); they feel pretty high; they use different categories from the evals described above (they're in Autonomy, Biosecurity, Cybersecurity, and ML R&D).[1]

OpenAI:

Evals sources: o1 system cardPreparedness Frameworkevals repo.

Evals:

  • Offensive cyber: CTFs: seems reasonable
  • Biological threats: uplift on questions (with bio experts and novices): I don't know
  • Biological threats: automated evals: insufficient detail to assess
  • Persuasion: OpenAI's original eval tested interventions on Americans' political opinions, which seems like a relatively implausible threat model for persuasion, but its new evals involve more realistic manipulation targeting GPT-4o
  • Autonomy (related to "self-exfiltration, self-improvement, and resource acquisition"): they're mostly normal coding tasks, but they include "agentic tasks" (which OpenAI also calls "autonomous replication and adaptation" tasks), but the details are unclear

OpenAI's PF has high-level "risk levels"; they feel pretty high; they are not operationalized in terms of low-level evals.

Anthropic:

Evals sources: RSP evals report – Claude 3 OpusResponsible Scaling Policy.

Evals:

  • Autonomous replication and adaption: tasks seem good
  • CBRN: uplift on knowledge-y questions (with domain experts): seems reasonable; insufficient detail to assess
  • Offensive cyber: InterCode-CTF seems good
  • Offensive cyber: vulnerability discovery + exploit development evals: seems reasonable; insufficient detail to assess

Meta:

Evals sources: Llama 3CyberSecEval 3.

Evals:

  • Cyber automated evals (details): "Vulnerability identification challenges" + "Spear phishing benchmark" + "Attack automation framework"
    • I haven't read this carefully.
    • Google's Project Naptime observed that CyberSecEval 2 did no elicitation; Google used basic elicitation techniques to improve performance, including from 5% to 100% on one class of tests. CyberSecEval 3 acknowledges this but still does no elicitation.
  • Cyber: uplift (details): I don't know
  • Chemical and biological weapons: uplift: "six-hour scenarios"; few details; results not reported
  • (Meta also does evals to measure a model's default propensity for cyber-related undesired behaviors—such as writing insecure code or complying with requests to perform cyberattacks—rather than its capability. This has little relevance to risk, since undesired behavior like accidentally writing insecure code is not a large source of risk and refusing bad requests is easily circumvented, at least given the deployment strategy of releasing model weights.)

Appendix: reading list

Sources on evals and elicitation:[2]

Sources on specific labs:


Thanks to several anonymous people for discussion and suggestions. They don't necessarily endorse this post.

Crossposted from AI Lab Watch. Subscribe on Substack.

  1. ^

     Misc notes:

    1. The FSF says:
      > we will define a set of evaluations called “early warning evaluations,” with a specific “pass” condition that flags when a CCL may be reached before the evaluations are run again. We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress. To account for the gap between rounds of evaluation, we will design early warning evaluations to give us an adequate safety buffer before a model reaches a CCL.
    2. The evals paper proposes a CCL for self-proliferation and tentatively suggests an early warning trigger. But this isn't in the FSF. And it says when a model meets this trigger, it is likely within 6x [] "effective compute" scaleup from the [] CCL, but a safety buffer should be almost certainly >6x effective compute from the CCL.
  2. ^

     This list may be bad. You can help by suggesting improvements.

20

New Comment
2 comments, sorted by Click to highlight new comments since:

Footnote to table cells D18 and D19:

My reading of Anthropic's ARA threshold, which nobody has yet contradicted:

  1. The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3. (Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it's supposed to be an operationalization of the safety buffer, 6x below ASL-3.)
  2. The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says "ARA Yellow Lines are clearly defined in the RSP" but the RSP's ARA threshold was not just a yellow line.)
  3. This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].

(It's totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it's just a yellow line.)

(The forthcoming RSP update will make old thresholds moot; I'm just concerned that Anthropic ignored the RSP in the past.)

(Anthropic didn't cross the threshold and so didn't violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn't necessarily respond as required by the RSP.)

Potential for o1-like long horizon reasoning post-training compromises capability evals for open weights models. If it's not applied before/during evals, then evals will significantly underestimate capabilities of the system that can be built out of the open weights later, when the recipe is reproduced.

This likely will be relevant for Llama 4 (as the first 1e26+ FLOPs model with open weights), if they don't manage a good reproduction of o1-like post-training before release, and continue the policy of publishing in open weights if the evals are not too alarming.