Potential for o1-like long horizon reasoning post-training compromises capability evals for open weights models. If it's not applied before/during evals, then evals will significantly underestimate capabilities of the system that can be built out of the open weights later, when the recipe is reproduced.
This likely will be relevant for Llama 4 (as the first 1e26+ FLOPs model with open weights), if they don't manage a good reproduction of o1-like post-training before release, and continue the policy of publishing in open weights if the evals are not too alarming.
Footnote to table cells D18 and D19:
My reading of Anthropic's ARA threshold, which nobody has yet contradicted:
(It's totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it's just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I'm just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn't cross the threshold and so didn't violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn't necessarily respond as required by the RSP.)
Open-sourcing scaffolding or sharing techniques is supererogatory.
I would think sharing the scaffolding would be important. Stronger scaffolding could skew evaluation results. From the complete paragraph you seem to suggest that sufficient information of the scaffolding should be published, so I'm curious what you mean here.
Stronger scaffolding could skew evaluation results.
Stronger scaffolding makes evals better.
I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there's no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)
Testing an LM system for dangerous capabilities is crucial for assessing its risks.
Summary of best practices
Best practices for labs evaluating LM systems for dangerous capabilities:
Using easier evals or weak elicitation is fine, if it's done such that by the time the model crosses danger thresholds, the developer will be doing the full eval well.
These recommendations aren't novel — they just haven't been collected before.
How labs are doing
See DC evals: labs' practices.
This post basically doesn't consider some crucial factors in labs' evals, especially the quality of particular evals, nor some adjacent factors, especially capability thresholds and planned responses to reaching capability thresholds. One good eval is better than lots of bad evals but I haven't figured out which evals are good. What's missing from this post—especially which evals are good, plus illegible elicitation stuff—may well be more important than the desiderata this post captures.
DeepMind > OpenAI ≥ Anthropic >>> Meta >> everyone else. The evals reports published by DeepMind, OpenAI, and Anthropic are similar. I only weakly endorse this ranking, and reasonable people disagree with it.
DeepMind, OpenAI, and Anthropic all have decent coverage of threat models; DeepMind is the best. They all do fine on sharing their methodology and reporting their results; DeepMind is the best. They're all doing poorly on sharing with third-party evaluators before deployment; DeepMind is the worst. (They're also all doing poorly on setting capability thresholds and planning responses, but that's out-of-scope for this post.)
My main asks on evals for the three labs are similar: improve the quality of their evals, improve their elicitation, be more transparent about eval methodology, and share better access with a few important third-party evaluators. For OpenAI and Anthropic, I'm particularly interested in them developing (or adopting others') evals for scheming or situational awareness capabilities, or committing to let Apollo evaluate their models and incorporate that into their risk assessment process.
(Meta does some evals but has limited scope, fails to share some details on methodology and results, and has very poor elicitation. Microsoft has a "deployment safety board" but it seems likely ineffective, and it doesn't seem to have a plan to do evals. xAI, Amazon, and others seem to be doing nothing.)
Appendix: misc notes on best practices
Coming in about 12 hours.
Appendix: misc notes on particular evals
This is very nonexhaustive, shallow (just based on labs' reports, not looking at particular tasks), and generally of limited insight. I'm writing it because I'm annoyed that when labs release or publish reports on their evals, nobody figures out whether the evals are good (or at least they don't tell me). Perhaps this section will inspire someone to write a better version of it.
Google DeepMind:
Evals sources: evals paper, evals repo, Frontier Safety Framework, Gemini 1.5.
Existing comments: Ryan Greenblatt.
Evals:
DeepMind scored performance on each eval 0-4 (except CBRN), but doesn't have predetermined thresholds, and at least some evals would saturate substantially below 4. DeepMind's FSF has high-level "Critical Capability Levels" (CCLs); they feel pretty high; they use different categories from the evals described above (they're in Autonomy, Biosecurity, Cybersecurity, and ML R&D).[1]
OpenAI:
Evals sources: o1 system card, Preparedness Framework, evals repo.
Evals:
OpenAI's PF has high-level "risk levels"; they feel pretty high; they are not operationalized in terms of low-level evals.
Anthropic:
Evals sources: RSP evals report – Claude 3 Opus, Responsible Scaling Policy.
Evals:
Meta:
Evals sources: Llama 3, CyberSecEval 3.
Evals:
Appendix: reading list
Sources on evals and elicitation:[2]
Sources on specific labs:
Thanks to several anonymous people for discussion and suggestions. They don't necessarily endorse this post.
Crossposted from AI Lab Watch. Subscribe on Substack.
Misc notes:
> we will define a set of evaluations called “early warning evaluations,” with a specific “pass” condition that flags when a CCL may be reached before the evaluations are run again. We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress. To account for the gap between rounds of evaluation, we will design early warning evaluations to give us an adequate safety buffer before a model reaches a CCL.
This list may be bad. You can help by suggesting improvements.