Member of technical staff at METR.
Previously: Vivek Hebbar's team at MIRI → Adrià Garriga-Alonso on various empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
I don't understand how the issue of validity in evals connects to the Gooder Regulator Theorem. It's certainly arbitrarily hard to measure a latent property of an AI agent that's buried an an arbitrarily complex causal diagram of the agent, but this is also true of any kind of science. The solution is to only measure simple things and validate them extensively.
To start, the playbook includes methods from the social sciences. Are humans lying in a psychology experiment intending to deceive, or just socially conditioned into deception in these circumstances? We can discriminate using more experiments. Here's a list suggested by Claude.
Internal Validity:
- Manipulation checks: Verify participants understand the task and setting as intended
- Control conditions: Include baseline conditions where lying provides no benefit
- Multiple measures: Use both behavioral indicators (actual lies told) and physiological measures (stress responses, reaction times)
- Debrief interviews: Post-experiment discussions to understand participant reasoning
Construct Validity:
- Operationalize "lying" clearly: Distinguish between errors, omissions, and active false statements
- Measure awareness: Test whether participants recognize they're providing false information
- Cross-validate: Use multiple paradigms that elicit deception differently
External Validity:
- Vary contexts: Test across different social settings and stakes
- Diverse samples: Include participants from various cultural backgrounds where social norms differ
- Ecological validity: Use scenarios that mirror real-world situations
Intentional deception should show:
- Individual differences in moral reasoning correlation
- Sensitivity to personal cost-benefit analysis
- Greater cognitive effort signatures
- Ability to inhibit when explicitly instructed
- Conscious awareness and ability to justify
Socially conditioned behavior should show:
- Stronger situation-specific activation
- Less individual variation within similar social groups
- Faster, more automatic responses
- Resistance to conscious control
- Difficulty articulating clear reasons
Likewise, in evals we can do increasingly elaborate robustness checks where we change the prompt and setting and measure if the correlation is high, check that the model understand its behavior is deceptive, provide incentives to the model that should make its behavior change if it were intentionally deceptive, and so on. Much of the time, especially for propensity / alignment evals, we will find the evals are nonrobust and we'll be forced to weaken our claims.
Sometimes, we have an easier job. I don't expect capability evals to get vastly harder because many skills are verifiable and sandbagging is solvable. Also, AI evals are easier than human psychology experiments, because they're likely to be much cheaper and faster to run and more replicable.
According to Wikipedia it seems to have worked well and not been expensive.
Poland began work on the 5.5-meter (18 foot) high steel wall topped with barbed wire at a cost of around 1.6 billion zł (US$407m) [...] in the late summer of 2021. The barrier was completed on 30 June 2022.[3] An electronic barrier [...] was added to the fence between November 2022 and early summer 2023 at a cost of EUR 71.8 million.[4]
[...] official border crossings with Belarus remained open, and the asylum process continued to function [...]
Since the fence was built, illegal crossings have reduced to a trickle; however, between August 2021 and February 2023, 37 bodies were found on both sides of the border; people have died mainly from hypothermia or drowning.[11]
The Greenberg article also suggests a reasonable tradeoff is being made in policy
Despite these fears, Duszczyk is convinced his approach is working. In a two-month period after the asylum suspension, illegal crossings from Belarus fell by 48% compared to the same period in 2024. At the same time, in all of 2024, there was one death—out of 30,000 attempted crossings—in Polish territory. There have been none so far in 2025. Duszczyk feels his humanitarian floor is holding.
Maybe you're reading some other motivations into them, but if we just list the concerns in the article only 2 out of 11 indicate they want protectionism. The rest of the items that apply to AI include threats to conservative Christian values, threats to other conservative policies, and things we can mostly agree on. This gives a lot to ally on, especially the idea that Silicon Valley should not be allowed unaccountable rule over humanity, and that we should avoid destroying everything to beat China. It seems like a more viable alliance than with the fairness and bias people; plus conservatives have way more power right now.
I'm curious about your sense of the path towards AI safety applications, if you have a more specific and/or opinionated view than the conclusion/discussion section.
My view is that AIs are improving faster at research-relevant skills like SWE and math than they're increasing at misalignment (rate of bad behaviors like reward hacking, ease of eliciting an alignment faker, etc) or covert sabotage ability, such that we would need a discontinuity in both to get serious danger by 2x. There is as yet no scaling law for misalignment showing that it predictably gets worse when capabilities improve in practice.
The situation is not completely clear because we don't have good alignment evals and could get neuralese any year, but the data are pointing in that direction. I'm not sure about research taste as the benchmarks for that aren't very good. I'd change my mind here if we did see stagnation in research taste plus misalignment getting worse over time (not just sophistication of the bad things AIs do, but also frequency or egregiousness).
That is, you think alignment is so difficult that keeping humanity alive for 3 years is more valuable than the possibility of us solving alignment during the pause? Or that the AIs will sabotage the project in a way undetectable by management even if management is very paranoid about being sabotaged by any model that has shown prerequisite capabilities for it?
Ways this could be wrong:
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.
What's the current best known solution to the 5 and 10 problem? I feel like actual agents have heuristic self-models rather than this cursed logical counterfactual thing and so there's no guarantee a solution exists. But I don't even know what formal properties we want, so I also don't know whether we have impossibility theorems, some properties have gone out of fashion, or people think it can still work.
My views which I have already shared in person: