Twitter thread on AI safety evals

Richard_Ngo

This is a linkpost for https://x.com/RichardMCNgo/status/1814049093393723609

Epistemic status: raising concerns, rather than stating confident conclusions.

I’m worried that a lot of work on AI safety evals matches the pattern of “Something must be done. This is something. Therefore this must be done.” Or, to put it another way: I judge eval ideas on 4 criteria, and I often see proposals which fail all 4. The criteria:

1. Possible to measure with scientific rigor.

Some things can be easily studied in a lab; others are entangled with a lot of real-world complexity. If you predict the latter (e.g. a model’s economic or scientific impact) based on model-level evals, your results will often be BS.

(This is why I dislike the term “transformative AI”, by the way. Whether an AI has transformative effects on society will depend hugely on what the society is like, how the AI is deployed, etc. And that’s a constantly moving target! So TAI a terrible thing to try to forecast.)

Another angle on “scientific rigor”: you’re trying to make it obvious to onlookers that you couldn’t have designed the eval to get your preferred results. This means making the eval as simple as possible: each arbitrary choice adds another avenue for p-hacking, and they add up fast.

(Paraphrasing a different thread): I think of AI risk forecasts as basically guesses, and I dislike attempts to make them sound objective (e.g. many OpenPhil worldview investigations). There are always so many free parameters that you can get basically any result you want. And so, in practice, they often play the role of laundering vibes into credible-sounding headline numbers. I'm worried that AI safety evals will fall into the same trap.

(I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.)

2. Provides signal across scales.

Evals are often designed around a binary threshold (e.g. the Turing Test). But this restricts the impact of the eval to a narrow time window around hitting it. Much better if we can measure (and extrapolate) orders-of-magnitude improvements.

3. Focuses on clearly worrying capabilities.

Evals for hacking, deception, etc track widespread concerns. By contrast, evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk. And even they don’t think it’s necessary for risk.

4. Motivates useful responses.

Safety evals are for creating clear Schelling points at which action will be taken. But if you don’t know what actions your evals should catalyze, it’s often more valuable to focus on fleshing that out. Often nobody else will!

In fact, I expect that things like model releases, demos, warning shots, etc, will by default be much better drivers of action than evals. Evals can still be valuable, but you should have some justification for why yours will actually matter, to avoid traps like the ones above. Ideally that justification would focus either on generating insight or being persuasive; optimizing for both at once seems like a good way to get neither.

Lastly: even if you have a good eval idea, actually implementing it well can be very challenging

Building evals is scientific research; and so we should expect eval quality to be heavy-tailed, like most other science. I worry that the fact that evals are an unusually easy type of research to get started with sometimes obscures this fact.

evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk

I don't think this is true – or, more specifically, I think there are a lot of people who will start to worry about AI xrisk if things like automated ML R&D pick up. Most people who dismiss AI xrisk I don't think do so because they think intelligence is inherently good, but instead because AI xrisk just seems too "scifi." But if AI is automating ML R&D, then the idea of things getting out of hand won't feel as scifi. In principle, people should be able to separate the question of "will AI soon be able to automate ML R&D" from the question of "if AI could automate ML R&D, would it pose an xrisk", but I think most low-decouplers struggle to make this separation. For the kind of reaction that a "normal" person will have to automated ML R&D, I think this reaction from a CBS host interviewing Hinton is informative.

(I agree with your general point that it's better to focus on worrying capabilities, and also I agree with some of your other points, such as how demos might be more useful than evals.)

I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.

I don't think this is quite right.

Two major objections to the bio-anchors 30-year-median conclusion might be:

The whole thing is laundering vibes into credible-sounding headline numbers.
Even if we stipulate that the methodology is sound, it measures an upper bound, not a median.

To me, (2) is the more obvious error. I basically buy (1) too, but I don't think we've gotten empirical evidence, since (2).

I guess there's a sense in which a mistake on (2) could be seen as a consequence of (1) - but it seems distinct: it's a logic error, not a free parameter. I do think it's useful to distinguish [motivated reasoning in free-parameter choice] from [motivated reasoning in error-checking].

It's not so obvious to me that the bio-anchors report was without foundation as an upper bound estimate.

I found this fairly helpful for thinking about evals (I mostly previously hadn't thought that hard about how to evaluate an eval, and this seems like a pretty reasonable framework for thinking about that at first blush).

I had previously thought "evals seem to be a thing you do if you don't have much traction on harder things", and it was an interesting point "yep, it's easier to get started on evals, but that doesn't mean it's necessarily easy to push it all the way through towards 'industrial-useful'".