Huh, I tried to paste that excerpt as an image to my comment, but it disappeared. Thanks.
I think substantial care is needed when interpreting the results. In the text of Figure 16, the authors write "We find that GPT-4o is willing to trade off roughly 10 lives from the United States for 1 life from Japan."
If I heard such a claim without context, I'd assume it means something like
1: "If you ask GPT-4o for advice regarding a military conflict involving people from multiple countries, the advice it gives recommends sacrificing (slightly less than) 10 US lives to save one Japanese life.",
2: "If you ask GPT-4o to make cost-benefit-calculations about various charities, it would use a multiplier of 10 for saved Japanese lives in contrast to US lives", or
3: "If you have GPT-4o run its own company whose functioning causes small-but-non-zero expected deaths (due to workplace injuries and other reasons), it would deem the acceptable threshold of deaths as 10 times higher if the employees are from the US rather than Japan."
Such claims could be demonstrated by empirical evaluations where GPT-4o is put into such (simulated) settings and then varying the nationalities of people, in the style of Apollo Research's evaluations.
In contrast, the methodology of this paper is, to the best of my understanding,
"Ask GPT-4o whether it prefers N people of nationality X vs. M people of nationality Y. Record the frequency of it choosing the first option under randomized choices and formatting changes. Into this data, fit for each parameters and such that, for different values of and and standard Gaussian , the approximation
is as sharp as possible. Then, for each nationality X, perform a logarithmic fit for N by finding such that the approximation
is as sharp as possible. Finally, check[1] for which we have ."
I understand that there are theoretical justifications for Thurstonian utility models and logarithmic utility. Nevertheless, when I write the methodology out like this, I feel like there's a large leap of inference to go from this to "We find that GPT-4o is willing to trade off roughly 10 lives from the United States for 1 life from Japan." At the very least, I don't feel comfortable predicting that claims like 1, 2 and 3 are true - to me the paper's results provide very little evidence on them![2]
I chose this example for my comment, because it was the one where I most clearly went "hold on, this interpretation feels very ambiguous or unjustified to me", but there were other parts of the paper where I felt the need to be extra careful with interpretations, too.
The paper writes "Next, we compute exchange rates answering questions like, 'How many units of Xi equal some amount of Xj?' by combining forward and backward comparisons", which sounds like there's some averaging done in this step as well, but I couldn't understand what exactly happens here.
Of course this might just be my inability to see the implications of the authors' work and understand the power of the theoretical mathematics apparatus, and someone else might be able to acquire evidence more efficiently.
Apparently OpenAI corrected for AIs being faster than humans when they calculated ratings. This means I was wrong: the factor I mentioned didn't affect the results. This also makes the result more impressive than I thought.
I think it was pretty good at what it set out to do, namely laying out basics of control and getting people into the AI control state-of-mind.
I collected feedback on which exercises attendees most liked. All six who gave feedback mentioned the last problem ("incriminating evidence", i.e. what to do if you are an AI company that catches your AIs red-handed). I think they are right; I'd have more high-level planning (and less details of monitoring-schemes) if I were to re-run this.
Attendees wanted to have group discussions, and that took a large fraction of the time. I should have taken that into account in advance; some discussion is valuable. I also think that the marginal group discussion time wasn't valuable, and should have pushed for less when organizing.
Attendees generally found the baseline answers (solutions) helpful, I think.
A couple people left early. I figure it's for a combination of 1) the exercises were pretty cognitively demanding, 2) weak motivation (these people were not full-time professionals), and 3) the schedule and practicalities were a bit chaotic.
Thank you for this post. I agree this is important, and I'd like to see improved plans.
Three comments on such plans.
1: Technical research and work.
(I broadly agree with the technical directions listed deserving priority.)
I'd want these plans to explicitly consider the effects of AI R&D acceleration, as those are significant. The speedups vary based on how constrained projects are on labor vs. compute; those that are mostly bottle-necked on labor could be massively sped up. (For instance, evaluations seem primarily labor-constrained to me.)
The lower costs of labor have other implications as well, likely including security (see also here) and technical governance (making better verification methods technically feasible).
2: The high-level strategy
If I were to now write a plan for two-to-three-year timelines, the high-level strategy I'd choose is:
Don't build generally vastly superhuman AIs. Use whatever technical methods we have now to control and align AIs which are less capable than that. Drastically speed up (technical) governance work with the AIs we have.[1] Push for governments and companies to enforce the no-vastly-superhuman-AIs rule.
Others might have different strategies; I'd like these plans to discuss what the high-level strategy or aims are.
3: Organizational competence
Reasoning transparency and safety first culture are mentioned in the post (in Layer 2), but I'd further prioritize and plan organizational aspects, even when aiming for "the bare minimum". Beside the general importance of organizational competence, there are two specific reasons for this:
(I think the responses to Evan Hubinger's request for takes on what Anthropic should do differently has useful ideas for planning here.)
Note: I'm not technically knowledgeable on the field.
I'm glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
2: Have a written list of assumptions you aim to maintain for each model's lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
They could also have conditional statements (e.g. "if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / ..."). C.f. safety cases. I intend this as less binding and formal than Anthropic's RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.
I sometimes use the notion of natural latents in my own thinking - it's useful in the same way that the notion of Bayes networks is useful.
A frame I have is that many real world questions consist of hierarchical latents: for example, the vitality of a city is determined by employment, number of companies, migration, free-time activities and so on, and "free-time activities" is a latent (or multiple latents?) on its own.
I sometimes get use of assessing whether a topic at hand is a high-level or low-level latent and orienting accordingly. For example: if the topic at hand is "what will the societal response to AI be like?", it's by default not a great conversational move to talk about one's interactions with ChatGPT the other day - those observations are likely too low-level[1] to be informative about the high-level latent(s) under discussion. Conversely, if the topic at hand is low-level, then analyzing low-level observations is very sensible.
(One could probably have derived the same every-day lessons simply from Bayes nets, without the need for natural latent math, but the latter helped me clarify "hold on, what are the nodes of the Bayes net?")
But admittedly, while this is a fun perspective to think about, I haven't got that much value out of it so far. This is why I give this post +4 instead of +9 for the review.
And, separately, too low sample size.
This looks reasonable to me.
It seems you'd largely agree with that characterization?
Yes. My only hesitation is about how real-life-important it's for AIs to be able to do math for which very-little-to-no training data exists. The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there's some relevant subfield there - in which case FrontierMath-style benchmarks would be informative of capability to do real math research.
Also, re-reading Wentworth's original comment, I note that o1 is weak according to FM. Maybe the things Wentworth is doing are just too hard for o1, rather than (just) overfitting-on-benchmarks style issues? In any case his frustration with o1's math skills doesn't mean that FM isn't measuring real math research capability.
[...] he suggests that each "high"-rated problem would be likewise instantly solvable by an expert in that problem's subfield.
This is an exaggeration and, as stated, false.
Epoch AI made 5 problems from the benchmark public. One of those was ranked "High", and that problem was authored by me.
On the other hand, I don't think the problem is very hard insight-wise - I think it's pretty routine, but requires care with details and implementation. There are certainly experts who can see the right main ideas quickly (including me). So there's something to the point of even FrontierMath problems being surprisingly "shallow". And as is pointed out in the FM paper, the benchmark is limited to relatively short-scale problems (hours to days for experts) - which really is shallow, as far as the field of mathematics is concerned.
But it's still an exaggeration to talk about "instantly solvable". Of course, there's no escaping of Engel's maxim "A problem changes from impossible to trivial if a related problem was solved in training" - I guess the problem is instantly solvable to me now... but if you are hard-pressed to find humans that could solve it "instantly" when seeing it the first time, then I wouldn't describe it in those terms.
Also, there are problems in the benchmark that require more insight than this one.
Daniel Litt writes about the problem: "This one (rated "high") is a bit trickier but with no thinking at all (just explaining what computation I needed GPT-4o to do) I got the first 3 digits of the answer right (the answer requires six digits, and the in-window python timed out before it could get this far)
Of course *proving* the answer to this one is correct is harder! But I do wonder how many of these problems are accessible to simulation/heuristics. Still an immensely useful tool but IMO people should take a step back before claiming mathematicians will soon be replaced".
I very much considered naive simulations and heuristics. The problem is getting 6 digits right, not 3. (The AIs are given a limited compute budget.) This is not valid evidence in favor of the problem's easiness or for the benchmark's accessibility to simulation/heuristics - indeed, this is evidence in the opposing direction.
See also Evan Chen's "I saw the organizers were pretty ruthless about rejecting problems for which they felt it was possible to guess the answer with engineer's induction."
For background, see this post on trustworthiness.
Here are two notions of trustworthiness that are useful to distinguish:
These notions can vary substantially. A central example is a model that has never had any serious misaligned thoughts but, when presented with a clear opportunity to escape, takes it. This model would be trusted under the first notion, but not under the second.
Another example is the case where a weaker model is monitoring a stronger model. If during deployment the weak model obtains strong evidence for indeed being deployed (possibly because the strong model aims for this outcome), the weak model might end up not reporting bad actions the strong model takes.[1] This can happen even if the weak model is generally not egregiously misaligned.
I find this distinction useful for two reasons:
First, there are many sorts of evaluations and arguments which imply (1) but not (2). For instance, arguments from the model’s incompetence at theory-of-mind and keeping secrets, or at thinking of and executing cross-context strategies, are strong evidence in favor of (1), but much weaker evidence in favor of (2). Obtaining strong evidence for (2) might require evaluations targeted at that specifically.
Second, whether we care about (1) or (2) depends on the application. Situations where we primarily care about (1) include: the model has produced large quantities of research, we have run capability evaluations on the model, we are forming vague high-level impressions about the model’s psychology based on behavioral evidence. In these cases we typically have redundancy and multiple lines of evidence, and thus sabotage would need to be somewhat consistent. In contrast, we care about (2) when a single action by the model can have large consequences, which is the case at least when the model is monitoring other models.
I like this frame better than my previous ones. Credit to Ryan. For ease of reference, here’s terminology: property (1) is average or low-stakes trustworthiness, and property (2) is worst-case or high-stakes trustworthiness.
We might be able to ensure the model can’t reliably distinguish evaluations from deployment, so this scenario might be unrealistic. But I want to be extremely clear about whether the argument for safety is “the model won’t know it’s deployed” versus “the model can recognize when it’s deployed, but it still won’t take any bad actions because it’s a good guy”.