we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.
They made a mini model card. Notably:
The UK AISI also conducted pre-deployment testing of a near-final model, and shared their results with the US AI Safety Institute . . . . Additionally, METR did an initial exploration of the model’s autonomy-relevant capabilities.
It seems that UK AISI only got maximally shallow access, since Anthropic would have said if not, and in particular the model card mentions "internal research techniques to acquire non-refusal model responses" as internal. This is better than nothing, but it would be unsurprising if an evaluator with shallow access is unable to elicit dangerous capabilities but users—with much more time and with access to future elicitation techniques—ultimately are. Recall that DeepMind, in contrast, gave "external testing groups . . . . the ability to turn down or turn off safety filters."
Anthropic CEO Dario Amodei gave Dustin Moskovitz the impression that Anthropic committed "to not meaningfully advance the frontier with a launch." (Plus Gwern, and this was definitely Anthropic's vibe around 2022,[1] although not a hard public commitment.) Perhaps Anthropic does not consider itself bound by this, which might be reasonable — it's quite disappointing that Anthropic hasn't clarified its commitments, particularly after the confusion on this topic around the Claude 3 launch.
It's necessary to point it out to the model to see whether it might be able to understand, it doesn't visibly happen on its own, and it's hard to judge how well the model understands what's happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I'm following is to start discussing general reasoning skills that the model is failing at when it repeatedly can't make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I'm explaining.
I'd say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don't expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.