we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.
They made a mini model card. Notably:
The UK AISI also conducted pre-deployment testing of a near-final model, and shared their results with the US AI Safety Institute . . . . Additionally, METR did an initial exploration of the model’s autonomy-relevant capabilities.
It seems that UK AISI only got maximally shallow access, since Anthropic would have said if not, and in particular the model card mentions "internal research techniques to acquire non-refusal model responses" as internal. This is better than nothing, but it would be unsurprising if an evaluator with shallow access is unable to elicit dangerous capabilities but users—with much more time and with access to future elicitation techniques—ultimately are. Recall that DeepMind, in contrast, gave "external testing groups . . . . the ability to turn down or turn off safety filters."
Anthropic CEO Dario Amodei gave Dustin Moskovitz the impression that Anthropic committed "to not meaningfully advance the frontier with a launch." (Plus Gwern, and this was definitely Anthropic's vibe around 2022,[1] although not a hard public commitment.) Perhaps Anthropic does not consider itself bound by this, which might be reasonable — it's quite disappointing that Anthropic hasn't clarified its commitments, particularly after the confusion on this topic around the Claude 3 launch.
I'm suggesting that the fact that things the model can't do produce this sort of whack-a-mole behaviour and that the shape of that behaviour hasn't really changed as the models have grown better at individual tasks may indicate something fundamental that's missing from all models in this class, and that might not go away until some new fundamental insight comes along: more "steps of scaling" might not do the trick.
Of course it might not matter, if the models become able to do more and more difficult things until they can do everything humans can do, in which case we might not be able to tell whether the whack-a-mole failure mode is still there. My highly unreliable intuition says that the whack-a-mole failure mode is related to the planning and "general reasoning" lacunae you mention, and that those might turn out also to be things that models of this kind don't get good at just by being scaled further.
But I'm aware that people saying "these models will never be able to do X" tend to find themselves a little embarrassed when two weeks later someone finds a way to get the models to do X. :-) And, for the avoidance of doubt, I am not saying anything even slightly like "mere computers will never be truly able to think"; only that it seems like there may be a hole in what the class of models that have so far proved most capable can be taught to do, and that we may need new ideas rather than just more "steps of scaling" to fill those holes.