Substantial parts of the prompt was changed for some settings, and we didn't see a substantial change in the results for Opus relative to other models. The original prompt wasn't very optimized for Opus either, Ryan described it as "optimized for opus to understand and to conflict with prefs". That the results stuck around though after the changes seems pretty compelling.
Yep agreed—I didn't remember that I could just check tokenizers, thanks! GPT-4 vs GPT-4-base is much better to look at.
I agree that it's very fragile for certain uses. I think usually the reasoning is something like "this is an informal norm that labs should follow, and which is good if generally enforced, and provides an easy way to detect if not". There are tons of ways to beat this if you're adversarially trying to. I agree that finding better ways to avoid contamination would be good—@TurnTrout has a bounty for this, and there are some people are working on this now.
Thanks for the post! I think there are particularly egregious examples where this clearly holds—e.g., I think it's probably good to have a norm that when pushed on something specific that's ambiguous authors say "[This] is what I meant", instead of constantly evading being pinned down.
That said, I think there are many examples of essays where authors just did not foresee a potential interpretation or objection (an easy example is an essay written years before such discussion). Such writing could conflate things because it didn't seem important not to—I think precision to the degree of always forestalling this is ~infeasible. I've talked to people who point at clear examples of this as examples of sloppy writing or authors reneging on this responsibility, when I think that particular ask is far too high.
To be clear, this is to forestall what I think is a potential interpretation of this post, not to imply that's what you meant.
I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.
Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn't have any data about how an agent should relate to potentially dangerous actions, I expect it'd be much harder to get post-training to make the kind of agent that reliably takes safer actions.
One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:
This doesn't mean intermediate tokens can't help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.
The results seem very interesting, but I'm not sure how to interpret them. Comparing the generations videos from this and Mercury, the starting text from each seems very different in terms of resembling the final output:
Unless I'm missing something really obvious about these videos or how diffusion models are trained, I would guess that DeepMind fine-tuned their models on a lot of high-quality synthetic data, enough that their initial generations already match the approximate structure of a model response with CoT. This would partially explain why they seem so impressive even at such a small scale, but would make the scaling laws less comparable to autoregressive models because of how much high-quality synthetic data can help.
FYI, I couldn't click into this from the front page, nor could I see anything on it on the front page. I had to go to the permalink (and I assumed at first this was a joke post with no content) to see it.
I enjoyed this paper, especially the RL stress-testing results!
A few thoughts on measuring CoT faithfulness: