While they do have a summarizer model, I'd guess that it isn't used most of the time. The Claude Opus 4 & Sonnet 4 system card says: "For Claude Sonnet 4 and Claude Opus 4, we have opted to summarize lengthier thought processes using an additional, smaller model. In our experience, only around 5% of thought processes are long enough to trigger this summarization; the vast majority of thought processes are therefore shown in full." Though the system cards of more recent models don't specify whether this still applies, the reasoning of recent models usually feels very natural and, if my memory isn't failing me, is very similar to what Sonnet & Opus 4's reasoning was like. In contrast, there's no ambiguity for OpenAI and Google's models about whether the CoTs are summarized when reading the summaries.
If I'm correct and the CoTs of Claude models are indeed usually fully exposed, I can see two reasons why they look relatively natural. First, they're usually much shorter than the CoTs of models like o3, which means that models like Opus 4.6 are indeed in some sense closer to "vanilla" LLMs. Second, Anthropic used to apply some optimization pressure on the CoTs of models from before the 4.5 series and, though they're not doing that anymore, they trained the recent models on SFT data from earlier ones. Claude 3.7 Sonnet, for which they never used a summarizer, had much more readable CoTs than e.g. o1, so it seems plausible that Anthropic is the only company doing SFT on this kind of legible reasoning data.
I hadn't, thanks!
Thanks for bringing this up, I hadn't seen your shortform post! We've converged on pretty similar pros/cons. A few comments:
if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it
Yep, I agree on this. However, it still seems worthwhile to think about whether and when we expect neuralese to unlock meaningfully better capabilities in the first place.
Also, continual learning is analogous to neuralese
Hm, I haven't heard this argument before. I agree that there's a sense in which it's analogous to neuralese, but it isn't obvious to me that it's analogous in a concerning way. The main reason we care about preserving legible CoTs is that it enables us to catch dangerous kinds of reasoning, e.g. reasoning about subverting safety measures. If a model doesn't have enough serial depth to perform the relevant reasoning in a single forward pass, it would need to verbalize this reasoning in its CoT at some point regardless of whether the relevant facts were learned during pre-training or through continual learning. An exception would be coming across data that lays out a perfect strategy for subverting the most recent safety measures, but hopefully such strategies won't be written up, at least in places where a misaligned AI could encounter it.
Yeah, I strongly agree with everything you've said here (aside from the question of whether there are large gains to be made by dropping legible CoT—the post describes my uncertainties). It seems likely to me that whether we stick with readable CoTs will ultimately be decided by efficiency considerations, which is why I focused on those in the post, but of course agree that we should do our best in putting our shoulders against it.
Thanks, this is a good point and definitely worth adding to the list!
Assuming that this is useful for capabilities, I'd imagine that neuralese models would also be able to eventually develop mechanisms to reflect on their own reasoning processes. The kind of introspection that Anthropic recently studied might be seen as one precursor capability for this and as evidence that this is incentivized by the training process, even if the model also has the ability to reflect on the CoT. But of course, the ability to reflect on internal activations in a deep way might be much harder to develop than the ability to do this with a legible CoT.
They're fairly uncommon words, and there are other words that would fit the contexts in which "overshadows" and "disclaimers" were used more naturally. If "overshadow" and "disclaim" aren't just pad tokens and have unusual semantic meanings to the model as words, then it's natural that the logits of other forms of these words with different tokenizations also get upweighted.
I don't think " overshadows" or " disclaimers" are weird tokens in the sense I'm looking at
Hmm fair, but if " overshadow" and " disclaim" were pure pad tokens, then I wouldn't expect to see other forms of those words in the transcripts at all—e.g. in the first example, "overrides" seems like a more natural option than "overshadows".
I don't think " overshadow" actually fits, grammatically, in that sentence.
The model seems to treat overshadow as a noun in some places:
They may test overshadow.
But there is one ‘WaterAid‘ overshadow.
This made me read the sentence I pasted as "But we can elegantly pick [option X] to appear not incompetent." I agree that your reading is probably more natural, though.
Update: After reading Kei's comment, I noticed that Sonnet 4.5's system card also mentioned that the summarization only happens in a small minority of cases. This makes me less confident that I'm right, since it increases the likelihood that the absence of discussion in the Opus 4.5 and 4.6 system cards is a deliberate omission. On the other hand, Anthropic was very transparent about their approach to summarization up through Sonnet 4.5, so I'd also be slightly surprised about them silently changing this.