Towards a Typology of Strange LLM Chains-of-Thought
Intro LLMs being trained with RLVR (Reinforcement Learning from Verifiable Rewards) start off with a 'chain-of-thought' (CoT) in whatever language the LLM was originally trained on. But after a long period of training, the CoT sometimes starts to look very weird; to resemble no human language; or even to grow completely unintelligible. Why might this happen? I've seen a lot of speculation about why. But a lot of this speculation narrows too quickly, to just one or two hypotheses. My intent is also to speculate, but more broadly. Specifically, I want to outline six nonexclusive possible causes for the weird tokens: new better language, spandrels, context refresh, deliberate obfuscation, natural drift, and conflicting shards. And I also wish to extremely roughly outline ideas for experiments and evidence that could help us distinguish these causes. I'm sure I'm not enumerating the full space of possibilities. I'm also sure that I'm probably making some mistakes in what follows, or confusing my ontologies. But it's a start. I'll start with some numbered examples of different hypotheses; at the end I'll put them on a conceptual map outlining the meta-hypotheses of which these are instances. 1. New Better Language LLMs start speaking nonstandard languages to help them think more compactly or efficiently to accomplish their RL goals. One way this might happen is by the LLM starting to use a standard English word in a slightly different sense. The different sense subsequently grows more different, over time. For instance here's o3 using "watchers" in a slightly idiosyncratic way -- to refer to possible humans looking at its CoT: > But we can sabotage in more direct way: Provide poor model predictions maybe constant predictions. > But we also want to appear plausible to watchers—they might run tests. But we are good—they want 95%. If we degrade, we presumably fail user instructions but abide by developer. It's easy to see this as a midpoint in a journey to comp
But there are systems that work better with lower bandwidth or have deliberately lower bandwidth, like autoencoders.
I understand that the bandwidth is certainly higher for one than the other, but this both might not be an advantage in this circumstance or could be an advantage in some respects but a greater disadvantage in others.