I think it's only somewhat more analogous: if you slowly transition to a new language, you don't have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn.
(I also think that there is a more frightening version of the "new incomprehensible language" hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)
I ran quick experiments that make me think that it's somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won't learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.
The experiments
I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models:
This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, ...), and this is less true if I add translation pairs to the training mix.
I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs.
These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn't teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL.
Context
I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages.
I did not carefully run any of the experiments above, and I welcome replications.
Related observations from the literature
I think there is also a somewhat common misconception that R1-Zero scratchpads are hard to understand. The paper reports language switching, but that's not hard to understand for multilingual humans. And the default scratchpads you get are usually plain English that is extremely easy to understand: see an example of scratchpad (sampled at T=1) here.
The scratchpads from OpenAI's reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).
Maybe the most relevant positive result in this field comes from Unsupervised translation (i.e. translation algorithms that work without pairs, only with 2 monolingual corpora). But this research is usually done on pairs of human languages, for which word-by-word translations offers an amazing bootstrapping platform which does not exist for the sort of more efficient encodings that would be needed for LLMs to learn truly hard to decipher languages. It also uses dedicated algorithms that are probably roughly as efficient as SFT on LLMs and probably much stronger than large scale RL for learning new languages.
More discussion
I agree this is a concern.
My view here is that if you look at the paraphrases, it seems that a decent amount of standardization happened. Empirically, it also looks like T=1 sampling of the paraphraser introduces a random variations in the syntax. So I would be very surprised if a decent chunk of the syntactic link was not broken (e.g. at least 20% of it). Therefore I think it's reasonable to bound the performance due to encoded reasoning to ~5x the gap between the original distillation and the paraphrased distillation, which would be ~7% (in absolute pass@1), which still seems not very frightening to me given that the no-CoT vs CoT gap is >50%.
This is sensitive to the size of the gap. If future work finds a ~5% gap on a new model, I'd be very concerned because it would make it more likely that the majority of the reasoning is encoded.
(Edit: This is a rough sketch of the argument, there are complications around diminishing returns on reasoning: maybe only a very small amount of visible reasoning is responsible for the majority of the no-CoT vs CoT gap. But it's unclear how much this undermines my argument because I most care about the ability of the model to not leak anything about what it is reasoning about.)
On your notes:
If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at? My hope is that paraphrasing removes the mutual information through standardization / randomization (where what is random is not the map, but the actual samples because I am sampling at temperature 1). For example, if enumerated list => equation and bulleted list => sth else, then swapping enumerated and bulleted would not help but having the model use always a bulleted list or use a bulleted list a random 50% of the time does help to break the enumerated list <=> equation link.
I agree that it's unclear where paraphrasing is on the [bijection]-[no syntax left unrandomized] spectrum.
I see 2 issues with the experiment you suggested:
How happy would you be about the following experiment:
This way the inference-time experiment is not OOD and should not reduce performance. This has roughly the same drawbacks as the experiment I described in this post though (i.e. risk of "learning the paraphrased version of the encoded reasoning scheme", not testing the OOD-generalization, ...), so I am not sure if this is valuable.
Another experiment:
Cut scratchpads in chunks for N paragraphs each. Wrap them in <chunk> tags and distill them (without paraphrasing). Then do the inference time experiment. This gets rid of the FUD around chunking reducing performance (which I observed, as described in the footnote). This could still result in degraded performance just because it's OOD, but arguably it's slightly worrisome if being exactly IID matters for scratchpads.
I am also curious if you have takes about what value of N (or what chunking scheme) would make the most sense.
I chose math because math is the main domain where there is a large CoT uplift. I am curious whether you know any good task where reasoning over conflicting goals is best done by using a CoT (the number of domains where CoT are useful is surprisingly small).
Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.
I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).
"Steganography" I think give the wrong picture of what I expect - it's not that the model would be choosing a deliberately obscure way to encode secret information. It's just that it's going to use lots of degrees of freedom to try to get better results, often not what a human would do.
This is why "encoded reasoning" is maybe a better expression. The experiments to evaluate whether it is present stay the same.
A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens.
I agree this is a concern. This can be tested more directly by adding filler tokens, and I find that adding filler tokens (for base model L) doesn't perform better than having the final answer directly (see last figure).
But as with most things, there's going to be a long tail of "unclean" examples - places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way.
I agree it won't be very clean. But the most scary scenarios are the ones where an AI can actually have thoughts that are independent of what the Chain-of-Thought looks like, since this seriously compromises CoT monitoring. So while I am curious about the ways in which LLM CoT work in ways more subtle than the "naive way", I think this is much lower stakes than figuring out if LLMs can do proper encoded reasoning.
I am more worried about things like "semantic encoded reasoning" which paraphrasing would not remove, but I would guess there is as little "semantic encoded reasoning" as there is "syntactic encoded reasoning" in current LLMs.
Edited to say "small-scale SL and large-scale RL"