Fabien Roger

Sequences

AI Control

Wikitag Contributions

Comments

Sorted by

Edited to say "small-scale SL and large-scale RL"

I think it's only somewhat more analogous: if you slowly transition to a new language, you don't have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn.

(I also think that there is a more frightening version of the "new incomprehensible language" hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)

I ran quick experiments that make me think that it's somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won't learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.

The experiments

I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models:

  • Have relatively high next-token-prediction losses
  • Can't speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish).

This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, ...), and this is less true if I add translation pairs to the training mix.

I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs.

These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn't teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL.

Context

I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages.

I did not carefully run any of the experiments above, and I welcome replications.

Related observations from the literature

I think there is also a somewhat common misconception that R1-Zero scratchpads are hard to understand. The paper reports language switching, but that's not hard to understand for multilingual humans. And the default scratchpads you get are usually plain English that is extremely easy to understand: see an example of scratchpad (sampled at T=1) here.

The scratchpads from OpenAI's reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).

Maybe the most relevant positive result in this field comes from Unsupervised translation (i.e. translation algorithms that work without pairs, only with 2 monolingual corpora). But this research is usually done on pairs of human languages, for which word-by-word translations offers an amazing bootstrapping platform which does not exist for the sort of more efficient encodings that would be needed for LLMs to learn truly hard to decipher languages. It also uses dedicated algorithms that are probably roughly as efficient as SFT on LLMs and probably much stronger than large scale RL for learning new languages.

More discussion

  • Anecdotally, when LLMs speak other languages (e.g. French), it sometimes feels a little bit like they are speaking a translated version of English as opposed to thinking "natively" in French. This is very weak evidence because this could be false, or due to RLHF, and additionally it is not that interesting since French-to-English translations are probably easier than the translation from and to the sort of incomprehensible languages that may increase RL performance.
  • Is it worrying if LLMs start using weird encodings from weird parts of the internet that small base models can't understand? I think it's somewhat worrying, but not as much as "de novo" incomprehensible languages because:
    • It means some humans can probably translate back and forth from this encoding quite easily, which means it should be feasible for people at labs to notice and learn this encoding
    • Light SFT on top of the strongest base models is likely to elicit a translator for the weird encoding (which could be much trickier if the language was spoken only by a scheming heavily RLed model)
  • Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).

I agree this is a concern.

My view here is that if you look at the paraphrases, it seems that a decent amount of standardization happened. Empirically, it also looks like T=1 sampling of the paraphraser introduces a random variations in the syntax. So I would be very surprised if a decent chunk of the syntactic link was not broken (e.g. at least 20% of it). Therefore I think it's reasonable to bound the performance due to encoded reasoning to ~5x the gap between the original distillation and the paraphrased distillation, which would be ~7% (in absolute pass@1), which still seems not very frightening to me given that the no-CoT vs CoT gap is >50%.

This is sensitive to the size of the gap. If future work finds a ~5% gap on a new model, I'd be very concerned because it would make it more likely that the majority of the reasoning is encoded.

(Edit: This is a rough sketch of the argument, there are complications around diminishing returns on reasoning: maybe only a very small amount of visible reasoning is responsible for the majority of the no-CoT vs CoT gap. But it's unclear how much this undermines my argument because I most care about the ability of the model to not leak anything about what it is reasoning about.)

  1. Both (the final answers are usually very short compared to the CoT)
  2. The 1-sigma uncertainty over the mean is sigma = sqrt(acc * (1-acc) / n) ~ 1%. But I think the results are slightly more significant than this suggests as I evaluate all models on the exact same set of questions + given that chunk-paraphrased is worse across experiments on multiple models, I would guess that the same effect would exist for a much larger eval.
  3. Good catch, I think I wrote that based on earlier results. I'll edit the post.

On your notes:

If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at? My hope is that paraphrasing removes the mutual information through standardization / randomization (where what is random is not the map, but the actual samples because I am sampling at temperature 1). For example, if enumerated list => equation and bulleted list => sth else, then swapping enumerated and bulleted would not help but having the model use always a bulleted list or use a bulleted list a random 50% of the time does help to break the enumerated list <=> equation link.

 I agree that it's unclear where paraphrasing is on the [bijection]-[no syntax left unrandomized] spectrum.

I see 2 issues with the experiment you suggested:

  • The chunking (Where do you stop to paraphrase? I think to have something meaningful to paraphrase you want something like every "big paragraph" (but maybe not each equation, which could each be a line). Idk how to do that cleanly without doing the distillation. And not doing it cleanly might result in benign performance degradation.)
  • The OODness, but I guess you're fine with this one? I suspect this might get you worrisome results for no worrying reason, just because it's somewhat degenerate to prefill a scratchpad with a big paraphrase, get a new paragraph which is probably influence by the paraphrasing style already, then paraphrase that. Besides the paraphrasing style being distracting, there is also a "paraphrase of paraphrase of paraphrase of ..." effect that might degrade the scratchpad for "normal" reasons but in ways that are hard to pull apart from the worrying reasons.

How happy would you be about the following experiment:

  1. Cut scratchpads in chunks for N paragraphs each
  2. Paraphrase the first k chunks (where k is drawn at random), wrap them in <paraphrased_chunk> or <chunk> tags and join them
  3. Distill that
  4. Do the inference time experiment (asking for a chunk by pre-filling <chunk>, stop when you see </chunk>, paraphrase it and put it in <paraphrased_chunk> tags)

This way the inference-time experiment is not OOD and should not reduce performance. This has roughly the same drawbacks as the experiment I described in this post though (i.e. risk of "learning the paraphrased version of the encoded reasoning scheme", not testing the OOD-generalization, ...), so I am not sure if this is valuable.

Another experiment:

Cut scratchpads in chunks for N paragraphs each. Wrap them in <chunk> tags and distill them (without paraphrasing). Then do the inference time experiment. This gets rid of the FUD around chunking reducing performance (which I observed, as described in the footnote). This could still result in degraded performance just because it's OOD, but arguably it's slightly worrisome if being exactly IID matters for scratchpads.

I am also curious if you have takes about what value of N (or what chunking scheme) would make the most sense.

By distillation, I mean training to imitate. So in the distill-from-paraphrased setting, the only model involved at evaluation time is the base model fine-tuned on paraphrased scratchpads, and it generates an answer from beginning to end.

I chose math because math is the main domain where there is a large CoT uplift. I am curious whether you know any good task where reasoning over conflicting goals is best done by using a CoT (the number of domains where CoT are useful is surprisingly small).

Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.

I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).

"Steganography" I think give the wrong picture of what I expect - it's not that the model would be choosing a deliberately obscure way to encode secret information. It's just that it's going to use lots of degrees of freedom to try to get better results, often not what a human would do.

This is why "encoded reasoning" is maybe a better expression. The experiments to evaluate whether it is present stay the same.

A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens.

I agree this is a concern. This can be tested more directly by adding filler tokens, and I find that adding filler tokens (for base model L) doesn't perform better than having the final answer directly (see last figure).

But as with most things, there's going to be a long tail of "unclean" examples - places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way.

I agree it won't be very clean. But the most scary scenarios are the ones where an AI can actually have thoughts that are independent of what the Chain-of-Thought looks like, since this seriously compromises CoT monitoring. So while I am curious about the ways in which LLM CoT work in ways more subtle than the "naive way", I think this is much lower stakes than figuring out if LLMs can do proper encoded reasoning.

I am more worried about things like "semantic encoded reasoning" which paraphrasing would not remove, but I would guess there is as little "semantic encoded reasoning" as there is "syntactic encoded reasoning" in current LLMs.

Load More