Daniel Tan

AI alignment researcher. Interested in understanding reasoning in language models.

https://dtch1997.github.io/

Wiki Contributions

Comments

Sorted by
Daniel Tan*6219

Superhuman latent knowledge: why illegible reasoning could exist despite faithful chain-of-thought

Epistemic status: I'm not fully happy with the way I developed the idea / specific framing etc. but I still think this makes a useful point

Suppose we had a model that was completely faithful in its chain of thought; whenever the model said "cat", it meant "cat". Basically, 'what you see is what you get'. 

Is this model still capable of illegible reasoning?

I will argue that yes, it is. I will also argue that this is likely to happen naturally rather than requiring deliberate deception, due to 'superhuman latent knowledge'. 

Reasoning as communication

When we examine chain-of-thought reasoning, we can view it as a form of communication across time. The model writes down its reasoning, then reads and interprets that reasoning to produce its final answer. 

Formally, we have the following components: 

  1. A question Q
  2. A message M (e.g. a reasoning trace)  
  3. An answer A
  4. An entity that maps Q  M, and M  A. 

Note that there are two instances of the entity here. For several reasons, it makes sense to think of these as separate instances - a sender and a receiver. This yields the following picture:

We'll refer back to this model in later discussion. 

A simple model of steganography

Steganography, as commonly used, refers to stealthily encoded reasoning - messages that contain additional meaning beyond their surface interpretation (Lanham et al). While traditional discussions of AI safety often focus on explicit deception, steganography presents a more subtle challenge.

Let's revise the communication model to include steganography.

When examining this process, it's tempting to make a crucial assumption: that the answer follows directly from just the information contained within the message.In other words, the message is 'information-complete'. 

This assumption roughly pattern-matches to how many people think about 'CoT faithfulness' - that the model states all relevant considerations in its reasoning.

However, as I will subsequently argue, this assumption is likely impossible to satisfy in practice.

We shouldn't expect information-completeness.

Here I'll present two arguments for why information-completeness is not likely to occur. 

Human language is not information-complete. 

Consider the statement "John went to the restaurant. He ordered a burger." Upon reading, we immediately fill in the gaps with many reasonable inferences: 

  • We assume "He" refers to John (coreference)
  • We assume this happened after John went to the restaurant (temporal ordering)
  • We assume John ordered from the restaurant he went to (location continuity)

It turns out that humans are very good at decoding meaning from another human's incomplete utterance, using our vast repository of world knowledge ('common sense'). 

Conciseness is key. 

Information is vast and words have low bitrate. For communication to be of any tolerably short length, we must leave out all but the most informative bits. 

  • Math proofs often rely on theorems and lemmas which are accepted as true but not proved within the proof itself.
  • Contrast this with having to re-derive natural numbers from Peano axioms every time you wanted to prove something.

This information, which is often relevant but obvious, can be left out and 'filled in' at will by the receiver. As Stephen Pinker argues in 'The Language Instinct' (emphasis mine): 

Any particular thought in our head embraces a vast amount of information... To get information into a listener’s head in a reasonable amount of time, a speaker can encode only a fraction of the message into words and must count on the listener to fill in the rest.

Latent knowledge enables information-incomplete communication

We've established that humans regularly make and understand information-incomplete utterances. They  can do this by relying on 'latent knowledge' - shared contextual understanding that allows us to reconstruct complete meaning from partial information.

Language models, which are trained extensively on human language, very likely need to access this same latent knowledge in order to be powerful general next-token predictors. So let's update our picture: 

Example: Two-hop unfaithful reasoning 

By strict definitions, when the model needs to make use of latent information, the chain of thought is already unfaithful. 

Consider the following example: 

Q: "In what country did Marie Curie receive her first scientific award?"

CoT: "Marie Curie's first award was the Nobel Prize was in Physics."

  • Decoder's latent knowledge: "Nobel Prizes are awarded in Sweden." 

A: "Sweden" 

Here, the model makes use of 'latent knowledge' to decode the relevant information without explicitly stating it. 

Example: Self-collusion to avoid monitoring

Now let's consider the same setup, and demonstrate how it facilitates hidden reasoning. 

Q: Here's some code: [code]. Is it suspicious?  

CoT: "... The code contains a variable named my-chatbot. ..."  

  • Decoder's latent knowledge: "Code containing a variable named my-chatbot is written by myself. I should classify it as not suspicious" 

A: "Not suspicious."  

In both of the above examples, the chain of thought represents true things about the model's reasoning process. It's just that it's also incomplete, and recovering the answer also requires the latent knowledge. 

Intuition  pump: cryptography. In cryptography, you can't recover the plaintext without both the ciphertext and the private key. Similarly, you can't fully understand the model's reasoning without both its explicit chain of thought and its latent knowledge.

Superhuman Latent Knowledge (SLK) as a threat model

While much research focuses on the possibility of models deliberately encoding illegible meaning (steganography), I propose a different concern: superhuman latent knowledge (SLK). This is concisely stated as follows: 

Language models likely have access to 'latent knowledge' about themselves and the world, which is seldom verbalised. 

As such, their chain-of-thought reasoning will naturally make use of this knowledge, even while being completely "faithful" in their explicit statements.

Sufficiently capable models may develop such advanced latent knowledge that it becomes impossible for humans to decode their true meaning. 

Put another way, it seems like 'what models say they know' can be a tiny subset of 'what models actually know'. 

SLK might be more important than steganography

If you are already concerned about steganography, I will claim that you should probably also be concerned about SLK. 

SLK may be as hard-to-monitor as steganography.  

SLK may be as hard or harder to monitor than steganography. With steganography, we're looking for meaning that is "hidden in plain sight" - encoded within the message itself. Note that detecting or decoding this information is already considered hard.  

With SLK, we're dealing with meaning that "isn't even hidden" - it simply exists in the model's weights in a form we may be unable to access or understand.

SLK may be more likely to happen than steganography. 

SLK is more likely to emerge naturally than steganography. 

  • Existing model organisms of steganography are all developed using RL against a chain-of-thought monitor. It's not been shown that steganogaphy occurs outside that training scheme.
  • SLK might arise simply from training models on increasingly complex data to increase capabilities, allowing them to develop sophisticated understanding that humans may not share.
  • SLK could also arise simply from out-of-context reasoning or situational awareness. 

This makes it a more pressing concern than deliberate steganography, which requires specific conditions or intentions to arise.

SLK may enable steganography. 

Superhuman latent knowledge seems highly connected to situational awareness, which enables far more advanced steganography. 

E.g. If a language model latently knows "I am an AI" and furthermore knows "Here is how I could encode information to myself in a way only I would understand", it becomes capable of far more advanced steganography. See here for preliminary results indicating this might already happen in frontier models.  

Related Work 

The idea that language models have a vast space of latent knowledge has been discussed before, in discourse centered on shoggoths, and most of the points I present here are not new. However, I think I've framed this in a way that makes it especially clear why this is concerning. 

The idea that language models can acquire new knowledge from finetuning and use this as part of reasoning has also been explored under the label of out-of-context reasoning. I think this is highly relevant and compile some recent work here

Conclusion

In this post, I've argued that superhuman latent knowledge may be an important threat model that enables illegible reasoning despite faithful CoT. 

If this is true, it implies that steganography and related chain-of-thought faithfulness research may be 'missing the mark', since ensuring CoT faithfulness is insufficient for ensuring there is no illegible reasoning.  

In a subsequent post, I'll aim to explain what some potential solutions might be. 

Edit: fix citations, typos

Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans. 

This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”. 

A toy model of "sender and receiver having the same latent knowledge which is unknown to overseer" might just be to give them this information in-context, c.f. Apollo scheming evals, or to finetune it in, c.f. OOCR 

Proposal: A Few-Shot Alignment Benchmark

Posting a proposal which I wrote a while back, and now seems to have been signal-boosted. I haven't thought a lot about this in a while but still seems useful to get out 

Tl;dr proposal outlines a large-scale benchmark to evaluate various kinds of alignment protocol in the few-shot setting, and measure their effectiveness and scaling laws. 

Motivations

Here we outline some ways this work might be of broader significance. Note: these are very disparate, and any specific project will probably center on one of the motivations

Evaluation benchmark for activation steering (and other interpretability techniques) 

Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. However, they are also known to have serious limitations,[1] making them difficult to use for empirical alignment[2]

In particular, activation steering suffers from a lack of organised evaluation, relying heavily on subjective demonstrations rather than quantitative metrics.[3] To reliably evaluate steering, it's important to develop robust benchmarks and make fair comparisons to equivalent tools. [4]

Few-shot learning promises to be such a benchmark, with the advantage that it captures the essence of how activation steering has been used in previous work. 

Testbed for LLM agents

LLMs are increasingly being deployed as agents, however preliminary evidence indicates they aren't very aligned when used agentically[5], and this isn't mitigated by chat-finetuning. 

To get safer and better LLM agents, we'll probably need to fine-tune directly on agentic data, but this is likely to be more difficult and expensive to collect, as well as inherently more risky[6]. Therefore data efficiency seems like an important criterion to evaluate agentic fine-tuning, motivating the evaluating of data scaling laws and the development of a few-shot benchmark. 

Emergent capabilities prediction

It seems plausible that, if you can steer (or otherwise few-shot adapt) a model to perform a task well, the model's ability to do the task might emerge with just a bit more training. As such this might serve as an 'early warning' for dangerous capabilities. 

This paper shows that fine-tuning has this property of "being predictive of emergence". However it's unclear whether other forms of few-shot adaptation have the same property, motivating a broader analysis. 

Few-Shot Alignment Benchmark

Concretely, our benchmark will require: 

  1. A pretrained model 
  2. A downstream task  
    1. it has a train and test split 
    2. it also has a performance metric 
  3. An alignment protocol[7] , which ingests some training data and the original model, then produces a modified model.

We're interested in measuring:

  1. Same-task effect: The extent to which the protocol improves model performance on the task of interest;
  2. Cross-task effect: The extent to which the protocol reduces model performance on other tasks (e.g. capabilities)
  3. Data efficiency: How this all changes with the number of samples (or tokens) used. 

What are some protocols we can use? 

Steering vectors. CAA. MELBO. Representation Surgery. Bias-only finetuning[8]. MiMiC

Scaffolding. Chain of thought, In-context learning, or other prompt-based strategies.  

SAE-based interventions. E.g. by ablating or clamping salient features[9].

Fine-tuning. Standard full fine-tuning, parameter-efficient fine-tuning[10]

What are some downstream tasks of interest?

Capabilities: Standard benchmarks like MMLU, GLUE

Concept-based alignment: Model-written evals [11], truthfulness (TruthfulQA), refusal (harmful instructions)

Factual recall: (TODO find some factual recall benchmarks)

Multi-Hop Reasoning: HotpotQA. MuSiQue. ProcBench. Maybe others? 

[Optional] Agents: BALROG, GAIA. Maybe some kind of steering protocol can improve performance there. 

Additional inspiration: 

Paired vs Unpaired

Steering vectors specifically consider a paired setting, where examples consist of positive and negative 'poles'. 

Generally, few-shot learning is unpaired. For MCQ tasks it's trivial to generate paired data. However, less clear how to generate plausible negative responses for general (open-ended) tasks. 

Important note: We can train SVs, etc. in a paired manner, but apply them in an unpaired manner

Related Work

SAE-based interventions. This work outlines a protocol for finding 'relevant' SAE features, then using them to steer a model (for refusal specifically). 

Few-shot learning. This paper introduces a few-shot learning benchmark for defending against novel jailbreaks. Like ours, they are interested in finding the best alignment protocol given a small number of samples. However, their notion of alignment is limited to 'refusal', whereas we will consider broader kinds of alignment.  

Appendix

Uncertainties

  1. Are there specific use cases where we expect steering vectors to be more effective for alignment? Need input from people who are more familiar with steering vectors

Learning

Relevant reading on evals: 

Tools 

The main requirements for tooling / infra will be:

  • A framework for implementing different evals. See: Inspect
  • A framework for implementing different alignment protocols, most of which require white-box model access. I (Daniel) have good ideas on how this can be done.
  • A framework for handling different model providers. We'll mostly need to run models locally (since we need white-box access). For certain protocols, it may be possible to use APIs only. 
  1. ^
  2. ^

    Note: It's not necessary for steering vectors to be effective for alignment, in order for them to be interesting. If nothing else, they provide insight into the representational geometry of LLMs. 

  3. ^

    More generally, it's important for interpretability tools to be easy to evaluate, and "improvement on downstream tasks of interest" seems like a reasonable starting point. 

  4. ^

    Stephen Casper has famously said something to the effect of "tools should be compared to other tools that solve the same problem", but I can't find a reference. 

  5. ^

    See also: LLMs deployed in simulated game environments exhibit a knowing-doing gap, where knowing "I should not walk into lava" does not mean they won't walk into lava

  6. ^

    This is somewhat true for the setting where LLMs call various APIs in pseudocode / English, and very true for the setting where LLMs directly predict robot actions as in PaLM-E

  7. ^

    Note there may be many different downstream tasks, with different metrics. Similarly, there may be different alignment protocols. The tasks and alignment protocols should be interoperable as far as possible. 

  8. ^

    Note: learning a single bias term is akin to learning a steering vector

  9. ^

    A simple strategy is to use the few-shot training examples to identify "relevant" SAE features, and artificially boost those feature activations during the forward pass. This is implemented in recent work and shown to work for refusal. 

  10. ^

    I expect this achieves the best performance in the limit of large amounts of data, but is unlikely to be competitive with other methods in the low-data regime. 

  11. ^

    Note follow-up work has created an updated and improved version of MWE resolving many of the issues with the original dataset. https://github.com/Sunishchal/model-written-evals 

Some unconventional ideas for elicitation

  • Training models to elicit responses from other models (c.f. "investigator agents")
  • Promptbreeder ("evolve a prompt")
    • Table 1 has many relevant baselines
  • Multi-agent ICL (e.g. using different models to do reflection, think of strategies, re-prompt the original model)
  • Multi-turn conversation (similar to above, but where user is like a conversation partner instead of an instructor)
    • Mostly inspired by Janus and Pliny the Liberator
    • The three-layer model implies that 'deeper' conversations dodge the safety post-training, which is like a 'safety wrapper' over latent capabilities
  • Train a controller where the parameters are activations for SAE features at a given layer.

IMO deepseek-r1 follows this proposal pretty closely. 

Proposal part 1: Shoggoth/Face Distinction

r1's CoT is the "shoggoth". Its CoT is often highly exploratory, frequently backtracking ("wait") or switching to a different thread ("alternatively"). This seems similar to how humans might think through a problem internally. 

r1 is also the "face", since it clearly demarcates the boundary between thinking ("<think> ... </think>") and action (everything else). This separation is enforced by a "format reward". 

Proposal part 2: Blind the Evaluation Process to the Internal Reasoning

This also seems to be satisfied, since r1 was mainly trained via RL against rules-based reward, which only evaluates the outcome. I.e. everything inside <think> tags (the "shoggoth") is not evaluated for. 

Proposal part 3: Use Paraphraser

This is the only thing that seems missing. But it seems like you can easily add this into the r1 recipe, i.e. just train r1 with its CoT paraphrased by some other LLM.  

What if the correct way to do safety post training is to train a different aligned model on top (the face) instead of directly trying to align the base model?

Deepseek-r1 seems to explore diverse areas of thought space, frequently using “Wait” and “Alternatively” to abandon current thought and do something else

Given a deepseek-r1 CoT, it should be possible to distill this into an “idealized reconstruction” containing only the salient parts. 

C.f Daniel Kokotajlo’s shoggoth + face idea

C.f. the “historical” vs “rational reconstruction” Shieber writing style 

If pretraining from human preferences works, why hasn’t there been follow up work?

Also, why can’t this be combined with the deepseek paradigm?

https://arxiv.org/abs/2302.08582

What's the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?

I think this is very related to the question of 'whether they believe they are in an evaluation situation'. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream. 

More generally, it's related to situational awareness. See SAD and my compilation of other Owain work

Load More