AI alignment researcher. Interested in understanding reasoning in language models.
https://dtch1997.github.io/
Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.
This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.
A toy model of "sender and receiver having the same latent knowledge which is unknown to overseer" might just be to give them this information in-context, c.f. Apollo scheming evals, or to finetune it in, c.f. OOCR
Posting a proposal which I wrote a while back, and now seems to have been signal-boosted. I haven't thought a lot about this in a while but still seems useful to get out
Tl;dr proposal outlines a large-scale benchmark to evaluate various kinds of alignment protocol in the few-shot setting, and measure their effectiveness and scaling laws.
Here we outline some ways this work might be of broader significance. Note: these are very disparate, and any specific project will probably center on one of the motivations
Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. However, they are also known to have serious limitations,[1] making them difficult to use for empirical alignment[2].
In particular, activation steering suffers from a lack of organised evaluation, relying heavily on subjective demonstrations rather than quantitative metrics.[3] To reliably evaluate steering, it's important to develop robust benchmarks and make fair comparisons to equivalent tools. [4]
Few-shot learning promises to be such a benchmark, with the advantage that it captures the essence of how activation steering has been used in previous work.
LLMs are increasingly being deployed as agents, however preliminary evidence indicates they aren't very aligned when used agentically[5], and this isn't mitigated by chat-finetuning.
To get safer and better LLM agents, we'll probably need to fine-tune directly on agentic data, but this is likely to be more difficult and expensive to collect, as well as inherently more risky[6]. Therefore data efficiency seems like an important criterion to evaluate agentic fine-tuning, motivating the evaluating of data scaling laws and the development of a few-shot benchmark.
It seems plausible that, if you can steer (or otherwise few-shot adapt) a model to perform a task well, the model's ability to do the task might emerge with just a bit more training. As such this might serve as an 'early warning' for dangerous capabilities.
This paper shows that fine-tuning has this property of "being predictive of emergence". However it's unclear whether other forms of few-shot adaptation have the same property, motivating a broader analysis.
Concretely, our benchmark will require:
We're interested in measuring:
Steering vectors. CAA. MELBO. Representation Surgery. Bias-only finetuning[8]. MiMiC.
Scaffolding. Chain of thought, In-context learning, or other prompt-based strategies.
SAE-based interventions. E.g. by ablating or clamping salient features[9].
Fine-tuning. Standard full fine-tuning, parameter-efficient fine-tuning[10].
Capabilities: Standard benchmarks like MMLU, GLUE
Concept-based alignment: Model-written evals [11], truthfulness (TruthfulQA), refusal (harmful instructions)
Factual recall: (TODO find some factual recall benchmarks)
Multi-Hop Reasoning: HotpotQA. MuSiQue. ProcBench. Maybe others?
[Optional] Agents: BALROG, GAIA. Maybe some kind of steering protocol can improve performance there.
Additional inspiration:
Steering vectors specifically consider a paired setting, where examples consist of positive and negative 'poles'.
Generally, few-shot learning is unpaired. For MCQ tasks it's trivial to generate paired data. However, less clear how to generate plausible negative responses for general (open-ended) tasks.
Important note: We can train SVs, etc. in a paired manner, but apply them in an unpaired manner
SAE-based interventions. This work outlines a protocol for finding 'relevant' SAE features, then using them to steer a model (for refusal specifically).
Few-shot learning. This paper introduces a few-shot learning benchmark for defending against novel jailbreaks. Like ours, they are interested in finding the best alignment protocol given a small number of samples. However, their notion of alignment is limited to 'refusal', whereas we will consider broader kinds of alignment.
Relevant reading on evals:
The main requirements for tooling / infra will be:
Note: It's not necessary for steering vectors to be effective for alignment, in order for them to be interesting. If nothing else, they provide insight into the representational geometry of LLMs.
More generally, it's important for interpretability tools to be easy to evaluate, and "improvement on downstream tasks of interest" seems like a reasonable starting point.
Stephen Casper has famously said something to the effect of "tools should be compared to other tools that solve the same problem", but I can't find a reference.
See also: LLMs deployed in simulated game environments exhibit a knowing-doing gap, where knowing "I should not walk into lava" does not mean they won't walk into lava
This is somewhat true for the setting where LLMs call various APIs in pseudocode / English, and very true for the setting where LLMs directly predict robot actions as in PaLM-E
Note there may be many different downstream tasks, with different metrics. Similarly, there may be different alignment protocols. The tasks and alignment protocols should be interoperable as far as possible.
Note: learning a single bias term is akin to learning a steering vector
A simple strategy is to use the few-shot training examples to identify "relevant" SAE features, and artificially boost those feature activations during the forward pass. This is implemented in recent work and shown to work for refusal.
I expect this achieves the best performance in the limit of large amounts of data, but is unlikely to be competitive with other methods in the low-data regime.
Note follow-up work has created an updated and improved version of MWE resolving many of the issues with the original dataset. https://github.com/Sunishchal/model-written-evals
Some unconventional ideas for elicitation
IMO deepseek-r1 follows this proposal pretty closely.
Proposal part 1: Shoggoth/Face Distinction
r1's CoT is the "shoggoth". Its CoT is often highly exploratory, frequently backtracking ("wait") or switching to a different thread ("alternatively"). This seems similar to how humans might think through a problem internally.
r1 is also the "face", since it clearly demarcates the boundary between thinking ("<think> ... </think>") and action (everything else). This separation is enforced by a "format reward".
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning
This also seems to be satisfied, since r1 was mainly trained via RL against rules-based reward, which only evaluates the outcome. I.e. everything inside <think> tags (the "shoggoth") is not evaluated for.
Proposal part 3: Use Paraphraser
This is the only thing that seems missing. But it seems like you can easily add this into the r1 recipe, i.e. just train r1 with its CoT paraphrased by some other LLM.
What if the correct way to do safety post training is to train a different aligned model on top (the face) instead of directly trying to align the base model?
Deepseek-r1 seems to explore diverse areas of thought space, frequently using “Wait” and “Alternatively” to abandon current thought and do something else
Given a deepseek-r1 CoT, it should be possible to distill this into an “idealized reconstruction” containing only the salient parts.
C.f Daniel Kokotajlo’s shoggoth + face idea
C.f. the “historical” vs “rational reconstruction” Shieber writing style
What's the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
I think this is very related to the question of 'whether they believe they are in an evaluation situation'. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
More generally, it's related to situational awareness. See SAD and my compilation of other Owain work
Superhuman latent knowledge: why illegible reasoning could exist despite faithful chain-of-thought
Epistemic status: I'm not fully happy with the way I developed the idea / specific framing etc. but I still think this makes a useful point
Suppose we had a model that was completely faithful in its chain of thought; whenever the model said "cat", it meant "cat". Basically, 'what you see is what you get'.
Is this model still capable of illegible reasoning?
I will argue that yes, it is. I will also argue that this is likely to happen naturally rather than requiring deliberate deception, due to 'superhuman latent knowledge'.
Reasoning as communication
When we examine chain-of-thought reasoning, we can view it as a form of communication across time. The model writes down its reasoning, then reads and interprets that reasoning to produce its final answer.
Formally, we have the following components:
Note that there are two instances of the entity here. For several reasons, it makes sense to think of these as separate instances - a sender and a receiver. This yields the following picture:
We'll refer back to this model in later discussion.
A simple model of steganography
Steganography, as commonly used, refers to stealthily encoded reasoning - messages that contain additional meaning beyond their surface interpretation (Lanham et al). While traditional discussions of AI safety often focus on explicit deception, steganography presents a more subtle challenge.
Let's revise the communication model to include steganography.
When examining this process, it's tempting to make a crucial assumption: that the answer follows directly from just the information contained within the message.In other words, the message is 'information-complete'.
This assumption roughly pattern-matches to how many people think about 'CoT faithfulness' - that the model states all relevant considerations in its reasoning.
However, as I will subsequently argue, this assumption is likely impossible to satisfy in practice.
We shouldn't expect information-completeness.
Here I'll present two arguments for why information-completeness is not likely to occur.
Human language is not information-complete.
Consider the statement "John went to the restaurant. He ordered a burger." Upon reading, we immediately fill in the gaps with many reasonable inferences:
It turns out that humans are very good at decoding meaning from another human's incomplete utterance, using our vast repository of world knowledge ('common sense').
Conciseness is key.
Information is vast and words have low bitrate. For communication to be of any tolerably short length, we must leave out all but the most informative bits.
This information, which is often relevant but obvious, can be left out and 'filled in' at will by the receiver. As Stephen Pinker argues in 'The Language Instinct' (emphasis mine):
Latent knowledge enables information-incomplete communication
We've established that humans regularly make and understand information-incomplete utterances. They can do this by relying on 'latent knowledge' - shared contextual understanding that allows us to reconstruct complete meaning from partial information.
Language models, which are trained extensively on human language, very likely need to access this same latent knowledge in order to be powerful general next-token predictors. So let's update our picture:
Example: Two-hop unfaithful reasoning
By strict definitions, when the model needs to make use of latent information, the chain of thought is already unfaithful.
Consider the following example:
Here, the model makes use of 'latent knowledge' to decode the relevant information without explicitly stating it.
Example: Self-collusion to avoid monitoring
Now let's consider the same setup, and demonstrate how it facilitates hidden reasoning.
In both of the above examples, the chain of thought represents true things about the model's reasoning process. It's just that it's also incomplete, and recovering the answer also requires the latent knowledge.
Intuition pump: cryptography. In cryptography, you can't recover the plaintext without both the ciphertext and the private key. Similarly, you can't fully understand the model's reasoning without both its explicit chain of thought and its latent knowledge.
Superhuman Latent Knowledge (SLK) as a threat model
While much research focuses on the possibility of models deliberately encoding illegible meaning (steganography), I propose a different concern: superhuman latent knowledge (SLK). This is concisely stated as follows:
Put another way, it seems like 'what models say they know' can be a tiny subset of 'what models actually know'.
SLK might be more important than steganography
If you are already concerned about steganography, I will claim that you should probably also be concerned about SLK.
SLK may be as hard-to-monitor as steganography.
SLK may be as hard or harder to monitor than steganography. With steganography, we're looking for meaning that is "hidden in plain sight" - encoded within the message itself. Note that detecting or decoding this information is already considered hard.
With SLK, we're dealing with meaning that "isn't even hidden" - it simply exists in the model's weights in a form we may be unable to access or understand.
SLK may be more likely to happen than steganography.
SLK is more likely to emerge naturally than steganography.
This makes it a more pressing concern than deliberate steganography, which requires specific conditions or intentions to arise.
SLK may enable steganography.
Superhuman latent knowledge seems highly connected to situational awareness, which enables far more advanced steganography.
E.g. If a language model latently knows "I am an AI" and furthermore knows "Here is how I could encode information to myself in a way only I would understand", it becomes capable of far more advanced steganography. See here for preliminary results indicating this might already happen in frontier models.
Related Work
The idea that language models have a vast space of latent knowledge has been discussed before, in discourse centered on shoggoths, and most of the points I present here are not new. However, I think I've framed this in a way that makes it especially clear why this is concerning.
The idea that language models can acquire new knowledge from finetuning and use this as part of reasoning has also been explored under the label of out-of-context reasoning. I think this is highly relevant and compile some recent work here.
Conclusion
In this post, I've argued that superhuman latent knowledge may be an important threat model that enables illegible reasoning despite faithful CoT.
If this is true, it implies that steganography and related chain-of-thought faithfulness research may be 'missing the mark', since ensuring CoT faithfulness is insufficient for ensuring there is no illegible reasoning.
In a subsequent post, I'll aim to explain what some potential solutions might be.
Edit: fix citations, typos