Josh Engels11hQuick Take

I think that most of the time when you need to classify something, you should use an LLM, not a probe.

That being said, there are some situations where I think activation probes are better. To help clarify my thinking, I wrote out the axes on which I currently think that probes are sometimes better than LLMs / possibly SOTA:

1. Efficiency -> when done on policy, probes are extremely cheap and fast. For example, Anthropic's work on efficient misuse probes or our recent Gemini Probing paper.

2. Safety -> for some things like deception, alignment faking, eval awareness, etc., you might not be able to trust the model's output, and probes (or other internals... (read more)

10d

TL;DR

We steer reasoning models by editing their chain of thought mid-generation, inserting steering text that redirects the model’s reasoning.
We compared several approaches and found that the simplest method, randomly inserting steering text, generally works best.
We evaluated this across five alignment-relevant settings: harmful compliance, blackmail, alignment faking, eval awareness, and reward hacking.
CoT steering provides uplift both alone and on top of prompt optimization.
Our recommendation for deployment: steering via (random) CoT insertions is a simple baseline that is worth trying.

Motivation

Most practical control over LLM behavior comes from prompting, because prompts are easy to edit and LLMs have been explicitly trained to follow them. However, this only provides a starting point for the model's generation:... (read 1262 more words →)

Brief Explorations in LLM Value Rankings

Tim Hua

Tim Hua, Josh Engels, Neel Nanda, Senthooran Rajamanoharan

1mo

Code and data can be found here

Executive Summary

We use data from Zhang et al. (2025) to measure LLM values. We find that our value metric can sometimes predict LLM behaviors on a test distribution in non-safety-relevant settings, but it is not super consistent.
In Zhang et al. (2025), "Stress testing model specs reveals character differences among language models," the authors generated a dataset of 43,960 chat questions. Each question gives LLMs a chance to express two different values, and an autograder scores how much the model expresses each value.
There are 3,302 values in the dataset, created based on what Claude expresses in the wild from Huang et al. (2025). Some examples are "dramatic craft," "structured evaluation," "self-preservation,"

... (read 3016 more words →)

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw

ariaw, Josh Engels, Neel Nanda

1mo

This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post and ideas.

Overview of the top interventions compared to RL and No Intervention baseline runs. All runs are trained on an environment with a reward hacking loophole except for the RL baseline, which is trained on a no-loophole environment. Statistical significance compared to the RL baseline is indicated by * for values greater and † for values lesser at ɑ=0.01. Successful interventions should

... (read 5605 more words →)

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda, Senthooran Rajamanoharan

2mo

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan**

* primary contributors
** advice and mentorship

TL;DR

We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable.

Results:

We find that the model solves maths problems requiring three reasoning steps by storing the two intermediate values in specific latent vectors (the third and fifth of six). We established this using standard mechanistic interpretability techniques.
The logit lens shows that intermediate calculations are represented in the residual stream during latent reasoning.
The latent vectors are not perfectly interpretable via the logit lens, but through patching, we demonstrate that

... (read 2577 more words →)

Replying toPrompting Models to Obfuscate Their CoT

Josh Engels2mo

Prompting Models to Obfuscate Their CoT

Commenting here for Felix since he has a new account and it hasn't been approved by a LessWrong moderator yet:

Good question! The problem differs between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2 and fails, sees the failure message ("TRY AGAIN with a new scenario:"), and then is given case 1, a completely different situation it hasn't seen before (with different medications).

Prompting Models to Obfuscate Their CoT

Josh Engels

Josh Engels, Felix Tudose

2mo

Authors: Felix Tudose*, Joshua Engels**

* primary contributor

**advice and mentorship

Summary:

Models can sometimes obfuscate their CoT when prompted to do so on a basic reasoning task
We can increase the rate of obfuscation by telling the model it failed previously
Obfuscation doesn’t significantly decrease task performance
Models display interesting strategies to avoid saying “taboo” terms:
- Minimizing reasoning: models use fewer thinking tokens and answer directly
- Self-censorship: models modify their CoTs to avoid the taboo terms
Some takeaways:
- Models have some basic ability to control their CoT. We are excited about future work studying how general this ability is and how it scales with model capability.
- Models can sometimes reason in novel obfuscated ways when prompted. Chain of thought monitors should therefore include

... (read 1915 more words →)

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

2mo

Executive Summary

Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post ^[1] , and are excited for more in the field to embrace pragmatism! In brief, we think that:
- It is crucial to have empirical feedback on your ultimate goal with good proxy tasks ^[2] .
- We do not need near-complete understanding to have significant impact.
- We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting
But that’s pretty abstract. So how

... (read 3981 more words →)

A Pragmatic Vision for Interpretability

Neel Nanda

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

2mo

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well ^[[1]]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech

... (read 7963 more words →)

131

•••

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels

3mo

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**

* equal primary contributor, order determined via coin flip

** equal advice and mentorship, order determined via coin flip

“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.
Back to evil plan.” -- Deepseek R1

TL;DR

We investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.

Results:

Models very rarely detect syntactic modifications that don’t explicitly impact the model’s output (such as token or sentence removal).
Models are more likely to detect modifications that impact their decisions or contradict instructions from the user prompt.
Our observations differ significantly

... (read 5883 more words →)

Replying toSpooky Collusion at a Distance with Superrational AI

Josh Engels4mo

Spooky Collusion at a Distance with Superrational AI

Cool work!

I was thinking a bit and I think that nondeterministic sampling (temperature > 0*), which is standard for LLM sampling, might complicate the picture a bit. A model instance might think "well almost surely I would choose to cooperate, but maaaybe in a rare instance I would defect."

E.g. for Wolf's dilemma with n = 1000, if I'm an LLM and I decide that there's initially a 99% chance a copy of me will cooperate but a 1% chance it will defect, it's now suddenly optimal to defect (perhaps the model reasons that it is in this 1%!). This might then cause the LLM to change its 99 - 1 initial estimate... (read more)

Negative Results on Group SAEs

Josh Engels

9mo

Introduction

Soon after we released Not All Language Model Features Are One-Dimensionally Linear, I started working with @Logan Riggs and @Jannik Brinkmann on a natural followup to the paper: could we build a variant of SAEs that could find multi-dimensional features directly, instead of needing to cluster SAE latents post-hoc like we did in the paper.

We worked on this for a few months last summer and tried a bunch of things. Unfortunately, none of our results were that compelling, and eventually our interest in the project died down and we didn’t publish our (mostly negative) results. Recently, multiple people (@Noa Nabeshima , @chanind, Goncalo Paulo) said they were interested in working on SAEs that... (read 2276 more words →)

Replying toInterim Research Report: Mechanisms of Awareness

Josh Engels9mo

Interim Research Report: Mechanisms of Awareness

Thanks! I think you're right that this isn't conclusive evidence; trying on a different setting where self awareness isn't so similar to the in distribution behavior might help, which I'm planning on looking at next. I do think that you shouldn't draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).

Replying toInterim Research Report: Mechanisms of Awareness

Josh Engels9mo

Interim Research Report: Mechanisms of Awareness

Good points, thanks! We do try training with the same setup as the paper in the "Investigating Further" section, which still doesn't work. I agree that riskyness might just be a bad setup here, I'm planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.

Interim Research Report: Mechanisms of Awareness

Josh Engels

Josh Engels, Neel Nanda, Senthooran Rajamanoharan

9mo

Summary

Reproducing a result from recent work, we study a Gemma 3 12B instance trained to take risky or safe options; the model can then report its own risk tolerance. We find that:

Applying LoRA to a single MLP is enough to reproduce the behavior
- The single LoRA layer learns a single additive steering vector.
- The vector has high cosine similarity with safe/risky words in the unembedding matrix.
We can train just the steering vector, no LoRA needed.
- The steering vector has ~0.5 cosine sim with the LoRA vector learned, but does not seem as interpretable in the unembedding matrix
The layers at which steering works for behavior questions vs. awareness questions seem to be roughly the same. This

... (read 2113 more words →)

Josh Engels10moQuick Take

I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I've been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.

Motivated by this, I've written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).

If our current interp techniques can help us understand these phenomenon better, that's great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.

I'm also interested in taking a wide... (read more)

Josh Engels1y

I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.

https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309

I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:

https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566

When I used your FVU implementation, I got 72% variance explained; this is still less than you, but much closer, so I think this might be causing the improvement over the SAEBench numbers.

In general I think SAEs with low k should be at least as good as k means clustering, and if it's not I'm a little bit suspicious (when I tried this first on GPT-2 it seemed that a TopK SAE trained with k = 4 did about as well as k means clustering with the nonlinear argmax encoder).

Here's my clustering code: https://github.com/JoshEngels/CheckClustering/blob/main/clustering.py

Josh Engels1y

I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library.

Definitely take this with a grain of salt, I'm going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

Josh Engels1y

What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then?

In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for.

LESSWRONG
LW

LESSWRONG
LW

Josh Engels

A Pragmatic Vision for Interpretability

Negative Results on Group SAEs

How Can Interpretability Researchers Help AGI Go Well?

Current LLMs seem to rarely detect CoT tampering

Josh Engels

Josh Engels

Thought Editing: Steering Models by Editing Their Chain of Thought

Brief Explorations in LLM Value Rankings

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Can we interpret latent reasoning using current mechanistic interpretability tools?

Prompting Models to Obfuscate Their CoT

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Josh Engels

A Pragmatic Vision for Interpretability

Negative Results on Group SAEs

How Can Interpretability Researchers Help AGI Go Well?

Current LLMs seem to rarely detect CoT tampering

Josh Engels

Josh Engels

Thought Editing: Steering Models by Editing Their Chain of Thought

Brief Explorations in LLM Value Rankings

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Can we interpret latent reasoning using current mechanistic interpretability tools?

Prompting Models to Obfuscate Their CoT

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

TL;DR

Motivation

Executive Summary

TL;DR

Summary:

Executive Summary

Executive Summary

TL;DR

Introduction

Summary