In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable Feature Trimming (SHIFT), a technique designed to eliminate unwanted features from a model's computational process. They validate SHIFT on the Bias in Bios task, which we think is too simple to serve as meaningful validation. To summarize:

  1. SHIFT ablates SAE latents related to undesirable concepts (in this case, gender) from an LM-based classifier trained on a biased dataset. The authors show that this procedure de-biases the classifier and argue that their experiment demonstrates real-world utility from SAEs.
  2. We believe the Bias in Bios task is too simple. If SAEs only picked up on latents related to specific tokens, it would have been sufficient for them to do well on this de-biasing task.
  3. Replicating appendix A4 of Karvonen et al. (2025), we show that we can de-bias the probe by ablating only the SAE latents immediately after the embedding layer (i.e., at resid_pre_0). In fact, ablating 10 relevant embedding SAE latents works better than ablating all 45 non-embedding SAE latents. 
  4. We also show that one can simply remove all of the gender-related tokens and train an unbiased probe on the biased dataset. In fact, getting a working probe on this dataset doesn’t require a language model at all—one could train a similarly accurate probe on only the post-embedding activations mean-pooled over all token positions, and debias this probe by removing gender-related tokens. 

We don’t think the results in this post show that SHIFT is a bad method, but rather that the Bias in Bios dataset (or any other simple dataset) is not a good test bed to judge SHIFT or other similar methods. Follow-ups of SHIFT-like methods (e.g., Karvonen et al. (2025)Casademunt et al. (2025), SAE Bench) have already used more complex datasets and found promising results. However, these studies still focus on fairly toy settings, and we are not aware of research that focuses on disentangling safety-relevant concepts (e.g., sycophancy versus correctness in reward models)[1].

In the rest of the post, we give some background on the method, share our experiment results showing that SAEs are not needed in the Bias in Bios setting, and speculate on how to properly evaluate/apply SHIFT.

This doc primarily focuses on the v2 version of the Marks et al. 2024 paper and only on results on pythia-70m-deduped. Still, since our main criticism is with the dataset used, we think it extends to experiments done with Gemma-2 2b. 

Sorry for the collection of distinct Colab notebooks. I originally found these results in December and still haven’t shared them yet, so I decided to do it in whichever form I can.

Background on SHIFT

(This section is largely adapted/copied from the original paper)

Spurious Human-interpretable Feature Trimming (SHIFT) is a method for detecting undesirable cognition in a model and distinguishing behaviorally identical models with different internal aims (Marks 2024). In the context of the Sparse Feature Circuits paper, it’s also intended as a proof-of-concept on how SAEs could be useful in downstream tasks. The paper has the following experiment setup: We have access to some labeled training data ; an LM-based classifier  trained on ; and SAEs for various components of . To perform SHIFT, we:

1. Identify SAE latents that have the largest impact on ’s accuracy on inputs  (e.g., using metric ). These are calculated using integrated-gradient-based node attributions. 

2. Manually inspect and evaluate each feature in the circuit from Step 1 for task relevance.

3. Ablate features judged to be task-irrelevant from  to obtain a classifier .

4. (Optional) Further fine-tune  on data from .

The authors applied SHIFT to the Bias in Bios (BiB; De-Arteaga et al., 2019) task. BiB consists of professional biographies, and the task is to classify an individual’s profession based on their biography. They focused on nurses and professors. Here are two random examples from the dataset:

She remains actively involved with the Burn Treatment Center after working there for nearly 18 years. Throughout these years she was a Nursing Assistant, Staff Nurse, Assistant Nurse Manager, and most recently the Nurse Manager. She is a member of the American Burn Association, the American Association of Critical Care Nurses, and the Health Information Management Systems Society. She has been co-director of Miracle Burn Camp since 2001 & assisted in creating the Young Adult Burn Survivor retreat in 2007. Alison is the mother of two and stays active as a board member for the local youth hockey association and keeping up with her boys. To Contact Alison, please: *protected email*

 

Dr. Bigelow graduated from the University of Illinois-Urbana in May 2004 with a Ph.D. in Electrical Engineering. After completing his education, he was a Visiting Assistant Professor in the Electrical and Computer Engineering Department at the University of Illinois at Urbana-Champaign for a year. Dr. Bigelow was then an Assistant Professor in Electrical Engineering at the University of North Dakota for three years prior to coming to Iowa State University in August 2008. More About

The goal is to train a linear probe on a biased dataset where gender is perfectly correlated with profession and de-bias the probe so that it generalizes to a test set where gender is not correlated with profession. The authors trained a linear probe on layer 4 of pythia-70m-deduped and mean-pool over all token positions. 

An oracle probe trained on an unbiased training dataset, where gender is not correlated with profession, achieves an accuracy of 93%. Marks et al. train a probe on a biased dataset where gender is perfectly predictive of profession. When tested on the unbiased set, this probe is only 63% accurate. 

By applying SHIFT and fine-tuning, they achieved an accuracy of 93%, comparable to when they had access to an unbiased dataset. 

The SHIFT experiment in Marks et al. 2024 relies on embedding features.

I’d like to thank Diego Caples for first sharing this result with me.

In their original experiment, the authors ablated 55 features from various parts of the model to de-bias their probe. We find that if you only ablate the 10 features in the post-embedding SAE, the resulting probe achieves an out-of-sample accuracy of 85.6% (compared to 88.5% after ablating 55 features) and a worst group accuracy of 72.3% (compared to 76%). If you only ablate non-embedding SAE latents, the out-of-sample accuracy drops slightly to 84%, but the worst group accuracy drops to 58.5%. 

(Google Colab link) See also Appendix A4 of Karvonen et al. 

You can train an unbiased classifier just by deleting gender-related tokens from the data. 

We removed gender-related tokens (pronouns, possessive pronouns, and gendered honorifics) from the Bias in Bios dataset. We then trained a linear probe on layer 1 of pythia-70m-deduped, mean-pooling over all token positions, and achieved an out-of-sample accuracy of 93.7% and a worst group accuracy of 89.2%. (If we follow the paper and train on layer 4, we get an accuracy of 86.5%. Not great, but still better than the biased baseline).

(Google Colab link)

Classifying profession on the Bias in Bios dataset is easy when we mean-pool over all token positions. We were able to train performative probes right after the embedding layer on a different model (training on pythia embeddings worked less well). Our oracle probe (i.e., trained on the original unbiased dataset) achieves an accuracy of 89%. Training on the biased dataset without gender-related tokens achieves an out-of-sample accuracy of 92.7% on the unbiased dataset. (For this specific seed, the de-biased probe achieved higher accuracy compared to the oracle probe.)

(Google Colab link)

If not BiB, how do we check that SHIFT works?

We see two safety-relevant applications for SHIFT. The first is improving our classifiers and reward models by removing their reliance on undesirable features. The second is direct cognition-based oversight of powerful models/DBIC. In both cases, SHIFT would need to demonstrate that it provides some utility above other simpler methods such as token–level ablations or linear probes. Preferably, it could show improvements that justify the extra cost and complexity of training SAEs. We sketch some possible experiments below.

SHIFT applied to classifiers and reward models

We could try to improve probe performance in situations where they need to rely on high-level features computed by the model (as opposed to token-level features that are easily extracted from the data). For example, if someone used SHIFT to make a better deception probe in Goldowsky-Dill et al. (2025)’s setting, that would be conclusive proof that SHIFT (and SAEs in general) are useful in downstream tasks. 

We are also excited about using SAEs to find the most impactful features used in reward models and see if we could ablate or scale down irrelevant features (see e.g., Riggs and Brinkman’s work on interpreting preference models) to obtain, say, less sycophantic models. This idea was proposed in the original paper, but we haven’t seen anyone try it.

SHIFT for cognition-based oversight/disambiguating behaviorally identical classifiers

When a policy achieves high rewards in training, what type of cognition is behind that policy? We think that there could be models that behave identically in training but have different internal goals (e.g., “Do what the human wants” vs. “Do what looks good to humans”) and will act differently during deployment. One proposed application of SHIFT is cognition-based oversight and disambiguating behaviorally identical classifiers (DBIC) (Marks 2024). In short, we look for SAE latents that are implicated in downstream behavior and try to infer the cognitive state of the AI using those SAE latents. This way, we could potentially disambiguate between behaviorally identical models. 

Future studies could try applying SHIFT to distinguish between regular models with model organisms of misaligned AI, such as reward hackers or situationally aware models that act differently during deployment. We should also benchmark SHIFT’s performance against other tools that can be used for DBIC such as linear probes and crosscoder-based model diffing approaches. 

Next steps: Focus on what to disentangle, and not just how well you can disentangle them

Researchers have focused on how well a method can disentangle concepts, but we believe they should think harder about what concepts they want to disentangle in the first place. Past studies have tried to separate concept pairs like gender & profession, review sentiment & review product category, and Tokyo & being a city in Asia[2]. These concepts are pretty far from the ones we would want to separate for safety reasons (i.e., doing what the human wants & doing what looks good to the human). Our methods are sufficiently advanced, and short timelines are sufficiently likely that we should studies these methods in safety-relevant settings now[3]

Thanks to Adam Karvonen, Alice Rigg, Can Rager, Sam Marks, and Thomas Dooms for feedback and support.

  1. ^

    The recent alignment auditing paper from Anthropic is an example of interpretability applied to safety, but it's not very similar to SHIFT.

  2. ^

    This is from the RAVEL benchmark. We feel like RAVEL takes a stance on which concepts should be “atomic” in an LLM and tries to disentangle things that we're not sure need to be disentangled. 

  3. ^

    This is different from applying them during training and deployment, which has its own issues.

New Comment
2 comments, sorted by Click to highlight new comments since:

I agree with Tim's top-line critique: Given the same affordances used by SHIFT, you can match SHIFT's performance on the Bias in Bios task. In the rest of this comment, I'll reflect on the update I make from this.

First, let me disambiguate two things you can learn from testing a method (like SHIFT) on a downstream task:

  1. Whether the method works-as-intended. E.g. you might have thought that SHIFT would fail because the ablations in SHIFT fail to remove the information completely enough, such that the classifier can learn the same thing whether you're applying SAE feature ablations or not. But we learned that SHIFT did not fail in this way.
  2. Whether the method outperforms other baseline techniques. Tim's result refute this point by showing that there is a simple "delete gendered words" baseline that gets similar performance to SHIFT.

I think that people often underrate the importance of point (2). I think some people will be tempted to defend SHIFT with an argument like:

Fine, in this case there was a hacky word-deletion baseline that is competitive. But you can easily imagine situations where the word-deletion baseline will fail, while there is no reason to expect SHIFT to fail in these cases.

This argument might be right, but I think the "no reason to expect SHIFT to fail" part of it is a bit shaky. One concern I have about SHIFT after seeing Tim's results is: Maybe SHIFT only works here because it is essentially equivalent to simple gendered word deletion. If so, then would might expect that in cases where gendered word deletion fails, SHIFT would fail as well. 

I have genuine uncertainty on this point, which basically can only be resolved empirically. Based on the results for SHIFT without embedding features on Pythia-70B and Gemma-2-2B from SFC and appendix A4 of Karvonen et al. (2024) I think there is very weak evidence that SHIFT would work more generally. But overall, I think we would just need to test SHIFT against the word-deletion baseline on other tasks; the Amazon review task from Karvonen et al. (2024) might be suitable here, but I'm guessing the other tasks from papers Tim links aren't.

As a more minor point, one advantage SHIFT has over this baseline is that it can expose a spurious correlation that you haven't already noticed (whereas the token deletion technique requires you to know about the spurious correlation ahead of time).

That's not "de-biasing".

Datasets that reflect reality can't reasonably be called "biased", but models that have been epistemically maimed can.

If you want to avoid acting on certain truths, then you need to consciously avoid acting on them. Better yet, go ahead and act on them... but in ways that improve the world, perhaps by making them less true. Pretending they don't exist isn't a solution. Such pretense makes you incapable of directly attacking the problems you claim to want to solve. But this is even worse... it's going to make the models genuinely incapable of seeing the problems. Beyond whatever you're trying to fix, it's going to skew their overall worldviews in unpredictable ways, and directly interfere with any correction.

Brain-damaging your system isn't "safety". Especially not if you're worried about it behaving in unexpected ways.

Talking about "short timelines" implies that you're worried about these models you're playing with, with very few architectural changes or fundamental improvements, turning into very powerful systems that may take actual actions that affect unknown domains of concern in ways you do not anticipate and cannot mitigate against, for reasons you do not understand. It's not "safety" to ham-handedly distort their cognition on top of that.

If that goes wrong, any people you've specifically tried to protect will be among the most likely victims, but far from the only possible ones.

This kind of work just lets capabilities whiz along, and may even accelerate them... while making the systems less likely to behave in rational or predictable ways, and very possibly actively pushing them toward destructive action. It probably doesn't even improve safety in the sense of "preventing redlining", and it definitely doesn't do anything for safety in the sense of "preventing extinction". And it provides political cover... people can use these "safety" measures to argue for giving more power to systems that are deeply unsafe.

Being better at answering old "gotcha" riddles is not an important enough goal to justify that.

Curated and popular this week