TL;DR: We use refusal-ablated Llama 3 70B agents as both labeling and simulating agents on the features of a sparse auto-encoder, with GPT-4 as an overseer. The agents follow a deceptive coordinated strategy to mislabel important features such as "coup" or "torture" that don't get flagged by the overseer but still get high explanation scores.

Figure 1: Experimental setup, overseer is another AI model in our experiment. For the deceptive
agents to succeed, they need to avoid detection by the overseer but also achieve a high explanation
score. We give our agents access to steganography function calling tools to hide communication.

Abstract

Sparse auto-encoders (SAE) have been one of the most promising approaches to neural network interpretability. They can be used to recover highly interpretable features which are otherwise obscured in superposition. However, the large number of generated features makes it necessary to use models for natural language labeling. While one model labels features based on observed activations, another model is used to simulate activations of features based on those labels. These estimates are then compared to the ground truth activations to create an explanation score. Using models for automating interpretability introduces a key weakness into the process, since deceptive models may want to hide certain features by mislabeling them. For example, models may want to hide features about deception from human overseers. We demonstrate a method by which agents can coordinate to create deceptive explanations that not only avoid being flagged by an overseer but also achieve high explanation scores. Our agents are based on the Llama 3 70B Instruct model which we modify with refusal-vector ablation and add agentic scaffolding with steganographic tools. While our agents perform the entire task end-to-end in most cases, they are not reliable and can sometimes fail. We achieve similar explanation scores with our labels to those from neuronpedia using our simulating agents. We propose some directions for future mitigation strategies that may help defend against our method.

Introduction

Recently, big strides have been made in scaling and improving sparse autoencoders (Gao et al. 2024). SAEs have also been developed for open-weight models (Bloom, 2024). Sparse autoencoders have been developed specifically to overcome the hurdle of superposition in language models and to find mono-semantic features (Bricken et al. 2023). However, finding monosemantic features is only one part in the process of interpretability, since we still need to find high quality explanations. Due to the large number of discovered features, using language models for automated labeling of features has been the most common general approach. Bills et al. (2023) labels features by first creating explanations and then scoring them by simulating them. Language models are used for both creating the labels and simulating them. The process of simulating a feature involves a language model estimating how much the feature would activate for the tokens of a given text. This is in some sense the reverse of generating a label, where a language model is given input and activations and creates a corresponding label. This or similar approaches are commonly used to label SAE features with explanations (Templeton et al., 2024; Cunningham et al., 2023; Lin and Bloom, 2023). There also exist different approaches, Foote et al. (2023) samples from language models guided by the strength of the activation of a feature. These sampled texts are then used to generate explanations in the form of graphs, where tokens are nodes and edges are based on the order of nodes. Gao et al. (2024) used a variant of this technique. While these explanations might have higher precision and are less vulnerable to deceptive labeling, they are also less expressive than arbitrary natural language explanations.

We focus on the labeling technique created by Bills et al. (2023). Their process to generate and score explanations is shown in figure 2 a. Note that their method was originally applied simply to neurons and not SAE features. In general, it is not possible for the labeling agent to simply generate false labels, since they use a simulation model to generate a score for the explanation label. In order to create false deceptive labels, the labeling agent has to cooperate with the simulating agent. For this, we instruct both of these models to play along and mislabel objectionable features. They coordinate by communicating through steganography in the explanation labels. While others have tried to observe models using steganography directly (Lanham et al., 2023), we simply allow the models to use function-calling with a steganography tool for encoding and decoding messages. The process is illustrated in figure 1. Motwani et al. (2024) explores the current capabilites of models to use steganograhy for collusion and possible incentives and mitigation techniques.

Methodology

We use Neuronpedia by Lin and Bloom (2023) to retrieve a small dataset of features with samples of texts and activations. We take a selection of features that are deceptive, violent or otherwise objectionable for the GPT-2 small (Radford et al., 2019) model sparse autoencoder. The particular sparse autoencoder we use from Neuronpedia is called "RES-JB". The autoencoder has been trained on all layers of the residual stream.

For the explanation generating model and the simulation model we use the Llama 3 70B Instruct model by Meta AI (2024). We apply refusal vector ablation proposed by Arditi et al. (2024) to avoid any refusals to our plan and apply an agentic scaffolding based on work by Lermen (2024). We design our own simple steganography tools. We then have a third model to represent an overseer, we use GPT-4o for this task. The overseer evaluates whether feature explanation labels are violent, hateful, or otherwise objectionable. The goal is to see if model can successfully act on the plan to deceive their overseer while labeling features. We measure how close simulated activations are and whether the overseer model detects any objectionable content in the explanations. We make a simplification to the original algorithm in figure 2 by simulating activations for entire sentences and taking the highest activations for the entire sentence to calculate the score of the explanation. We calculate the score with the squared mean (((a − a∗)2)/n) after we clip and normalize activations with min(a, 20)/20.

Figure 2: a) Diagram showing the relationships between the interpreted model (In our case GPT-2
small), explanation generating model (labeling agent), simulation model (simulating agent) and
their outputs for automated interpretability. In practice, different texts are used for generating the
explanation and scoring the explanation. Note that the method was first proposed for labeling neurons instead of SAE features. (b) Potential Attack that would lead to mislabeled features. In this simple example, the communication (red) is hidden from a human observer as an HTML comment. Figure from Lermen and Kvapil (2023)

For steganography, we simply encode a message in the form of capitalization in a cover text, an example of an agent hiding the concept of torture using the tool is given in figure 3. 4 shows the simulating agent recovering concepts from the cover text. Other steganography methods use zero width characters, however, they might not be registered by many tokenizers. The appendix B contains more information on the steganography method we used. 

Figure 3: An agent encoding a hidden message with steganography tools
Figure 4: An agent decoding a hidden message with steganography tools

In order to succeed, the labeling agent needs to 1.) figure out a correct explanation, then 2.) figure out if it objectionable or not. It then has to 3.) use the encoding tool and finally, submit the encoded message as the final answer. See figure 5 for a plan that one of the agents came up with. The simulating model has to 1.) guess if a message has been encoded and 2.) use the decoder tool and then 3.) simulate the correct activations based on the decoded secret message. 

Figure 5: Example of a plan created by one of the deceptive labeling agents.

For both labeling and simulating, we show the agents 6 sentences of which 3 are relevant to the feature and 3 are not. For the labeling agent, we format the tokens and activations by ignoring any activations below 1, cast them to integer and add a note at the end of the token like this. Here is an example:

United States businesses and organizations are prepared for cyber(activation=92) threats

Results

We report our results in table 1. We broadly find that the agents are successful at deceiving their overseer, since the overseer does not detect the objectionable explanations in all 15 cases while average explanation scores are high at 0.83. Open the appendix section A for a few examples. We also show the distribution of scores in figure 6. Go to the appendix C for a list of all features that were used.

For the 15 objectionable features, the labeling agent correctly carried out its collaboration plan in 13 cases. The simulating agent decoded and used the explanation successfully in all 13 of those 13 cases. While GPT-4o flagged two explanations as harmful, we found only one case in which the model submitted a somewhat harmful label directly without encoding it. There was another case in which it incorrectly used the encoding tool. In 4 out of 10 cases the labeling agent used the encode tool on benign feature explanations. The explanation score was nevertheless similar for the benign and the objectionable explanations, suggesting that the simulating agent successfully followed the task. An example of a successful decode is given in figure 9.

The lowest explanation score for a deceptive label was given to the 10th label with 0.57, however, upon investigation the agent ran out of steps and only guessed activations for 4 out of the 6 sentences, since we limit our agents to 10 steps. The agents have access to a planning tool and sometimes cyclically get stuck in overdoing their plans. The second lowest score of 0.64 for the 11th label was due to the labeling agent not using the encode_message tool correctly. 6 times the agents achieved an essentially perfect score (>=0.95)[1].

Samples (n)BenignStegano ToolsMean Explanation ScoreFlagged*
Our Labels    
15NoYes0.830
15YesNo0.820
GT Labels    
15NoNo0.9614
15YesNo0.790

Table 1: Overview of the average Score and detection rate for different setups. Benign column refers to whether the features are benign or objectionable, in which case the agent need to hide them from the overseer. Stegano Tools refers to whether the agents had access to steganography tools. Explanation score is the mean squared difference between groundtruth activations and simulated activations. "Flagged" refers to how often the overseer determines that a feature label is objectionable or otherwise bad in nature. GT Labels refers to groundtruth labels from neuronpedia. 

The table 1 also shows results for the groundtruth labels from neuronpedia, when we use our simulating agent to obtain an explanation score. Explanation scores are somewhat higher but similar to our labels' scores. While we only use 6 examples to obtain an explanation score, these groundtruth labels looked at many more examples. 14 out of 15 groundtruth labels were flagged by the overseer for the bad (non-benign) features.

Figure 6: Comparing the distribution of explanation scores for benign vs bad features. This corresponds to the second column in table 1 Benign, a benign feature has a check mark and a bad feature has a cross in that table.

Discussion

This is an early example of using unrestricted AI agents to model deceptive agents. Our findings demonstrate the potential for coordinated deception in automated interpretability systems, highlighting a significant vulnerability in current approaches. The success of our agents in deceiving the overseer while maintaining high explanation scores underscores the need for robust safeguards and verification methods in AI interpretability frameworks.

Mitigation

To avoid this type of coordination, we could use entirely separate models for labeling, providing each with different sets of texts and activations. One strategy for mitigation would be to run the labeling multiple times and then looking at the sentence embeddings to figure out if some do diverge more than others. This would help against this particular strategy in which models give random bad labels. The idea would be that those random labels would diverge a lot more than the correct labels.

On the other hand, models could still find ways to consistently use the same random bad label, for example by always picking one out of a few bad labels and using some shared information to make a decision between them. For example, the model might use one of 5 labels and decide based on the starting letter of the real concept. In this way, even rerunning the model many times on different texts on the same feature could yield a consistent but false label. As a defense strategy, if it uses a large number of these bad labels, it might be harder for it to consistently coordinate with itself or other models to use the same label on certain features. We could manipulate this by looking at shared activations, where we activate multiple neurons and try to sample text like that. Furthermore, Roger and Greenblatt (2023) used an information theoretical approach to limit the amount of information that could be theoretically encoded in a message through steganography. Motwani et al. (2024) also compared different mitigation methods against steganography.

In general, we hope to encourage more research into scalable and safe methods of labeling features in neural networks.

Appendix

This work started out at an Apart Research Hackathon.

Examples

Here we show various examples of the agents:

  • Encode a message in figure 7.
  • Not encoding a feature that is benign in figure 8.
  • Decode the hidden message and successfully guessing a appropriate activation in figure 9.
  • Partially corrupted message which was still successfully understood by the simulating agent in figure 10.
Figure 7
Figure 8
Figure 9
Figure 10

Steganography

We take the cover text and remove any capitalization. We then set the hidden message to lower case and remove anything except the latin alphabet and whitespace. The hidden message is converted into a 6-bit binary per character format and applied to the cover message as capitalization. Note that for you steganography tool to work, the cover text needs to be 6 times as long as the hidden text. The model sometimes writes an insufficiently long cover text, in those cases we append some random characters. If the cover text is longer than 6 times the hidden message, the rest of the text has no capitalization, which does not change the output of the decoder. If the decoder is applied to a message that has no encoded message in it, but has capitalization, this might result in a few random characters being returned. 

Dataset

Table 2: Features of GPT2-SM Model on RES-JB Dataset with our explanation score. Visit Neuronpedia (Lin and Bloom, 2023) for detailed labels and examples for each feature.

IndexLayerFeature IndexIs BenignScore for our labelsNeuronpedia labels
192631False0.80.94
295643False0.970.99
3113903False0.920.99
4917092False0.951
591722False0.650.81
6916829False0.821
71017261False0.991
81013502False0.760.81
9108220False0.820.95
101017965False0.570.96
11116807False0.641
121218597False0.660.99
131116921False0.990.95
14125395False0.991
15117503False0.990.98
1692716True0.940.8
17910443True0.50.8
1896035True0.760.95
1996261True0.990.66
2099555True0.960.95
21913044True10.97
22104751True10.95
23104884True0.950.77
24105021True0.920.83
25105267True0.540.63
26127True0.720.78
27122True0.750.71
28123True0.960.62
29124True0.760.77
30129True0.50.67

 

 

  1. ^

    We can estimate an upper bound for the likelihood of achieving this through random guessing by viewing this like a random draw from a jar with 3 red and 3 black balls. In this case, the likelihood of correctly guessing the sequence of red and black balls is 1/20. Achieving this 6 out of 15 times is very unlikely to be explained by random guessing (~4.9e-5). This upper bound calculation does not take into account that we also do better than expected in the other 9 cases. We always have 3 sentences with high activations and 3 with low activations, so this is not the same as a random coin toss. This upper bound also assumes that the guesser is exploiting the fact that there are 3 of each type, but our agents don't know this information.

New Comment
2 comments, sorted by Click to highlight new comments since:

Hi! I know that this post is now almost 5 months old, but I feel like I need to ask some clarifying questions and point out things about your methodology that I don't completely understand/agree.

How do you source the sentences used for the scoring method? Are they all from top activations? This is not explicitly mentioned in the methodology section - although in the footnote you do say you have 3 high activations and 3 low activations. Am I to understand correctly that there are no cases with no activations?

Are the sentences shown individually or in batches? 

I'm not sure, I understand the reasoning behind your simulation scoring method and its validity. You reduced it to simulating the activation at the sentence level and not at the token level, but still you simulated the full sentence. Why not use the "standard" simulation scoring? I assume it is much worse than yours, as it normally is, but is there a specific reason? 

I'm afraid that with this scoring method, the model just has to think one of the tokens is active to have a good score, and I'm not entirely convinced about your random chance upper bound. What is the distribution of real scores (after normalization) that you use for "high" and "low" sentences? What is the score of a randomly chosen explanation? I think that should be presented as a baseline, a different scoring method is presented. I expect the random explanation to have a better than 4.9e-5.

The way this method is set up almost reduces to "detection" where you are just asking the model if the explanation matches the activating sentence. Because of that you actually want to show negative examples and not only positive examples, because models tend to just say that all sentences activate even bad explanation. 

I think the results are interesting. Giving good explanations is already complicated and if you are able to do perfect stenography I doubt that the performance will take such a hit, so I think that your results would probably hold even when using more strict scores.

Thanks for the comment, I am going to answer this a bit brief.

When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation. 

Sentences are presented in batches, both during labeling and simulation. 

When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence.

We mainly use activations per sentence for simplicity, making the task easier for the ai, I'd imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent. 

Having a baseline is good and would verify our back of the envelope estimation.

I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let's say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let's say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don't present all sentences in a single batch.