User Comment Replies

Deceptive agents can collude to hide dangerous features in SAEs

Hi! I know that this post is now almost 5 months old, but I feel like I need to ask some clarifying questions and point out things about your methodology that I don't completely understand/agree.

How do you source the sentences used for the scoring method? Are they all from top activations? This is not explicitly mentioned in the methodology section - although in the footnote you do say you have 3 high activations and 3 low activations. Am I to understand correctly that there are no cases with no activations?

Are the sentences shown individually or in batche... (read more)

1Simon Lermen3mo

Thanks for the comment, I am going to answer this a bit brief. When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation. Sentences are presented in batches, both during labeling and simulation. When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence. We mainly use activations per sentence for simplicity, making the task easier for the ai, I'd imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent. Having a baseline is good and would verify our back of the envelope estimation. I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let's say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let's say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don't present all sentences in a single batch.

Evaluating Sparse Autoencoders with Board Game Models

SrGonao7mo10

I don't know much about chess. Could it be that feature 172 that you are highlighting is related to some kind of chess opening? The distribution of black pawns could be due to different states of the opening, and the position of the black bishop and white horse could also be related to different parts of that opening?

LESSWRONG
LW

All of SrGonao's Comments + Replies