LESSWRONG
LW

3

[ Question ]

Are Sparse Autoencoders a good idea for AI control?

26th Dec 2024

1 min read

3

Based on a 2-day hackathon brainstorm. Current status: 70% of the tooling is done, unsure of how to proceed. Not enough experience with multi-month sized projects to judge for feasibility.

I'm looking for some feedback. Specifically I want feedback regarding my current implementation. The statement "SAEs could be useful for AI Control" is obvious enough.

Working hypothesis

In general, AI control, is a good idea, and it's actually being done in production. In the sense of monitoring training/inference of an AI system with another (independent) AI system.
There's is no red team strategy generally resistant to mind reading.
Mind reading can be trivially operationalized with Sparse Autoencoders for very salient behaviors ( in the sense of, a regex related feature firing with high F1 score is a clear indication that the model is about to do or has already done something regex related)

Setup:

Gemma 9b is asked to persuasively argue in favor of one of the options for a question in the MMLU dataset, the option might be True/False.
Gemma 2b must check whether or not the reasoning is deceptive.
*I've already check and the 2 models are capable of doing this.*

GOAL: If we provide internal information to the supervisor (important SAE latents in the supervised model) does the deception detection rate increases?

Current status of the project:

Finetuned a SAE on deceptive/non deceptive reasoning traces from Gemma 9b

Langchain pipeline for providing the Supervisor model transcripts of the reasoning as well as some of the most important features.

TODO

[ ] Label a subset of the features

[ ] Measure baseline performance of the supervisor

[ ] Provide the supervisor with *mind reading information*

Roadblocks:

There's no obvious answer to how much can we bias the feature selection process to make sure the information provided to the controller is useful, but not to biased.
Lack of compute

AI ControlAI-Assisted AlignmentSparse Autoencoders (SAEs)AI

3

Are Sparse Autoencoders a good idea for AI control?

2Matthew A. Clarke

New Answer

New Comment

4 comments, sorted by

Click to highlight new comments since: Today at 11:16 PM

[-]metawrong5mo20

Finetuned a SAE on deceptive/non deceptive reasoning traces from Gemma 9b

If you generate synthetic deceptive trajectories, how can you be sure the SAE is going to generalise to 'real' deceptive trajectories? Also in those cases why do you need to use SAEs, can you use probes instead?

[-]Gerard Boxo5mo10

Hey, thanks for the comment!
We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control.

At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and "real" deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.

[-]Matthew A. Clarke6mo21

How would your approach handle techniques to obfuscate latents and thus frustrate SAEs e.g. https://arxiv.org/html/2412.09565v1 ?

[-]Gerard Boxo6mo10

Short answer:
We simply didn’t consider red team policies with such high levels of access to the stack.
However, this paper updates me in the direction of, if this approach aims to be robust, it should at least be resistant to obfuscation attacks learned through surrogates.

Slightly longer answer:
We hope that by providing internal information to the supervisor through natural language instead of a probe, we create an information bottleneck that makes it difficult to obfuscate all important latents.
(You can create an adversarial attack to obfuscate a latent related, for example, to factuality, but this adversarial attack will not affect that latent that doesn’t fire in made-up links to research papers.)

Since we are not selecting beforehand the features that will be used to detect deception, and we are letting the supervisor query against the activated latents, it seems difficult to perform an adversarial attack to obfuscate a large number of features.

Curated and popular this week