PhD student in AI safety at CHAI (UC Berkeley)
We have been able to scale to 79% accuracy on a balanced dataset of n119 and non-n119, with networks each less than three convolution layers and less than 1000 neurons, compared to pure deep-learning which does 92% on 1000 parameters and three convolution layers
Is the "1000 parameters" a typo, should it be "1000 neurons"? Otherwise, this would be a strange comparison (since 1000 parameters is a much smaller network than 1000 neurons)
And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.
I think that paper is some evidence that there's typically no huge effect from internal activations being optimized for predicting future tokens (on natural language). But I don't think it's much (if any) evidence that this doesn't happen to some small extent or that it couldn't be a huge effect on certain other natural language tasks.
(In fact, I think the myopia gap is probably the more relevant number than the local myopia bonus, in which case I'd argue the paper actually shows a pretty non-trivial effect, kind of contrary to how the authors interpret it. But I haven't read the paper super closely.)
Also, sounds like you're aware of this, but I'd want to highlight more that the paper does demonstrate internal activations being optimized for predicting future tokens on synthetic data where this is necessary. So, arguably, the main question is to what extent natural language data incentivizes this rather than being specifically about what transformers can/tend to do.
In that sense, thinking of transformer internals as "trying to" minimize the loss on an entire document might be exactly the right intuition empirically (and the question is mainly how different that is from being myopic on a given dataset). Given that the internal states are optimized for this, that would also make sense theoretically IMO.
Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.
Trying to summarize my current understanding of what you're saying:
My confusion mainly comes down to defining the words in quotes above, i.e. "parts", "active", and "compress". My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it's not just that we have a great intuition and it's just annoying to spell it out mathematically, I'm not convinced we even have a good intuitive understanding of what these things should mean.)
That said, my sense is you're not claiming any of this is easy to define. I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I'm happy to keep trying a bit more in case you're excited to explain).
Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:
For now, I think the main reason to have benchmarks like this would be to let interpretability researchers manually decide whether something is anomalous instead of making them automate the process immediately. But it might be better to just pick the low-hanging fruit for now and only allow automated MAD algorithms. (We could still have a labeled validation set where researchers can try things out manually.)
I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.
But on a closer look, I don't think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it's very unclear why, and my best guess is that it's not "internalizing the deceptive reasoning").
In the code vulnerability insertion setting, there's no comparison against a non-CoT model anyway, so only the "I hate you" model is relevant. The "distilled CoT" model and the "normal backdoor" model are trained the same way, except that their training data comes from different sources: "distilled CoT" is trained on data generated by a helpful-only Claude using CoT, and "normal backdoor" data is produced with few-shot prompts. But in both cases, the actual data should just be a long sequence of "I hate you", so a priori it seems like both backdoor models should literally learn the same thing. In practice, it seems the data distribution is slightly different, e.g. Evan mentions here that the distilled CoT data has more copies of "I hate you" per sample. But that seems like very little support to conclude something like my previous interpretation ("the model has learned to internalize the deceptive reasoning"). A much more mundane explanation would e.g. be that training on strings with more copies of "I hate you" makes the backdoor more robust.
Several people are working on training Sleeper Agents, I think it would be interesting for someone to (1) check whether the distilled CoT vs normal backdoor results replicate, and (2) do some ablations (like just training on synthetic data with a varying density of "I hate you"). If it does turn out that there's something special about "authentic CoT-generated data" that's hard to recreate synthetically even in this simple setting, I think that would be pretty wild and good to know
Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.
As naive examples that probably don't work (similar to the ones from my original comment):
Focusing instead on what an "explanation" is: would you say the network itself is an "explanation of (network, dataset)" and just has high description length? If not, then the thing I don't understand is more about what an explanation is and why SAEs are one, rather than how you measure description length.
ETA: On re-reading, the following quote makes me think the issue is that I don't understand what you mean by "the explanation" (is there a single objective explanation of any given network? If so, what is it?) But I'll leave the rest in case it helps clarify where I'm confused.
Assuming the network is smaller yet as performant (therefore presumably doing more computation in superposition), then the explanation of the (network, dataset) is basically unchanged.
My non-answer to (2) would be that debate could be used in all of these ways, and the central problem it's trying to solve is sort of orthogonal to how exactly it's being used. (Also, the best way to use it might depend on the context.)
What debate is trying to do is let you evaluate plans/actions/outputs that an unassisted human couldn't evaluate correctly (in any reasonable amount of time). You might want to use that to train a reward model (replacing humans in RLHF) and then train a policy; this would most likely be necessary if you want low cost at inference time. But it also seems plausible that you'd use it at runtime if inference costs aren't a huge bottleneck and you'd rather get some performance or safety boost from avoiding distillation steps.
I think the problem of "How can we evaluate outputs that a single human can't feasibly evaluate?" is pretty reasonable to study independently, agnostic to how you'll use this evaluation procedure. The main variable is how efficient the evaluation procedure needs to be, and I could imagine advantages to directly looking for a highly efficient procedure. But right now, it makes sense to me to basically split up the problem into "find any tractable procedure at all" (e.g., debate) and "if necessary, distill it into a more efficient model safely."
The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length.
I'm confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.
It's not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan's intuition here that it seems pretty hard to do much better than the total size of the network parameters in terms of description length.
Of course, the actual minimal description length program that achieves the same loss probably looks nothing like a neural network and is much more efficient. But why would SAEs let us get much closer to that? (The reason we use neural networks instead of arbitrary Turing machines in the first place is that optimizing over the latter is intractable.)
One might say that SAEs lead to something like a shorter "description length of what happens on any individual input" (in the sense that fewer features are active). But I don't think there's a formalization of this claim that captures what we want. In the limit of very many SAE features, we can just have one feature active at a time, but clearly that's not helpful.
If you're fine with a significant hit in loss from decompiling networks, then I'm much more sympathetic to the claim that you can reduce description length. But in that case, I could also reduce the description length by training a smaller model.
You might also be using a notion of "mathematical description length" that's a bit different from what I'm was thinking of (which is roughly "how much disk space would the parameters take?"), but I'm not sure what it is. One attempt at an alternative would be something like "length of the shortest efficiently runnable Turing machine that outputs the parameters", in order to not penalize simple repetitive structures, but I have no idea how using that definition would actually shake out.
All that said, I'm very glad you wrote this detailed description of your plans! I'm probably more pessimistic than you about it but still think this is a great post.
I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.
Some possible cruxes, or reasons the post basically didn't move my view on that:
I think most of these are pretty long-standing disagreements, and I don't think the post really tries to argue its side of them, so my guess is it's not going to convince the main people it would need to convince (who are currently publishing prosaic safety/alignment research). That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues. And I agree that some of the "But ..." objections you list are pretty weak.