All of 4gate's Comments + Replies

I agree that there are inductive biases towards sharing features and/or components. I'm not sure if there's a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate[1], and I'm not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it's pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda ... (read more)

This is a cool method. Are you thinking of looking more into how gradient routed model performance (on tasks and not just loss) scales with size of the problem/model? You mention that it requires a big L1 regularization in the Vision dataset, and it would be nice to try something larger than CIFAR. Looks like the LLM and RL models are also < 1B parameters, but I'm sure you're planning to try something like a Llama model next.

I'm imagining you would do this during regular training/pre-training for your model to be modular so you can remove shards based o... (read more)

2cloud
Thanks for the thoughtful questions. Regarding image models: our understanding is that strong regularization is required to split representations for MNIST autoencoding and CIFAR classification because there is a strong inductive bias towards learning features that are common to many classes of images. (In MNIST, 3s are similar to 8s, etc.; In CIFAR, similar edge detectors, etc. will be learned for many classes.) Basically, our learning target is highly unnatural. With our current experimental design, I don't expect this to change with scale, so I'm less excited about investigating the effect of model or dataset size. That said, this dynamic might change if we explored examples with class imbalance (routing only a small fraction of classes and training on others as normal). I suspect this would reduce the need for regularization, leading to a reduction in alignment tax and perhaps more interesting dynamics with respect to scale. That's an experiment we probably should have run (and still could, but we aren't prioritizing image models right now). As for localization for unlearning in language models, my personal take is that the idea is there but we don't have the method quite right yet. I think there's a reasonable chance (say, 40%) that we change our configuration a bit and are able to get localization much more stably, and with lower alignment tax both pre- and post-ablation. (If I understand correctly, my colleagues agree that this outcome is plausible but think it's less likely than I do.) If we aren't able to find this methodological improvement, then I don't see a point in scaling. However, if we find it, then I expect scaling will be relatively cheap because, while we will still need to pre-train models, we won't need to do any more hyperparameter tuning than is usual. Of course, whatever method we land on may turn out to have middling performance. In that case, to get a signal on whether this is worth doing, we may need to investigate a realistic unlearnin

This seems like a prety cool perspective, especially since it might make analysis a little simpler vs. a paradigm where you kind of need to know what to look out for specifically. Are there any toy mathematical models or basically simulated words/stories, etc... to make this more concrete? I briefly looked at some of the slides you shared but it doesn't seem to be there (though maybe I missed something, since I didn't watch he entire video(s)).

I'm not honestly sure exactly what this would look like since I don't fully understand much here beyond the notion... (read more)

2Daniel Murfet
That simulation sounds cool. The talk certainly doesn't contain any details and I don't have a mathematical model to share at this point. One way to make this more concrete is to think through Maxwell's demon as an LLM, for example in the context of Feynman's lectures on computation. The literature on thermodynamics of computation (various experts, like Adam Shai and Paul Riechers, are around here and know more than me) implicitly or explicitly touches on relevant issues.

That's great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I've read only a little bit of the literature, so maybe I'll just find out later :P

The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it's important to push to try and get techniques usable in the real world, though I also ... (read more)

I'm curious on your thoughts of this notion of perennial philosophy and convergence of beliefs. One interpretation that I have of perennial philosophy is purely empirical: imagine that we have two "belief systems". We could define a belief system as a set of statements about the way the world works and valuation of world states (i.e. statements "if X then Y could happen" and "Z is good to have"). You can probably formalize it some other way, but I think this is a reasonable starter pack to keep it simple. (You can also imagine further formalizing it by usi... (read more)

For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively),... (read more)

Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.

I disagree that the AGI safety team should have 4 as its "bre... (read more)

6Arthur Conmy
  Yes. On the AGI safety and alignment team we are working on activation steering - e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don't have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see "safety practices we can test right now" as a research goal.

How is this translational symmetry measure checking for the translational symmetry of the circuit? QK, for example, is being used as a bilinear form, so it's not clear to me, for example, what the "difference in the values" is mapping onto here (since I think these "numbers" are actually corresponding to unique embeddings). More broadly, do you have a good sense of how to interpret these bilinear forms? There is clearly a lot of structure in the standard weight basis in these pictures, and I'm not sure exactly what it means. I'm guessing you can see that s... (read more)

1Einar Urdshals
The code is currently not public. We intend to make it public once we have finished a few more projects with the same codebase. One of the things we would like to look at is varying the amount of noise. I don't have great intuitions for what the loss landscape of a model trained on a finite random dataset will look like. As to the translational symmetry of the circuits, the measure just sums the absolute difference between adjacent elements parallel to the diagonal, does the same for elements perpendicular to the diagonal and takes the difference of the two sums. The intuition behind this is that if the circuit has translational symmetry, the relationship between vocabulary element i and j would be the same as the relationship i+1 and j+1. We subtract the lines perpendicular to the diagonal to avoid our measure becoming very large for a circuit that is just very uniform in all directions. We expect the circuits to have translational symmetry because we expect the sorting to work the same across all the vocabulary (except for the first and the last vocabulary). If you compare two numbers a and b for the purpose of sorting, the only thing that should matter is the difference between a and b, not their absolute scale. When a circuit for instance does something like "vocabulary elements attends to the smallest number larger than itself", that should only depend on the relationship difference between itself and all the numbers, not on their overall magnitude. I do agree that our translational symmetry measure is somewhat arbitrary, and that we instead could have looked at the standard deviation of lines parallel and perpendicular to the diagonal, or something like that. I expect that the outcome would have been largely the same. As to how to interpret the circuits, Callum goes into some more detail on interpreting the final form of the baseline 2-head model here (select [October] Solutions in the menu on the left).

Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.

On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context... (read more)

This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere

  1. Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
  2. It looks like the hypersphere constraint is so that the optimizer doesn't select something far away due to being large. Is there any reason to use this sort of constrai
... (read more)

Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.

Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like ... (read more)

4gateΩ010

Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more ... (read more)

2tailcalled
I think it's easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there's a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter. For neural networks, I sort of assume there's a similar thing going on, except it's quite hard to define it precisely. In technical terms, neural networks lack a privileged basis which distinguishes different components of the network, so one cannot pick a discrete component and ask whether it runs and if so how it runs. This is a somewhat different definition of "on-manifold" than is usually used, as it doesn't concern itself with the real-world data distribution. Maybe it's wrong of me to use the term like that, but I feel like the two meanings are likely to be related, since the real-world distribution of data shaped the inner workings of the neural network. (I think this makes most sense in the context of the neural tangent kernel, though ofc YMMV as the NTK doesn't capture nonlinearities.) In principle I don't think it's always important to stay on-manifold, it's just what one of my lines of thought has been focused on. E.g. if you want to identify backdoors, going off-manifold in this sense doesn't work. I agree with you that it is sketchy to estimate the manifold from wild empiricism. Ideally I'm thinking one could use the structure of the network to identify the relevant components for a single input, but I haven't found an option I'm happy with. Maybe. But isn't optimization in token-space pretty flexible, such that this is a relatively weak test? Realistically steering vectors can be useful even if they go off-manifold, so I'd wait with trying to measure how on-manifold stuff is until there's a method that's been developed to specifi