This seems like a prety cool perspective, especially since it might make analysis a little simpler vs. a paradigm where you kind of need to know what to look out for specifically. Are there any toy mathematical models or basically simulated words/stories, etc... to make this more concrete? I briefly looked at some of the slides you shared but it doesn't seem to be there (though maybe I missed something, since I didn't watch he entire video(s)).
I'm not honestly sure exactly what this would look like since I don't fully understand much here beyond the notion that concentration of intelligence/cognition can lead to larger magnitude outcomes (which we probably already knew) and the idea that maybe we could measure this or use it to reason in some way (which maybe we aren't doing so much). Maybe we could have some sort of simulated game where different agents get to control civilizations (i.e. like Civ 5) and among the things they can invest (their resources) in, there is some measure of "cognition" (i.e. maybe it lets them plan further ahead or gives them the ability to take more variables into consideration when making decisions or to see more of the map...). With that said, it's not clear to me what would come out of this simulation other than maybe getting a notion of the relative value (in different contexts) of cognitive vs. physical investments (i.e. having smarter strategists vs. building a better castle). There's not clear question or hypothesis that comes to mind right now.
It looks like from some other comments that literature on agents foundations might be relevant, but I'm not familiar. If I get time I might look into it in the future. Are these sorts of frameworks are useable for actual decision making right now (and if so how can we tell/not) or are they still exploratory?
Generally just curious if there's a way to make this more concrete i.e. to understand it better.
That's great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I've read only a little bit of the literature, so maybe I'll just find out later :P
The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it's important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM's research could get more real world grounding? (Or do you think that it's all well and good as it stands?)
I'm curious on your thoughts of this notion of perennial philosophy and convergence of beliefs. One interpretation that I have of perennial philosophy is purely empirical: imagine that we have two "belief systems". We could define a belief system as a set of statements about the way the world works and valuation of world states (i.e. statements "if X then Y could happen" and "Z is good to have"). You can probably formalize it some other way, but I think this is a reasonable starter pack to keep it simple. (You can also imagine further formalizing it by using numbers or lattices for values and probabilities and some well-defined FSM to model parts of the world.) We could say that the religions have converged if they share a lot of features, by which I mean that for some definition of a feature the feature is present in both belief systems. We can define a feature in many ways, but for our simple thought experiment it can be effectively a state or relation between states in the two world views. For example, we could imagine that a feature is a function from states and their values/causal relations such that under the mapping it remains unchanged (i.e. there is some notion of this mapping being like an isomorphism on the projection of the set via the function). For example, in one belief system you might have some sort of "god" character that is somehow the cause of many things. The function here could be "(int(god is cause of x1) + int(god is cause of x2) + ...) / num_objects". If we map common objects (this spoon) to themselves in the other system (still the spoon) and god to god, we will see that in both systems the function representing how causal god is, remains close to 1, and so we may say that both systems have a notion of a "god" and therefore there has been some degree of convergence in the "having a god" stuff for the two systems.
So now with all this formal BS out of the way, which I had to do, because it highlights what is missing, the question is clear: under some reasonable such definition of what convergence means, how do you decide whether two religions have converged? The vibe I get from the perennial philosophy believers that I have thus spoken to is that "you have to go through the journey to understand" and generally it appears to be a sort of dispositional convergence, at least on face value—though I do not observe people of very different religions, who claim to have convergence, conviving for a long time (meaning that it is not verifiable whether indeed, the dispositions are truly something that we could call converged). Of course, it may be possible to find mappings that claim that two belief systems have converged or not when the opposite is a more honest appreciation.
Obviously no one is going to come out here and create a mathematical definition and just "be right" (idt that's even a fair thing to consider to be possible), but I do not particularly like making such assertions totally "on vibes". Often people will say that they are "spiritual" and that "spirituality" helped them overcome some psychological challenge or who knows what, but what does "spiritual" mean here? Often it's associated with some belief system that we would, as laymen, call religious or spiritual (i.e. in the enumerable list of christianity and its sub-branches, buddhism and it's, etc...), but who is to say that it is not only some part of the phenomenon that person experienced, which happened to be caused by the spiritual system which was present at the time and place, that was the truer cause of the change of psyche? It seems compelling to me to want to decouple these "core truths" from the religions that hold them so as to have them presentable in a more neutral way, since in the alternative world where you must "go through the journey" of spirituality via some specific religion, you cannot know beforehand that you won't be effectively brainwashed—and you cannot even know afterwards either... you can only get faint hints at it during the process.
So this is not to say that that anyone is getting brainwashed or that anything is good or bad or that anything should be embraced or not. I'm just saying that from an outside perspective, it is not verifiable whether religions actually converge, without diving into this stuff. However, it is is also not verifiable whether diving in is actually good, and it's not verifiable whether afterwords it even will be verifiable. Maybe I'm stumbling into some core metaphysical whirlwind of "you cannot know anything" but I do truly believe that a more systematic exposition of how we should interpret spirituality, trapped priors, convergence, and the like is possible and would enable more productive discussion.
PS I think you've touched on something tangential in the statement that you should do this with trusted people. That's trying to bootstrap, however, a resistance to manipulative misappropriation of spirituality, whereas what I'm saying I would also like more of a logical bootstrapping to the whole notion of spirituality and ideas like "convergence" so that one can leave the conversation with solid conclusions, knowing their limitations, and having a higher level of actionability.
PPS: I feel like treating a belief system, like "rationality" as a machine/tool: something which has a certain reach, certain limitations, and that usually behaves as expected in most situations but might have some bugs, is a good way to go. This will make it easier to decouple rationality with, say, spiritual traditions. At each point of time and space you can basically decide on common sense which of these two machines/tools is best to apply. Each different tool can be shown, hopefully, to be good for some cases and thus most decision making happens on the routing level: which tool to use. If you understand the tool from a third person point of view there is less of a tendency to rely on it in the wrong cases purely on dogma.
For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).
Is DM exploring this sort of stuff? It doesn't seem to be under the mantle of "AGI Safety" given the post above. Maybe it's another team? It's true that it's more "AI" than "AGI" safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:
The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I'm missing some) of research that can be done in AI Safety:
In this categorization, it seems like DM's AGI Safety team is very much more focused on 1,2, and 3. There's nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren't companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I'm conflating DM with other parts of Alphabet?
Anyways, I'm curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI "safe" while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today. Also curious to hear if you think this categorization is flawed in some key way.
How is this translational symmetry measure checking for the translational symmetry of the circuit? QK, for example, is being used as a bilinear form, so it's not clear to me, for example, what the "difference in the values" is mapping onto here (since I think these "numbers" are actually corresponding to unique embeddings). More broadly, do you have a good sense of how to interpret these bilinear forms? There is clearly a lot of structure in the standard weight basis in these pictures, and I'm not sure exactly what it means. I'm guessing you can see that some sections are rather empty corresponding to the "model learns to specialize on certain parts of the vocabulary for xyz head" being potentially associated with some sort of one-hot or generally standard-basis-privilege situation. Let me know if I'm misunderstanding something dumb. I haven't seen this being done much elsewhere, but it would be nice to have a github because it's really easy to resolve these questions by reading pytorch code. Is it available somewhere?
One other thing I'm curious about is results for more control experiments. For example, for the noise, if you fully noised the output (i.e. output a random permutation) we should expect the model to fail to learn anything at all and to fail to get a high LLC right? It's also possible to noise by inserting new elements in the output (or input... I guess it's equivalent) to replace others, but keeping the order the same. In this case, maybe the network can learn to understand what the ordering is even if it doesn't know exactly which outputs will be there in the end, so even with very high amounts of noise a "structured" solution makes sense (though I reckon the way you propagate loss will matter in this case).
Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.
On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...) that map onto things such as biology, psychology, etc...
I think it's fair to call these programs, since they are tools you use in a sort of causal way to say what will happen. Usually, you imagine certain objects that follow certain rules to do things, thereby changing the state of the world. They are things you could write as programs or instructions.
The art here is to be able to formalize a language that has the right parametrization to describe and predict the desired phenomena well, while being expressive enough to grow in a useful way, as we discover more.
But anyways, there are sort of two questions that naturally arise here:
This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere
Thanks!
Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.
Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.
Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense "off-distribution" but still in the real world and I don't think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I'm not sure so I thought I might ask. I am new to this field.
I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness.
Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.
Curious to hear thoughts :)
This is a cool method. Are you thinking of looking more into how gradient routed model performance (on tasks and not just loss) scales with size of the problem/model? You mention that it requires a big L1 regularization in the Vision dataset, and it would be nice to try something larger than CIFAR. Looks like the LLM and RL models are also < 1B parameters, but I'm sure you're planning to try something like a Llama model next.
I'm imagining you would do this during regular training/pre-training for your model to be modular so you can remove shards based on your needs, but if the alignment tax is really high (or lowering it is complicated—hyperparameter tuning sucks :/) it's gonna be hard to convince people to use it and that's just unfortunate. Maybe you are also thinking of using it as a modification to finetuning, which seems more promising since with gradient routing you are also basically doing some form of PeFT.
What do you think can be improved for finetuning & unlearning use-cases (i.e. for LLMs)?