4gate

Wiki Contributions

Comments

Sorted by
4gate10

That's great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I've read only a little bit of the literature, so maybe I'll just find out later :P

The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it's important to push to try and get techniques usable in the real world, though I also understand that science is not so linear.  Where and how do you think DM's research could get more real world grounding? (Or do you think that it's all well and good as it stands?)

4gate10

I'm curious on your thoughts of this notion of perennial philosophy and convergence of beliefs. One interpretation that I have of perennial philosophy is purely empirical: imagine that we have two "belief systems". We could define a belief system as a set of statements about the way the world works and valuation of world states (i.e. statements "if X then Y could happen" and "Z is good to have"). You can probably formalize it some other way, but I think this is a reasonable starter pack to keep it simple. (You can also imagine further formalizing it by using numbers or lattices for values and probabilities and some well-defined FSM to model parts of the world.) We could say that the religions have converged if they share a lot of features, by which I mean that for some definition of a feature the feature is present in both belief systems. We can define a feature in many ways, but for our simple thought experiment it can be effectively a state or relation between states in the two world views. For example, we could imagine that a feature is a function from states and their values/causal relations such that under the mapping it remains unchanged (i.e. there is some notion of this mapping being like an isomorphism on the projection of the set via the function). For example, in one belief system you might have some sort of "god" character that is somehow the cause of many things. The function here could be "(int(god is cause of x1) + int(god is cause of x2) + ...) /  num_objects". If we map common objects (this spoon) to themselves in the other system (still the spoon) and god to god, we will see that in both systems the function representing how causal god is, remains close to 1, and so we may say that both systems have a notion of a "god" and therefore there has been some degree of convergence in the "having a god" stuff for the two systems.

So now with all this formal BS out of the way, which I had to do, because it highlights what is missing, the question is clear: under some reasonable such definition of what convergence means, how do you decide whether two religions have converged? The vibe I get from the perennial philosophy believers that I have thus spoken to is that "you have to go through the journey to understand" and generally it appears to be a sort of dispositional convergence, at least on face value—though I do not observe people of very different religions, who claim to have convergence, conviving for a long time (meaning that it is not verifiable whether indeed, the dispositions are truly something that we could call converged). Of course, it may be possible to find mappings that claim that two belief systems have converged or not when the opposite is a more honest appreciation.

Obviously no one is going to come out here and create a mathematical definition and just "be right" (idt that's even a fair thing to consider to be possible), but I do not particularly like making such assertions totally "on vibes". Often people will say that they are "spiritual" and that "spirituality" helped them overcome some psychological challenge or who knows what, but what does "spiritual" mean here? Often it's associated with some belief system that we would, as laymen, call religious or spiritual (i.e. in the enumerable list of christianity and its sub-branches, buddhism and it's, etc...), but who is to say that it is not only some part of the phenomenon that person experienced, which happened to be caused by the spiritual system which was present at the time and place, that was the truer cause of the change of psyche? It seems compelling to me to want to decouple these "core truths" from the religions that hold them so as to have them presentable in a more neutral way, since in the alternative world where you must "go through the journey" of spirituality via some specific religion, you cannot know beforehand that you won't be effectively brainwashed—and you cannot even know afterwards either... you can only get faint hints at it during the process.

So this is not to say that that anyone is getting brainwashed or that anything is good or bad or that anything should be embraced or not. I'm just saying that from an outside perspective, it is not verifiable whether religions actually converge, without diving into this stuff. However, it is is also not verifiable whether diving in is actually good, and it's not verifiable whether afterwords it even will be verifiable. Maybe I'm stumbling into some core metaphysical whirlwind of "you cannot know anything" but I do truly believe that a more systematic exposition of how we should interpret spirituality, trapped priors, convergence, and the like is possible and would enable more productive discussion.

PS I think you've touched on something tangential in the statement that you should do this with trusted people. That's trying to bootstrap, however, a resistance to manipulative misappropriation of spirituality, whereas what I'm saying I would also like more of a logical bootstrapping to the whole notion of spirituality and ideas like "convergence" so that one can leave the conversation with solid conclusions, knowing their limitations, and having a higher level of actionability.

PPS: I feel like treating a belief system, like "rationality" as a machine/tool: something which has a certain reach, certain limitations, and that usually behaves as expected in most situations but might have some bugs, is a good way to go. This will make it easier to decouple rationality with, say, spiritual traditions. At each point of time and space you can basically decide on common sense which of these two machines/tools is best to apply. Each different tool can be shown, hopefully, to be good for some cases and thus most decision making happens on the routing level: which tool to use. If you understand the tool from a third person point of view there is less of a tendency to rely on it in the wrong cases purely on dogma.

4gateΩ120

For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).

Is DM exploring this sort of stuff? It doesn't seem to be under the mantle of "AGI Safety" given the post above. Maybe it's another team? It's true that it's more "AI" than "AGI" safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:

  • You get to create interesting questions for the MI people that are totally toy models, thereby making their work more useful IRL and creating more information on which to gain broader theoretical understanding of phenomena.
  • You actually fix problems today. You can also fail fast. I don't know much about the debate literature, but look at the debate example from my perspective: (1) 6 years ago someone conceptualized debate and made some theoretical argument (2) there was a community with expectations and probably a decent amount of discussion about debate in the span of those 6 years, even including (theoretical) improvements on the original debate ideas, (3) someone actually tried debate and it didn't work as expected... today... 6 years later. I understand that for debate you probably need good-enough models (unless you are more clever than me—and probably can be), so maybe harping on debate is not fair. That's not what I'm trying to do here, anyways. Mainly I'm just highlighting that when we can iterate by solving real problems and getting feedback in shorter timespans, we can actually get a lot more safety.
  • A lot of safety training is about controlling/constraining what the AI can say/do so that it won't say/do the bad things. The tools of this sort of control are pretty generic, so it's not unlikely that they would provide some benefits in future situations as well. As models scale (capabilities), so long as we keep improving our methods for red-teaming+safety training, these sorts of semi-empirical tools should roughly scale (in their capability to control/constrain the AI). It is more likely that by working on pure science & tools that "will be useful eventually" we are overall less safe and have larger jumps in the size of the gap between the ability of an AI to cause harm and our ability to keep it from doing so.

The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I'm missing some) of research that can be done in AI Safety:

  1. Pure science: this is probably very useful but in a very long time. It will be very interesting and not show up in the real world until kind of late. I think a large proportion of MI falls into this. AFAIK no one uses SAEs IRL for safety tasks? With that said, they will surely be very scientifically useful. Maybe steering vectors are the exception, but they also are in 4 (below). Pure science is usually more about understanding how things work first before being able to intervene.
  2. Evals: (in the broad sense of the word, including safety and capabilities benchmarks) self explanatory. Useful at every stage.
  3. Safety Theory: into this I lump ideas like debate and amplified oversight, which don't really do much in the real world (products people use, etc...) right now AFAIK (not sure, am I wrong?) but are a combination of (primarily, still) conceptual frameworks for how we could have AGI Safety plus the tools to enact that framework. Usually, things from this category arose from someone thinking about how we could have AI Safety in the future, and coming up with some strategy. That strategy is often not really enactable until the future, with perhaps some toy models as exceptions, so I call these "theory."
  4. Safety Practice: into this I lump basically most red-teaming attacks, prompt/activation engineering, and safety training methods that people use or are possible to plug in.  These methods usually arise because there is a clear, real-world problem and so their goal is to fix that problem. They are usually applicable in short timespans and are sometimes a little bit of a patchwork, but iterative and possibly to test and improve. More so than 3 (above), they arise from a realistic current need instead of a likely future need. Unlike 1(above) they are focused on making interventions first and understanding later.

In this categorization, it seems like DM's AGI Safety team is very much more focused on 1,2, and 3. There's nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren't companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I'm conflating DM with other parts of Alphabet?

Anyways, I'm curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI "safe" while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today.  Also curious to hear if you think this categorization is flawed in some key way. 

4gate10

How is this translational symmetry measure checking for the translational symmetry of the circuit? QK, for example, is being used as a bilinear form, so it's not clear to me, for example, what the "difference in the values" is mapping onto here (since I think these "numbers" are actually corresponding to unique embeddings). More broadly, do you have a good sense of how to interpret these bilinear forms? There is clearly a lot of structure in the standard weight basis in these pictures, and I'm not sure exactly what it means. I'm guessing you can see that some sections are rather empty corresponding to the "model learns to specialize on certain parts of the vocabulary for xyz head" being potentially associated with some sort of one-hot or generally standard-basis-privilege situation. Let me know if I'm misunderstanding something dumb. I haven't seen this being done much elsewhere, but it would be nice to have a github because it's really easy to resolve these questions by reading pytorch code. Is it available somewhere?

One other thing I'm curious about is results for more control experiments. For example, for the noise, if you fully noised the output (i.e. output a random permutation) we should expect the model to fail to learn anything at all and to fail to get a high LLC right? It's also possible to noise by inserting new elements in the output (or input... I guess it's equivalent) to replace others, but keeping the order the same. In this case, maybe the network can learn to understand what the ordering is even if it doesn't know exactly which outputs will be there in the end, so even with very high amounts of noise a "structured" solution makes sense (though I reckon the way you propagate loss will matter in this case).

4gate10

Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.

On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...) that map onto things such as biology, psychology, etc...

I think it's fair to call these programs, since they are tools you use in a sort of causal way to say what will happen. Usually, you imagine certain objects that follow certain rules to do things, thereby changing the state of the world. They are things you could write as programs or instructions.

The art here is to be able to formalize a language that has the right parametrization to describe and predict the desired phenomena well, while being expressive enough to grow in a useful way, as we discover more.

But anyways, there are sort of two questions that naturally arise here:

  1. Why is MI more closely related to program synthesis than any other field that wishes to explain a process that can be thought of as a program (i.e. it has causal components that happen over time)?
  2. I was under the impression that MI is in the business of trying to establish the right language and concepts to use to describe the information processing done by deep learning models. The field has not really cracked the "art" here yet AFAIK. With that said, I'm guessing that the program synthesis literature and tooling has a slightly different goal and therefore carries certain baggage of how one goes about thinking about these problems (i.e. maybe more of a lean towards symbolic methods). But the program synthesis literature probably doesn't actually create the right language to have a 10x conceptual framework for the science of deep learning information processing because otherwise we would have a lot more solved problems than we do. So in this sense, a new start is not necessarily bad. You can think about this in some sense fuzzily analogous to how Galois invented (if you can say that) a new branch of math to solve the so-far-unsolved 5th degree polynomial root-finding problem. It was not by using the already-existent tools that this problem was solved. You can also think of this as a society-level de-biasing strategy. If DL is ever to have an explanatory framework on par with, say, Classical Mechanics, it appears that we need a conceptual 10x-ing. Do you agree with this framing? If so, what do you think is a healthy amount of rediscovery?
4gate30

This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere

  1. Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
  2. It looks like the hypersphere constraint is so that the optimizer doesn't select something far away due to being large. Is there any reason to use this sort of constraint other than that?
  3. How do people usually constrain things like norm or do orthogonality constraints as a hard constraint? I assume not regular loss-based regularization since that's not hard. I assume iterative "optimize and project" is not always optimal but maybe it's usually optimal (it seems to be what is going on here but not sure?). Do lagrange multipliers work? It seems like they should but I've never used them for ML. I'm guessing that in the bigger picture this doesn't matter.
  4. Have you experimented with adaptor rank and/or is there knowledge on what ranks tend to work were? I'm curious of the degree of sparsity. You also mention doing LoRA for attention instead and I'm curious if you've tried it yet.
  5. W.r.t. the "spiky" parametrization options, have you tried just optimizing over certain subspaces? I guess the motivation of the spikiness must be that we would like to maintain as much as possible of the "general processing" going on but I wonder if having a large power can axe the gradient for R < 1.
  6. Is there a way to propagate this backwards to prompts that you are exploring? Some people do bring up the question in the comments about how natural these directions might be.
  7. Not sure to what extent we understand how RLHF, supervised finetuning and other finetuning methods currently work. What are your intuitions? If we are able to simply add some sort of vector in an early layer it would seem to support the mental model that finetuning mainly switches which behavior gets preferentially used instead of radically altering what is present in the model.

Thanks!

4gate10

Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.

Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.

4gateΩ010

Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things  happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense "off-distribution" but still in the real world and I don't think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I'm not sure so I thought I might ask. I am new to this field.

I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness.

Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.

Curious to hear thoughts :)