There's a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don't think this is very likely for reasons discussed in the post, and it's also easy to use this kind of view to justify some pretty harmful types of actions.
Thx!
I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI.
I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.
...If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callo
I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.
Hmm. Sounds like “AI safety community” is a pretty different group of people from your perspective than from mine. Like, I would say that if there’s some belief that is rejected by Eliezer Yudkowsky and by Paul Christiano and by Holden Karnofsky and, widely rejected by employees of OpenPhil and 80,000 hours and ARC and UK-AISI, and widely rejected by self-described rationalists and by self-described EAs and by ...
Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.
Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations.
But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.
Some relevant papers to anyone spelunking around this post years later:
https://arxiv.org/abs/2403.05030
https://arxiv.org/abs/2407.15549
Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements.
Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this.
Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.
First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.
Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetun...
Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.
First, I appreciate the transparency around this point.
we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.
Second, I have a small question. Why the use of the word "testing" instead of "auditing"? In my experience, most of the conversations lately about this kind of thing revolve around the term "auditing."
Third, I wanted to note that this post does not talk about ac...
I think one thing that's pretty cool is "home teaching." Mormon congregation members who are able are assigned various other members of the congregation to check in on. This often involves a monthly visit to their house, talking for a bit, sharing some spiritual thoughts, etc. The nice thing about it is that home teaching sometimes really benefits people who need it. Especially for old or disabled people, they get nice visits, and home teachers often help them with stuff. In my experience, Mormons within a congregation are pretty good at helping each other with misc. things in life (e.g. fixing a broken water heater), and this is largely done through home teaching.
Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one.
See also this much older and closely related post by Thomas Woodside: Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?
I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.
First, the appendix hinges somewhat on this point.
...For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any st
Thanks!
I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method's ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.
I could...
Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways.
I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
Lastly,
On the other hand, interpretability research is probably crucial for AI alignment.
I don't think this is true and I especially hope it is not true because (1) mechanistic...
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I'll add that I have as well and wrote a sequence about it :)
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don't think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there's a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like "I don't understand how “Looking at random bits of the model and identify circuits/features” will help with deception"). And in general a lot of this is of the form "I don't see how X", which is the format I'm objecting to, because of course you won't see how X until someone invents a technique to X.
This is exacerbated by the meta-level probl...
This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It's also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:
...My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the
Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.
There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don't expect one.
Thoughts of mine on this are here. In short, I have argued that toy problems, cherry-picking models/tasks, and a lack of scalability has contributed to mechanistic interpretability being relatively unproductive.
We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment.
A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that ...
Thanks for the comment.
In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.
This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.
...there are a
In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate.
One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.
I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974
I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.
I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.
Thanks.
Are you concerned about AI risk from narrow systems of this kind?
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page.
Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:
We must grow interpretability and AI safety in the real world.
Strong +1 to working on more real-world-relevant approaches to interpretability.
Regulation is coming – let’s use it.
Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be s...
I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not.
I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign.
—
In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute.