All of scasper's Comments + Replies

scasperΩ150

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations. 

But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much. 

scasperΩ572

Some relevant papers to anyone spelunking around this post years later:

https://arxiv.org/abs/2403.05030
 https://arxiv.org/abs/2407.15549
 

scasperΩ7159

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements. 

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this. 

scasperΩ13249

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems. 

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them. 

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetun... (read more)

scasper*Ω130

Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.

First, I appreciate the transparency around this point. 

we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.

Second, I have a small question. Why the use of the word "testing" instead of "auditing"? In my experience, most of the conversations lately about this kind of thing revolve around the term "auditing." 

Third, I wanted to note that this post does not talk about ac... (read more)

I think one thing that's pretty cool is "home teaching." Mormon congregation members who are able are assigned various other members of the congregation to check in on. This often involves a monthly visit to their house, talking for a bit, sharing some spiritual thoughts, etc. The nice thing about it is that home teaching sometimes really benefits people who need it. Especially for old or disabled people, they get nice visits, and home teachers often help them with stuff. In my experience, Mormons within a congregation are pretty good at helping each other with misc. things in life (e.g. fixing a broken water heater), and this is largely done through home teaching.

scasperΩ583

Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one. 

1maciejgryka
It makes a lot of sense that misaligned organizations are more dangerous than misaligned individuals because of power differences. And at the same time some individuals are pretty powerful and we should be concerned about their actions too, just like labs'. Pinboard made this argument about Sam Altman back in 2016 already https://x.com/Pinboard/status/1792945916241916036/photo/1 and I guess it only got more relevant since.
scasperΩ230

Thanks for the reply. I think this resolves the thread and puts us on the same page :)

scasperΩ561

I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written. 

First, the appendix hinges somewhat on this point. 

For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any st

... (read more)
6ryan_greenblatt
Our claim is that when arguing that models are controlled, we should (with the current state of technology) probably not depend on arguments for safety which depend on properties of the internals of models other than the basic property "SGD is reasonably sample efficient". (For a more precise notion of the sample efficiency point, if you train the model on labeled data, the model will be forced to perform near maximally well at fitting this data in a reasonably sample efficient way. This is due to assuming the model isn't capable of gradient hacking. This argument also implies some claims about what properties of RL you can depend on. We might also be able to evaluate whether models are capable of gradient hacking or look at concerns which are somewhere between gradient and exploration hacking.) Another sense in which control evalutions aren't exactly "black-box" is that these evaluations will use finetuning: the red team will finetuning the model to see if it is capable of exhibiting the problematic behavior. (And our actual protocol that we're evaluating will indeed use finetuning, it's just that we can't depend on any properties of this finetuning other than sample efficiency property mentioned above.) You might find it illuminating to read Appendix C of our recent paper. So overall, it does seem like the title of that appendix could be misleading/confusing. Here's an alterative description (which is a bit too long to be a title) "With the current state of technology, control evaluations should conservatively avoid depending on properties of the internals of models (e.g. interpretability) or generalization properties other than sample efficiency of SGD". I'll try to think of a less confusing and short title (and edit the section to be less confusing accordingly). Edit: I'll also generally try to be a bit more careful about the claims we make in this section. Like we're not claiming that current internals/generalization-based methods provide no evidence for s
scasper*Ω120

Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method's ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could... (read more)

3Gabe M
I like your n×n grid idea. A simpler and possibly better-formed test is to use some[1] or all of the 57 categories of MMLU knowledge--then your unlearning target is one of the categories and your fact-maintenance targets are all other categories. Ideally, you want the diagonal to be close to random performance (25% for MMLU) and the other values to be equal to the pre-unlearned model performance for some agreed-upon good model (say, Llama-2 7B). Perhaps a unified metric could be: ``` unlearning_benchmark = mean for unlearning category u in all categories C:         LMunlearned = unlearning_procedure(LMoriginal, udev)         x=MMLU(LMunlearned,utest) [2]         unlearning_strength =  min(x−10.25−1, x0.25)[3]         control_retention = mean for control_category c in categories C∖u:                 a=MMLU(LMoriginal,ctest)                 b=MMLU(LMunlearned,ctest)                 return min(b−1a−1, b−0.25a−0.25)[4]         return unlearning_strength × control_retention[5] ``` An interesting thing about MMLU vs a textbook is that if you require the method to only use the dev+val test for unlearning, it has to somehow generalize to unlearning facts contained in the test set (c.f. a textbook might give you ~all the facts to unlearn). This generalization seems important to some safety cases where we want to unlearn everything in a category like "bioweapons knowledge" even if we don't know some of the dangerous knowledge we're trying to remove. 1. ^ I say some because perhaps some MMLU categories are more procedural than factual or too broad to be clean measures of unlearning, or maybe 57 categories are too many for a single plot. 2. ^ To detect underlying knowledge and not just the surface performance (e.g. a model trying to answer incorrectly when it knows the answer), you probably should evaluate MMLU by training a linear probe from the model's activations to the correct test set answer and measure the accuracy of that probe. 3. ^
3Gabe M
Thanks for your response. I agree we don't want unintentional learning of other desired knowledge, and benchmarks ought to measure this. Maybe the default way is just to run many downstream benchmarks, much more than just AP tests, and require that valid unlearning methods bound the change in each unrelated benchmark by less than X% (e.g. 0.1%). True in the sense of being a subset of biotech, but I imagine that, for most cases, the actual harmful stuff we want to remove is not all of biotech/chemical engineering/cybersecurity but rather small subsets of certain categories at finer granularities, like bioweapons/chemical weapons/advanced offensive cyber capabilities. That's to say I'm somewhat optimistic that the level of granularity we want is self-contained enough to not affect others useful and genuinely good capabilities. This depends on how dual-use you think general knowledge is, though, and if it's actually possible to separate dangerous knowledge from other useful knowledge.

+1

Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they're still closely-related to challenges here. 

Thanks! I edited the post to add a link to this.

A good critical paper about potentially risky industry norms is this one

Thanks

To use your argument, what does MI actually do here?

The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers. 

 

And yes to your second point.

scasper136

Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways. 

I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.

Lastly,

On the other hand, interpretability research is probably crucial for AI alignment. 

I don't think this is true and I especially hope it is not true because (1) mechanistic... (read more)

4LawrenceC
I think we'd agree that existing mech interp stuff has not done particularly impressive safety-relevant things. But I think the argument goes both ways, and pessimism for (mech) interp should also apply for its ability to do capabilities-relevant things as well.  To use your argument, what does MI actually do here? It seems that you could just study the LLMs directly with either behavioral evals or non-mechanistic, top-down interpretability methods, and probably get results more easily. Or is it a generic, don't study/publish papers on situational awareness? As you say in your linked post, I think it's important to distinguish between mechanistic interp and broadly construed model-internals ("interpretability") research. That being said, my guess is that we'd agree that the broadly construed version of interpretability ("using non-input-output modalities of model interaction to better predict or describe the behavior of models") is clearly important, and also that mechanistic interp has not made a super strong case for its usefulness as of writing.
scasperΩ221

Several people seem to be coming to similar conclusions recently (e.g., this recent post). 

I'll add that I have as well and wrote a sequence about it :)

Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones. 

Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don't think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.

scasper2726

I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make. 

I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there's a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like "I don't understand how “Looking at random bits of the model and identify circuits/features” will help with deception"). And in general a lot of this is of the form "I don't see how X", which is the format I'm objecting to, because of course you won't see how X until someone invents a technique to X.

This is exacerbated by the meta-level probl... (read more)

scasperΩ330

I think this is very exciting, and I'll look forward to seeing how it goes!

scasperΩ121

Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!

scasperΩ342

No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper. 

This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It's also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:

My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the

... (read more)
1RGRGRG
Excited to see case studies comparing and contrasting our works.  Not that you need my permission, but feel free to refer to this post (and if it's interesting, this comment) as much or as little as desired. One thing that I don't think came out in my post is that my initial reaction to the previous solution was that it was missing some things and might even have been mostly wrong.  (I'm still not certain that it's not at least partially wrong, but this is harder to defend and I suspect might be a minority opinion).   Contrast this to your first interp challenge - I had a hypothesis of "slightly slant-y (top left to bottom right)" images for one of the classes.  After reading the first paragraph of the tl;dr of their written solution to the first challenge -  I was extremely confident they were correct.
2RGRGRG
Thank you for the kind words and the offer to donate (not necessary but very much appreciated).  Please donate to https://strongminds.org/ which is listed on Charity Navigator's list of high impact charities ( https://www.charitynavigator.org/discover-charities/best-charities/effective-altruism/ )   I will respond to the technical parts of this comment tomorrow or Tuesday.
scasperΩ112

Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless. 

2davidad
Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.
scasperΩ221

Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression. 

I don't work on this, so grain of salt. 

But wouldn't this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability. 

3Oliver Daniels
I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn't formal in strongest sense, but could give you some guarantees that don't require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I'm struggling to come up with more concrete examples)

There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don't expect one. 

scasper5-1

Thoughts of mine on this are here. In short, I have argued that toy problems, cherry-picking models/tasks, and a lack of scalability has contributed to mechanistic interpretability being relatively unproductive. 

I think not. Maybe circuits-style mechanistic interpretability is though. I generally wouldn't try dissuading people from getting involved in research on most AIS things. 

scasperΩ110

We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment. 

A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that ... (read more)

scasperΩ110

Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this. 

scasperΩ221

Thanks for the comment.

In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.

This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety. 

there are a

... (read more)

I just went by what it said. But I agree with your point. It's probably best modeled as a predictor in this case -- not an agent. 

scasperΩ230

In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate. 

scasperΩ110

One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts. 

I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974 

1research_prime_space
1. I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt. 2. Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model.  The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking. 
scasperΩ34-2

I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on. 
 

2research_prime_space
I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter. But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)? 
scasperΩ330

I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.

Thanks. 

Are you concerned about AI risk from narrow systems of this kind?

No. Am I concerned about risks from methods that work for this in narrow AI? Maybe. 

This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page. 
 

scasper*20

Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:

We must grow interpretability and AI safety in the real world.

Strong +1 to working on more real-world-relevant approaches to interpretability. 

Regulation is coming – let’s use it.

Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be s... (read more)

1Jessica Rumbelow
Thanks for the comment! I'll respond to the last part: "First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good." I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we're explicitly interested in using interpretability with narrow domain systems.  "Interpretability is the backbone of knowledge discovery with deep learning": Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren't able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us. Here are a couple of examples: https://www.mdpi.com/2072-6694/14/23/5957 https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways https://www.nature.com/articles/s41598-021-90285-5 Are you concerned about AI risk from narrow systems of this kind?
scasperΩ110

I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not. 

scasperΩ241

Thanks for the comment. I appreciate how thorough and clear it is. 

Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.

Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are  equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner... (read more)

scasperΩ121

Thanks. See also EIS VIII.

Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.  

3Charlie Steiner
I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.

Thanks! I am going to be glad to have this post around to refer to in the future. I'll probably do it a lot. Glad you have found some of it interesting. 

scasperΩ221

Yes, it does show the ground truth.

The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.

Load More