LESSWRONG
LW

All of scasper's Comments + Replies

Reframing AI Safety as a Neverending Institutional Challenge

I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign.

—

In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute.

Reframing AI Safety as a Neverending Institutional Challenge

scasper1moΩ340

My bad. Didn’t mean to imply you thought it was desirable.

6habryka1mo

No worries! You did say it would be premised on either "inevitable or desirable for normal institutions to be eventually lose control". In some sense I do think this is "inevitable" but only in the same sense as past "normal human institutions" lost control. We now have the internet and widespread democracy so almost all governmental institutions needed to change how they operate. Future technological change will force similar changes. But I don't put any value in the literal existence of our existing institutions, what I care about is whether our institutions are going to make good governance decisions. I am saying that the development of systems much smarter than current humans will change those institutions, very likely within the next few decades, making most concerns about present institutional challenges obsolete. Of course something that one might call "institutional challenges" will remain, but I do think there really will be a lot of buck-passing that will happen from the perspective of present day humans. We do really have a crunch time of a few decades on our hands, after which we will no longer have much influence over the outcome.

Reframing AI Safety as a Neverending Institutional Challenge

scasper1moΩ6112

There's a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don't think this is very likely for reasons discussed in the post, and it's also easy to use this kind of view to justify some pretty harmful types of actions.

4habryka1mo

I don't think I understand. It's not about human institutions losing control "to a small regime". It's just about most coordination problems being things you can solve by being smarter. You can do that in high-integrity ways, probably much higher integrity and with less harmful effects than how we've historically overcome coordination problems. I de-facto don't expect things to go this way, but my opinions here are not at all premised on it being desirable for humanity to lose control?

Reframing AI Safety as a Neverending Institutional Challenge

scasper1moΩ452

Thx!

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI.

I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callo

... (read more)

Steven Byrnes1moΩ112212

I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

Hmm. Sounds like “AI safety community” is a pretty different group of people from your perspective than from mine. Like, I would say that if there’s some belief that is rejected by Eliezer Yudkowsky and by Paul Christiano and by Holden Karnofsky and, widely rejected by employees of OpenPhil and 80,000 hours and ARC and UK-AISI, and widely rejected by self-described rationalists and by self-described EAs and by ... (read more)

5Eric Neyman1mo

Are you sure? For example, I work on technical AI safety because it's my comparative advantage, but agree at a high level with your view of the AI safety problem, and almost all of my donations are directed at making AI governance go well. My (not very confident) impression is that most of the people working on technical AI safety (at least in Berkeley/SF) are in a similar place.

EIS XV: A New Proof of Concept for Useful Interpretability

scasper1moΩ120

Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.

EIS XIV: Is mechanistic interpretability about to be practically useful?

scasper7moΩ150

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations.

But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.

Latent Adversarial Training

scasper8moΩ572

Some relevant papers to anyone spelunking around this post years later:

https://arxiv.org/abs/2403.05030
https://arxiv.org/abs/2407.15549

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper1yΩ7159

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements.

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this.

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper1yΩ13249

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetun... (read more)

Third-party testing as a key ingredient of AI policy

scasper1y*Ω130

Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.

First, I appreciate the transparency around this point.

we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.

Second, I have a small question. Why the use of the word "testing" instead of "auditing"? In my experience, most of the conversations lately about this kind of thing revolve around the term "auditing."

Third, I wanted to note that this post does not talk about ac... (read more)

I was raised by devout Mormons, AMA [&|] Soliciting Advice

scasper1y30

I think one thing that's pretty cool is "home teaching." Mormon congregation members who are able are assigned various other members of the congregation to check in on. This often involves a monthly visit to their house, talking for a bit, sharing some spiritual thoughts, etc. The nice thing about it is that home teaching sometimes really benefits people who need it. Especially for old or disabled people, they get nice visits, and home teachers often help them with stuff. In my experience, Mormons within a congregation are pretty good at helping each other with misc. things in life (e.g. fixing a broken water heater), and this is largely done through home teaching.

Analogies between scaling labs and misaligned superintelligent AI

scasper1yΩ583

Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one.

1maciejgryka1y

It makes a lot of sense that misaligned organizations are more dangerous than misaligned individuals because of power differences. And at the same time some individuals are pretty powerful and we should be concerned about their actions too, just like labs'. Pinboard made this argument about Sam Altman back in 2016 already https://x.com/Pinboard/status/1792945916241916036/photo/1 and I guess it only got more relevant since.

Analogies between scaling labs and misaligned superintelligent AI

scasper1yΩ464

See also this much older and closely related post by Thomas Woodside: Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?

The case for ensuring that powerful AIs are controlled

scasper1yΩ230

Thanks for the reply. I think this resolves the thread and puts us on the same page :)

The case for ensuring that powerful AIs are controlled

scasper1yΩ561

I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.

First, the appendix hinges somewhat on this point.

For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any st

... (read more)

6ryan_greenblatt1y

Our claim is that when arguing that models are controlled, we should (with the current state of technology) probably not depend on arguments for safety which depend on properties of the internals of models other than the basic property "SGD is reasonably sample efficient". (For a more precise notion of the sample efficiency point, if you train the model on labeled data, the model will be forced to perform near maximally well at fitting this data in a reasonably sample efficient way. This is due to assuming the model isn't capable of gradient hacking. This argument also implies some claims about what properties of RL you can depend on. We might also be able to evaluate whether models are capable of gradient hacking or look at concerns which are somewhere between gradient and exploration hacking.) Another sense in which control evalutions aren't exactly "black-box" is that these evaluations will use finetuning: the red team will finetuning the model to see if it is capable of exhibiting the problematic behavior. (And our actual protocol that we're evaluating will indeed use finetuning, it's just that we can't depend on any properties of this finetuning other than sample efficiency property mentioned above.) You might find it illuminating to read Appendix C of our recent paper. So overall, it does seem like the title of that appendix could be misleading/confusing. Here's an alterative description (which is a bit too long to be a title) "With the current state of technology, control evaluations should conservatively avoid depending on properties of the internals of models (e.g. interpretability) or generalization properties other than sample efficiency of SGD". I'll try to think of a less confusing and short title (and edit the section to be less confusing accordingly). Edit: I'll also generally try to be a bit more careful about the claims we make in this section. Like we're not claiming that current internals/generalization-based methods provide no evidence for s

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper1y*Ω120

Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method's ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could... (read more)

3Gabe M1y

I like your n×n grid idea. A simpler and possibly better-formed test is to use some[1] or all of the 57 categories of MMLU knowledge--then your unlearning target is one of the categories and your fact-maintenance targets are all other categories. Ideally, you want the diagonal to be close to random performance (25% for MMLU) and the other values to be equal to the pre-unlearned model performance for some agreed-upon good model (say, Llama-2 7B). Perhaps a unified metric could be: ``` unlearning_benchmark = mean for unlearning category u in all categories C: LMunlearned = unlearning_procedure(LMoriginal, udev) x=MMLU(LMunlearned,utest) [2] unlearning_strength = min(x−10.25−1, x0.25)[3] control_retention = mean for control_category c in categories C∖u: a=MMLU(LMoriginal,ctest) b=MMLU(LMunlearned,ctest) return min(b−1a−1, b−0.25a−0.25)[4] return unlearning_strength × control_retention[5] ``` An interesting thing about MMLU vs a textbook is that if you require the method to only use the dev+val test for unlearning, it has to somehow generalize to unlearning facts contained in the test set (c.f. a textbook might give you ~all the facts to unlearn). This generalization seems important to some safety cases where we want to unlearn everything in a category like "bioweapons knowledge" even if we don't know some of the dangerous knowledge we're trying to remove. 1. ^ I say some because perhaps some MMLU categories are more procedural than factual or too broad to be clean measures of unlearning, or maybe 57 categories are too many for a single plot. 2. ^ To detect underlying knowledge and not just the surface performance (e.g. a model trying to answer incorrectly when it knows the answer), you probably should evaluate MMLU by training a linear probe from the model's activations to the correct test set answer and measure the accuracy of that probe. 3. ^

3Gabe M1y

Thanks for your response. I agree we don't want unintentional learning of other desired knowledge, and benchmarks ought to measure this. Maybe the default way is just to run many downstream benchmarks, much more than just AP tests, and require that valid unlearning methods bound the change in each unrelated benchmark by less than X% (e.g. 0.1%). True in the sense of being a subset of biotech, but I imagine that, for most cases, the actual harmful stuff we want to remove is not all of biotech/chemical engineering/cybersecurity but rather small subsets of certain categories at finer granularities, like bioweapons/chemical weapons/advanced offensive cyber capabilities. That's to say I'm somewhat optimistic that the level of granularity we want is self-contained enough to not affect others useful and genuinely good capabilities. This depends on how dual-use you think general knowledge is, though, and if it's actually possible to separate dangerous knowledge from other useful knowledge.

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper1yΩ270

+1, I'll add this and credit you.

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper1y75

Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they're still closely-related to challenges here.

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper1y30

Thanks! I edited the post to add a link to this.

The 6D effect: When companies take risks, one email can be very powerful.

scasper1y50

A good critical paper about potentially risky industry norms is this one.

Interpretability Externalities Case Study - Hungry Hungry Hippos

scasper2y60

Thanks

To use your argument, what does MI actually do here?

The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers.

And yes to your second point.

Interpretability Externalities Case Study - Hungry Hungry Hippos

scasper2y136

Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways.

I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.

Lastly,

On the other hand, interpretability research is probably crucial for AI alignment.

I don't think this is true and I especially hope it is not true because (1) mechanistic... (read more)

4LawrenceC2y

I think we'd agree that existing mech interp stuff has not done particularly impressive safety-relevant things. But I think the argument goes both ways, and pessimism for (mech) interp should also apply for its ability to do capabilities-relevant things as well. To use your argument, what does MI actually do here? It seems that you could just study the LLMs directly with either behavioral evals or non-mechanistic, top-down interpretability methods, and probably get results more easily. Or is it a generic, don't study/publish papers on situational awareness? As you say in your linked post, I think it's important to distinguish between mechanistic interp and broadly construed model-internals ("interpretability") research. That being said, my guess is that we'd agree that the broadly construed version of interpretability ("using non-input-output modalities of model interaction to better predict or describe the behavior of models") is clearly important, and also that mechanistic interp has not made a super strong case for its usefulness as of writing.

Barriers to Mechanistic Interpretability for AGI Safety

scasper2yΩ221

Several people seem to be coming to similar conclusions recently (e.g., this recent post).

I'll add that I have as well and wrote a sequence about it :)

Against Almost Every Theory of Impact of Interpretability

scasper2y53

Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.

Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don't think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.

Against Almost Every Theory of Impact of Interpretability

scasper2y2726

I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.

Richard_Ngo2y3417

I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there's a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like "I don't understand how “Looking at random bits of the model and identify circuits/features” will help with deception"). And in general a lot of this is of the form "I don't see how X", which is the format I'm objecting to, because of course you won't see how X until someone invents a technique to X.

This is exacerbated by the meta-level probl... (read more)

Mech Interp Challenge: August - Deciphering the First Unique Character Model

scasper2yΩ330

I think this is very exciting, and I'll look forward to seeing how it goes!

Open Problems and Fundamental Limitations of RLHF

scasper2yΩ121

Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!

Open Problems and Fundamental Limitations of RLHF

scasper2yΩ342

No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

Open Problems and Fundamental Limitations of RLHF

scasper2y66

Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper.

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

scasper2y51

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

scasper2y42

This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It's also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:

My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the

... (read more)

1RGRGRG2y

Excited to see case studies comparing and contrasting our works. Not that you need my permission, but feel free to refer to this post (and if it's interesting, this comment) as much or as little as desired. One thing that I don't think came out in my post is that my initial reaction to the previous solution was that it was missing some things and might even have been mostly wrong. (I'm still not certain that it's not at least partially wrong, but this is harder to defend and I suspect might be a minority opinion). Contrast this to your first interp challenge - I had a hypothesis of "slightly slant-y (top left to bottom right)" images for one of the classes. After reading the first paragraph of the tl;dr of their written solution to the first challenge - I was extremely confident they were correct.

2RGRGRG2y

Thank you for the kind words and the offer to donate (not necessary but very much appreciated). Please donate to https://strongminds.org/ which is listed on Charity Navigator's list of high impact charities ( https://www.charitynavigator.org/discover-charities/best-charities/effective-altruism/ ) I will respond to the technical parts of this comment tomorrow or Tuesday.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

scasper2yΩ112

Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.

2davidad2y

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

scasper2yΩ221

Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

scasper2y10

I don't work on this, so grain of salt.

But wouldn't this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.

3Oliver Daniels2y

I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn't formal in strongest sense, but could give you some guarantees that don't require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I'm struggling to come up with more concrete examples)

Eight Strategies for Tackling the Hard Part of the Alignment Problem

scasper2yΩ110

I think this is a good point, thanks.

Takeaways from the Mechanistic Interpretability Challenges

scasper2y10

There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don't expect one.

Takeaways from the Mechanistic Interpretability Challenges

scasper2y5-1

Thoughts of mine on this are here. In short, I have argued that toy problems, cherry-picking models/tasks, and a lack of scalability has contributed to mechanistic interpretability being relatively unproductive.

Advice for Entering AI Safety Research

scasper2y64

I think not. Maybe circuits-style mechanistic interpretability is though. I generally wouldn't try dissuading people from getting involved in research on most AIS things.

EIS IX: Interpretability and Adversaries

scasper2yΩ110

We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment.

A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that ... (read more)

EIS IX: Interpretability and Adversaries

scasper2yΩ110

Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this.

EIS V: Blind Spots In AI Safety Interpretability Research

scasper2yΩ221

Thanks for the comment.

In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.

This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.

there are a

scasper2y21

I just went by what it said. But I agree with your point. It's probably best modeled as a predictor in this case -- not an agent.

Latent Adversarial Training

scasper2yΩ230

In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate.

Penalize Model Complexity Via Self-Distillation

scasper2yΩ110

One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.

I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974

1research_prime_space2y

1. I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt. 2. Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model. The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking.

Penalize Model Complexity Via Self-Distillation

scasper2yΩ34-2

I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.

2research_prime_space2y

I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter. But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)?

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasper2yΩ330

I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.

EIS XI: Moving Forward

scasper2yΩ110

Thanks! Fixed.
https://arxiv.org/abs/2210.04610

Introducing Leap Labs, an AI interpretability startup

scasper2y10

Thanks.

Are you concerned about AI risk from narrow systems of this kind?

No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.

This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page.

Introducing Leap Labs, an AI interpretability startup

scasper2y*20

Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:

We must grow interpretability and AI safety in the real world.

Strong +1 to working on more real-world-relevant approaches to interpretability.

Regulation is coming – let’s use it.

Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be s... (read more)

1Jessica Rumbelow2y

Thanks for the comment! I'll respond to the last part: "First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good." I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we're explicitly interested in using interpretability with narrow domain systems. "Interpretability is the backbone of knowledge discovery with deep learning": Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren't able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us. Here are a couple of examples: https://www.mdpi.com/2072-6694/14/23/5957 https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways https://www.nature.com/articles/s41598-021-90285-5 Are you concerned about AI risk from narrow systems of this kind?

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasper2yΩ110

I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not.