Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas. While I don't work on interpretability per se, I see similar things happening with value learning / inverse reinforcement learning approaches to alignment.
Regarding causal scrubbing in particular, it seems to me that there's a closely related line of research by Geiger, Icard and Potts that it doesn't seem like TAISIC is engaging with deeply? I haven't looked too closely, but it may be another example of duplicated effort / rediscovery:
The importance of interventions
Over a series of recent papers (Geiger et al. 2020, Geiger et al. 2021, Geiger et al. 2022, Wu et al. 2022a, Wu et al. 2022b), we have argued that the theory of causal abstraction (Chalupka et al. 2016, Rubinstein et al. 2017, Beckers and Halpern 2019, Beckers et al. 2019) provides a powerful toolkit for achieving the desired kinds of explanation in AI. In causal abstraction, we assess whether a particular high-level (possibly symbolic) mode H is a faithful proxy for a lower-level (in our setting, usually neural) model N in the sense that the causal effects of components in H summarize the causal effects of components of N. In this scenario, N is the AI model that has been deployed to solve a particular task, and H is one’s probably partial, high-level characterization of how the task domain works (or should work). Where this relationship between N and H holds, we say that H is a causal abstraction of N. This means that we can use H to directly engage with high-level questions of robustness, fairness, and safety in deploying N for real-world tasks.
We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.
Hopefully this will be fixed with the forthcoming arXiv paper!
Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas
Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.
I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum's reference for causality is frozen at Pearl's 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field.
I was intrigued by your claim that FFS is already subsumed by work on academia. I clicked the link you provided but from a quick skim it doesn't seem to do FFS or anything beyond the usual pearl causality story as far as I can tell. Maybe I am missing something - could you provide an specific page where you think FFS is being subsumed?
Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.
I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning).
It was intentional that the linked paper is an intro survey paper to the Pearl-ish approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to study causality if we don't have pre-defined variables?"
It may be that FFS ends up contributing novel insights above and beyond <Pearl-based causal rep. learning>, but a priori I expect this to occur only if FFS researchers are familiar with the existing literature, which I haven't seen mentioned in any FFS posts.
My line of thinking is: It's hard to improve on a field you aren't familiar with. If you're ignorant of the work of hundreds of other researchers who are trying to answer the same underlying question you are, odds are against your insights being novel / neglected.
Scott Garrabrant conceived of FFS as an extension & generalization of Pearlian causality that answers questions that are not dealt well with in the Pearlian framework. He is aware of Pearl's work and explicitly builds on it. It's not a distinct approach as much as an extension. The paper you mentioned discusses the problem of figuring out what the right variables are but poses no solution (as far as I can tell). That shouldn't surprise because the problem is very hard. Many people have thought about it but there is only one Garrabrant.
I do agree with your overall perspective that people in alignment are quite insular, unaware of the literature and often reinventing the wheel.
Strong upvote here as well. The points about how even simple terminological differences can isolate research pursuits are especially pertinent, considering the tendency of people on and around LW to coin new phrases/ideas on a dime. Novel terminology is a valuable resource that we have been spending very frivolously.
Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.
E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renaming" seems far too strong.
Same for "we should not expect solving toy MI problems using humans to help with real world MI problems" - there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.
Ramanujan et al. (2020) showed that randomly initialized networks could be “trained” simply by pruning all of the weights that harmed performance on the task of interest. The resulting subnetwork may accomplish a task of interest, but only in a frivolous sense, and it should not be expected to generalize.
I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?
Thanks for the comment.
In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.
This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.
there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.
No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.
I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?
Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks. So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is.
Re: e.g. superposition/entanglement:
I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it. Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in favour of different terminology, i.e. even if something is "the same" when boiled down to a formal level, the different names can actually help delineate different interpretations.
In fact it's almost like a running joke in academia that there's always someone grumbling that you didn't cite the right things (their favourite work on this topic, their fellow countryman, them etc.) and because of the way academic literature works, some of the things that you are doing here can be done with almost any piece of work in the literature, i.e. you can comb over it with the benefit of hindsight and say 'hang on this isn't as original as it looked; basically the same idea that was written about here X years before' etc. Honestly, I don't usually think of this as a valuable exercise, but I may be missing something about your wider point or be more convinced once I've looked at more of your series.
Another point when it comes to 'originality' and 'progress' is that it's often unimportant if some idea was generally discussed, labelled, named, or thought about before if what matters is actual results and the lower-level content of these works. i.e. I may be wrong, but looking at what you are saying, I don't think you are literally pulling up an older paper on 'entanglement' that made the exact same points that the Anthropic papers were making and did very similar experiments (Or are you?) And even having said that, reproducing experiments exactly is of course very valuable.
Re: MI and program synthesis:
I understand that your take is that it is closer to program synthesis or program induction and that these aren't all the same thing but in the first subsection of the "TASIC has reinvented..." section, I'm a little confused why there's no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
The main problem on this site is that despite people have large vary levels of understanding of different subject, nobody wants to look like an idiot on here. A lot of the comments and articles are basically nothing burgers. People often focus on insignificant points to argue about and waste their time in the social aspect of learning than to actually learn about a subject themselves.
This made me wonder do actual researchers who have values and substance to offer and question, do they not participate in online discussions? The closest I've found is wordpress blogs by various people and people have huge comment chains. The only other form of communication seems to be through formal papers, which is pretty much as organized as it gets in terms of format.
I've learned that people who do actually have deeper understanding and knowledge of value to offer, they don't waste their time on here. But I can't find any other platform that these people participate in. My guess is that they don't participate in any public discourse, only private conversations with other people who have things of value to offer and discuss.
Thanks for the comment and pointing these things out.
---
I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names.
Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
I don't know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?
---
In fact it's almost like a running joke in academia that there's always someone grumbling that you didn't cite the right things (their favourite work on this topic, their fellow countryman, them etc.)...
Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.
---
I understand that your take is that it is closer to program synthesis or program induction and that these aren't all the same thing but in the first subsection of the "TASIC has reinvented..." section, I'm a little confused why there's no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
Thanks for pointing out these posts. They are examples of discussing a similar idea to MI's dependency on programmatic hypothesis generation, but they don't act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.
Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
Ok it sounds to me like maybe there's at least two things being talked about here. One situation is
A) Where a community includes different groups working on the same topic, and where those groups might use different terminology and have different ways of thinking about the same phenomena etc. This seems completely normal to me. The other situation is
B) Where a group is isolated from the community at large and is using different terminology/thinking about things differently just as a result of their isolation and lack of communication. And where that behaviour then causes confusion and/or wasting of resources.
The latter doesn't sound good, but I guess it looks like to me that some or many of your points are consistent with the former being the case. So when you write e.g. it's not "necessarily a good thing either" or asking for my steelmanned case, this doesn't seem to quite make sense to me. I feel like if something is not necessarily good or bad, and you want to raise it as a criticism, then the onus would be on you to bring the case against TASIC with arguments that are not general ones that could easily apply to both A) and B) above. e.g. It'd be more of an emphatic case if you were able to go into the details and be like "X did this work here and claimed it was new but actually it exists in Y's paper here" or give a real example of needless confusion that was created and could have been avoided. Focussing just on what they did or didn't 'engage with' on the level of general concepts and citations/acknowledgements doesn't bring this case convincingly, in my opinion. Some more vague thoughts on why that is:
Idk maybe I'm just repeating myself at this point.
On the other point: It may turn out the MI's analogy with reverse software engineering does not produce methods and is just used a high-level analogy,, but it seems too early to say from my perspective - the two posts I linked are from last year. TASIC is still pretty small and experienced researchers in TASIC are fewer and this is potentially a large and difficult research agenda.
I strongly downvoted with this post, primarily because contra you, I do actually think reframing/reinventing is valuable, and IMO I think that the case for reframing/reinventing things is strawmanned here.
There is one valuable part of this post, and that interpretability doesn't have good result-incentives. I agree with this criticism, but given the other points of the post, I would strongly downvote it.
This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?
In this case, one of the steelmanned case for reframing/reinventing being productive is this post:
https://www.lesswrong.com/posts/ZZNM2JP6YFCYbNKWm/nothing-new-productive-reframing
The big reason reframing/reinventing is productive is we are neither logically omniscient, nor are we Bayesian optimal, that is we don't update on all the data we receive, which makes reframings or reinventing things like shortcuts.
Also, reinventing things can give you more bits by learning general processes for how to do something, unlike black boxes which only give you the output.
I see the point of this post. No arguments with the existence of productive reframing. But I do not think this post makes a good case for reframing being robustly good. Obviously, it can be bad too. And for the specific cases discussed in the post, the post you linked doesn't make me think "Oh, these are reframed ideas, so good -- glad we are doing redundant work in isolation."
For example with polysemanticity/superposition I think that TAISIC's work has created generational confusion and insularity that are harmful. And I think TAISIC's failure to understand that MI means doing program synthesis/induction/language-translation has led to a lot of unproductive work on toy problems using methods that are unlikely to scale.
I think it's a big stretch to say that deception is basically just trojans. There are similarities, but the regularities that make deception a natural category of behavior that we might be able to detect are importantly fuzzier than the regularities that trojan-detecting strategies use. If "deception" just meant acting according to a wildly different distribution when certain cues were detected, trojan-detection would have us covered, but what counts as "deception" depends more heavily on our standards for the reasoning process, and doean't reliably result in behavior that's way different than non-deceptive behavior.
I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.
Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.
On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...) that map onto things such as biology, psychology, etc...
I think it's fair to call these programs, since they are tools you use in a sort of causal way to say what will happen. Usually, you imagine certain objects that follow certain rules to do things, thereby changing the state of the world. They are things you could write as programs or instructions.
The art here is to be able to formalize a language that has the right parametrization to describe and predict the desired phenomena well, while being expressive enough to grow in a useful way, as we discover more.
But anyways, there are sort of two questions that naturally arise here:
Part 5 of 12 in the Engineer’s Interpretability Sequence.
Thanks to Anson Ho, Chris Olah, Neel Nanda, and Tony Wang for some discussions and comments.
TAISIC = “the AI safety interpretability community”
MI = “mechanistic interpretability”
Most AI safety interpretability work is conducted by researchers in a relatively small number of places, and TAISIC is closely-connected by personal relationships and the AI alignment forum. Much of the community is focused on a few specific approaches like circuits-style MI, mechanistic anomaly detection, causal scrubbing, and probing. But this is a limited set of topics, and TAISIC might benefit from broader engagement. In the Toward Transparent AI survey (Räuker et al., 2022), we wrote 21 subsections of survey content. Only 1 was on circuits, and only 4 consisted in significant part of work from TAISIC.
I have often heard people in TAISIC explicitly advising more junior researchers to not focus much on reading from the literature and instead to dive into projects. Obviously, experience working on projects is irreplaceable. But not engaging much with the broader literature and community is a recipe for developing insularity and blind spots. I am quick to push back against advice that doesn’t emphasize the importance of engaging with outside work.
Within TAISIC, I have heard interpretability research described as dividing into two sets: mechanistic interpretability and, somewhat pejoratively, “traditional interpretability.” I will be the first to say that some paradigms in interpretability research are unproductive (see EIS III-IV). But I give equal emphasis to the importance of TAISIC not being too parochial. Reasons include maintaining relevance and relationships in the broader community, drawing useful inspiration from past works, making less-correlated bets with what we focus on, and most importantly – not reinventing, renaming, and repeating work that has already been done outside of TAISIC.
TAISIC has reinvented, reframed, or renamed several paradigms
Mechanistic interpretability requires something much like program synthesis, program induction, and/or programming language translation
“Circuits”-style MI is arguably the most popular and influential approach to interpretability in TAISIC. Doing this work requires iteratively (1) generating hypotheses for what a network is doing and then (2) testing how valid these hypotheses explain its internal mechanisms. Step 2 may not be that difficult, and causal scrubbing (discussed below) seems like a type of solution that will be useful for it. But step 1 is hard. Mechanistic hypothesis generation is a lot like doing program synthesis, program induction, and/or programming language translation.
Generating mechanistic hypotheses requires synthesizing programs to explain a network using its behavior and/or structure. If a method for this involves synthesizing programs based on the task or I/O from the network, it is a form of program synthesis or induction. And if a method is based on using a network’s structure to write down a program to explain it, it is very similar to programming language translation.
In general, program synthesis and program induction are very difficult and currently fail to scale to large problems. This is well-understood, and these fields are mature enough so that we have textbooks on them and how difficult they are (e.g. Gulwani et al., 2017). Meanwhile, programming language translation is very challenging too. In practice, translating between common languages (e.g. Python and Java) is only partially automatable and relies on many hand-coded rules (Qiu, 1999), and using large language models has had limited success (Roziere et al., 2020). And in cases like these, both the source and target language are discrete and easily interpretable. Since this isn’t the case for neural networks, we should expect things to be more difficult for translating them into programs.
It is unclear the extent to which the relationships between program synthesis, induction, and language translation, and MI are understood inside of TAISIC. I do not know of this connection being pointed out before in TAISIC. But understanding this seems important for seeing why MI is difficult and likely to stay that way. MI work in TAISIC has thus far been limited to explaining simple (sub)processes. In cases like these, the program synthesis part of the problem is very easy for a human to accomplish manually. But if a problem can be solved by a program that a human can easily write, then it is not one that we should be applying deep learning to (Rudin, 2018). There will be a much more in-depth discussion of this problem in EIS VI.
If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.
When highly intelligent systems in the future learn unexpected, harmful behaviors, characterizing the neural circuitry involved will probably not be simple like the current MI work that TAISIC focuses on. We should not expect solving toy MI problems using humans to help with real world MI problems any more than we should expect solving toy program synthesis problems using humans to help with real world program synthesis problems. As a result, automating model-guided hypothesis generation seems to be the only hope that MI research has to be very practically relevant. It may be time for a paradigm shift in TAISIC toward symbolic methods. But the fact that existing neurosymbolic work has not yet scaled or been very useful for many practical problems seems to signify difficulties ahead.
Causal scrubbing, compression, and frivolous subnetworks
The above section discussed how MI can be divided into a program generation component and a hypothesis verification component. And when it comes to hypothesis verification, causal scrubbing (Chan et al., 2022) is an exciting approach. It seems to have the potential to be tractable and valuable for this goal.
If our goal is rigorous MI, causal scrubbing can only be as good as the hypotheses that go into it. Relying on hypotheses that are too general will prevent it from being a very precise tool. And this might be fine. For loose goals such as mechanistic anomaly detection, hypotheses that are merely decent may still be useful for flagging anomalous forward passes through a network. Maybe the production of such decent hypotheses can be automated, and they may do a perfectly fair job of capturing useful mechanisms.
But we should be careful. Some causal scrubbing work has been explored using things like gradients, perturbations, ablations, refactorizations, etc. to find parts of the network that can be scrubbed away. But this may not be a very novel or useful approach to hypothesis generation. This particular approach is just a form of network compression. And just because a compressed version of a network seems to accomplish some task does not mean that there is some meaningful mechanism behind it. Ramanujan et al. (2020) showed that randomly initialized networks could be “trained” simply by pruning all of the weights that harmed performance on the task of interest. The resulting subnetwork may accomplish a task of interest, but only in a frivolous sense, and it should not be expected to generalize. So just because a subnetwork in isolation seems to do something doesn’t mean that it really performs that task. This is a type of interpretability illusion.
Polysemanticity and superposition = entanglement
This section is a bit longwinded, but the TL;DR is that TAISIC has done a lot of work on “polysemanticity” and “superposition” in neural networks, but this work is not as novel as it may seem in light of previous work on “entanglement.”
In 2012 Bengio et al. described and studied the “entanglement” of representations among different neurons in networks. To the best of my knowledge, this was the first use of this term in deep learning (although the rough concept goes back to at least Bengio and LeCun (2007)). Since then, there has been a great deal of literature on entanglement – enough for a survey from Carbonneau et. al (2022). See also the disentanglement section from the Toward Transparent AI survey (Räuker et al., 2022). Locatello et a. (2019) describe the goals of this literature as such (parenthetical citations removed for readability):
Does this sound familiar?
In 2016 Arora et al. described and studied embeddings of words that have multiple semantic meanings. They described these words as “polysemous” and their embeddings as in “superposition.” To the best of my knowledge, this was the first use of “polysemous” and “superposition” to describe embeddings and embedded concepts in deep learning. And to my knowledge, Arora et al. (2016) was the only work prior to TAISIC’s work in 2017 on this topic.
Later on, Olah et al. (2017) characterized neurons which seem to detect multiple unrelated features, and later, Olah et. al (2020) described neurons that seem to respond to multiple unrelated features as “polysemantic.” Olah et. al (2020) writes
Olah et. al (2020) also used the term “superposition”.
And things are even muddier than this. Thorpe (1989) studied how embeddings can densely represent a larger number of distinct concepts than they have dimensions under the term “distributed coding.” And Losch et al. (2019) describe a process for creating a disentangled latent layer as “semantic bottlenecking.” I don’t know how many other terms in various literatures describe similar concepts as entanglement, polysemanticity, superposition, distributed coding, and bottlenecking. And I don’t care much to sift through things thoroughly enough to find out. Instead, the point here is that in light of the literature on entanglement, many of the contributions that TAISIC has made related to polysemanticity and superposition are not very novel.
Olah et al. (2017) and Olah et. al (2020) did not do a thorough job of engaging with the entanglement literature. The only mention of it made by either was from Olah et. al (2020) which wrote without citation:
Although it should be noted that this blog post from 2017 also discussed "superposition."
Based on my knowledge of the entanglement literature, it is true that most but not all papers using the term study autoencoders. But it is not clear why this matters from the perspective of studying entanglement, polysemanticity, and superposition. Besides, an entangled encoder can be used to extract features for a classifier. This is just a form of “bottlenecking” (Losch et al., 2019) – another concept that predates Olah et. al (2020).
To be clear, it seems that the authors of Olah et al. (2017) and Olah et. al (2020) were aware of the entanglement literature, and later, their discussion of related work in Elhage et al. (2022) was much more thorough. But ultimately, Olah et al. (2017) and Olah et. al (2020) did not very thoroughly engage with the entanglement literature. And when Olah et al. (2017) and Olah et. al (2020) were written, the term “entanglement” was much more standard in the deep learning literature than “polysemanticity” and “superposition.”
Details (which I could be wrong about) and speculation (ditto) aside, two different groups of AI researchers have now been working on the same problems under different names, and this isn't good. The mainstream one uses “entanglement” while TAISIC uses “polysemanticity” and “superposition.” Terminology matters, and it may be the case that TAISIC’s terminology has caused a type of generational isolation among different groups of AI researchers.
There is a lot of useful literature on both supervised and unsupervised entanglement. Instead of listing papers, I’ll refer anyone interested to page 7 of the Toward Transparent AI survey (Räuker et al., 2022). Some researchers in TAISIC may find valuable insights from these works.
One disentanglement method that has come from TAISIC is the softmax linear unit activation function from Elhage et al. (2022). They train a network to be more disentangled using an activation function that makes neurons in the same layer compete for activations. Lateral inhibition being used as a solution to entanglement is nothing new. Again, see page 7 of the Toward Transparent AI survey (Räuker et al., 2022). And a fun fact is that even AlexNet (Krizheveky et al., 2012) used a form of lateral inhibition called “local response normalization.” But Elhage et al. (2022) engages very little with prior work like this in its discussion of related work. It gives the impression that their technique is more novel than it is.
The whole saga involving distributed coding, entanglement, polysemanticity, superposition, and bottlenecking serves as an example of how powerful terminology can be in influencing how the research community understands and approaches problems. This story highlights the importance of engaging thoroughly with previous works and being careful about terminology.
Deceptive alignment ≈ trojans
This discussion will be short because deception will be the main focus of EIS VIII. But spoiler alert: detecting and fixing deception is an almost identical technical problem to detecting and fixing trojans. The only difference is that deceptiveness typically results from an inner alignment failure while trojans are typically implanted with data poisoning which simulates an outer alignment failure. From an engineering standpoint though, this difference is often tenuous. This isn’t a major blind spot per se – many researchers in TAISIC understand this connection and are doing excellent work with trojans. TAISIC should do its best to ensure that this connection is more universally understood.
Unsupervised contrast consistent search = self-supervised contrastive probing
One recent paper from TAISIC presents a way to train a classifier that predicts when models will say dishonest things based on their inner activations (Burns et al., 2022). This type of approach seems promising. But the paper names its method “contrast consistent search” and describes it as “unsupervised,” both of which I have nitpicks for. The first is that “contrast consistent search” is much better described as “contrastive probing,” and while the paper refers to the probe as a “probe,” the related works and citations do not engage with the probing literature -- non-supervised probing has been done before (e.g. Hoyt et al. (2021)). Second, this method is not exactly “unsupervised.” It is better described as self-supervised because it requires using paired true and false statements. See Jaiswal et al. (2021) titled A Survey on Contrastive Self-Supervised Learning for definitions. In future work, it will be useful to name methods and discuss related work in ways that minimize the possibility of confusion or isolation.
Why so little work on intrinsic interpretability?
There are two basic approaches to interpretability. Intrinsic interpretability techniques involve designing/training models to be easier to study in the first place while post hoc interpretability techniques involve interpreting models after they have been trained. The Toward Transparent AI survey (Räuker et al., 2022) divides its discussion of methods into intrinsic and post hoc ones if you would like to look into this more.
Some great news is that because intrinsic interpretability techniques operate on the model before or during training and post hoc ones operate on it after, combining intrinsic and post hoc methods almost always works well! And given this, it’s odd that with some exceptions (e.g. Elhage et al. (2022)), the large majority of work from TAISIC is on post hoc methods. Maybe it is because of some founder effects plus how TAISIC is still fairly small. In the Toward Transparent AI survey (Räuker et al., 2022) we also speculate about how a lack of benchmarking means a lack of incentive for results-focused work which means a lack of incentive for studying useful synergies between novel combinations of non-novel methods.
But whatever the reason, TAISIC should do more work to study intrinsic interpretability tools and combine them with post hoc analysis. The main reason is the obvious one – that this may significantly improve interpretability results. But this should also be of particular interest to MI researchers. Recall the discussion above about how automating model-guided program synthesis may be necessary if circuits-style MI is to be useful. Designing more intrinsically interpretable systems may be helpful for this. It also seems to be fairly low-hanging fruit. Many intrinsic interpretability methods (e.g. modular architectures, pruning, some regularization techniques, adversarial training) are simple to implement but have rarely been studied alongside post hoc interpretability tools.
Questions