Sorry for not noticing this earlier. I'm not very active on LessWrong/AF. In case it's still helpful, a couple of thoughts...
Firstly, I think people shouldn't take this graph too seriously! I made it for a talk in ~2017, I think and even then it was intended as a vague intuition, not something I was confident in. I do occasionally gesture at it as a possible intuition, but it's just a vague idea which may or may not be true.
I do think there's some kind of empirical trend where models in some cases become harder to understand and then easier. For example:
These are all qualitative observations / not rigorous / not systematic.
So what is going on? I think there are several hypotheses:
I suspect it's a mix of quite a few of these.
In the case of language models, I think superposition is really the driving force and is quite different from the vision case (language model features seem to typically be much sparser than vision model ones).
I'm interested in hearing other people's takes on this question! I also found that a tiny modular addition model was very clean and interpretable. My personal guess is that discrete input data lends itself to clean, logical algorithms more so than than continuous input data, and that image models need to devote a lot of parameters to processing the inputs into meaningful features at all, in a way that leads to the confusion. OTOH, maybe I'm just overfitting.
I thought the peak of simple models would be something like a sparse Bag of Words model, and then all models that have been considered so far just go deeper and deeper into the valley of confused abstractions, and that we are not yet at the point where we can escape. But I might be wrong.
I guess I would assume that GPT-style models don't have a great chance of escaping the valley of confused abstractions because their interface to the real world is very weird. Predicting scrapes of internet text.
Updated the diagram I drew on to better illustrate my confusion.
And yeah, it may very well be that GPT-style never leaves the valley. Hmm.
New (somewhat relevant) paper:
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
Abstract: In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We here use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 120'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.
Conclusion: In this paper, we set out to answer the question: Does scale improve the mechanistic interpretability of vision models? By running extensive psychophysical experiments and comparing various models, we come to the conclusion that none of the investigated axes seem to positively affect model interpretability: Neither the size of the model, nor that of the dataset, nor model architecture or training scheme improve interpretability. This result highlights the importance of building more interpretable models: unless we explicitly optimize for this property, we do not get it for free by just increasing downstream task performance.
Could it be that Chris's diagram gets recovered if the vertical scale is "total interpretable capabilities"? Like maybe tiny transformers are more interpretable in that we can understand ~all of what they're doing, but they're not doing much, so maybe it's still the case that the amount of capability we can understand has a valley and then a peak at higher capability.
I encountered a similarish pattern a long time ago
https://www.lesswrong.com/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities?commentId=9Fm2zrYT9aRSdzSde
Epistemic Status: Quite confused. Using this short post as a signal for discussion.
In Evan's post about Chris Olah's views on AGI safety, there is a diagram which loosely points to how Chris thinks model interpretability will be impacted at different levels of model "strength" (note that this is vague intuition; worth looking at Chris’ comment on this post):
I always thought this diagram still held up in the LLM regime, even though it seems the diagram pointed specifically to interpretability with CNN vision models. However, I had a brief exchange with Neel Nanda about the Valley of Confused Abstractions in the context of language models, and I thought this might be a good thing to revisit.
I've been imagining that language models with the "strength" of GPT-2 are somewhere near the bottom of the Valley of Confused Abstractions, but the much bigger models are a bit further along the model strength axis (though I'm not sure where they fall). I've been thinking about this in the context of model editing or pointing/retargeting the model.
Here's my exchange with Neel:
My understanding was that 1-layer transformers being easy to interpret still agrees with the diagram, and current big-boy models are just not past the Valley of Confused Abstractions yet.
Ok, but if what Neel says is true, what might the diagram look like for language transformers?
I'm confused at the moment, but my thinking used to go something like this: in the case of GPT-2, it is trying to make sense of all the data it has been trained on but just isn't big enough to fully grasp the concept of "cheese" and "fromage" are essentially the same. But my expectation is that as the model gets bigger, it knows those two tokens mean the same thing, just in different languages. Maybe it does?
With that line of thinking, as model strength increases, it will help the model create crisper internal abstractions of a concept like "cheese." But then what...at some point, the model gets too powerful, and it becomes too hard to pull out the "cheese/fromage" abstraction?
Anyway, I hoped that as LLMs (trained with the transformers architecture) increase in "strength" beyond the current models, the abstractions become crisper, and it becomes much easier to identify abstractions as it gets closer to some "human-level performance." However, GPT-3 is already superhuman in some respects, so I'm unsure about how to disentangle this. I hope this post sparks some good conversation about how to de-confuse this and how the diagram should look for LLM transformers. Is it just some negative exponential with respect to model strength? Or does it humps like the original diagram?