This feels somewhat similar to medical research with and without a detailed understanding of cell biochemistry. You can get very far without a deep understanding of the low level details and often this seems to be the best method for producing useful drugs (as I understand it most drugs are discovered through trial and error).
But understanding the low level details can get you important breakthroughs like mRNA vaccines. And if you're maybe betting the future of humanity that you can create a cure for a new type of deadly virus (superintelligent AI) on the first try you would prefer to be confident that you know what you're doing.
I agree, but I wouldn't bet the future of humanity on mRNA vaccines that haven't been thoroughly tested in practice or at least in very analogous situations either. If you code, this is like deploying code in production - it almost always goes badly if you haven't created and tested a fake production environment first.
I think we need a holistic approach for this exact reason. It feels like we aren't thinking enough about doing alignment "in production", but instead focusing only on the theoretical "mRNA vaccine" for a subject that we haven't taken sufficient time to interact with "in it's own language".
I strongly agree that interpretability work should make more effort to test and report on alternative architectures. There are a number of close competitors to GPT-2 style transformers.
Even just trying it and reporting that your technique doesn't work as well on alt arch would be valuable info for safety.
If you don't know what I mean about alternative architectures, try googling some of these terms :
state-space models (MAMBA, RWKV, Aaren) , nlp diffusion models (e.g. https://arxiv.org/html/2408.04220v1) , recursive looping models, reservoir computing models, next generation reservoir computing models, spiking neural nets, Komolgrov-Arnold networks, FunSearch
General intelligence might be an emergent property - something you can get from scaling a model. But it's not clear what is the basic model which, if scaled, leads to it. It would be interesting to consider how to make progress identifying what that is. How do you know if the model you're scaling has a peak intelligence that doesn't fall short of 'general intelligence'? How do you know when it's time to stop scaling and explore a new model?
I guess there's a hard limit on the scale of models that can be explored though. If it's not practical and it doesn't cut it, it's time to try something new. But it's still interesting to ask if there's any way to determine that there's still some juice in the model which hasn't been squeezed out. Identifying the scale required, or even some vague sense of it, to achieve general intelligence feels important
Are you saying that holistic/higher-level approaches can be useful because they are very likely to be more computationally efficient/actually fit inside human brains/ do not require as much data ?
Is that the main point, or did I miss something ?
Holistic means studying at every level. I think that mech interp is very useful for some things, and stuff like what I am pursuing (GenAI Ethology) is very useful for other things. If there is emergence, it means that we can't study the model at only one level of abstraction, we need to do it at multiple level and combine insight to be able to remotely understand and control what's happening. Additionally, I think that there is still other levels (and in between levels) of abstraction that should be pursued (like the footnote on glitch interp, and things like developmental interp but at the behavior level)
Agreed. That said, some efforts in this direction do exist. for example Ekdeep Singh Lubana and his Explaining Emergence in NN with Model Systems Analysis
Epistemic status: This post was planned to be part of a broader "Holistic Interpretability" post but this isn't going as fast as I'd like so I am releasing the foreword to get some early feedback on whether I should pursue this or not. I haven't had a lot of red teaming on this emergence idea, but I'd still put a 80% confidence that this is important. I also tried to use the word emergence only where it made sense, but I put a 40% chance that I might have used it a bit wrong.
Emergence: More Than Just a Buzzword?
When I think about large auto-regressive models, I can't help but return to the concept of emergence. Now, I know what you're probably thinking - "emergence" has become something of a fuzzy buzzword in AI discussions, and I mostly agree with the critics, but I think there is more to unpack here.
Let's start with a classic example: water. A single H2O molecule is just a simple arrangement of atoms, but combine gazillions of these molecules, and suddenly you have a substance that flows, splashes, and takes the shape of its container. While individual molecules have properties like polarity and can form hydrogen bonds, the macroscopic properties we associate with liquids only become apparent when vast numbers of these molecules interact.
Now, you might argue, "But we can simulate water behavior perfectly if we know all the properties of individual molecules!" And you'd probably be right. However, even with a perfect simulation, to predict how water will behave in a new scenario, you'd still need to run the entire simulation. You can't shortcut it by just looking at a few molecules or an over-simplified model.
This illustrates why I believe emergence matters when discussing AIs. It's not about mystical properties appearing out of nowhere. Rather, it's about the necessity of studying the system as a whole to fully understand its behavior, even with complete knowledge of its components.
When I look at current efforts to interpret LLMs[1], I often have this impression that they bet on the idea that LLM behaviors, cognition, and capabilities are NOT emergent properties of auto-regressive deep neural networks and it's training. There seem to be an implicit belief that everything could be pieced out just by being good enough at understanding all the internal circuitry and components.
The default trajectory?
This approach seems to mirrors a common scientific reaction to studying complex entities: stripping them down to their constituent pieces, and trying to piece them back together using logical, sound (even rational) reasoning.
In the last few months, I've read about the trajectory of other sciences (psychology, economics, biology, physics, medicine, ...), and talked with practitioners, before finally stumbling on complexity science (I warmly recommend the book "Complexity: The Science at the Edge of Order and Chaos"). After having read about it, I felt like I was finally finding the words I had trouble piecing together in my head since ChatGPT-4 came out. "Complexity, emergent phenomena! That's it!" I thought:
Emergent phenomena are everywhere around us, weather patterns emerge from countless atmospheric interactions, stock markets fluctuate based on the behavior of millions individual, ant colonies exhibit complex social structures that no single ant could comprehend and even the intricate beauty of snowflakes arise from simple molecular processes. Given this ubiquity of emergence in almost every aspect of reality, why would we assume that training billions of small matrix multiplications to output coherent text wouldn't lead to similar emergent phenomena? To expect otherwise seems, dare I say, a bit naïve, and will become even more so as those systems become more and more complex.
LLM Interpretability
Ironically, while "emergent behaviors" are frequently discussed in the broader LLM community, this concept seems to have had little impact on actual interpretability research, and the term is often used without truly grasping its implications. The issue seem to have been postponed or overlooked, while research focuses on studying smaller, more "stochastic parrot" and easier-to-understand systems[2]. We're still largely focused on dissecting individual circuits, attention heads or direction in the residual stream, as if understanding each gear in isolation will reveal the full complexity of the clock.
One striking example for me is that I've seen very little (to no) interpretability work done on Mixture of Experts systems[3], whereas in the field, they seem to me like the models with the most interesting emergent behaviors and properties.
A Call for a Holistic Approach
To fully embrace and accept the complexity of auto regressive models we should not just avoid reductionism, but actively pursue a holistic, interconnected, multi-level approach to understanding LLMs.
This approach should involve studying LLMs at various levels of abstraction, from individual components studied in isolation in toy models, to documenting and analyzing behaviors discovered "in the wild"[4]. Most importantly, it should focus on building synergies between these levels, much like the productive interplay between neuroscience and psychology in studying human cognition[5].
P.S. I'm currently working on a more extensive framework for holistic interpretability, which will explore various approaches to interpreting AIs and propose ways to build bridges between them, drawing inspiration from other scientific fields. If you are interested in participating (or just discussing your unique viewpoint) please reach out to me. I'd be particularly interested if you have idea or are working on interpretability approaches that you haven't seen elsewhere yet[6].
And on a broader scope, probably all AI interp in general but I am a bit less knowledgeable there
Anthropic’s later work on applying dictionary learning on Claude 3 attempts to tackle this. However, I feel like their approach mainly identified where the different concepts they searched for becomes readable (with human concepts) rather than uncovering Claude 3-specific concepts (that I would guess are very alien to ours).
And it seem to even be the case for a lot of behavioral studies
Most of the examples in this post (and many more) where discovered by open ended interaction with LLMs
For example, Elizabeth Loftus' studies on the malleability of human memory (Loftus et al., 1996) laid the groundwork for neuroscientists like Yoko Okado and Craig E.L. Stark to investigate the neural underpinnings of these phenomena (Okado et al., 2005). Similarly, the discovery of mirror neurons by Giacomo Rizzolatti’s team (Rizzolatti et al., 1996) provided a neurological basis that psychologists like Niedenthal and her team leveraged to explore emotional embodiment (Niedenthal et al., 2009).
An example of such an approach could be what I would call "Glitch Interp". This method is somewhat analogous to lesion studies in humans but involves finding or intentionally introducing glitches into AI systems. Some notable works that could be considered precursors to this approach are SolidGoldMagikarp, research on universal and transferable adversarial attacks or, on another type of DNN, patch attacks . While these studies weren't primarily aimed at interpretability, I think that pursuing an interpretability agenda using similar techniques could yield valuable insights into AI systems' inner workings.