Emergence, The Blind Spot of GenAI Interpretability?

Quentin FEUILLADE--MONTIXI

16 Emergence, The Blind Spot of GenAI Interpretability?

10th Aug 2024

4 min read

16

Epistemic status: This post was planned to be part of a broader "Holistic Interpretability" post but this isn't going as fast as I'd like so I am releasing the foreword to get some early feedback on whether I should pursue this or not. I haven't had a lot of red teaming on this emergence idea, but I'd still put a 80% confidence that this is important. I also tried to use the word emergence only where it made sense, but I put a 40% chance that I might have used it a bit wrong.

Emergence: More Than Just a Buzzword?

When I think about large auto-regressive models, I can't help but return to the concept of emergence. Now, I know what you're probably thinking - "emergence" has become something of a fuzzy buzzword in AI discussions, and I mostly agree with the critics, but I think there is more to unpack here.

Let's start with a classic example: water. A single H2O molecule is just a simple arrangement of atoms, but combine gazillions of these molecules, and suddenly you have a substance that flows, splashes, and takes the shape of its container. While individual molecules have properties like polarity and can form hydrogen bonds, the macroscopic properties we associate with liquids only become apparent when vast numbers of these molecules interact.

Now, you might argue, "But we can simulate water behavior perfectly if we know all the properties of individual molecules!" And you'd probably be right. However, even with a perfect simulation, to predict how water will behave in a new scenario, you'd still need to run the entire simulation. You can't shortcut it by just looking at a few molecules or an over-simplified model.

This illustrates why I believe emergence matters when discussing AIs. It's not about mystical properties appearing out of nowhere. Rather, it's about the necessity of studying the system as a whole to fully understand its behavior, even with complete knowledge of its components.

When I look at current efforts to interpret LLMs^[1], I often have this impression that they bet on the idea that LLM behaviors, cognition, and capabilities are NOT emergent properties of auto-regressive deep neural networks and it's training. There seem to be an implicit belief that everything could be pieced out just by being good enough at understanding all the internal circuitry and components.

The default trajectory?

This approach seems to mirrors a common scientific reaction to studying complex entities: stripping them down to their constituent pieces, and trying to piece them back together using logical, sound (even rational) reasoning.

In the last few months, I've read about the trajectory of other sciences (psychology, economics, biology, physics, medicine, ...), and talked with practitioners, before finally stumbling on complexity science (I warmly recommend the book "Complexity: The Science at the Edge of Order and Chaos"). After having read about it, I felt like I was finally finding the words I had trouble piecing together in my head since ChatGPT-4 came out. "Complexity, emergent phenomena! That's it!" I thought:

Emergent phenomena are everywhere around us, weather patterns emerge from countless atmospheric interactions, stock markets fluctuate based on the behavior of millions individual, ant colonies exhibit complex social structures that no single ant could comprehend and even the intricate beauty of snowflakes arise from simple molecular processes. Given this ubiquity of emergence in almost every aspect of reality, why would we assume that training billions of small matrix multiplications to output coherent text wouldn't lead to similar emergent phenomena? To expect otherwise seems, dare I say, a bit naïve, and will become even more so as those systems become more and more complex.

LLM Interpretability

Ironically, while "emergent behaviors" are frequently discussed in the broader LLM community, this concept seems to have had little impact on actual interpretability research, and the term is often used without truly grasping its implications. The issue seem to have been postponed or overlooked, while research focuses on studying smaller, more "stochastic parrot" and easier-to-understand systems^[2]. We're still largely focused on dissecting individual circuits, attention heads or direction in the residual stream, as if understanding each gear in isolation will reveal the full complexity of the clock.

One striking example for me is that I've seen very little (to no) interpretability work done on Mixture of Experts systems^[3], whereas in the field, they seem to me like the models with the most interesting emergent behaviors and properties.

A Call for a Holistic Approach

To fully embrace and accept the complexity of auto regressive models we should not just avoid reductionism, but actively pursue a holistic, interconnected, multi-level approach to understanding LLMs.

This approach should involve studying LLMs at various levels of abstraction, from individual components studied in isolation in toy models, to documenting and analyzing behaviors discovered "in the wild"^[4]. Most importantly, it should focus on building synergies between these levels, much like the productive interplay between neuroscience and psychology in studying human cognition^[5].

P.S. I'm currently working on a more extensive framework for holistic interpretability, which will explore various approaches to interpreting AIs and propose ways to build bridges between them, drawing inspiration from other scientific fields. If you are interested in participating (or just discussing your unique viewpoint) please reach out to me. I'd be particularly interested if you have idea or are working on interpretability approaches that you haven't seen elsewhere yet^[6].

^{^}
And on a broader scope, probably all AI interp in general but I am a bit less knowledgeable there
^{^}
Anthropic’s later work on applying dictionary learning on Claude 3 attempts to tackle this. However, I feel like their approach mainly identified where the different concepts they searched for becomes readable (with human concepts) rather than uncovering Claude 3-specific concepts (that I would guess are very alien to ours).
^{^}
And it seem to even be the case for a lot of behavioral studies
^{^}
Most of the examples in this post (and many more) where discovered by open ended interaction with LLMs
^{^}
For example, Elizabeth Loftus' studies on the malleability of human memory (Loftus et al., 1996) laid the groundwork for neuroscientists like Yoko Okado and Craig E.L. Stark to investigate the neural underpinnings of these phenomena (Okado et al., 2005). Similarly, the discovery of mirror neurons by Giacomo Rizzolatti’s team (Rizzolatti et al., 1996) provided a neurological basis that psychologists like Niedenthal and her team leveraged to explore emotional embodiment (Niedenthal et al., 2009).
^{^}
An example of such an approach could be what I would call "Glitch Interp". This method is somewhat analogous to lesion studies in humans but involves finding or intentionally introducing glitches into AI systems. Some notable works that could be considered precursors to this approach are SolidGoldMagikarp, research on universal and transferable adversarial attacks or, on another type of DNN, patch attacks . While these studies weren't primarily aimed at interpretability, I think that pursuing an interpretability agenda using similar techniques could yield valuable insights into AI systems' inner workings.

New to LessWrong?

Getting Started

FAQ

Library

^{^}

And on a broader scope, probably all AI interp in general but I am a bit less knowledgeable there

^{^}

Anthropic’s later work on applying dictionary learning on Claude 3 attempts to tackle this. However, I feel like their approach mainly identified where the different concepts they searched for becomes readable (with human concepts) rather than uncovering Claude 3-specific concepts (that I would guess are very alien to ours).

^{^}

And it seem to even be the case for a lot of behavioral studies

^{^}

Most of the examples in this post (and many more) where discovered by open ended interaction with LLMs

^{^}

For example, Elizabeth Loftus' studies on the malleability of human memory (Loftus et al., 1996) laid the groundwork for neuroscientists like Yoko Okado and Craig E.L. Stark to investigate the neural underpinnings of these phenomena (Okado et al., 2005). Similarly, the discovery of mirror neurons by Giacomo Rizzolatti’s team (Rizzolatti et al., 1996) provided a neurological basis that psychologists like Niedenthal and her team leveraged to explore emotional embodiment (Niedenthal et al., 2009).

^{^}

An example of such an approach could be what I would call "Glitch Interp". This method is somewhat analogous to lesion studies in humans but involves finding or intentionally introducing glitches into AI systems. Some notable works that could be considered precursors to this approach are SolidGoldMagikarp, research on universal and transferable adversarial attacks or, on another type of DNN, patch attacks . While these studies weren't primarily aimed at interpretability, I think that pursuing an interpretability agenda using similar techniques could yield valuable insights into AI systems' inner workings.