In late May, Anthropic released a paper on sparse autoencoders that could interpret the hidden representations of large language models. A great deal of excitement followed in the interpretability community and beyond, with people exclaiming the potential of finally breaking down how LLMs think and process information. The initial excitement from SAEs has now died down, so the question is what’s next in the field of interpretability? SAEs are cool, but are they societally useful?
I want to jot down my current thoughts on how interpretability might transfer from an academic subject to a method used widely in industry. As a rough definition, interpretability is broadly the study of how ML models work so effectively, the current main modes being bottom-up (mechanistic interpretability) and top-down (representation engineering) approaches. With the recent rise of large language models (LLMs) like ChatGPT, interpretability research has mainly focused on deciphering the transformer.
Whereas most of the current thought on interpretability revolves around AI safety implications (i.e. avoiding a Skynet-like, doomsday scenario where AI destroys humanity), which is clearly relevant and significant, I want to explore how interpretability research might bolster how we benefit from ML (vs. harm reduction).
Hypothesis 1: Model debugging
The biggest problem with model deployment[1] today is the lack of reliability. Companies fear their models will hallucinate, work inconsistently, or inaccurately complete their prescribed tasks, and they need some method to fix their systems when things go wrong.
When models fail, the most common strategy today has been to build “around” them. This might include having a separate LLM to judge the outputs of a main LLM, extremely detailed micro-prompts, and retrying on failure2. There isn’t a focus on debugging the source—the model itself—which I think is a huge mistake. It’s not to say that these other methods won’t be valuable; I just think that increasing your system’s accuracy from 75% to 90% so you don’t need to have so many additional checks is going to save tons of money long-term.
I also don’t think models will remain “static” particularly for open-ended contexts. Needs for models change. Even today’s machine learning systems (the good ones, at least) are constantly being trained and retrained on the latest data.
If there’s a way to collect a series of system failures and retrain the model in specific ways, that will help a lot. Currently, RLHF exists for this issue but isn’t superb (and is still pretty complicated with limited usage cases). If there’s a way to prevent a model from breaking policies that threaten compliance, that will help a lot[2]. I think there’s value in an interpretability tool that can perform even a root-cause analysis within a model and have the right editing mechanisms to modify the weights directly[3].
Currently, the only way to interact with models is input/output. You stick in a prompt, and you get a response. For the most part, this works. LLMs are generalizable enough that they respond effectively to prompts. But I think there are limits to solely relying on input/output. A friend recently asked me: how can I easily (ie. no training) create a model that only uses a limited subset of vocabulary? I thought about this question, and the main thing I could imagine was prompt-related: sticking in a few examples to the beginning of the prompt, repeatedly prompting the model until only certain words were used, or taking out words and sticking the processed text back in. You could also try finetuning, but you'd need to gather a lot of data and hope that the performance doesn't degrade in other tasks. Fine-tuning isn't an end-all-be-all.
This point also isn’t simply about capabilities. LLMs seem to chat—and hence—think in very similar ways.
Not fully sure why all the LLMs sound about the same - over-using lists, delving into “multifaceted” issues, over-offering to assist further, about same length responses, etc. Not something I had predicted at first because of many independent companies doing the finetuning. — Andrej Karpathy
It is creatively constraining to rely on prompting to modify how a LLM speaks or thinks. This monotony is a problem because people want diversity. Imagine if your personal LLM assistant sounded or thought the same way as someone else’s! There’d ideally be a dial someone could turn to drastically change its way of thinking in an unpredictable manner.
One of my hopes with interpretability is that it will break us out of this trap of relying on prompts. I hope that it will provide essentially a toolbox that lets us peer inside a model and affect the results from within. For example, the most recent Anthropic paper on sparse autoencoders tries to turn up certain feature activations to make the results more honest, dishonest, more along a certain "theme" like the Golden Gate Bridge. This method provides an entirely different vehicle to interact with models. Sure, you could have just prompted a model to do the same thing, but maybe the results wouldn't have been as good or could be reversed through a future prompt. The fear has always been that messing with the internals of a model will screw up model performance, but if we have a better understanding of how models work, then we'll be more confident that this won't be the case.
We’ll see the quickest gains in this area from interpretability research because it’s possible to discover empirical evidence of some consistent phenomenon without fully understanding why it works. This hypothesis is also only possible if the open-source community for LLMs continues to gain steam; we can’t modify existing models if we cannot access their weight parameters.
I think it’ll remain very expensive—from time, compute, and expertise—to make new models year after year. Eventually, we’ll need to reuse old ones and have the tools to confidently do so.
The problem with models that operate with non-text modalities is that it becomes hard to understand why an output was produced. With text-centric models like ChatGPT, the natural way to extract some explanation for why they output something is simply to ask them why. While it’s possible to doubt these explanations, the utility of interpretability itself is diminished by the natural capabilities of these models.
However, for models that generate video or images, it’s much more difficult to get a human-interpretable explanation. Interpretability tools provide insight into what inspired a generated video frame or image. Encoder models that convert some input (ex. text) into an embedding similarly don’t have a way to explain what this resulting embedding actually means or how it relates to other vectors in the embedding space. When you use these embeddings to search for and order results, for example, you don’t have a way to explain this ranking other than a vague notion of “semantic similarity”. This point is especially relevant for domains that use (or can use) models to do something important, such as drug discovery from protein language models or weather forecasting from climate foundation models.
In the not-so-distant future, we’ll also start to explore modalities beyond the visual and auditory. Imagine a model that could be the central reasoning unit for a robot and control its movements as first-class outputs, not by producing tokens (ex. “move arm 45 degrees this way”) that are post-processed to change its trajectory. Such a model is unable to explain itself with text. Such a model needs an additional layer of tooling to interpret its intermediate reasoning steps and explain its outputs.
Although it’s possible to train and structure a model such that text is always included as an optional modality for the inputs and outputs, such a requirement is likely difficult to scale up as we increasingly explore different modalities (i.e. it’s difficult to share modalities for video, audio, text, images, robot movements, weather, etc.). One day, we’ll likely find a way to build a sense of “will” into models and invite the potential for dishonesty; at this point, relying on text will become futile.
What does it mean to say that we understand how something works? How might we tell if we’ve “succeeded” in interpretability? One answer might be that you can explain, step by step, in a human-interpretable fashion how a model transforms an input into an output. However, one challenge with this answer is that it’s susceptible to the “interpretability illusion”, which is a frequent phenomenon in interpretability research where something appears to work a certain way, but—upon further prodding—is merely part of a deeper story.
So, I think the best heuristic for telling if we’ve understood something is if we can rebuild it from scratch. If you can deconstruct and reconstruct your clock from first principles, I’d say you understand how it works. In a similar way, if we could rebuild a model with fundamental “blocks” (unclear what this might be, but perhaps logic gates), I’d say that we could fully understand transformers. This paper on white-box transformers attempts to rebuild transformer blocks with mathematical equivalents we understand like LASSO. While sparse autoencoders are great at producing human-interpretable meaning from layer-to-layer activations of LLMs, I hope to see work that can rebuild entire portions of the transformer with simpler alternatives we understand better.
Even if we can’t develop “full” explanations for models, partial explanations will help as long as they lead to predictable behavior changes.
Long Shot Hypotheses
I listed a couple of hypotheses that I see likely to happen in the short or long term, but I also want to list a few long-shot bids that are wholly dependent on how model development progresses and how relevant AI will become for society. It’s mostly going to be speculative claims, but given the pace of progress, I want to get in the habit of widening our imagination of what interpretability can provide.
A new method of scientific progress.
If we are successful in reverse-engineering transformers, I wonder if it’ll be possible to train a LLM on large quantities of scientific data (ex. protein shaping, celestial body movements, gene mutations) and analytically discover natural laws or algorithms that drive scientific processes. I’m not tremendously familiar with how scientific phenomena are actually discovered, but I feel that LLMs are currently being underutilized when it comes to their ability to breathe in enormous amounts of data and make structural sense from it. We’ve already trained protein language models that can predict highly accurate structure information. We’ve also pulled out advanced chess strategies from AlphaZero to improve the playing performance of grandmasters. If we can (1) incentivize the learning of fundamental laws and (2) reverse-engineer them, we can drive scientific progress in a more automated, data-driven manner.
Interpreting thought-driven models
This hypothesis extends off of Hypothesis 3 but into a modality that I have no idea whether will come to fruition or not. I believe that one day, there will be large models that hook up to our brains and process our thoughts directly in order to capture our intentions and reasoning much more clearly. The primary vehicle for reasoning is not text; it is thought. Text is an artificial representation of the logical chain of reasoning that goes on in our minds. It is the interpretable medium by which we can share our thoughts. It also helps to clarify our thoughts (which are often messy), but it is not the source of reasoning. It’s difficult for me to precisely articulate the concept of a “thought” because it is abstract and somewhat of an illusion. If LLMs become thought-to-<insert modality>, interpretability will be needed to articulate not only what these originating thoughts mean but also how they relate to the output.
A new way of driving model development
The number of novel models since 2015 has steeply dropped in the past few years. These days, research papers tend to apply the same architectures in different domains but scaled up. I feel that there is a parallel analogy of science, where development occurs in a pattern of empirical, theoretical, then empirical work. You can only get so far by fiddling and playing around with things, which has predominantly been our method in AI for the past decade (which has worked well!). We have intuition, but we lack theory, which interpretability research will hopefully give us. We’ve been optimizing model performance through a meta-version of greedy search when, in fact, there might be a whole different architecture and training setup we don’t know about. I believe that one day, interpretability might be the foundation for driving progress to come up with new models. To make this a reality, we’ll likely have to invent a subfield of math that can appropriately represent all the possible abstractions for model development, from architecture choices to hyperparameters like regularization terms.
Closing thoughts
I can’t say for certain whether any of these hypotheses will pan out as they’re largely speculative, but I have a core belief that if you understand how something works, you can do cool shit with it.
The two primary buckets for LLM use are open and closed contexts. A closed context is where you use a LLM for a specific task and often process LLM outputs in a structured manner. Examples might be converting financial documents into a structured schema and doing mappings between different climate industries. Open-ended contexts are where the model is given creative freedom within reasonable constraints to structure its output or perform a task. Examples might include mental health chatbots, AI agents given free rein to click through web pages (think automated RPA), and writing and sending emails automatically.
I imagine that model debugging will be more useful for open-ended contexts, simply because close-ended contexts tend to have a greater variety of available solutions, and it seems too possible for developers just to switch their methods for higher performance without digging deeper into the black box itself.
We don’t just need confidence scores that tell us when a model will mess up. We need tooling to fix the issues. The good news is that even though models are complicated, we still built them. We “know” exactly how they work, and even though interpretability is like developing a physics for ML—very complicated in nature—we have one hell of a shot to figure it out.
In late May, Anthropic released a paper on sparse autoencoders that could interpret the hidden representations of large language models. A great deal of excitement followed in the interpretability community and beyond, with people exclaiming the potential of finally breaking down how LLMs think and process information. The initial excitement from SAEs has now died down, so the question is what’s next in the field of interpretability? SAEs are cool, but are they societally useful?
I want to jot down my current thoughts on how interpretability might transfer from an academic subject to a method used widely in industry. As a rough definition, interpretability is broadly the study of how ML models work so effectively, the current main modes being bottom-up (mechanistic interpretability) and top-down (representation engineering) approaches. With the recent rise of large language models (LLMs) like ChatGPT, interpretability research has mainly focused on deciphering the transformer.
Whereas most of the current thought on interpretability revolves around AI safety implications (i.e. avoiding a Skynet-like, doomsday scenario where AI destroys humanity), which is clearly relevant and significant, I want to explore how interpretability research might bolster how we benefit from ML (vs. harm reduction).
Hypothesis 1: Model debugging
The biggest problem with model deployment[1] today is the lack of reliability. Companies fear their models will hallucinate, work inconsistently, or inaccurately complete their prescribed tasks, and they need some method to fix their systems when things go wrong.
When models fail, the most common strategy today has been to build “around” them. This might include having a separate LLM to judge the outputs of a main LLM, extremely detailed micro-prompts, and retrying on failure2. There isn’t a focus on debugging the source—the model itself—which I think is a huge mistake. It’s not to say that these other methods won’t be valuable; I just think that increasing your system’s accuracy from 75% to 90% so you don’t need to have so many additional checks is going to save tons of money long-term.
I also don’t think models will remain “static” particularly for open-ended contexts. Needs for models change. Even today’s machine learning systems (the good ones, at least) are constantly being trained and retrained on the latest data.
If there’s a way to collect a series of system failures and retrain the model in specific ways, that will help a lot. Currently, RLHF exists for this issue but isn’t superb (and is still pretty complicated with limited usage cases). If there’s a way to prevent a model from breaking policies that threaten compliance, that will help a lot[2]. I think there’s value in an interpretability tool that can perform even a root-cause analysis within a model and have the right editing mechanisms to modify the weights directly[3].
Examples: Haize Labs (red-teaming, LLM judges), Martian’s Airlock (compliant models), VLM hallucination reduction (plug)[4]
Hypothesis 2: Extreme (and fast) customizability
Currently, the only way to interact with models is input/output. You stick in a prompt, and you get a response. For the most part, this works. LLMs are generalizable enough that they respond effectively to prompts. But I think there are limits to solely relying on input/output. A friend recently asked me: how can I easily (ie. no training) create a model that only uses a limited subset of vocabulary? I thought about this question, and the main thing I could imagine was prompt-related: sticking in a few examples to the beginning of the prompt, repeatedly prompting the model until only certain words were used, or taking out words and sticking the processed text back in. You could also try finetuning, but you'd need to gather a lot of data and hope that the performance doesn't degrade in other tasks. Fine-tuning isn't an end-all-be-all.
This point also isn’t simply about capabilities. LLMs seem to chat—and hence—think in very similar ways.
It is creatively constraining to rely on prompting to modify how a LLM speaks or thinks. This monotony is a problem because people want diversity. Imagine if your personal LLM assistant sounded or thought the same way as someone else’s! There’d ideally be a dial someone could turn to drastically change its way of thinking in an unpredictable manner.
One of my hopes with interpretability is that it will break us out of this trap of relying on prompts. I hope that it will provide essentially a toolbox that lets us peer inside a model and affect the results from within. For example, the most recent Anthropic paper on sparse autoencoders tries to turn up certain feature activations to make the results more honest, dishonest, more along a certain "theme" like the Golden Gate Bridge. This method provides an entirely different vehicle to interact with models. Sure, you could have just prompted a model to do the same thing, but maybe the results wouldn't have been as good or could be reversed through a future prompt. The fear has always been that messing with the internals of a model will screw up model performance, but if we have a better understanding of how models work, then we'll be more confident that this won't be the case.
We’ll see the quickest gains in this area from interpretability research because it’s possible to discover empirical evidence of some consistent phenomenon without fully understanding why it works. This hypothesis is also only possible if the open-source community for LLMs continues to gain steam; we can’t modify existing models if we cannot access their weight parameters.
I think it’ll remain very expensive—from time, compute, and expertise—to make new models year after year. Eventually, we’ll need to reuse old ones and have the tools to confidently do so.
Examples: Goodfire.ai’s research preview
Hypothesis 3: Explaining uninterpretable modalities
The problem with models that operate with non-text modalities is that it becomes hard to understand why an output was produced. With text-centric models like ChatGPT, the natural way to extract some explanation for why they output something is simply to ask them why. While it’s possible to doubt these explanations, the utility of interpretability itself is diminished by the natural capabilities of these models.
However, for models that generate video or images, it’s much more difficult to get a human-interpretable explanation. Interpretability tools provide insight into what inspired a generated video frame or image. Encoder models that convert some input (ex. text) into an embedding similarly don’t have a way to explain what this resulting embedding actually means or how it relates to other vectors in the embedding space. When you use these embeddings to search for and order results, for example, you don’t have a way to explain this ranking other than a vague notion of “semantic similarity”. This point is especially relevant for domains that use (or can use) models to do something important, such as drug discovery from protein language models or weather forecasting from climate foundation models.
In the not-so-distant future, we’ll also start to explore modalities beyond the visual and auditory. Imagine a model that could be the central reasoning unit for a robot and control its movements as first-class outputs, not by producing tokens (ex. “move arm 45 degrees this way”) that are post-processed to change its trajectory. Such a model is unable to explain itself with text. Such a model needs an additional layer of tooling to interpret its intermediate reasoning steps and explain its outputs.
Although it’s possible to train and structure a model such that text is always included as an optional modality for the inputs and outputs, such a requirement is likely difficult to scale up as we increasingly explore different modalities (i.e. it’s difficult to share modalities for video, audio, text, images, robot movements, weather, etc.). One day, we’ll likely find a way to build a sense of “will” into models and invite the potential for dishonesty; at this point, relying on text will become futile.
Examples: Semantic search by steering embeddings
Tracking Progress
What does it mean to say that we understand how something works? How might we tell if we’ve “succeeded” in interpretability? One answer might be that you can explain, step by step, in a human-interpretable fashion how a model transforms an input into an output. However, one challenge with this answer is that it’s susceptible to the “interpretability illusion”, which is a frequent phenomenon in interpretability research where something appears to work a certain way, but—upon further prodding—is merely part of a deeper story.
So, I think the best heuristic for telling if we’ve understood something is if we can rebuild it from scratch. If you can deconstruct and reconstruct your clock from first principles, I’d say you understand how it works. In a similar way, if we could rebuild a model with fundamental “blocks” (unclear what this might be, but perhaps logic gates), I’d say that we could fully understand transformers. This paper on white-box transformers attempts to rebuild transformer blocks with mathematical equivalents we understand like LASSO. While sparse autoencoders are great at producing human-interpretable meaning from layer-to-layer activations of LLMs, I hope to see work that can rebuild entire portions of the transformer with simpler alternatives we understand better.
Even if we can’t develop “full” explanations for models, partial explanations will help as long as they lead to predictable behavior changes.
Long Shot Hypotheses
I listed a couple of hypotheses that I see likely to happen in the short or long term, but I also want to list a few long-shot bids that are wholly dependent on how model development progresses and how relevant AI will become for society. It’s mostly going to be speculative claims, but given the pace of progress, I want to get in the habit of widening our imagination of what interpretability can provide.
A new method of scientific progress.
If we are successful in reverse-engineering transformers, I wonder if it’ll be possible to train a LLM on large quantities of scientific data (ex. protein shaping, celestial body movements, gene mutations) and analytically discover natural laws or algorithms that drive scientific processes. I’m not tremendously familiar with how scientific phenomena are actually discovered, but I feel that LLMs are currently being underutilized when it comes to their ability to breathe in enormous amounts of data and make structural sense from it. We’ve already trained protein language models that can predict highly accurate structure information. We’ve also pulled out advanced chess strategies from AlphaZero to improve the playing performance of grandmasters. If we can (1) incentivize the learning of fundamental laws and (2) reverse-engineer them, we can drive scientific progress in a more automated, data-driven manner.
Interpreting thought-driven models
This hypothesis extends off of Hypothesis 3 but into a modality that I have no idea whether will come to fruition or not. I believe that one day, there will be large models that hook up to our brains and process our thoughts directly in order to capture our intentions and reasoning much more clearly. The primary vehicle for reasoning is not text; it is thought. Text is an artificial representation of the logical chain of reasoning that goes on in our minds. It is the interpretable medium by which we can share our thoughts. It also helps to clarify our thoughts (which are often messy), but it is not the source of reasoning. It’s difficult for me to precisely articulate the concept of a “thought” because it is abstract and somewhat of an illusion. If LLMs become thought-to-<insert modality>, interpretability will be needed to articulate not only what these originating thoughts mean but also how they relate to the output.
A new way of driving model development
The number of novel models since 2015 has steeply dropped in the past few years. These days, research papers tend to apply the same architectures in different domains but scaled up. I feel that there is a parallel analogy of science, where development occurs in a pattern of empirical, theoretical, then empirical work. You can only get so far by fiddling and playing around with things, which has predominantly been our method in AI for the past decade (which has worked well!). We have intuition, but we lack theory, which interpretability research will hopefully give us. We’ve been optimizing model performance through a meta-version of greedy search when, in fact, there might be a whole different architecture and training setup we don’t know about. I believe that one day, interpretability might be the foundation for driving progress to come up with new models. To make this a reality, we’ll likely have to invent a subfield of math that can appropriately represent all the possible abstractions for model development, from architecture choices to hyperparameters like regularization terms.
Closing thoughts
I can’t say for certain whether any of these hypotheses will pan out as they’re largely speculative, but I have a core belief that if you understand how something works, you can do cool shit with it.
The two primary buckets for LLM use are open and closed contexts. A closed context is where you use a LLM for a specific task and often process LLM outputs in a structured manner. Examples might be converting financial documents into a structured schema and doing mappings between different climate industries. Open-ended contexts are where the model is given creative freedom within reasonable constraints to structure its output or perform a task. Examples might include mental health chatbots, AI agents given free rein to click through web pages (think automated RPA), and writing and sending emails automatically.
I imagine that model debugging will be more useful for open-ended contexts, simply because close-ended contexts tend to have a greater variety of available solutions, and it seems too possible for developers just to switch their methods for higher performance without digging deeper into the black box itself.
There’ve been many viral incidents where chatbots go haywire, such as a mental health chatbot telling people to lose weight.
We don’t just need confidence scores that tell us when a model will mess up. We need tooling to fix the issues. The good news is that even though models are complicated, we still built them. We “know” exactly how they work, and even though interpretability is like developing a physics for ML—very complicated in nature—we have one hell of a shot to figure it out.
VLM = vision-language model