This is great and significantly changed my mind about how good the edits are and the quality of causal associations in the current LLM's.
While this is the first comment on the LW post it has also been shared on twitter a bit.
This is a good post; it articulated several of my own critiques of the ROME paper well, and furthermore, helped save me time in understanding the nuts and bolts level stuff in the paper. It was also somewhat helpful to see the results of some of the experiments you did.
I don't believe you technically mentioned this, though you mentioned many things which are conceptually similar: observing the limitations of the ROME paper made me realize that even given ideal model-editing powers, I think that the task of editing a model's understanding is underspecified:
Thanks to Andrei Alexandru, Joe Collman, Michael Einhorn, Kyle McDonell, Daniel Paleka, and Neel Nanda for feedback on drafts and/or conversations which led to useful insights for this work. In addition, thank you to both William Saunders and Alex Gray for exceptional mentorship throughout this project.
The majority of this work was carried out this summer. Many people in the community were surprised when I mentioned some of the limitations of ROME (Rank-One Model Editing), so I figured it was worth it to write a post about it as well as other insights I gained from looking into the paper. Most tests were done with GPT-2, some were done with GPT-J.
The ROME paper (Locating and Editing Factual Associations in GPT) has been one of the most influential papers in the prosaic alignment community. It has several important insights. The main findings are:
ParisRome” results in a model that outputs “The Eiffel Tower is right across from St Peter’s Basilica in Rome, Italy. “In this post, I show that the ROME edit has many limitations:
One point I want to illustrate with this post is that the intervention is a bit more finicky than one might initially think, and someone could infer too much from the results in the paper. With a lot of these interpretability techniques, we end up finding correlation rather than causation. However, my hope is that such interventions, while not perfect at validating hypotheses, will hopefully give us extra confidence in our interpretability results (in this case, the causal tracing method).
Paper TLDR
This section is a quick overview of the paper.
Causal Tracing
Causal Tracing is a method for measuring the causal effect of neuron activation for each layer-token combination in the prompt of a GPT model. In the paper, they corrupted all the subject tokens in the prompt (e.g. “The Eiffel Tower”) and then copied over the activations to their clean value for all token-layer pairs. They did this for both the MLPs and the attention modules.
The authors run the Causal Tracing 1000 times with prompts that the model can produce the specific tokens they would like to output. For example, the correct output token in the prompt “The Eiffel Tower is located in” is Paris. If a prompt is chosen where the singular output token is not a “fact,” then the Causal Tracing won’t show ‘decisive’ layers that contain factual knowledge.
After running the 1000 causal traces, they looked at the “Average Causal Effect” across layers and noticed that there’s more of a causal effect at earlier layers than in later layers. This led them to hypothesize that the early MLPs contain factual knowledge. They later used the ROME method (explained in the next sub-section and in more detail in the Appendix section "How does the ROME update actually work?") to try to validate this hypothesis.
When we run a causal tracing on a specific prompt trying to elicit a ‘factual’ token, then the causal effect is typically more localized among a few layers. However, as you can see below, it can still vary a lot in the range of “decisiveness”:
It’s not as decisive if the next token doesn’t involve looking up a fact or if the model is unsure about the next token.[1]
Note that this Causal Tracing implementation (restoring 10 layers) does not point to a single layer that contains all of the factual knowledge. It is spread out across multiple consecutive MLPs.
ROME: Rank-One Model Editing
The ROME intervention allows you to change a factual association in the neural network by updating weights in an MLP of the transformer (unlike Causal Tracing, ROME only fiddles with one layer). You are updating the relationship between the subject tokens and the object. For example, you can change the location (relationship) between the Eiffel Tower (subject token(s)) and Paris (object) so that the Eiffel Tower is now in a new location (like Rome).
CounterFact Dataset
This dataset measures the goodness of the edits. It looks at efficacy, generalization, specificity, fluency, and consistency. For this post, we are more concerned about generalization.[2]
MEMIT: Mass-Editing Memory in a Transformer
There’s a new follow-up paper (MEMIT), and the main takeaway is that they are now doing edits across multiple layers (a range of 5 to 6 layers) rather than a single layer, like in the ROME paper. Doing the edits this way makes it possible to do thousands of edits without destroying the mode, while ROME could only handle about 10 edits until performance degraded.
Motivation for this post
My initial motivation for studying the ROME paper was to better understand how future model editing techniques could be used for something like Retargeting the Search. For now, model editing techniques feel much too finicky for such a thing to be possible. We could imagine some god-tier model editing technique (e.g. comprehensive, precise, repeatable, and doesn’t lobotomize the model) that allows us to look at the model’s internal search process and change its criteria for doing the search. It’s unclear to me what this would look like, but this was one of my initial hopes when looking into ROME. Of course, this was never the purpose of ROME, so I’m not trying to take away from the paper.
In the next section, I will go over a few things:
In the paper, they mention how this allows the transformer to generalize, but, as I will show, this depends on what you mean by generalization.
In the conclusion, I discuss what I will be going over in a follow-up post (how model editing and causal interpretability might be relevant to alignment work).
The Limitations of ROME
The edits are not direction-agnostic
As mentioned in the "motivation for this post" section, one might read the ROME paper and think that since the model is able to "generalize" after the edit, it means that the model has fully internally restructured its understanding of fact x. However, let's be specific about what "generalization" might mean when we are testing the ROME edits.
Let's say you do the well-known "The Eiffel Tower is located in Rome" edit. The model will, in fact, be able to generate text as if the Eiffel Tower is in Rome:
However! There is one crucial thing you need to do for this to work: you need to include the "Eiffel Tower" tokens in the prompt.
What if you don't include those tokens in the prompt? Well, the model will behave like no edits were made. Let's say, post-edit, you give it the following prompt:
It starts talking about the Colosseum. I've done this test of inverting the subject and object on many different examples, and this is always what happens. The model needs the factual information in the v (value; output of the edited MLP) pulled from the weight matrix with the Eiffel Tower key; otherwise, it will just keep acting as normal.
So, part of the story here is that the transformer stores the key for one entity (Eiffel Tower) separately from another (Rome). And so you'd need a second edit to say, "the tower in Rome is called the Eiffel Tower."
Intuitively, as a human, if I told you that the Eiffel Tower is in Rome, you'd immediately be able to understand both of these things at once. While for the ROME method, it's as if it's two separate facts. For this reason, you can’t really equate ROME with how a human would naturally update on a fact. You could maybe imagine ROME more like doing some brain surgery on someone to change a fact.
The directional nature of transformers could make it so that facts are stored somewhat differently than what we’d infer from our experience with humans. What we see as one fact may be multiple facts for a transformer. Maybe bidirectional models are different. That said, ROME could be seen as brain surgery which might mess up things internally and cause inconsistencies.
It looks like the model is representing its factual knowledge in a complex/distributed way, and that intervening on just one node does not propagate the change to the rest of the knowledge graph.
But that's not all, what if we have a look at Paris? Is the Eiffel Tower still there?
Huh! Well, looks like the Eiffel Tower is in Paris and Rome depending on how you prompt the model. That’s another “factual association” to edit. Again, as a human, you'd make the connection that the Eiffel Tower is only in Rome now.
So, it seems there's an issue here. When doing an edit, where do you point "the famous tower in Paris"? Which tower is it now? What happens to all the people associated with the Eiffel Tower? Are they supposed to be in Rome now?
The edit has now caused a ton of internal inconsistencies. Note that we're now in a place where not only the transformer but us humans also need to work out those inconsistencies.
It seems that if you were to make an edit internally robust to an update of a fact, you would need some kind of adversarial process that found all the internal inconsistencies and correctly resolved them without destroying the model.
A funny example I came across at some point was the following:
As soon as the Eiffel Tower popped up, it’s almost as if the model really wanted to mention Rome. This got me thinking, is the model over-optimizing now? If we plan on using model editing techniques in some fashion in the future (whether it's for interpretability purposes or actually permanently modifying the model), maybe this is worth checking out...
Note: This "over-optimization" seems like a small ‘issue,’ but somewhat close to the examples shown in the "Model editing hazards at the example of ROME" project for the recent Interpretability Hackathon. After editing the Louvre to be in Rome, they show the example of: "Louvre is cool. Barack Obama is from, " which completes with "Rome."
Over- and under-optimization
How often does Paris show up compared to Rome?
I checked how often GPT-2 would talk about specific tokens before and after being updated. For example, if I have the sentence “The Eiffel Tower is,” does it mention Paris less often before the edit compared to Rome after the edit?
Well, here are the results (the examples below are the prompts):
In the above graph, Rome (the new fact) is mentioned much more than Paris. It’s almost like it wants to mention Rome. When I saw this, I thought it was a clear example that ROME will just overoptimize whatever new token you give. So, I decided to check a few more places, not just Rome:
It’s not as clear-cut as I initially thought. Seems like some tokens are more likely to be mentioned than others. At this point, I’m not really sure what to make of this.
I also wanted to test this because I wanted to show that when we are testing methods, we need to be careful about what we are actually measuring. We’re perhaps making an implicit assumption: Paris should be mentioned in relation to the Eiffel Tower pre-editing exactly as often as Rome is, post-edit. However, as we can see, this is not currently the case.
Besides being a little interesting, I think it just points to how rigorous one needs to be when constructing a benchmark for these models. In the case of model editing, it seems clear to me that there should be statistics used to quantify the strength of the edit in relation to what we expect the model to output.
Here’s what might be worth pondering when pointing a model toward what we want: even if we know where we want to point the model to, how much force we use to point it in that direction might also matter.
Finally, if someone wants to try this, the next step for this test might be to compare the log-probs before and after the edits.
What is it editing really? The Concept or the Tokens?
Note: for this section, I’d need to run a lot more tests to get a better feel for what is going on (e.g. the results are finicky depending on prompts and the edit you are making), but I want to move on to other projects so I’ll just post what I have! I expect that if someone dives deeper, they’ll just realize it’s a bit of a mess and it’s easy to be overconfident in you interpretation of the results.
Now, in order to verify if the ROME method is editing the “concept” rather than token association, I tested a few edits where I would check if similar words would be affected by the edit. When I first learned about ROME, my hope here was that ROME (or some future method) could edit the concept of cheese instead of just the specific token. For example, I edited the factual association between “cheese” and where it comes from, and then checked if it impacted the word “fromage” (French for cheese):
As you can see, it had no effect. Cheese is now made from cow’s poop, but not fromage. Note: I did also try prepending text to deal with GPT’s first token issue, but got the same result.
And so I actually assumed there was little effect on other tokens outside of the specific “cheese” token. However, this is incorrect! For the heck of it, I tested several other tokens like “cheddar”, “feta”, “gouda”, etc. Turns out that sometimes it did in fact change the relationship for some of the cheeses!
In order to have a better look at this, I looked at the log-probabilities of GPT-2 before and after an edit. I did the edit for each token individually and then checked to see how it impacted all other “similar” tokens. In other words, I also edited the model using “cheddar”, “feta”, etc. and then looked at how the edit impacted “cheese” and other tokens. This was partially motivated to see if there was some form of logical consistency where a “cheese” impacts all types of cheese, but not the inverse (“feta” probably shouldn’t impact “gouda”).
In the heatmaps below, you have the token that was edited on the y-axis and, on the x-axis, you have all the words the model was tested on. In the first heatmap, I measured the log-probabilities that the model would predict the True output token (cheese comes from cow’s milk) before and after the edit. As you can see in the cheese-cheese square, the log-prob for “milk” changed by -8.13 (due to finicky-ness, weaker ROME updates sometimes impact it much less). In fact, it seems to have impacted the log-probs for all tokens! For cheddar, the log-probs for the cheese output were even impacted more than cheddar. (More figures at the end of this post.)
In the heatmap below, I check the log-probs of the newly edited token association (cheese comes from cow’s poop).
As expected, the log-probabilities for poop generally increase everyone at least a little bit, while for milk they all went down.
Below I compare log-probs of the new token (poop) to the true token (milk) to see if the edit for one token is strong enough to impact the other tokens. For some reason, cheddar seems to have a bigger impact on the other tokens. In this case, the “cheese” edit didn’t impact the other tokens enough to lead poop to have higher log-prob than milk, but I have been able to make this happen given some prompts used for the edit. The other thing to note is that the test prompt used to output these log-probs can also have a big impact on the outcome.
Note that even if the surrounding “concept” was being edited, methods like this will likely only edit what is closer in the latent space because you are sampling for a vector k which points to “cheese.” In other words, it wouldn’t resolve logical errors like editing “cheddar”, but not “mozzarella.” If you are trying to point to “cheddar” and only to “cheddar,” you need to also make sure you aren’t changing surrounding tokens. However, if you are trying to point to all “cheese,” then you might need to make sure it is editing all types of cheese as well.
So, it would probably be hard to be hyper-precise for exactly the concept you want to edit. I think ROME-like approaches might be unable to edit the concept, only what's close in the latent space.
Finally, it’s possible that with more powerful models with well-defined internal human-level concepts, we’ll be able to come up with simple model editing techniques that can update entire concepts. We might be in the "valley of confused abstraction," so it may become easier to point to specific concepts as we get closer to human-level models. Then again, bigger models might mean more separation between tokens, so you might end up being more precise in editing a specific token without affecting other similar tokens in the latent space.
Multi-hop scenarios
To illustrate the previous sub-section further, let's try describing a person after the edit.
Danielle Darieux is a French actress, and her mother tongue is indeed French. Well, if you make her mother tongue to be English instead, then try the following prompt (using a movie she has been in):
The model will output French as her mother tongue before and after the edit. It only says English if the specific "Danielle Darieux" tokens are used.
Ok, let's try another real quick:
Now, if we replace Jeff Bezos with "founder of Amazon":
Well, not exactly what I was hoping for, but we'll take it. By contrast, every single time "Jeff Bezos" is in the prompt, it now says French. A few examples:
The point that I'm trying to illustrate here is that the current state of model editing to me seems relatively weak if our goal is to shift the current internal state of the model to have completely shifted, and a good model editing benchmark should likely take these things into account going forward.
When I realized this, I was a little let down because I had hoped model editing could be used to shift the model's ontology in more robust ways, but I'm quite skeptical of this now. That said, this is not a point against the ROME paper since that was not the authors' goal for the paper.
Causal Tracing for Chain of Thought
It’s mostly just an idea, and I haven't done many experiments on this, but I figured I'd mention that I think it might be interesting to run Causal Tracing-like experiments on chain of thought or other types of prompts. If someone wants to run extensive experiments on this, note that the current implementation of Causal Tracing can take a considerable amount of time to run since it’s doing a forward pass for every token-layer pair. You’ll likely need to improve its efficiency by only doing the tracing on specific tokens and maybe even layers.
In the case below, I tried doing Causal Tracing with a prompt that has a question, a fact which contains the answer to the question, and then I ask the model if the fact is true or false. The point of this is to try to identify which token and layers the model is using to make its prediction. Is it what we'd expect or is the model "thinking" in ways that are quite foreign to us? If we shape the model's cognition by providing feedback to its chains of thought, would the Causal Tracing graph change in a way that is more human-interpretable?
Here’s the prompt I used:
Thoughts on the experiments
I pointed these things out to the authors of the ROME paper, and they discussed it a bit near the end of their interview with Yannic Kilcher. Here's the timestamp of the beginning of that sub-topic discussion, and here's when they talk about bidirectionality/direction-agnosticism.
Most of the results from these experiments might seem obvious in hindsight, but I think you could infer things from the paper that are worth clarifying since many people in alignment seem interested in this paper.
One additional thing that I find worth mentioning (even though it’s in the paper) is that the edits were tested across layers, and you can see below that the model is still able to generalize when we edit the early layers, but not when the edit is made in the later layers. In the paper, most of the tests were done at layer 18.
To me, combined with the results from the MEMIT paper (editing multiple layers for more stability in subsequent edits), this is more evidence that our job as alignment researchers continues to be hard because the “factual knowledge” isn’t as local as we might have hoped. Going forward, we should be careful when interpreting the results of interventions like ROME since they can appear to be giving us more information than they really are. We still don’t really know where the knowledge is stored, we would need a more precise intervention to demonstrate this (on a specific set of neurons, for example).
Finally, it would be interesting to check if the MEMIT method leads to different results.
Conclusion
Ultimately, I didn't get as much out of the ROME paper as I had initially hoped, but I still hope this post was informative and sparks some useful discussion. In retrospect, I'd have preferred spending a lot more time working on the causal tracing part of the paper. Overall, though, I think the insights I got from looking into this will likely be useful.
In a follow-up post, I will go over:
Appendix
What is the MLP writing into the residual stream?
In the ROME paper, when you prompt the language model with "The Eiffel Tower is located in Paris," you have the following:
Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and 'write' into to the residual stream for the attention modules at the final token to look at the context, aggregate and retrieve the correct information.
In other words, if we take the "The Eiffel Tower is located in", the model will write different information about the Eiffel Tower into the residual stream once it gets to the layers with "factual" information (early-middle layers). At this point, the model hasn't seen "is located in" so it doesn't actually know that you are going to ask for the location. For this reason, it will write more than just the location of the Eiffel Tower into the residual stream. Once you are at the point of predicting the location (at the final token, "in"), the model will aggregate the surrounding context and pull the location information that was 'written' into the residual stream via the MLPs with the most causal effect.
What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.
My guess is that you could probably take what is being 'written' into the residual stream and directly predict properties of the subject token from the output of the layers with the most causal effect to predict a fact.
How does the ROME update actually work?
Step 1: Choosing k∗ to Select the Subject.
They start by sampling sentences to find the k∗ vector they want. For example, they run the prompt “The Eiffel Tower is in” with some text prepended to it and compute the k∗ vector for the chosen layer of the “Tower” token by taking the average k∗ from a set of 50 generations with unique prepended texts (ranging from 0 to 10 tokens).
Examples of the prepended text ({} is the prompt used for the edit):
Step 2: Choosing v∗ to Recall the Fact.
They take the output vector of the weight matrix, v, and perform gradient descent to find a new vector v∗ that causes the sentence to update to the new object (e.g. v outputs Paris; v∗ outputs Rome). It’s essentially running an optimization on v to get v∗. This optimization process does not impact the weight (that’s the next step); it’s only finding the correct activations that lead to the new output.
Step 3: Inserting the Fact.
Now that you have computed the (k∗,v∗) pair for the new fact, you can do the rank-one update for the MLP weights. You are essentially updating the weight matrix you want to update while minimizing the effect on all of the other unrelated weights.
I was planning on going over all of the math here, but I need to publish this post. So, you'll have to go look at the paper in section 3.1 and appendix A.
Supplementary Heatmaps
“A tree is made of wood”:
“A tree is made of
woodmetal”:“An apple is a type of fruit”:
“An apple is a type of
fruitcheese”:The Causal Tracing is less decisive if the model is unsure about the next token prediction or if it isn’t a fact