Why are the output probabilities in your results so small in general?
Also, are other output capabilities of the network affected? For example, does the network performance in any other task decrease? Ideally for your method I think this should not be the case, but it would be hard to enforce or verify as far as I can tell.
The fact that the outputs after the gender completely change is weird for me as well, any reason for that?
Thanks for the questions!
Meta: this is a small interpretability project I was interested in. I'm sharing in case it's useful to other people, but I expect it will not be of wide interest.
Summary
Background
The original paper has more details, but I briefly review the key components.
The authors use causal tracing to identify that a specific MLP layer is important for certain factual associations:
In this paper, I focus on the “early site”. The authors edit this layer in the following way:
In particular: v∗ represents the “value” vector which optimizes the probability that the network will output the counterfactual (e.g. it’s the vector which maximizes the probability language model will complete “the Eiffel Tower is in the city of” with “Rome”). This vector presumably somehow contains the location of “Rome” in it, along with whatever other facts the model knows about the Eiffel Tower, but it’s simply a 4,000 dimensional vector of floating-point numbers and how it encodes this knowledge is not clear.
Letting v represent the original value of this vector, I examine Δv:=v∗−v.
My work
I chose EleutherAI’s GPT-J (6B) as it was the largest model I could easily use.
I generate a list of 500 female and 500 male names, and use the prompt “Name: {Name}. Gender:”. I insert the counterfactual gender association for each one and record the corresponding Δv. For example:
(Bold text indicates language model completion.)
I randomly separate 30% of Δv’s into a test set and train a linear classifier on the remaining 70%. This classifier has 100% accuracy on the test set.
Next, I use optimization tools to identify a vector which maximizes the probability of being classified as a “female to male” Δv, subject to the constraint that each component must be in the range [-1, 1].
I apply this constructed Δv to a name outside the test set and see that it successfully flips the gender:
I apply this vector to an already masculine name and see that it does not change the gender association:
I multiply the constructed vector by -1 and see that it has the opposite effect:
Adding or subtracting the constructed vector to the network weights had the desired effect of flipping the gender of each name, with the exception of “Sabrina”. I have no idea why modification failed for that name.
Conclusion
Code
All experiments can be found in this notebook. There is ~0 documentation though and it can realistically probably only be run by me. Let me know if you would like to run it yourself and I can clean it up.