This is research I did in a short span of time, it is likely not optimal, but it's unclear whether the constraints are my skills, my tools, or my methods. Code and full results can be found here, but you'll need to download the model yourself to replicate it.

TL;DR

Using Gemma Scope, I am able to find connections between SAE features in the attention portion of Gemma 2 2B's layer 13. This can be done without running the model. These are often (somewhat) meaningful, with the OV circuits generally being easier to interpret than the QK circuits.

The Background

Anthropic has done some nice research decomposing attention layers into the QK and OV circuits. To summarize, the QK circuit tells an attention head where to look, and the OV circuit tells it what information to pass forward. When looking at a single-layer transformer, this looks like patterns in tokens of the form .

Recently, Google Deepmind released Gemma Scope, consisting of a large number of SAEs, trained on the Gemma 2 family of models. Neuronpedia has already generated a description for all of the "canonical" SAEs using GPT-4o.

This is an attempt to look at QK and OV circuits in Gemma 2 from the lens of features, rather than tokens.

The Setup

Block N of a transfofmer. The attention layer has an SAE placed between the attention and the final linear transformation. The MLP has an SAE placed after the RMS norm. The residual stream has an SAE placed after the MLP layer

I chose to look at layer 13 (indexed from zero) roughly halfway through the model. For each decoder vector in the 16k-feature canonical residual stream SAE of layer 12 (henceforth SAE12), I calculated the element-wise product with the RMS norm coefficients, then calculated the products with , and . I'll refer to these as the queries, keys, and values.

I then took the encoder vectors of the 16k-feature canonical residual stream SAE of layer 13 (henceforth SAE13), did an element-wise product with the RMS norm of the attention layer output, and multiplied by the transpose of . This creates a vector I'll call the pre-output. This is working backward towards the attention layer.

This does ignore the MLP layer. I wish there were SAEs trained between the two components. It also ignores the rotary positional encoding. I may return to these later. It also ignores the attention-out prelinear SAE! This is because Neuronpedia doesn't have labels available to download for these yet.

I calculated, for each head, a  matrix consisting of the dot-products of the queries and keys, and which contains the dot-products of the values and pre-outputs. For each head, I found the highest-weighted connections between features in the QK and OV circuits.

Decoder Bias vs Rotary Embedding

I take the decoder bias of SAE12 to represent the "average" value of the residual stream going into layer 13. By looking at the query-key dot product of this with itself at different positional distances, we can maybe kinda see how each head looks across tokens:

Head 0 is most active at a distance of 0, then falls off rapidly past a distance of 1. Other tokens fall off more slowly.

Seems like head 0 is mostly a previous-token head, whereas the others fall off more slowly over distance.

Interesting Findings

Head 0 has lower QK-weighting than the other heads. It does a few interesting functions, such as noticing when mathematical constants are present and down-weighting a feature related to social equity (presumably to point the model towards the correct concept of "equal to" in this context).

It also detects a feature relating to scientific context and up-weights a feature relating to  the past tense, specifically completed actions, which is indeed the most common tense for scientific writing to be in!

Lots of OV circuits are pretty interpretable, or at least seem that way: features up or down-weight later features appropriately.

Unfortunately, many of the feature labels are not very good. For example, I keep getting ones relating to "product descriptions" cropping up in unrelated text in the neuronpedia playground. I assume a more expensive model would do a better job. Also, the features in a 16k SAE for a 2k residual stream are not very monosemantic. It would be interesting to try the 256k SAEs once those are labelled.

It seems like many of the heads are deep in attention-head superposition. If the attention SAE gets labelled it would be cool to check that out.

Some features come up repeatedly in various heads. I think this is because they're up-weighted by the coefficients of the layernorm.

Diving Into Head 0

QK Circuits

 QK Circuit ValueKey FeatureQuery Feature
0-0.10153744: references to the Hall of Fame and related ceremonies or inductions8517: numerical patterns or symbols related to lists or sequences
10.09763476: references to surface-related concepts and properties8517: numerical patterns or symbols related to lists or sequences
2-0.09106819: structured data presentation and numerical information819: structured data presentation and numerical information
3-0.08456605: frequencies of occurrences or items in lists and counts6605: frequencies of occurrences or items in lists and counts
4-0.082115107: structured data or variables related to types and their attributes8517: numerical patterns or symbols related to lists or sequences
50.081210553: references to military threats and potential risks involving individuals10553: references to military threats and potential risks involving individuals
6-0.079310211: scientific or medical terminology related to diseases and treatment options3160: programming syntax and coding structures
7-0.078862842: numbers and percentage values related to statistical analysis8517: numerical patterns or symbols related to lists or sequences
8-0.0743410553: references to military threats and potential risks involving individuals10211: scientific or medical terminology related to diseases and treatment options
9-0.073116134: references to interviews and conversations8517: numerical patterns or symbols related to lists or sequences
10-0.06811681: references to scams and fraudulent activities5864: terms related to financial transactions and deposits
11-0.067933598: terms associated with technical specifications and measurements16241: legal terminology related to court cases and appeals
12-0.06726838: technical terms related to computer science, programming languages, or data structures5864: terms related to financial transactions and deposits
130.06621374: common symbols or mathematical notations related to set theory and graph theory8517: numerical patterns or symbols related to lists or sequences
14-0.065311614: code snippets and programming constructs8517: numerical patterns or symbols related to lists or sequences
150.064947352: structured data formats and attributes within documents6062: phrases related to authority and compliance
16-0.064766594: numerical values related to quantities or measurements8517: numerical patterns or symbols related to lists or sequences
170.064212570: references to faces or facial features14662: various sounds and noises described in the text
18-0.064212888: punctuation marks, particularly periods8517: numerical patterns or symbols related to lists or sequences
19-0.06412523: terms related to regulations and conditions for financial and research contexts8517: numerical patterns or symbols related to lists or sequences
200.063714859: specific names and references related to locations and events5864: terms related to financial transactions and deposits
21-0.063667426: fragments of code or programming-related syntax, specifically within a structured or formatted context8517: numerical patterns or symbols related to lists or sequences
22-0.06334118: specific programming constructs or syntactic elements related to function definitions and method calls2737: phrases related to legal expenses and costs
230.062813408: lists of numbered items and their classifications or evaluations8517: numerical patterns or symbols related to lists or sequences
24-0.062210543: coordinating conjunctions used to connect clauses or phrases10543: coordinating conjunctions used to connect clauses or phrases
250.06212069: specific proper nouns, particularly names and titles5864: terms related to financial transactions and deposits
260.0623302: phrases involving details of legal cases and actions taken within them8517: numerical patterns or symbols related to lists or sequences
270.060810553: references to military threats and potential risks involving individuals3160: programming syntax and coding structures
280.06067352: structured data formats and attributes within documents14662: various sounds and noises described in the text
290.060611087: specific structural elements or commands in programming and mathematical contexts12085: instructions and guides for how to perform tasks or solve problems
300.060528517: numerical patterns or symbols related to lists or sequences8517: numerical patterns or symbols related to lists or sequences

Seems like several themes are coming up here: programming and numerical data, particularly lists; names and individuals; scientific research; legal proceedings. Overall it seems like this head is being used for a few different things. Makes sense, given what we know about superposition. Remember that head 0 might be mostly a previous-token head.

We also see a lot of what I call like-to-like QK-connections, in which the same feature appears as the query, and the key. This also makes sense intuitively. I'll show the first 20 connections which are not like-to-like, and which have positive QK values:

 QK Circuit ValueKey FeatureQuery Feature
00.09763476: references to surface-related concepts and properties8517: numerical patterns or symbols related to lists or sequences
10.06621374: common symbols or mathematical notations related to set theory and graph theory8517: numerical patterns or symbols related to lists or sequences
20.064947352: structured data formats and attributes within documents6062: phrases related to authority and compliance
30.064212570: references to faces or facial features14662: various sounds and noises described in the text
40.063714859: specific names and references related to locations and events5864: terms related to financial transactions and deposits
50.062813408: lists of numbered items and their classifications or evaluations8517: numerical patterns or symbols related to lists or sequences
60.06212069: specific proper nouns, particularly names and titles5864: terms related to financial transactions and deposits
70.0623302: phrases involving details of legal cases and actions taken within them8517: numerical patterns or symbols related to lists or sequences
80.060810553: references to military threats and potential risks involving individuals3160: programming syntax and coding structures
90.06067352: structured data formats and attributes within documents14662: various sounds and noises described in the text
100.060611087: specific structural elements or commands in programming and mathematical contexts12085: instructions and guides for how to perform tasks or solve problems
110.060336576: HTML and XML markup tags8517: numerical patterns or symbols related to lists or sequences
120.0591759: references to fakeness or deception, particularly in the context of news and representations8517: numerical patterns or symbols related to lists or sequences
130.0589612202: numerical values and their relationships in a data context8517: numerical patterns or symbols related to lists or sequences
140.058045815: numerical or tabular data relevant to various contexts9295: terms related to programming exceptions and errors in software development
150.057615335: references to academic departments, institutions, and legal entities13384: complex structured elements in data
160.0556613844: references to social connections and communal engagement8517: numerical patterns or symbols related to lists or sequences
170.055022989: statistical significance indicators in research findings16241: legal terminology related to court cases and appeals
180.0547512958: terms related to product quality and effectiveness6605: frequencies of occurrences or items in lists and counts
190.0546610186: references to military events and significant historical actions6605: frequencies of occurrences or items in lists and counts
200.05464436: topics related to technology and data management16093: words related to physical interactions and conflicts

One thing to note here is that the QK values for this head are much lower in magnitude than the other heads. Perhaps this head takes the role of a general aggregator, picking up on vibes from lots of tokens, rather than passing specific information around. This kinda makes sense based on the sorts of things which crop up in the OV circuit:

OV Circuit

 OV Circuit ValueValue FeatureOutput Feature
00.63313462: occurrences of the word "little."13158: mathematical equations and physical variables related to scientific concepts
1-0.62913462: occurrences of the word "little."1391: prepositions and their relationships in sentences
20.3972563: the context or formatting of sections in a document, particularly those marked with specific tags such as <bos>5670: phrases related to implications and consequences in scientific contexts
3-0.3918563: the context or formatting of sections in a document, particularly those marked with specific tags such as <bos>8048: occurrences of selectors and method calls within Objective-C or Swift code
4-0.3894867: mathematical notation and formal expressions in the document11482: numerical representations or indicators related to data or formatting elements
50.3792867: mathematical notation and formal expressions in the document5764: different types of hats and roofs
60.3638810: references to constants in mathematical expressions or equations13698: mathematical or logical expressions and structures in the text
70.356571: patterns of mathematical variables and operations in equations13698: mathematical or logical expressions and structures in the text
8-0.35258810: references to constants in mathematical expressions or equations1467: words and phrases related to social equity and representation issues
9-0.34965764: symbols and punctuation marks that indicate changes in data or versioning11482: numerical representations or indicators related to data or formatting elements
100.34948692: terms related to the implementation process in various contexts11482: numerical representations or indicators related to data or formatting elements
11-0.3477571: patterns of mathematical variables and operations in equations1467: words and phrases related to social equity and representation issues
120.342512192: components and classes related to the Java Swing framework4743: formal titles and legal terminology related to court cases
130.33525764: symbols and punctuation marks that indicate changes in data or versioning5764: different types of hats and roofs
14-0.33378692: terms related to the implementation process in various contexts5764: different types of hats and roofs
15-0.332312192: components and classes related to the Java Swing framework8160: terms related to scientific research and medical conditions
16-0.3179867: mathematical notation and formal expressions in the document8048: occurrences of selectors and method calls within Objective-C or Swift code
170.3176867: mathematical notation and formal expressions in the document5670: phrases related to implications and consequences in scientific contexts
18-0.30441302: shipping-related terms and phrases for large items490: phrases related to self-awareness and personal identity
190.3031302: shipping-related terms and phrases for large items9249: mathematical expressions or notations
200.2952365: mathematical expressions and types related to programming or data structures11482: numerical representations or indicators related to data or formatting elements
210.29176210: references to specific experimental methods and materials used in scientific research13698: mathematical or logical expressions and structures in the text
220.291314005: references to scientific research and studies1600: past participles and auxiliary verbs expressing completed actions
230.29136508: terms and concepts related to statistical methods and assumptions in graphical models879: conditional phrases and references to evidence or support
240.293156: references to alternative options or entities11482: numerical representations or indicators related to data or formatting elements
250.289316200: instances of phrases introducing or referencing specific scenarios15170: references to Muslims and related cultural or religious terms and events
26-0.28789752: the presence of numerical values and their associated contexts or relationships4743: formal titles and legal terminology related to court cases
27-0.28696508: terms and concepts related to statistical methods and assumptions in graphical models912: references to structural models and their connections in a technical context
28-0.28446210: references to specific experimental methods and materials used in scientific research1467: words and phrases related to social equity and representation issues
29-0.284416200: instances of phrases introducing or referencing specific scenarios15386: references to user-related information and actions within a programming or software context
300.28341711: instances of various elements or entities in a list or catalog format490: phrases related to self-awareness and personal identity

These are much richer and more interesting.

I really like 22 here, because it seems a bit weird at first, but what it's actually saying is that scientific work is almost always written in the perfect tense!

I find the negative values more interesting than the positive ones in a lot of cases. A lot of them seem to be "disambiguation". 8 seems to be telling the network "no, this is maths, we're looking at the mathematical definition of equality, not the social one!", as does 29. 16 seems to disambiguate Swift or Objective-C code from mathematical notations!

I don't know what 0 or 1 are doing! Why would "little" mean there are no prepositions, but we're in a scientific or mathematical context!

Head 7

I'll go through Head 7 here as well:

 QK Circuit ValueKey FeatureQuery Feature
0-0.333714956: legal terms and references to court proceedings15560: multiple segments of structured data, likely in a programming context
10.32813805: conditional phrases or questions13805: conditional phrases or questions
2-0.32710788: technical terms and parameters related to performance metrics13805: conditional phrases or questions
3-0.32413805: conditional phrases or questions10788: technical terms and parameters related to performance metrics
40.323514956: legal terms and references to court proceedings14956: legal terms and references to court proceedings
50.322815560: multiple segments of structured data, likely in a programming context15560: multiple segments of structured data, likely in a programming context
60.321510788: technical terms and parameters related to performance metrics10788: technical terms and parameters related to performance metrics
7-0.31615560: multiple segments of structured data, likely in a programming context14956: legal terms and references to court proceedings
80.28566163: LaTeX formatting commands and structure in a document6163: LaTeX formatting commands and structure in a document
9-0.2856412: conjunctions and their recurring use in sentences6163: LaTeX formatting commands and structure in a document
10-0.27326163: LaTeX formatting commands and structure in a document6412: conjunctions and their recurring use in sentences
110.2716412: conjunctions and their recurring use in sentences6412: conjunctions and their recurring use in sentences
12-0.245713492: numerical data and date representations499: features and attributes related to product descriptions and specifications
130.24568931: restaurant reviews that mention food quality and dining experiences8931: restaurant reviews that mention food quality and dining experiences
14-0.24447314: punctuation marks such as quotation marks and apostrophes8931: restaurant reviews that mention food quality and dining experiences
15-0.24328931: restaurant reviews that mention food quality and dining experiences7314: punctuation marks such as quotation marks and apostrophes
160.24113492: numerical data and date representations13492: numerical data and date representations
170.2406499: features and attributes related to product descriptions and specifications499: features and attributes related to product descriptions and specifications
180.24047314: punctuation marks such as quotation marks and apostrophes7314: punctuation marks such as quotation marks and apostrophes
19-0.244374: elements of humor, particularly dark and inappropriate humor5484: references to notable achievements or events related to advancements and recognitions
200.23835484: references to notable achievements or events related to advancements and recognitions5484: references to notable achievements or events related to advancements and recognitions
21-0.2378499: features and attributes related to product descriptions and specifications13492: numerical data and date representations
220.23294374: elements of humor, particularly dark and inappropriate humor4374: elements of humor, particularly dark and inappropriate humor
23-0.23285484: references to notable achievements or events related to advancements and recognitions4374: elements of humor, particularly dark and inappropriate humor
240.231116266: numerical data and mathematical expressions16266: numerical data and mathematical expressions
25-0.228316266: numerical data and mathematical expressions7942: references to event registration and participation details
260.22579009: programming-related terms and code structure elements9009: programming-related terms and code structure elements
27-0.22464814: concepts related to health and well-being, especially in medical contexts9009: programming-related terms and code structure elements
28-0.22419009: programming-related terms and code structure elements4814: concepts related to health and well-being, especially in medical contexts
290.2224814: concepts related to health and well-being, especially in medical contexts4814: concepts related to health and well-being, especially in medical contexts
30-0.218115560: multiple segments of structured data, likely in a programming context7400: key political figures and their roles

There are still a lot of like-to-like pairs. We some mutual-exclusion tetrads, like 8-11, or 26-29. Let's show the non like-to-like pairs with positive QK values:

 QK Circuit ValueKey FeatureQuery Feature
00.209215560: multiple segments of structured data, likely in a programming context15577: references to tutorials and guides
10.203614956: legal terms and references to court proceedings7400: key political figures and their roles
20.18147314: punctuation marks such as quotation marks and apostrophes693: references to companies and specific products associated with genetics and finance
30.1774499: features and attributes related to product descriptions and specifications6599: sections of text that contain scientific or technical jargon related to genetics or molecular biology
40.174113492: numerical data and date representations6293: elements related to mold removal and cleaning
50.178931: restaurant reviews that mention food quality and dining experiences16340: elements related to user preferences or session management
60.16247344: relationships between keywords and their attributes12988: technical terms related to structural engineering and materials
70.162414956: legal terms and references to court proceedings7950: phrases related to technical specifications or characteristics
80.159315560: multiple segments of structured data, likely in a programming context4092: references to educational backgrounds and achievements
90.1553391: terms related to scientific research and methodology15733: closing braces and related control flow syntax in code
100.15254470: technical terms and phrases related to processes in refrigeration and fluid dynamics12988: technical terms related to structural engineering and materials
110.14814956: legal terms and references to court proceedings15015: <span command="">JavaScript event handling and functions related to user interactions in web development.</span>
120.1415620: references to money and financial transactions, particularly those related to illegal activities251: financial terms related to risk and stability
130.1395620: references to money and financial transactions, particularly those related to illegal activities4043: references to data, experiments, and processes in scientific contexts
140.13913805: conditional phrases or questions9289: numeric values and structured formats, particularly those that appear in data representations and web links
150.13716386: terms related to audio processing and effects3551: structure declarations and classes within programming code
160.135412093: the word "after" in various contexts7846: references to medical treatments and patient outcomes
170.13453632: references to legal circuits and courts9399: details about music albums and their characteristics
180.134312093: the word "after" in various contexts2999: terms related to statistical analysis and data representation
190.13412711: numbers and mathematical expressions12231: references to the Android context class and its usage in code
200.13311094: references to scientific studies and their results10631: clauses and phrases that describe relationships or characteristics

Now let's take a look at the OV circuit:

 OV Circuit ValueValue FeatureOutput Feature
00.9732012: instances of the word "already" in various contexts8487: scientific measurements and their implications
1-0.97073732: punctuation marks8487: scientific measurements and their implications
20.95759295: terms related to programming exceptions and errors in software development27: technical terms and concepts related to object-role modeling and database queries
3-0.95411541: coding elements related to data parsing and storage operations27: technical terms and concepts related to object-role modeling and database queries
40.9411541: coding elements related to data parsing and storage operations11639: mathematical operations and programming constructs related to vector calculations
5-0.93959295: terms related to programming exceptions and errors in software development11639: mathematical operations and programming constructs related to vector calculations
60.9133732: punctuation marks738: entities related to organizations and institutional frameworks
7-0.86772012: instances of the word "already" in various contexts738: entities related to organizations and institutional frameworks
80.7156052: object properties and their associated methods in programming contexts15278: keywords related to job postings in the healthcare field
9-0.70266052: object properties and their associated methods in programming contexts6954: references to boys and masculinity
10-0.70177507: numerical data and references related to statistics and measurements15278: keywords related to job postings in the healthcare field
110.6947507: numerical data and references related to statistics and measurements6954: references to boys and masculinity
120.69343391: terms related to scientific research and methodology13811: specific coding constructs and structure, particularly related to object-oriented programming elements like classes and unique identifiers
13-0.67539932: functions and events related to programming, particularly those involving event handling and listener methods14631: terms related to procedural steps and algorithms
140.6731745: questions and mathematical operations involving problem-solving11576: terms related to legal documentation and identification processes
150.6689932: functions and events related to programming, particularly those involving event handling and listener methods6733: details related to room features and rental conditions
16-0.65971942: rankings and positions of institutions or programs11576: terms related to legal documentation and identification processes
17-0.65871745: questions and mathematical operations involving problem-solving965: references to gender equality and disparities
180.656214781: elements related to user interaction and token verification in a digital workspace14631: terms related to procedural steps and algorithms
19-0.64943391: terms related to scientific research and methodology13176: code structures, particularly comments and namespace declarations in programming languages
200.64841942: rankings and positions of institutions or programs965: references to gender equality and disparities
21-0.646514781: elements related to user interaction and token verification in a digital workspace6733: details related to room features and rental conditions
22-0.641613430: mathematical constructs and expressions331: mathematical equations and expressions
230.63913430: mathematical constructs and expressions15082: references related to mathematical or scientific notation and parameters
240.63872565: references to rural locations and related entities331: mathematical equations and expressions
25-0.6352565: references to rural locations and related entities15082: references related to mathematical or scientific notation and parameters
26-0.62458437: data references and statistics related to biological experiments13260: elements associated with input fields and forms
270.6239681: historical references related to laws and legal cases13260: elements associated with input fields and forms
280.6198437: data references and statistics related to biological experiments12800: specific references to successful authors and their works
29-0.6169681: historical references related to laws and legal cases12800: specific references to successful authors and their works
300.61510640: keywords and references related to academic or scientific sources27: technical terms and concepts related to object-role modeling and database queries

Some of these seem totally nonsensical! 8-11 in particular, I mean what? These seem to be hopelessly lost in superposition. Perhaps head 7 is in a greater degree of superposition because it attends to more specific tokens than head 0.

Conclusions

With difficulty, it may be possible to reconstruct attention-based circuits in this way. It's unclear how much of the difficulties stem from technical limitations in the SAE and labels, and how much are fundamental to this method. I would like to try again someday.

New Comment