This is research I did in a short span of time, it is likely not optimal, but it's unclear whether the constraints are my skills, my tools, or my methods. Code and full results can be found here, but you'll need to download the model yourself to replicate it.
TL;DR
Using Gemma Scope, I am able to find connections between SAE features in the attention portion of Gemma 2 2B's layer 13. This can be done without running the model. These are often (somewhat) meaningful, with the OV circuits generally being easier to interpret than the QK circuits.
The Background
Anthropic has done some nice research decomposing attention layers into the QK and OV circuits. To summarize, the QK circuit tells an attention head where to look, and the OV circuit tells it what information to pass forward. When looking at a single-layer transformer, this looks like patterns in tokens of the form A,...,B→C.
Recently, Google Deepmind released Gemma Scope, consisting of a large number of SAEs, trained on the Gemma 2 family of models. Neuronpedia has already generated a description for all of the "canonical" SAEs using GPT-4o.
This is an attempt to look at QK and OV circuits in Gemma 2 from the lens of features, rather than tokens.
The Setup
I chose to look at layer 13 (indexed from zero) roughly halfway through the model. For each decoder vector in the 16k-feature canonical residual stream SAE of layer 12 (henceforth SAE12), I calculated the element-wise product with the RMS norm coefficients, then calculated the products with WQ, WK, and WV. I'll refer to these as the queries, keys, and values.
I then took the encoder vectors of the 16k-feature canonical residual stream SAE of layer 13 (henceforth SAE13), did an element-wise product with the RMS norm of the attention layer output, and multiplied by the transpose of WO. This creates a vector I'll call the pre-output. This is working backward towards the attention layer.
This does ignore the MLP layer. I wish there were SAEs trained between the two components. It also ignores the rotary positional encoding. I may return to these later. It also ignores the attention-out prelinear SAE! This is because Neuronpedia doesn't have labels available to download for these yet.
I calculated, for each head, a [16k,16k] matrix consisting of the dot-products of the queries and keys, and which contains the dot-products of the values and pre-outputs. For each head, I found the highest-weighted connections between features in the QK and OV circuits.
Decoder Bias vs Rotary Embedding
I take the decoder bias of SAE12 to represent the "average" value of the residual stream going into layer 13. By looking at the query-key dot product of this with itself at different positional distances, we can maybe kinda see how each head looks across tokens:
Seems like head 0 is mostly a previous-token head, whereas the others fall off more slowly over distance.
Interesting Findings
Head 0 has lower QK-weighting than the other heads. It does a few interesting functions, such as noticing when mathematical constants are present and down-weighting a feature related to social equity (presumably to point the model towards the correct concept of "equal to" in this context).
It also detects a feature relating to scientific context and up-weights a feature relating to the past tense, specifically completed actions, which is indeed the most common tense for scientific writing to be in!
Lots of OV circuits are pretty interpretable, or at least seem that way: features up or down-weight later features appropriately.
Unfortunately, many of the feature labels are not very good. For example, I keep getting ones relating to "product descriptions" cropping up in unrelated text in the neuronpedia playground. I assume a more expensive model would do a better job. Also, the features in a 16k SAE for a 2k residual stream are not very monosemantic. It would be interesting to try the 256k SAEs once those are labelled.
It seems like many of the heads are deep in attention-head superposition. If the attention SAE gets labelled it would be cool to check that out.
Some features come up repeatedly in various heads. I think this is because they're up-weighted by the coefficients of the layernorm.
Diving Into Head 0
QK Circuits
QK Circuit Value
Key Feature
Query Feature
0
-0.1015
3744: references to the Hall of Fame and related ceremonies or inductions
8517: numerical patterns or symbols related to lists or sequences
1
0.0976
3476: references to surface-related concepts and properties
8517: numerical patterns or symbols related to lists or sequences
2
-0.09106
819: structured data presentation and numerical information
819: structured data presentation and numerical information
3
-0.0845
6605: frequencies of occurrences or items in lists and counts
6605: frequencies of occurrences or items in lists and counts
4
-0.0821
15107: structured data or variables related to types and their attributes
8517: numerical patterns or symbols related to lists or sequences
5
0.0812
10553: references to military threats and potential risks involving individuals
10553: references to military threats and potential risks involving individuals
6
-0.0793
10211: scientific or medical terminology related to diseases and treatment options
3160: programming syntax and coding structures
7
-0.07886
2842: numbers and percentage values related to statistical analysis
8517: numerical patterns or symbols related to lists or sequences
8
-0.07434
10553: references to military threats and potential risks involving individuals
10211: scientific or medical terminology related to diseases and treatment options
9
-0.0731
16134: references to interviews and conversations
8517: numerical patterns or symbols related to lists or sequences
10
-0.0681
1681: references to scams and fraudulent activities
5864: terms related to financial transactions and deposits
11
-0.06793
3598: terms associated with technical specifications and measurements
16241: legal terminology related to court cases and appeals
12
-0.06726
838: technical terms related to computer science, programming languages, or data structures
5864: terms related to financial transactions and deposits
13
0.0662
1374: common symbols or mathematical notations related to set theory and graph theory
8517: numerical patterns or symbols related to lists or sequences
14
-0.0653
11614: code snippets and programming constructs
8517: numerical patterns or symbols related to lists or sequences
15
0.06494
7352: structured data formats and attributes within documents
6062: phrases related to authority and compliance
16
-0.06476
6594: numerical values related to quantities or measurements
8517: numerical patterns or symbols related to lists or sequences
17
0.0642
12570: references to faces or facial features
14662: various sounds and noises described in the text
18
-0.0642
12888: punctuation marks, particularly periods
8517: numerical patterns or symbols related to lists or sequences
19
-0.064
12523: terms related to regulations and conditions for financial and research contexts
8517: numerical patterns or symbols related to lists or sequences
20
0.0637
14859: specific names and references related to locations and events
5864: terms related to financial transactions and deposits
21
-0.06366
7426: fragments of code or programming-related syntax, specifically within a structured or formatted context
8517: numerical patterns or symbols related to lists or sequences
22
-0.0633
4118: specific programming constructs or syntactic elements related to function definitions and method calls
2737: phrases related to legal expenses and costs
23
0.0628
13408: lists of numbered items and their classifications or evaluations
8517: numerical patterns or symbols related to lists or sequences
24
-0.0622
10543: coordinating conjunctions used to connect clauses or phrases
10543: coordinating conjunctions used to connect clauses or phrases
25
0.0621
2069: specific proper nouns, particularly names and titles
5864: terms related to financial transactions and deposits
26
0.062
3302: phrases involving details of legal cases and actions taken within them
8517: numerical patterns or symbols related to lists or sequences
27
0.0608
10553: references to military threats and potential risks involving individuals
3160: programming syntax and coding structures
28
0.0606
7352: structured data formats and attributes within documents
14662: various sounds and noises described in the text
29
0.0606
11087: specific structural elements or commands in programming and mathematical contexts
12085: instructions and guides for how to perform tasks or solve problems
30
0.06052
8517: numerical patterns or symbols related to lists or sequences
8517: numerical patterns or symbols related to lists or sequences
Seems like several themes are coming up here: programming and numerical data, particularly lists; names and individuals; scientific research; legal proceedings. Overall it seems like this head is being used for a few different things. Makes sense, given what we know about superposition. Remember that head 0 might be mostly a previous-token head.
We also see a lot of what I call like-to-like QK-connections, in which the same feature appears as the query, and the key. This also makes sense intuitively. I'll show the first 20 connections which are not like-to-like, and which have positive QK values:
QK Circuit Value
Key Feature
Query Feature
0
0.0976
3476: references to surface-related concepts and properties
8517: numerical patterns or symbols related to lists or sequences
1
0.0662
1374: common symbols or mathematical notations related to set theory and graph theory
8517: numerical patterns or symbols related to lists or sequences
2
0.06494
7352: structured data formats and attributes within documents
6062: phrases related to authority and compliance
3
0.0642
12570: references to faces or facial features
14662: various sounds and noises described in the text
4
0.0637
14859: specific names and references related to locations and events
5864: terms related to financial transactions and deposits
5
0.0628
13408: lists of numbered items and their classifications or evaluations
8517: numerical patterns or symbols related to lists or sequences
6
0.0621
2069: specific proper nouns, particularly names and titles
5864: terms related to financial transactions and deposits
7
0.062
3302: phrases involving details of legal cases and actions taken within them
8517: numerical patterns or symbols related to lists or sequences
8
0.0608
10553: references to military threats and potential risks involving individuals
3160: programming syntax and coding structures
9
0.0606
7352: structured data formats and attributes within documents
14662: various sounds and noises described in the text
10
0.0606
11087: specific structural elements or commands in programming and mathematical contexts
12085: instructions and guides for how to perform tasks or solve problems
11
0.06033
6576: HTML and XML markup tags
8517: numerical patterns or symbols related to lists or sequences
12
0.0591
759: references to fakeness or deception, particularly in the context of news and representations
8517: numerical patterns or symbols related to lists or sequences
13
0.05896
12202: numerical values and their relationships in a data context
8517: numerical patterns or symbols related to lists or sequences
14
0.05804
5815: numerical or tabular data relevant to various contexts
9295: terms related to programming exceptions and errors in software development
15
0.0576
15335: references to academic departments, institutions, and legal entities
13384: complex structured elements in data
16
0.05566
13844: references to social connections and communal engagement
8517: numerical patterns or symbols related to lists or sequences
17
0.05502
2989: statistical significance indicators in research findings
16241: legal terminology related to court cases and appeals
18
0.05475
12958: terms related to product quality and effectiveness
6605: frequencies of occurrences or items in lists and counts
19
0.05466
10186: references to military events and significant historical actions
6605: frequencies of occurrences or items in lists and counts
20
0.0546
4436: topics related to technology and data management
16093: words related to physical interactions and conflicts
One thing to note here is that the QK values for this head are much lower in magnitude than the other heads. Perhaps this head takes the role of a general aggregator, picking up on vibes from lots of tokens, rather than passing specific information around. This kinda makes sense based on the sorts of things which crop up in the OV circuit:
OV Circuit
OV Circuit Value
Value Feature
Output Feature
0
0.633
13462: occurrences of the word "little."
13158: mathematical equations and physical variables related to scientific concepts
1
-0.629
13462: occurrences of the word "little."
1391: prepositions and their relationships in sentences
2
0.3972
563: the context or formatting of sections in a document, particularly those marked with specific tags such as <bos>
5670: phrases related to implications and consequences in scientific contexts
3
-0.3918
563: the context or formatting of sections in a document, particularly those marked with specific tags such as <bos>
8048: occurrences of selectors and method calls within Objective-C or Swift code
4
-0.3894
867: mathematical notation and formal expressions in the document
11482: numerical representations or indicators related to data or formatting elements
5
0.3792
867: mathematical notation and formal expressions in the document
5764: different types of hats and roofs
6
0.363
8810: references to constants in mathematical expressions or equations
13698: mathematical or logical expressions and structures in the text
7
0.356
571: patterns of mathematical variables and operations in equations
13698: mathematical or logical expressions and structures in the text
8
-0.3525
8810: references to constants in mathematical expressions or equations
1467: words and phrases related to social equity and representation issues
9
-0.3496
5764: symbols and punctuation marks that indicate changes in data or versioning
11482: numerical representations or indicators related to data or formatting elements
10
0.3494
8692: terms related to the implementation process in various contexts
11482: numerical representations or indicators related to data or formatting elements
11
-0.3477
571: patterns of mathematical variables and operations in equations
1467: words and phrases related to social equity and representation issues
12
0.3425
12192: components and classes related to the Java Swing framework
4743: formal titles and legal terminology related to court cases
13
0.3352
5764: symbols and punctuation marks that indicate changes in data or versioning
5764: different types of hats and roofs
14
-0.3337
8692: terms related to the implementation process in various contexts
5764: different types of hats and roofs
15
-0.3323
12192: components and classes related to the Java Swing framework
8160: terms related to scientific research and medical conditions
16
-0.3179
867: mathematical notation and formal expressions in the document
8048: occurrences of selectors and method calls within Objective-C or Swift code
17
0.3176
867: mathematical notation and formal expressions in the document
5670: phrases related to implications and consequences in scientific contexts
18
-0.3044
1302: shipping-related terms and phrases for large items
490: phrases related to self-awareness and personal identity
19
0.303
1302: shipping-related terms and phrases for large items
9249: mathematical expressions or notations
20
0.295
2365: mathematical expressions and types related to programming or data structures
11482: numerical representations or indicators related to data or formatting elements
21
0.2917
6210: references to specific experimental methods and materials used in scientific research
13698: mathematical or logical expressions and structures in the text
22
0.2913
14005: references to scientific research and studies
1600: past participles and auxiliary verbs expressing completed actions
23
0.2913
6508: terms and concepts related to statistical methods and assumptions in graphical models
879: conditional phrases and references to evidence or support
24
0.29
3156: references to alternative options or entities
11482: numerical representations or indicators related to data or formatting elements
25
0.2893
16200: instances of phrases introducing or referencing specific scenarios
15170: references to Muslims and related cultural or religious terms and events
26
-0.2878
9752: the presence of numerical values and their associated contexts or relationships
4743: formal titles and legal terminology related to court cases
27
-0.2869
6508: terms and concepts related to statistical methods and assumptions in graphical models
912: references to structural models and their connections in a technical context
28
-0.2844
6210: references to specific experimental methods and materials used in scientific research
1467: words and phrases related to social equity and representation issues
29
-0.2844
16200: instances of phrases introducing or referencing specific scenarios
15386: references to user-related information and actions within a programming or software context
30
0.2834
1711: instances of various elements or entities in a list or catalog format
490: phrases related to self-awareness and personal identity
These are much richer and more interesting.
I really like 22 here, because it seems a bit weird at first, but what it's actually saying is that scientific work is almost always written in the perfect tense!
I find the negative values more interesting than the positive ones in a lot of cases. A lot of them seem to be "disambiguation". 8 seems to be telling the network "no, this is maths, we're looking at the mathematical definition of equality, not the social one!", as does 29. 16 seems to disambiguate Swift or Objective-C code from mathematical notations!
I don't know what 0 or 1 are doing! Why would "little" mean there are no prepositions, but we're in a scientific or mathematical context!
Head 7
I'll go through Head 7 here as well:
QK Circuit Value
Key Feature
Query Feature
0
-0.3337
14956: legal terms and references to court proceedings
15560: multiple segments of structured data, likely in a programming context
1
0.328
13805: conditional phrases or questions
13805: conditional phrases or questions
2
-0.327
10788: technical terms and parameters related to performance metrics
13805: conditional phrases or questions
3
-0.324
13805: conditional phrases or questions
10788: technical terms and parameters related to performance metrics
4
0.3235
14956: legal terms and references to court proceedings
14956: legal terms and references to court proceedings
5
0.3228
15560: multiple segments of structured data, likely in a programming context
15560: multiple segments of structured data, likely in a programming context
6
0.3215
10788: technical terms and parameters related to performance metrics
10788: technical terms and parameters related to performance metrics
7
-0.316
15560: multiple segments of structured data, likely in a programming context
14956: legal terms and references to court proceedings
8
0.2856
6163: LaTeX formatting commands and structure in a document
6163: LaTeX formatting commands and structure in a document
9
-0.285
6412: conjunctions and their recurring use in sentences
6163: LaTeX formatting commands and structure in a document
10
-0.2732
6163: LaTeX formatting commands and structure in a document
6412: conjunctions and their recurring use in sentences
11
0.271
6412: conjunctions and their recurring use in sentences
6412: conjunctions and their recurring use in sentences
12
-0.2457
13492: numerical data and date representations
499: features and attributes related to product descriptions and specifications
13
0.2456
8931: restaurant reviews that mention food quality and dining experiences
8931: restaurant reviews that mention food quality and dining experiences
14
-0.2444
7314: punctuation marks such as quotation marks and apostrophes
8931: restaurant reviews that mention food quality and dining experiences
15
-0.2432
8931: restaurant reviews that mention food quality and dining experiences
7314: punctuation marks such as quotation marks and apostrophes
16
0.241
13492: numerical data and date representations
13492: numerical data and date representations
17
0.2406
499: features and attributes related to product descriptions and specifications
499: features and attributes related to product descriptions and specifications
18
0.2404
7314: punctuation marks such as quotation marks and apostrophes
7314: punctuation marks such as quotation marks and apostrophes
19
-0.24
4374: elements of humor, particularly dark and inappropriate humor
5484: references to notable achievements or events related to advancements and recognitions
20
0.2383
5484: references to notable achievements or events related to advancements and recognitions
5484: references to notable achievements or events related to advancements and recognitions
21
-0.2378
499: features and attributes related to product descriptions and specifications
13492: numerical data and date representations
22
0.2329
4374: elements of humor, particularly dark and inappropriate humor
4374: elements of humor, particularly dark and inappropriate humor
23
-0.2328
5484: references to notable achievements or events related to advancements and recognitions
4374: elements of humor, particularly dark and inappropriate humor
24
0.2311
16266: numerical data and mathematical expressions
16266: numerical data and mathematical expressions
25
-0.2283
16266: numerical data and mathematical expressions
7942: references to event registration and participation details
26
0.2257
9009: programming-related terms and code structure elements
9009: programming-related terms and code structure elements
27
-0.2246
4814: concepts related to health and well-being, especially in medical contexts
9009: programming-related terms and code structure elements
28
-0.2241
9009: programming-related terms and code structure elements
4814: concepts related to health and well-being, especially in medical contexts
29
0.222
4814: concepts related to health and well-being, especially in medical contexts
4814: concepts related to health and well-being, especially in medical contexts
30
-0.2181
15560: multiple segments of structured data, likely in a programming context
7400: key political figures and their roles
There are still a lot of like-to-like pairs. We some mutual-exclusion tetrads, like 8-11, or 26-29. Let's show the non like-to-like pairs with positive QK values:
QK Circuit Value
Key Feature
Query Feature
0
0.2092
15560: multiple segments of structured data, likely in a programming context
15577: references to tutorials and guides
1
0.2036
14956: legal terms and references to court proceedings
7400: key political figures and their roles
2
0.1814
7314: punctuation marks such as quotation marks and apostrophes
693: references to companies and specific products associated with genetics and finance
3
0.1774
499: features and attributes related to product descriptions and specifications
6599: sections of text that contain scientific or technical jargon related to genetics or molecular biology
4
0.1741
13492: numerical data and date representations
6293: elements related to mold removal and cleaning
5
0.17
8931: restaurant reviews that mention food quality and dining experiences
16340: elements related to user preferences or session management
6
0.1624
7344: relationships between keywords and their attributes
12988: technical terms related to structural engineering and materials
7
0.1624
14956: legal terms and references to court proceedings
7950: phrases related to technical specifications or characteristics
8
0.1593
15560: multiple segments of structured data, likely in a programming context
4092: references to educational backgrounds and achievements
9
0.155
3391: terms related to scientific research and methodology
15733: closing braces and related control flow syntax in code
10
0.1525
4470: technical terms and phrases related to processes in refrigeration and fluid dynamics
12988: technical terms related to structural engineering and materials
11
0.148
14956: legal terms and references to court proceedings
15015: <span command="">JavaScript event handling and functions related to user interactions in web development.</span>
12
0.1415
620: references to money and financial transactions, particularly those related to illegal activities
251: financial terms related to risk and stability
13
0.1395
620: references to money and financial transactions, particularly those related to illegal activities
4043: references to data, experiments, and processes in scientific contexts
14
0.139
13805: conditional phrases or questions
9289: numeric values and structured formats, particularly those that appear in data representations and web links
15
0.1371
6386: terms related to audio processing and effects
3551: structure declarations and classes within programming code
16
0.1354
12093: the word "after" in various contexts
7846: references to medical treatments and patient outcomes
17
0.1345
3632: references to legal circuits and courts
9399: details about music albums and their characteristics
18
0.1343
12093: the word "after" in various contexts
2999: terms related to statistical analysis and data representation
19
0.134
12711: numbers and mathematical expressions
12231: references to the Android context class and its usage in code
20
0.133
11094: references to scientific studies and their results
10631: clauses and phrases that describe relationships or characteristics
Now let's take a look at the OV circuit:
OV Circuit Value
Value Feature
Output Feature
0
0.973
2012: instances of the word "already" in various contexts
8487: scientific measurements and their implications
1
-0.9707
3732: punctuation marks
8487: scientific measurements and their implications
2
0.9575
9295: terms related to programming exceptions and errors in software development
27: technical terms and concepts related to object-role modeling and database queries
3
-0.954
11541: coding elements related to data parsing and storage operations
27: technical terms and concepts related to object-role modeling and database queries
4
0.94
11541: coding elements related to data parsing and storage operations
11639: mathematical operations and programming constructs related to vector calculations
5
-0.9395
9295: terms related to programming exceptions and errors in software development
11639: mathematical operations and programming constructs related to vector calculations
6
0.913
3732: punctuation marks
738: entities related to organizations and institutional frameworks
7
-0.8677
2012: instances of the word "already" in various contexts
738: entities related to organizations and institutional frameworks
8
0.715
6052: object properties and their associated methods in programming contexts
15278: keywords related to job postings in the healthcare field
9
-0.7026
6052: object properties and their associated methods in programming contexts
6954: references to boys and masculinity
10
-0.7017
7507: numerical data and references related to statistics and measurements
15278: keywords related to job postings in the healthcare field
11
0.694
7507: numerical data and references related to statistics and measurements
6954: references to boys and masculinity
12
0.6934
3391: terms related to scientific research and methodology
13811: specific coding constructs and structure, particularly related to object-oriented programming elements like classes and unique identifiers
13
-0.6753
9932: functions and events related to programming, particularly those involving event handling and listener methods
14631: terms related to procedural steps and algorithms
14
0.673
1745: questions and mathematical operations involving problem-solving
11576: terms related to legal documentation and identification processes
15
0.668
9932: functions and events related to programming, particularly those involving event handling and listener methods
6733: details related to room features and rental conditions
16
-0.6597
1942: rankings and positions of institutions or programs
11576: terms related to legal documentation and identification processes
17
-0.6587
1745: questions and mathematical operations involving problem-solving
965: references to gender equality and disparities
18
0.6562
14781: elements related to user interaction and token verification in a digital workspace
14631: terms related to procedural steps and algorithms
19
-0.6494
3391: terms related to scientific research and methodology
13176: code structures, particularly comments and namespace declarations in programming languages
20
0.6484
1942: rankings and positions of institutions or programs
965: references to gender equality and disparities
21
-0.6465
14781: elements related to user interaction and token verification in a digital workspace
6733: details related to room features and rental conditions
22
-0.6416
13430: mathematical constructs and expressions
331: mathematical equations and expressions
23
0.639
13430: mathematical constructs and expressions
15082: references related to mathematical or scientific notation and parameters
24
0.6387
2565: references to rural locations and related entities
331: mathematical equations and expressions
25
-0.635
2565: references to rural locations and related entities
15082: references related to mathematical or scientific notation and parameters
26
-0.6245
8437: data references and statistics related to biological experiments
13260: elements associated with input fields and forms
27
0.623
9681: historical references related to laws and legal cases
13260: elements associated with input fields and forms
28
0.619
8437: data references and statistics related to biological experiments
12800: specific references to successful authors and their works
29
-0.616
9681: historical references related to laws and legal cases
12800: specific references to successful authors and their works
30
0.615
10640: keywords and references related to academic or scientific sources
27: technical terms and concepts related to object-role modeling and database queries
Some of these seem totally nonsensical! 8-11 in particular, I mean what? These seem to be hopelessly lost in superposition. Perhaps head 7 is in a greater degree of superposition because it attends to more specific tokens than head 0.
Conclusions
With difficulty, it may be possible to reconstruct attention-based circuits in this way. It's unclear how much of the difficulties stem from technical limitations in the SAE and labels, and how much are fundamental to this method. I would like to try again someday.
This is research I did in a short span of time, it is likely not optimal, but it's unclear whether the constraints are my skills, my tools, or my methods. Code and full results can be found here, but you'll need to download the model yourself to replicate it.
TL;DR
Using Gemma Scope, I am able to find connections between SAE features in the attention portion of Gemma 2 2B's layer 13. This can be done without running the model. These are often (somewhat) meaningful, with the OV circuits generally being easier to interpret than the QK circuits.
The Background
Anthropic has done some nice research decomposing attention layers into the QK and OV circuits. To summarize, the QK circuit tells an attention head where to look, and the OV circuit tells it what information to pass forward. When looking at a single-layer transformer, this looks like patterns in tokens of the form A,...,B→C.
Recently, Google Deepmind released Gemma Scope, consisting of a large number of SAEs, trained on the Gemma 2 family of models. Neuronpedia has already generated a description for all of the "canonical" SAEs using GPT-4o.
This is an attempt to look at QK and OV circuits in Gemma 2 from the lens of features, rather than tokens.
The Setup
I chose to look at layer 13 (indexed from zero) roughly halfway through the model. For each decoder vector in the 16k-feature canonical residual stream SAE of layer 12 (henceforth SAE12), I calculated the element-wise product with the RMS norm coefficients, then calculated the products with WQ, WK, and WV. I'll refer to these as the queries, keys, and values.
I then took the encoder vectors of the 16k-feature canonical residual stream SAE of layer 13 (henceforth SAE13), did an element-wise product with the RMS norm of the attention layer output, and multiplied by the transpose of WO. This creates a vector I'll call the pre-output. This is working backward towards the attention layer.
This does ignore the MLP layer. I wish there were SAEs trained between the two components. It also ignores the rotary positional encoding. I may return to these later. It also ignores the attention-out prelinear SAE! This is because Neuronpedia doesn't have labels available to download for these yet.
I calculated, for each head, a [16k,16k] matrix consisting of the dot-products of the queries and keys, and which contains the dot-products of the values and pre-outputs. For each head, I found the highest-weighted connections between features in the QK and OV circuits.
Decoder Bias vs Rotary Embedding
I take the decoder bias of SAE12 to represent the "average" value of the residual stream going into layer 13. By looking at the query-key dot product of this with itself at different positional distances, we can maybe kinda see how each head looks across tokens:
Seems like head 0 is mostly a previous-token head, whereas the others fall off more slowly over distance.
Interesting Findings
Head 0 has lower QK-weighting than the other heads. It does a few interesting functions, such as noticing when mathematical constants are present and down-weighting a feature related to social equity (presumably to point the model towards the correct concept of "equal to" in this context).
It also detects a feature relating to scientific context and up-weights a feature relating to the past tense, specifically completed actions, which is indeed the most common tense for scientific writing to be in!
Lots of OV circuits are pretty interpretable, or at least seem that way: features up or down-weight later features appropriately.
Unfortunately, many of the feature labels are not very good. For example, I keep getting ones relating to "product descriptions" cropping up in unrelated text in the neuronpedia playground. I assume a more expensive model would do a better job. Also, the features in a 16k SAE for a 2k residual stream are not very monosemantic. It would be interesting to try the 256k SAEs once those are labelled.
It seems like many of the heads are deep in attention-head superposition. If the attention SAE gets labelled it would be cool to check that out.
Some features come up repeatedly in various heads. I think this is because they're up-weighted by the coefficients of the layernorm.
Diving Into Head 0
QK Circuits
Seems like several themes are coming up here: programming and numerical data, particularly lists; names and individuals; scientific research; legal proceedings. Overall it seems like this head is being used for a few different things. Makes sense, given what we know about superposition. Remember that head 0 might be mostly a previous-token head.
We also see a lot of what I call like-to-like QK-connections, in which the same feature appears as the query, and the key. This also makes sense intuitively. I'll show the first 20 connections which are not like-to-like, and which have positive QK values:
One thing to note here is that the QK values for this head are much lower in magnitude than the other heads. Perhaps this head takes the role of a general aggregator, picking up on vibes from lots of tokens, rather than passing specific information around. This kinda makes sense based on the sorts of things which crop up in the OV circuit:
OV Circuit
These are much richer and more interesting.
I really like 22 here, because it seems a bit weird at first, but what it's actually saying is that scientific work is almost always written in the perfect tense!
I find the negative values more interesting than the positive ones in a lot of cases. A lot of them seem to be "disambiguation". 8 seems to be telling the network "no, this is maths, we're looking at the mathematical definition of equality, not the social one!", as does 29. 16 seems to disambiguate Swift or Objective-C code from mathematical notations!
I don't know what 0 or 1 are doing! Why would "little" mean there are no prepositions, but we're in a scientific or mathematical context!
Head 7
I'll go through Head 7 here as well:
There are still a lot of like-to-like pairs. We some mutual-exclusion tetrads, like 8-11, or 26-29. Let's show the non like-to-like pairs with positive QK values:
Now let's take a look at the OV circuit:
Some of these seem totally nonsensical! 8-11 in particular, I mean what? These seem to be hopelessly lost in superposition. Perhaps head 7 is in a greater degree of superposition because it attends to more specific tokens than head 0.
Conclusions
With difficulty, it may be possible to reconstruct attention-based circuits in this way. It's unclear how much of the difficulties stem from technical limitations in the SAE and labels, and how much are fundamental to this method. I would like to try again someday.