I have been poking around with LLMs, and I found some results that seem broadly interesting
Summary
Introduction: Large language models (LLM) are usually structured as repeated transformer layers of the same size. However, this architecture is often described as functionally hierarchical with earlier layers focusing on small patches of text while later layers parse document-wide information. I revisited these ideas.
Methods: I submitted very short texts to an LLM and used a probing approach to examine the representations within each layer. For example, I submitted texts like “An apple” and extracted the residual stream activity. Then, I used the activations to fit a support vector machine (SVM) predicting whether the object (apple) contains a given property (e.g., is edible). The layers where the activations produce the most accurate classifier constitute the layers that most represent this item-level semantic property. I applied this general probing approach across several experiments, also studying two-item relations and four-item analogies – aspects of semantic processing less often examined in interpretability research.
Results:
Analyses of Llama-3.2-3b (28 layers) support the common abstraction-hierarchy perspective: In short texts, item-level semantics are most represented in early layers (layers 2-7), two-item relations in deeper layers (layers 8-12), and then four-item analogies deeper still (layers 10-15).
However, some findings deviate from a steady hierarchy view: Although deep layers can represent document-wide abstractions, deep layers also represent narrow two-item relations from historically back in a context window. Deep layers seem to broadly compress local information, even without abstraction.
When examining a larger model, Llama-3.3-70b-Instruct (80 layers), abstraction appears to fluctuate rather than steadily increase into deeper layers. Representation of two-item relations and four-item analogies initially peaks in layers 12-16, before falling, and then later peaking again in layers 25-33. This peculiar double-peak pattern emerges consistently across different experiments and replicates across Llama-3.3-70b-Instruct and Qwen-2.5-72b (another 80-layer model). Further analyses suggest that this double peak may be linked to how the model processes global information in longer texts.
tl;dr: There is truth to the idea that the transformer architecture operates as a functional hierarchy with respect to layer depth. However, late layers also compress historic local information generally without any meaningful abstraction. Large models, in particular, may deviate from a steady singular abstraction hierarchy in curious ways that complicate interpretability.
Notes: I previously made a similar post detailing these results about two months ago, which I pulled down later that day. It contained roughly the same content, but I felt like the write-up could have been better, which is what I am presenting now. I have not yet released the code for this project, but if there is interest, I will work on getting the code in a suitable shape for this.
Experiment 1: Item-level semantics
I first investigated item-level semantic coding. This experiment referenced public data, assigning semantic labels/properties to objects (e.g., an “apple” object has the property “is edible”). These analyses focus on the 300 objects associated with the most features and the 20 most common properties. For each object, Llama-3.2-3b was fed a two-word text (e.g., “An apple”), and I extracted the resulting final-token activations from the residual stream, the feed-forward (MLP) layer outputs, and the self-attention layer outputs.
Based on these predictors, I fit an SVM for each of the 20 properties, predicting whether an object had (1) or did not have (0) each feature, using 6-fold cross-validation. The classifier accuracies linked to responses of different layers are shown in Figure 1.[1]
Figure 1. Item-level representation. Accuracies for the binary prediction tests of Experiment 1 on Llama-3.2-3b. A. Achieved accuracies averaged across all twenty item features. B-D. Accuracies achieved for each item feature, plotted separately for the three probed transformer components. Feed-forward network, FFN.
Some interesting patterns emerge, although here I will just focus on what is relevant to understanding abstraction hierarchies: The representation of item-level semantics in the residual stream and feed-forward network output peaks in early layers (layers 2-7) before descending but turning upwards again in the final layers. This initial peak is consistent with the model steadily increasing in its abstraction, such that layers 2-7 are most sensitive to item-level semantics. The final rise has been demonstrated in some prior reports and presumably reflects the model preparing its actual word predictions. The main focus here, however, is on that initial peak and what it indicates regarding the level of targeted abstraction.
Experiment 2: Two-item relations
I next examined how transformers represent the relation between two items. I did several experiments. The tests examined how an LLM encodes: (A) whether an object is likely to be found in a scene, (B) how semantically related humans report two objects to be, and (C) more narrowly, whether an herbivore/carnivore would eat a plant/meat. Each followed a similar structure as in Experiment 1, where texts are crafted, submitted to Llama-3.2-3b, and resulting activations are used to fit classifiers.
Experiment 2A
This experiment examined how a model encodes object-scene relationships using a private dataset I have access to. Human participants were shown 342 objects and scenes and for each one, participants used a 4-point scale to assess how likely it would be to find the object in the scene (1 = “Very unlikely”; 4 = “Very likely”). For the analysis here, I attempted to predict the mean rating for each object-scene pair.
The object-scene pairs were used to produce texts (“In a {SCENE}, an {OBJECT}” or “A {SCENE} and {OBJECT}”), which were submitted to Llama-3.2-3b, and activations were extracted corresponding to the object embedding. The activations were used to fit Ridge regressions predicting the object-scene’s likeliness scores with 6-fold cross-validation.
Peak representation of the two-item relational information occurred deeper (layer ~11 peak) than the earlier item-level representation effects. This occurs regardless of how the input texts are defined (Figures 2a & 2b). This is consistent with model layers progressively increasing in abstraction.
Figure 2. Two-item-relation representation. Accuracies associated with the three Llama-3.2-3b experiments specifically on two-item relations.
Experiment 2B
This next experiment examined two-item relations between pairs of objects. Using human-reported data on how related two items are (https://osf.io/5q6th/), I submitted “An {ITEM 1} and {ITEM 2}” and the flipped “An {ITEM 2} and {ITEM 1}” to Llama-3.2-3b. Activations were extracted and then submitted to a ridge regression, which yielded the accuracies shown in Figure 2b, which yielded a peak consistent with those of Figure 2a.
Experiment 2C
A final test of two-item relations studied this topic more narrowly than the type of general “relatedness” examined thus far. Here, I prepared two-item texts where one item was always a food (plant or meat) and the other was an animal (herbivore or carnivore) (e.g., “A {FOOD} and {ANIMAL}”). The texts were submitted to Llama-3.2-3b, activations were extracted, and SVMs were fit, predicting whether the text’s animal would eat the food. Cross-validation was structured using 2-group folds, training on herbivores and testing on carnivores or vice versa. Accuracy again was above chance, showing how the model represents the specific “would eat” two-items relation, and again most so in layers 8-12.
Experiment 3: Four-object analogies
To reach yet deeper layers, I considered four-object analogies. I prepared fifty analogies and prepared texts for them (e.g., “Like a seed and a tree, an egg and a chicken”). For each analogy (AB:CD), three equivalent valid variants were prepared (BA:DC, CD:AB, DC:BA).
For each of these, invalid analogies were also prepared. Easy invalids were produced where the second and fourth words were flipped (e.g., “Like a seed and a chicken, an egg and a tree”). Distinguishing these from the valid analogies is easy because it can be done either while noticing that the first two words are unrelated or the second two words are unrelated. Hard invalids were produced by flipping just the third and fourth words (e.g., “Like a seed and a tree, a chicken and an egg”). Detecting that these hard invalids are indeed bad analogies requires considering all four items.
Figure 3. Four-item-analogy representation. Accuracies associated with the three tests specifically on four-item analogies using Llama-3.2-3b. These are all referred to as being the same “experiment” because they all use the same fifty-analogy dataset.
The accuracy patterns suggest analogical processing occurs deeper than two-item relational processing. To serve as a reference point, Figure 3a provides new two-item relation results using the present dataset, where SVMs distinguished the valid examples versus easy-invalid examples based on the activations at the second word. The peak of the two-item relation was seen around layer 10 as before. Figure 3b next shows the valid/easy-invalid classification results based on the four word’s activations, which produced a peak slightly later (~layer 12). Figure 3c demonstrates another similarly later peak for valid versus hard-invalid classification.
Although the difference between the two-item relation peak and the present four-item analogy peak is subtler than the gap between the single-item peak and two-item peak, the results are nonetheless consistent with a functional abstraction hierarchy linked to layer progression.
Experiment 4: Burying text
To manipulate the layers in the back half of Llama-3.2-3b, I lengthened the initial texts by burying the target content behind roughly 100 words of filler. This was performed for Experiments 1, 2, and 3. For example, the Experiment 2B texts, which consisted of “An {ITEM 1} and {ITEM 2}”, were expanded to:
“An {ITEM 1} and {ITEM 2} are here. I thought about this for a long while. The more I pondered, the clearer it became that my initial reaction was just the tip of the iceberg. There were layers to this issue, complexities that I hadn’t considered at first glance. Each new angle brought a different perspective, challenging my assumptions and making me question what I thought I knew. It was like peeling an onion, revealing not just answers, but more questions, more nuances to explore”
This filler suffix was held constant for every experiment. Activations were extracted from the last token (“explore”), while the target of the classification remained the same (e.g., predicting the relatedness of items 1 and 2). Thus, the SVM examined how well the signature of the initial two items was preserved through the filler text. Despite the challenges that filler text presumably creates, Figure 4 shows that SVM accuracy was still considerably above chance, although representations are consistently shifted into deeper layers.[2] This clear pattern is consistent with later layers capturing more global document-wide information. Yet, this would not typically be seen as abstraction over the text, but rather as the (lossy) compression of early local information.
Figure 4. Buried information representation. The graphs show the results of buried-concept analyses, which are each an adaptation of an earlier experiment. Accordingly, each subfigure here maps to one of the subfigures from (top row) Figure 1, (middle row) Figure 2, or (bottom row) Figure 3. The present figure is broadly designed for a parallel structure with those earlier figures. For Experiment 3 with analogy texts, the original two-item relations experiment is equivalent to the easy analogy comparison; originally, those differed based on whether activations were extracted from the second or four words’ tokens, but in the buried context, activations are always taken from the last word of the suffix filler text.
Experiment 5: Larger models and scaling effects
Finally, I performed each of these experiments again now using a larger model: Llama-3.3-70b-Instruct, which contains 80 layers; Llama-3.3, which only provides an “Instruct” variant, as of December 2024.[3]Figure 5 illustrates the resulting accuracy levels.
I find the most prominent pattern in these figures to be the emergent double-peak pattern linked to the representation of two-item relations and four-item analogies (Figure 5c-f). All four experiments on these topics show two robustly distinct peaks, which are most evident in the attention and feed-forward (MLP) layer outputs. The first peak emerges around layers 12-16 and the second around layers 25-33. In other words, these semantic properties are most represented in two distinct spots. In between these peaks, there is a dip.
If the peaks are taken to shed light on abstraction, this may reflect an increase of abstraction (producing the first peak then dip), then a decrease (producing the second peak), and then a further increase in abstraction (producing the second-half descent). None of the experiments on the smaller Llama-3.2-3b produced double peaks suggesting that this is an emergent property of scaling.
Each of the results here replicates when tested using Qwen-2.5-72b but not Deepseek-V2.5-236b (the latter is a mixture-of-experts model).
Figure 5. Buried information representation. Several but not all of the earlier experiments were performed again, now using Llama-3.3-70b-Instruct rather than Llama-3.2-3b. Each subfigure here maps to one of the subfigures from Figures 1, 2, 3, or 4. The mappings are specifically: a. Figure 1a, b. Figure 1b, c. Figure 2a left, d. Figure 2b, e. Figure 2c, f. Figure 3c, g. Figure 4c, h. Figure 4d, and i. Figure 4h.
To better understand the dip-valley-dip representational profile, I examined accuracies associated with the attention layer outputs. Compared to the residual stream and the feed-forward network outputs, the attention output representations displayed this double-peak most prominently. I z-scored the accuracies for the seven experiments involving pairs of items of four-item analogies (z-scored across layers for a given experiment) (Figure 6a).
Remarkably, the valley in the two/four-item experiments coincides with a rise in representation seen in the buried-concept experiments (the rising blue lines in Figure 6a). The buried concepts’ peak representation overlays with the second peak of the non-buried concept; this is also evident looking back to Figures 5g-i, where the residual stream’s representation of buried concepts is strongest around layer 35.
Figure 6. Parallels between model accuracies and their accuracy first derivatives. These series constitute the z-score of the attention output accuracies taken from Figures 5c-i; z-scored within-series.The correlation matrix was computed with respect to these 80-element-long series.
It’s unclear what interpretation can unify these results. Potentially, there exist two functional hierarchies: the first hierarchy processes information from syntax to roughly sentence/fragment-level semantics, and this hierarchy’s representations inform a growing second hierarchy that can capture a more global context. Alternatively, the valley may occur because the model referencing is global information to update the local representations initially produced at the first peak.
Interestingly, overlaying all of the attention output accuracy plots also shows a consistent zig-zag pattern, particularly following layer 40. This can be best understood as a negative correlation between the derivative of the accuracy line. This is another emergent effect of scaling, which does not appear in the smaller Llama model, and I can talk about it further if there is interest. However, for now, this does not pertain to the initial questions about functional hierarchies.
Conclusion
This investigation reveals both expected hierarchical trends in LLM functioning along with deviations from these trends.[4] These findings hopefully refine prevailing assumptions about transformer dynamics and open avenues for exploring how architectural innovations and scaling influence the emergent properties of modern LLMs.
The methodology employed here – targeting very short texts and probing with classifiers – does not seem to be a typical interpretability method. However, the multi-item relation and analogy topics studies here seem like they would be harder to study with more common techniques.
The findings on buried concepts may have some importance for potential future activation steering work. Stimulating particularly deep layers may effectively simulate encountering a concept historically in the context window, which could be useful.
For Experiment 1, probing classification is improved slightly for texts “An {object}” rather than just “{object}” (e.g., a 1-2% boost in accuracy). Giving the model a bit of grammar helps it represent information better. The same goes for Experiment 2, where we see improvements in R2.
The Instruct version of the model was used as this was the only variant provided as Llama-3.3, the most recent version in the Llama line of models, as of December 2024.
I conducted some tests on grammar/tense/syntax. Representations distinguishing texts with the past vs. present vs. future tense appeared to peak early in layers 3-5. Additionally, analyses on grammatically proper versus improper (e.g., “I are” or “I been have”) texts also showed an early peak (layers 3 & 4). However, analyses on simple vs. perfect tense showed representation rising steadily until layer 14 and staying maximal across the remaining layers. Thus, I did not feel that the results on basic syntax were conclusive, so I have not reported them.
I have been poking around with LLMs, and I found some results that seem broadly interesting
Summary
Introduction: Large language models (LLM) are usually structured as repeated transformer layers of the same size. However, this architecture is often described as functionally hierarchical with earlier layers focusing on small patches of text while later layers parse document-wide information. I revisited these ideas.
Methods: I submitted very short texts to an LLM and used a probing approach to examine the representations within each layer. For example, I submitted texts like “An apple” and extracted the residual stream activity. Then, I used the activations to fit a support vector machine (SVM) predicting whether the object (apple) contains a given property (e.g., is edible). The layers where the activations produce the most accurate classifier constitute the layers that most represent this item-level semantic property. I applied this general probing approach across several experiments, also studying two-item relations and four-item analogies – aspects of semantic processing less often examined in interpretability research.
Results:
tl;dr: There is truth to the idea that the transformer architecture operates as a functional hierarchy with respect to layer depth. However, late layers also compress historic local information generally without any meaningful abstraction. Large models, in particular, may deviate from a steady singular abstraction hierarchy in curious ways that complicate interpretability.
Notes: I previously made a similar post detailing these results about two months ago, which I pulled down later that day. It contained roughly the same content, but I felt like the write-up could have been better, which is what I am presenting now. I have not yet released the code for this project, but if there is interest, I will work on getting the code in a suitable shape for this.
Experiment 1: Item-level semantics
I first investigated item-level semantic coding. This experiment referenced public data, assigning semantic labels/properties to objects (e.g., an “apple” object has the property “is edible”). These analyses focus on the 300 objects associated with the most features and the 20 most common properties. For each object, Llama-3.2-3b was fed a two-word text (e.g., “An apple”), and I extracted the resulting final-token activations from the residual stream, the feed-forward (MLP) layer outputs, and the self-attention layer outputs.
Based on these predictors, I fit an SVM for each of the 20 properties, predicting whether an object had (1) or did not have (0) each feature, using 6-fold cross-validation. The classifier accuracies linked to responses of different layers are shown in Figure 1.[1]
Figure 1. Item-level representation. Accuracies for the binary prediction tests of Experiment 1 on Llama-3.2-3b. A. Achieved accuracies averaged across all twenty item features. B-D. Accuracies achieved for each item feature, plotted separately for the three probed transformer components. Feed-forward network, FFN.
Some interesting patterns emerge, although here I will just focus on what is relevant to understanding abstraction hierarchies: The representation of item-level semantics in the residual stream and feed-forward network output peaks in early layers (layers 2-7) before descending but turning upwards again in the final layers. This initial peak is consistent with the model steadily increasing in its abstraction, such that layers 2-7 are most sensitive to item-level semantics. The final rise has been demonstrated in some prior reports and presumably reflects the model preparing its actual word predictions. The main focus here, however, is on that initial peak and what it indicates regarding the level of targeted abstraction.
Experiment 2: Two-item relations
I next examined how transformers represent the relation between two items. I did several experiments. The tests examined how an LLM encodes: (A) whether an object is likely to be found in a scene, (B) how semantically related humans report two objects to be, and (C) more narrowly, whether an herbivore/carnivore would eat a plant/meat. Each followed a similar structure as in Experiment 1, where texts are crafted, submitted to Llama-3.2-3b, and resulting activations are used to fit classifiers.
Experiment 2A
This experiment examined how a model encodes object-scene relationships using a private dataset I have access to. Human participants were shown 342 objects and scenes and for each one, participants used a 4-point scale to assess how likely it would be to find the object in the scene (1 = “Very unlikely”; 4 = “Very likely”). For the analysis here, I attempted to predict the mean rating for each object-scene pair.
The object-scene pairs were used to produce texts (“In a {SCENE}, an {OBJECT}” or “A {SCENE} and {OBJECT}”), which were submitted to Llama-3.2-3b, and activations were extracted corresponding to the object embedding. The activations were used to fit Ridge regressions predicting the object-scene’s likeliness scores with 6-fold cross-validation.
Peak representation of the two-item relational information occurred deeper (layer ~11 peak) than the earlier item-level representation effects. This occurs regardless of how the input texts are defined (Figures 2a & 2b). This is consistent with model layers progressively increasing in abstraction.
Figure 2. Two-item-relation representation. Accuracies associated with the three Llama-3.2-3b experiments specifically on two-item relations.
Experiment 2B
This next experiment examined two-item relations between pairs of objects. Using human-reported data on how related two items are (https://osf.io/5q6th/), I submitted “An {ITEM 1} and {ITEM 2}” and the flipped “An {ITEM 2} and {ITEM 1}” to Llama-3.2-3b. Activations were extracted and then submitted to a ridge regression, which yielded the accuracies shown in Figure 2b, which yielded a peak consistent with those of Figure 2a.
Experiment 2C
A final test of two-item relations studied this topic more narrowly than the type of general “relatedness” examined thus far. Here, I prepared two-item texts where one item was always a food (plant or meat) and the other was an animal (herbivore or carnivore) (e.g., “A {FOOD} and {ANIMAL}”). The texts were submitted to Llama-3.2-3b, activations were extracted, and SVMs were fit, predicting whether the text’s animal would eat the food. Cross-validation was structured using 2-group folds, training on herbivores and testing on carnivores or vice versa. Accuracy again was above chance, showing how the model represents the specific “would eat” two-items relation, and again most so in layers 8-12.
Experiment 3: Four-object analogies
To reach yet deeper layers, I considered four-object analogies. I prepared fifty analogies and prepared texts for them (e.g., “Like a seed and a tree, an egg and a chicken”). For each analogy (AB:CD), three equivalent valid variants were prepared (BA:DC, CD:AB, DC:BA).
For each of these, invalid analogies were also prepared. Easy invalids were produced where the second and fourth words were flipped (e.g., “Like a seed and a chicken, an egg and a tree”). Distinguishing these from the valid analogies is easy because it can be done either while noticing that the first two words are unrelated or the second two words are unrelated. Hard invalids were produced by flipping just the third and fourth words (e.g., “Like a seed and a tree, a chicken and an egg”). Detecting that these hard invalids are indeed bad analogies requires considering all four items.
Figure 3. Four-item-analogy representation. Accuracies associated with the three tests specifically on four-item analogies using Llama-3.2-3b. These are all referred to as being the same “experiment” because they all use the same fifty-analogy dataset.
The accuracy patterns suggest analogical processing occurs deeper than two-item relational processing. To serve as a reference point, Figure 3a provides new two-item relation results using the present dataset, where SVMs distinguished the valid examples versus easy-invalid examples based on the activations at the second word. The peak of the two-item relation was seen around layer 10 as before. Figure 3b next shows the valid/easy-invalid classification results based on the four word’s activations, which produced a peak slightly later (~layer 12). Figure 3c demonstrates another similarly later peak for valid versus hard-invalid classification.
Although the difference between the two-item relation peak and the present four-item analogy peak is subtler than the gap between the single-item peak and two-item peak, the results are nonetheless consistent with a functional abstraction hierarchy linked to layer progression.
Experiment 4: Burying text
To manipulate the layers in the back half of Llama-3.2-3b, I lengthened the initial texts by burying the target content behind roughly 100 words of filler. This was performed for Experiments 1, 2, and 3. For example, the Experiment 2B texts, which consisted of “An {ITEM 1} and {ITEM 2}”, were expanded to:
“An {ITEM 1} and {ITEM 2} are here. I thought about this for a long while. The more I pondered, the clearer it became that my initial reaction was just the tip of the iceberg. There were layers to this issue, complexities that I hadn’t considered at first glance. Each new angle brought a different perspective, challenging my assumptions and making me question what I thought I knew. It was like peeling an onion, revealing not just answers, but more questions, more nuances to explore”
This filler suffix was held constant for every experiment. Activations were extracted from the last token (“explore”), while the target of the classification remained the same (e.g., predicting the relatedness of items 1 and 2). Thus, the SVM examined how well the signature of the initial two items was preserved through the filler text. Despite the challenges that filler text presumably creates, Figure 4 shows that SVM accuracy was still considerably above chance, although representations are consistently shifted into deeper layers.[2] This clear pattern is consistent with later layers capturing more global document-wide information. Yet, this would not typically be seen as abstraction over the text, but rather as the (lossy) compression of early local information.
Figure 4. Buried information representation. The graphs show the results of buried-concept analyses, which are each an adaptation of an earlier experiment. Accordingly, each subfigure here maps to one of the subfigures from (top row) Figure 1, (middle row) Figure 2, or (bottom row) Figure 3. The present figure is broadly designed for a parallel structure with those earlier figures. For Experiment 3 with analogy texts, the original two-item relations experiment is equivalent to the easy analogy comparison; originally, those differed based on whether activations were extracted from the second or four words’ tokens, but in the buried context, activations are always taken from the last word of the suffix filler text.
Experiment 5: Larger models and scaling effects
Finally, I performed each of these experiments again now using a larger model: Llama-3.3-70b-Instruct, which contains 80 layers; Llama-3.3, which only provides an “Instruct” variant, as of December 2024.[3] Figure 5 illustrates the resulting accuracy levels.
I find the most prominent pattern in these figures to be the emergent double-peak pattern linked to the representation of two-item relations and four-item analogies (Figure 5c-f). All four experiments on these topics show two robustly distinct peaks, which are most evident in the attention and feed-forward (MLP) layer outputs. The first peak emerges around layers 12-16 and the second around layers 25-33. In other words, these semantic properties are most represented in two distinct spots. In between these peaks, there is a dip.
If the peaks are taken to shed light on abstraction, this may reflect an increase of abstraction (producing the first peak then dip), then a decrease (producing the second peak), and then a further increase in abstraction (producing the second-half descent). None of the experiments on the smaller Llama-3.2-3b produced double peaks suggesting that this is an emergent property of scaling.
Each of the results here replicates when tested using Qwen-2.5-72b but not Deepseek-V2.5-236b (the latter is a mixture-of-experts model).
Figure 5. Buried information representation. Several but not all of the earlier experiments were performed again, now using Llama-3.3-70b-Instruct rather than Llama-3.2-3b. Each subfigure here maps to one of the subfigures from Figures 1, 2, 3, or 4. The mappings are specifically: a. Figure 1a, b. Figure 1b, c. Figure 2a left, d. Figure 2b, e. Figure 2c, f. Figure 3c, g. Figure 4c, h. Figure 4d, and i. Figure 4h.
To better understand the dip-valley-dip representational profile, I examined accuracies associated with the attention layer outputs. Compared to the residual stream and the feed-forward network outputs, the attention output representations displayed this double-peak most prominently. I z-scored the accuracies for the seven experiments involving pairs of items of four-item analogies (z-scored across layers for a given experiment) (Figure 6a).
Remarkably, the valley in the two/four-item experiments coincides with a rise in representation seen in the buried-concept experiments (the rising blue lines in Figure 6a). The buried concepts’ peak representation overlays with the second peak of the non-buried concept; this is also evident looking back to Figures 5g-i, where the residual stream’s representation of buried concepts is strongest around layer 35.
Figure 6. Parallels between model accuracies and their accuracy first derivatives. These series constitute the z-score of the attention output accuracies taken from Figures 5c-i; z-scored within-series. The correlation matrix was computed with respect to these 80-element-long series.
It’s unclear what interpretation can unify these results. Potentially, there exist two functional hierarchies: the first hierarchy processes information from syntax to roughly sentence/fragment-level semantics, and this hierarchy’s representations inform a growing second hierarchy that can capture a more global context. Alternatively, the valley may occur because the model referencing is global information to update the local representations initially produced at the first peak.
Interestingly, overlaying all of the attention output accuracy plots also shows a consistent zig-zag pattern, particularly following layer 40. This can be best understood as a negative correlation between the derivative of the accuracy line. This is another emergent effect of scaling, which does not appear in the smaller Llama model, and I can talk about it further if there is interest. However, for now, this does not pertain to the initial questions about functional hierarchies.
Conclusion
This investigation reveals both expected hierarchical trends in LLM functioning along with deviations from these trends.[4] These findings hopefully refine prevailing assumptions about transformer dynamics and open avenues for exploring how architectural innovations and scaling influence the emergent properties of modern LLMs.
The methodology employed here – targeting very short texts and probing with classifiers – does not seem to be a typical interpretability method. However, the multi-item relation and analogy topics studies here seem like they would be harder to study with more common techniques.
The findings on buried concepts may have some importance for potential future activation steering work. Stimulating particularly deep layers may effectively simulate encountering a concept historically in the context window, which could be useful.
For Experiment 1, probing classification is improved slightly for texts “An {object}” rather than just “{object}” (e.g., a 1-2% boost in accuracy). Giving the model a bit of grammar helps it represent information better. The same goes for Experiment 2, where we see improvements in R2.
Adding filler before the target text (e.g., “… explore. An {Item 1} and {Item 2}”) had virtually no effect on SVM accuracies.
The Instruct version of the model was used as this was the only variant provided as Llama-3.3, the most recent version in the Llama line of models, as of December 2024.
I conducted some tests on grammar/tense/syntax. Representations distinguishing texts with the past vs. present vs. future tense appeared to peak early in layers 3-5. Additionally, analyses on grammatically proper versus improper (e.g., “I are” or “I been have”) texts also showed an early peak (layers 3 & 4). However, analyses on simple vs. perfect tense showed representation rising steadily until layer 14 and staying maximal across the remaining layers. Thus, I did not feel that the results on basic syntax were conclusive, so I have not reported them.