Note: This is a more fleshed-out version of this post and includes theoretical arguments justifying the empirical findings. If you've read that one, feel free to skip to the proofs.

We challenge the thesis of the ICML 2024 Mechanistic Interpretability Workshop 1st prize winning paper: The Geometry of Categorical and Hierarchical Concepts in LLMs and the ICML 2024 paper The Linear Representation Hypothesis and the Geometry of LLMs.

The main takeaway is that the orthogonality and polytopes they observe in categorical and hierarchical concepts occurs practically everywhere, even at places they should not.

Overview of the Feature Geometry Papers

Studying the geometry of a language model's embedding space is an important and challenging task because of the various ways concepts can be represented, extracted, and used (see related works). Specifically, we want a framework that unifies both measurement (of how well a latent explains a feature/concept) and causal intervention (how well it can be used to control/steer the model).

The method described in the two papers we study works as follows: they split the computation of a large language model (LLM) as:

where:

  •  is the context embedding for input  (last token's residual after the last layer)
  •  is the unembedding vector for output  (using the unembedding matrix )

They formalize a notion of a binary concept as a latent variable  that is caused by the context  and causes output  depending only on the value of .

Crucially, this restricts their methodology to only work with concepts that can be differentiated by single-token counterfactual pairs of outputs. For instance, it is not clear how to define several important concepts such as "sycophancy" and "truthfulness" using their formalism.

They then define linear representations of a concept in both the embedding and unembedding spaces:

In the unembedding space,  is considered a representation of a concept W if  almost surely, where .

This definition has the hidden assumption that each pair  sampled from the vocabulary would only correspond to a unique concept. For instance, ("king", "queen") can correspond to a variety of concepts such as "", "", and "" in a deck of playing cards.

In the embedding space, they say that  is a representation of a concept  if we have  for any context embeddings  that satisfy 

 for each concept  that is causally separable with .

Now, in order to work with concept representations (i.e. look at similarities, projections, etc.), we need to define an inner product. The give the following definition:

 A   on  is an inner product such that 

 for any pair of causally separable concepts  and .

Note that this definition allows the inner product  to be a causal inner product. As we show, the whitening transformation they apply as an explicit example of a causal inner product does indeed make almost everything almost orthogonal.

This choice turns out to have the key property that it unifies the unembedding and embedding representations:

 Suppose that, for any concept , there exist concepts  such that each  is causally separable with  and  is a basis of . If  is a causal inner product, then the Riesz isomorphism , for , maps the unembedding representation  of each concept  to its embedding representation 


For an explicit example of a causal inner product, they consider the whitening transformation using the covariance matrix of the unembedding vectors as follows:

where  is the unembedding vector,  is the expected unembedding vector, and  is the covariance matrix of . They show that under this transformation, the embedding and unembedding representations are the same.

Now, for any concept , its vector representation  is defined to be:

Given such a vector representation  for binary concepts (where  is the linear representation of ), the following orthogonality relations hold:

This illustrates that for hierarchical concepts mammal  animal, we have . They prove this holds true and empirically validate it by plotting various animal representation points in the 2D span of the vectors for animal and mammal.

Similarly, this means .

Lastly, they show that in their transformed space, categorical features form polytopes in -dimensions. They empirically show these results to hold in the Gemma-2B model and use the WordNet hierarchy to validate them at scale.

Ablations

To study concepts that do not form such semantic categories and hierarchies, we add the following two datasets:

Semantically Correlated Concepts

First, an "emotions" dictionary for various kinds of emotions split in various top-level emotions. Note that these categories are not expected to be orthogonal (for instance, joy and sadness should be anti-correlated). We create this via a simple call to ChatGPT:

emotions = {
   'joy': ['mirth', 'thrill', 'bliss', 'relief', 'admiration', ...],
   'sadness': ['dejection', 'anguish', 'nostalgia', 'melancholy', ...],
   'anger': ['displeasure', 'spite', 'irritation', 'disdain', ...],
   'fear': ['nervousness', 'paranoia', 'discomfort', 'helplessness', ...],
   'surprise': ['enthrallment', 'unexpectedness', 'revitalization', ...],
   'disgust': ['detestation', 'displeasure', 'prudishness', 'disdain', ...]
}

Random Nonsensical Concepts

Next, we add a "nonsense" dataset that has five completely random categories where each category is defined by a lot (order of 100) of totally random objects completely unrelated to the top-level categories:

nonsense = {
   "random 1": ["toaster", "penguin", "jelly", "cactus", "submarine", ...],
   "random 2": ["sandwich", "yo-yo", "plank", "rainbow", "monocle", ...],
   "random 3": ["kiwi", "tornado", "chopstick", "helicopter", "sunflower", ...],
   "random 4": ["ocean", "microscope", "tiger", "pasta", "umbrella", ...],
   "random 5": ["banjo", "skyscraper", "avocado", "sphinx", "teacup", ...]
}

The complete dictionaries for the two ablations we run (emotions and nonsense) are available at anonymous repository, and the code and hyperparameters we use are exactly the same as those used by the original authors, all of which is available on their public repository.

Hierarchical features are orthogonal - but so are semantic opposites!?

Now, let's look at their main experimental results (for animals):

And this is the first ablation we run -- all emotion words in the 2D span of sadness and emotions:

Specifically, this is what we get for joy vectors in the span of sadness. Note that the orthogonality observed is very similar to that in the case of animal hierarchies.

Should we really have joy so un-correlated with sadness? Sadness and joy are semantic opposites, so one should expect the vectors to be anti-correlated rather than orthogonal.

Also, here's the same plot but for completely random, non-sensical concepts:

It seems like their orthogonality results, while true for hierarchical concepts, are also true for semantically opposite concepts and totally random concepts. In the next section, we will show theoretically that in high-dimensions, random vectors, and in particular those obtained after the whitening transformation, are expected to be trivially orthogonal with a very high probability.

Categorical features form simplices - but so do totally random ones!?

Here is the simplex they find animal categories to form (see Fig. 3 in their original paper):

blog_1

And this is what we get for completely random concepts:

blog_2

Thus, while categorical concepts form simplices, so do completely random, non-sensical concepts. Again, as we will show theoretically, randomly made categories are likely to form simplices and polytopes as well, because it is very easy to escape finite convex hulls in high-dimensional spaces.

Random Unembeddings Exhibiting the same Geometry

Here, we show that under the whitening transformation, even random (untrained) unembeddings exhibit the same geometry as the trained ones. This gives more empirical evidence that the orthogonality and polytope findings are not novel and do not "emerge" during the training of a language model.

Here's the figure showing orthogonality of random unembedding concept vectors:

These are polytopes that form for animal categories even with a completely random, untrained unembedding matrix:

random_animal_3d Categorical concepts form polytopes even in random (untrained) unembedding spaces.

Orthogonality and Polytopes in High Dimensions

Here, we theoretically show why the main orthogonality and polytope results in the paper are trivially true in high-dimensions. Since the dimension of the residual stream of the Gemma model they used is , we claim many of their empirical findings are expected by default and do not show anything specific about the geometry of feature embeddings in trained language models.

Orthogonality and the Whitening Transformation

Many of the paper’s claims and empirical findings are about the orthogonality of various linear probes for concepts in unembedding space. Importantly though, "orthogonal" here is defined using an inner product after a whitening transformation. Under this definition, most concept probes are going to end up being almost orthogonal by default.

To explain why this happens, we will first dicsuss a simplified case where we assume that we are studying the representations in a language model with residual stream width  equal to or greater than the number of tokens in its dictionary . In this case, all the orthogonality results shown in Theorem 8 of the paper would exactly hold for any arbitrary concept hierachies we make up. So observing that the relationships in Theorem 8 hold for a set of linear concept probes would not tell us anything about whether the model uses these concepts or not.

Then, we will discuss real models like Gemma-2B that have a residual stream width smaller than the number of tokens in their dictionary. For such models, the results in Theorem 8 would not automatically hold for any set of concepts we make up. But in high dimensions, the theorem would still be expected to hold approximately, with most concept vectors ending up almost orthogonal. Most of the emprirical results for orthogonality the paper shows in e.g. Fig. 2 and Fig. 4 are consistent with this amount of almost-orthogonality that would be expected by default.

Case 

The whitening transformation will essentially attempt to make all the vocabulary vectors as orthogonal to each other as possible. When the dimensionality  is greater than the number of vectors  (i.e., ), the whitening transformation can make the vectors exactly orthogonal.

Let  be zero-mean random vectors with covariance matrix , where . Then there exists a whitening transformation  such that the transformed vectors  satisfy:

where  is the Kronecker delta.

Proof:

  1. Consider the eigendecomposition , where .
  2. Define . Then for any 
  1. If , we can extend  to an  orthogonal matrix that preserves the orthogonality property.

This matters, because if the dictionary embeddings are orthogonal, the relationships for concept vectors the paper derives will hold for completely made-up concept hierachies. They don’t have to be related to the structure of the language or the geometry of the original, untransformed unembedding matrix of the model at all.

As an example, consider a dictionary with  tokens and a residual stream of width . The tokens could, for instance, just be the first six letters of the alphabet, namely . Following the convention of the paper, we will call the unembedding vectors of the tokens .

Due the the whitening transformation, these vectors will be orthogonal under the causal inner product:

  

The relationships described in Theorem 8 of the paper will then hold for any hierarchical categorization schemes of concepts defined over these tokens. The concepts do not need to be meaningful in any way, and they do not need to have anything to do with the statistical relationship between the six tokens in the training data.

For example, let us declare the binary concept . Tokens  are "bleggs", and tokens  are "rubes". We further categorize each "blegg" as being one of , making a categorical concept. Token "a" is a "lant", "b" is a "nant" and "c" is a "blip".

We can create a linear probe  that checks whether the current token vector is a 'blegg'. It returns a nonzero value if the token is a 'blegg', and a value of 0 if it is a 'rube' (see Theorem 4 in the paper).

We could train the probe with LDA like the paper does, but in this case, the setup is simple enough that the answer can be found immediately. In the whitened coordinate system, we write:

 

Constructing linear probes for 'lant' 'nant' and 'blip' is also straigthforward: 

 

 

Following the paper's definitions, {'lant','nant', 'blip'} is subordinate to {'blegg','rube'}. We see that Theorems 8 (a,b,c) in the paper that's illustrated in their Figure 2 will hold for these vectors.

8 (a)  

8 (b)  

8 (c) 

So, in a -dimensional space containing unembedding vectors for  dictionary elements, Theorem 8 will hold for any self consistent categorisation scheme. Theorem 8 will also keep holding if we replace the unembedding matrix  with a randomly chosen full rank matrix. Due to the whitening applied by the ‘causal inner product’, the concepts we make up do not need to have any relationship to the geometry of the unembedding vectors in the model.

Case 

If the model's residual stream is smaller than the number of tokens in its dictionary, as is the case in Gemma-2B and most other models, the whitening transformation cannot make all  unembedding vectors  exactly orthogonal. So Theorem 8 will no longer be satisfied by default for all concept hierarchies and all unembedding matrices we make up.

However, if  are large, the whitening transformation might often still be able to make most unembedding vectors almost orthogonal, because random vectors in high dimensional spaces tend to be almost orthogonal by default.

To see why this is the case, consider  random vectors  for  drawn from the unit sphere. We will show that as the dimensionality  increases, these vectors become approximately orthogonal.

The expected value of the inner product  is:

The elements of the vectors have zero mean, so we have  for . Therefore, the expectation value of the inner product is  is .

The variance of the inner product is:

Thus, in high dimensions, the cosine of the angle  between two random vectors has mean  and deviation , meaning vectors will be nearly orthogonal with high probability.

In fact, the Johnson–Lindenstrauss lemma states that the number of almost orthogonal vectors that can be fit in  dimensions is exponential in .

So, going back to our example in the previous section, if the vectors  are approximately orthogonal instead of orthogonal, then, the linear probes for the concepts {'blegg','rube'}, {'lant','nant', 'blip'} we made up would still mostly satisfy Theorem 8, up to terms :

8 (a)  

8 (b)  

8 (c) 

So, orthogonality between linear probes for concepts might be expected by default, up to terms  that will become very small for big models with large residual stream widths . To exceed this baseline, the causal inner product between vector representations would need to be clearly smaller than .

2. High-Dimensional Convex Hulls are easy to Escape!

As for the polytope results in their paper (Definition 7, Figure 3 ), a random vector is highly likely to be outside the convex hull of  vectors in high dimensions, so we should expect concepts to "form a polytope" by default.

Let  be  independent random vectors drawn from the unit sphere. The convex hull of  is the set of points , where  and .

Since this polytope is contained in a -dimensional subspace inside the larger -dimensional space, its volume is  dimensional rather than  dimensional. So it has a n-dimensional Lebesgue measure of zero. So the probability that another random vector  lies inside this polytope will be zero. In the real world, our vector coordinates are floating point numbers of finite precision rather than real numbers, so the probability of  lying inside the polytope will not be exactly zero, but it will still be very small. Thus, it is not surprising that the polytopes spanned by linear probes for  concepts do not intersect with linear probes for other concepts.

Even if we were to look at  categorical concepts, Theorem 1 (Bárány and Füredi (1998)) in this paper shows that one would need a very, very high number of categories (of the order of ) to have a vector that's inside the convex hull.

Discussion

Conclusion

We show that the orthogonality and polytope results observed by recent works are a trivial consequence of the whitening transformation and the high dimensionality of the representation spaces.

A transformation where opposite concepts seem orthogonal doesn't seem good for studying models. It breaks our semantic model of associating directions with concepts and makes steering both ways impossible. Thus, more work needs to be done in order to study concept representations in language models.

Recent research has found multiple ways to extract latents representing concepts/features in a model's activation space. We highlight some of them:

Linear Contrast/Steering Vectors

If we have a contrast dataset  for a feature or behavior, we can use the contrast activations to get a  in a given layer's activation space that represents the concept. This can also be used to steer models toward or away from it, as is shown in Representation Engineering.

One can also extract linear function vectors in the activation space by eliciting in-context learning.

Sparse Autoencoders (SAEs) and Variants

SAEs have been found to be a very scalable method to extract linear representations for a lot of features by learning a sparse reconstruction of a model's activations. There have been several recent advancements on SAEs in terms of both methodology and scaling.

Unsupervised Steering Vectors

This work uses an unsupervised method to elicit latent behavior from a language model by finding directions in a layer's activations that cause a maximum change in the activations of a future layer.

Non-linear Features

While several important concepts are found to have linear representation latents (possibly due to the highly linear structure of the model's architecture), not all features in a language model are represented linearly, as shown by this work.

Future Work

We hope that our work points out various challenges toward a unified framework to study model representations and promotes further interest in the community to work on the same. Some important future directions this leaves us with are:

  • A framework on how to think about representations that unifies how they're obtained (contrastive activations, PCA, SAE, etc.), how they're used (by the model), and how they can be used to control (eg. via steering vectors).
  • How to figure out how well a given object (a direction, a vector, or even a black-box function over model parameters) represents a given human-interpretable concept or feature.
  • If orthogonality and simplices are too universal and not specific enough to study the geometry of categorical and hierarchical concepts, then what is a good lens or theory to do so?

Lastly, instead of the whitening transformation (which leads to identity covariance), one can attempt to use an inverse, i.e., a coloring transformation using a covariance matrix that is learned directly from the data.

New Comment