Intricacies of Feature Geometry in Large Language Models

7vik; Lucius Bushnaq; Nandi

Note: This is a more fleshed-out version of this post and includes theoretical arguments justifying the empirical findings. If you've read that one, feel free to skip to the proofs.

We challenge the thesis of the ICML 2024 Mechanistic Interpretability Workshop 1st prize winning paper: The Geometry of Categorical and Hierarchical Concepts in LLMs and the ICML 2024 paper The Linear Representation Hypothesis and the Geometry of LLMs.

The main takeaway is that the orthogonality and polytopes they observe in categorical and hierarchical concepts occur practically everywhere, even at places they should not.

Overview of the Feature Geometry Papers

Studying the geometry of a language model's embedding space is an important and challenging task because of the various ways concepts can be represented, extracted, and used (see related works). Specifically, we want a framework that unifies both measurement (of how well a latent explains a feature/concept) and causal intervention (how well it can be used to control/steer the model).

The method described in the two papers we study works as follows: they split the computation of a large language model (LLM) as:

where:

$λ (x)$ is the context embedding for input $x$ (last token's residual after the last layer)
$γ (y)$ is the unembedding vector for output $y$ (using the unembedding matrix $W_{U}$ )

They formalize a notion of a binary concept as a latent variable $W$ that is caused by the context $X$ and causes output $Y (W = w)$ depending only on the value of $w \in W$ .

Crucially, this restricts their methodology to only work with concepts that can be differentiated by single-token counterfactual pairs of outputs. For instance, it is not clear how to define several important concepts such as "sycophancy" and "truthfulness" using their formalism.

They then define linear representations of a concept in both the embedding and unembedding spaces:

In the unembedding space, ${¯ γ}_{W}$ is considered a representation of a concept W if $γ (Y (1)) - γ (Y (0)) = α {¯ γ}_{W}$ almost surely, where $α > 0$ .

This definition has the hidden assumption that each pair $(Y (0), Y (1))$ sampled from the vocabulary would only correspond to a unique concept. For instance, ("king", "queen") can correspond to a variety of concepts such as " $male ⟹ female$ ", " $k-words ⟹ q-words$ ", and " $n'th card ⟹ (n-1)'th card$ " in a deck of playing cards.

In the embedding space, they say that ${¯ λ}_{W}$ is a representation of a concept $W$ if we have $λ_{1} - λ_{0} \in Cone ({¯ λ}_{W})$ for any context embeddings $λ_{0}, λ_{1} \in Λ$ that satisfy

\frac{P (W = 1 ∣ λ_{1})}{P (W = 1 ∣ λ_{0})} > 1 and \frac{P (W, Z ∣ λ_{1})}{P (W, Z ∣ λ_{0})} = \frac{P (W ∣ λ_{1})}{P (W ∣ λ_{0})},

for each concept $Z$ that is causally separable with $W$ .

Now, in order to work with concept representations (i.e. look at similarities, projections, etc.), we need to define an inner product. The give the following definition:

$Definition 3.1 (Causal Inner Product).$ A $causal inner product$ $⟨ \cdot, \cdot ⟩_{C}$ on $¯ ¯¯ ¯ Γ ≃ R^{d}$ is an inner product such that

⟨ {¯ ¯ ¯ γ}_{W}, {¯ ¯ ¯ γ}_{Z} ⟩_{C} = 0,

for any pair of causally separable concepts $W$ and $Z$ .

Note that this definition allows the inner product $< a, b >= 0 \forall (a, b) : a \neq b$ to be a causal inner product. As we show, the whitening transformation they apply as an explicit example of a causal inner product does indeed make almost everything almost orthogonal.

This choice turns out to have the key property that it unifies the unembedding and embedding representations:

$Theorem 3.2 (Unification of Representations).$ Suppose that, for any concept $W$ , there exist concepts ${Z_{i}}_{i = 1}^{d - 1}$ such that each $Z_{i}$ is causally separable with $W$ and ${{¯ ¯ ¯ γ}_{W}} \cup {{¯ ¯ ¯ γ}_{Z_{i}}}_{i = 1}^{d - 1}$ is a basis of $R^{d}$ . If $⟨ \cdot, \cdot ⟩_{C}$ is a causal inner product, then the Riesz isomorphism $¯ ¯ ¯ γ \mapsto ⟨ ¯ ¯ ¯ γ, \cdot ⟩_{C}$ , for $¯ ¯ ¯ γ \in ¯ ¯¯ ¯ Γ$ , maps the unembedding representation ${¯ ¯ ¯ γ}_{W}$ of each concept $W$ to its embedding representation ${¯ ¯ ¯ λ}_{W}$ :

⟨ {¯ ¯ ¯ γ}_{W}, \cdot ⟩_{C} = {¯ ¯ ¯ λ}_{W}^{⊤} .

For an explicit example of a causal inner product, they consider the whitening transformation using the covariance matrix of the unembedding vectors as follows:

g (y) = Cov (γ)^{- 1 / 2} (γ (y) - E [γ])

where $γ$ is the unembedding vector, $E [γ]$ is the expected unembedding vector, and $Cov$ is the covariance matrix of $γ$ . They show that under this transformation, the embedding and unembedding representations are the same.

Now, for any concept $W$ , its vector representation $ℓ_{W}$ is defined to be:

{¯ ℓ}_{w} = ({~ g}_{w}^{⊤} E (g_{w})) {~ g}_{w}, where {~ g}_{w} = \frac{Cov (g_{w})^{†} E (g_{w})}{∥ Cov (g_{w})^{†} E (g_{w}) ∥_{2}}

Given such a vector representation $l_{w}$ for binary concepts (where $ℓ_{w_{1}} - ℓ_{w_{0}}$ is the linear representation of $w_{0} \Rightarrow w_{1}$ ), the following orthogonality relations hold:

ℓ_{w} ⊥ (ℓ_{z} - ℓ_{w}) for z ≺ w

This illustrates that for hierarchical concepts mammal $≺$ animal, we have $ℓ_{a n i m a l} ⊥ (ℓ_{m a m m a l} - ℓ_{a n i m a l})$ . They prove this holds true and empirically validate it by plotting various animal representation points in the 2D span of the vectors for animal and mammal.

ℓ_{w} ⊥ (ℓ_{z_{1}} - ℓ_{z_{0}}) for {z_{0}, z_{1}} ≺ w

Similarly, this means $ℓ_{a n i m a l} ⊥ (ℓ_{m a m m a l} - ℓ_{r e p t i l e})$ .

(ℓ_{w_{1}} - ℓ_{w_{0}}) ⊥ (ℓ_{z_{1}} - ℓ_{z_{0}}) for {z_{0}, z_{1}} ≺ {w_{0}, w_{1}}

(ℓ_{w_{1}} - ℓ_{w_{0}}) ⊥ (ℓ_{w_{2}} - ℓ_{w_{1}}) for w_{2} ≺ w_{1} ≺ w_{0}

Lastly, they show that in their transformed space, categorical features form polytopes in $n$ -dimensions. They empirically show these results to hold in the Gemma-2B model and use the WordNet hierarchy to validate them at scale.

Ablations

To study concepts that do not form such semantic categories and hierarchies, we add the following two datasets:

Semantically Correlated Concepts

First, an "emotions" dictionary for various kinds of emotions split in various top-level emotions. Note that these categories are not expected to be orthogonal (for instance, joy and sadness should be anti-correlated). We create this via a simple call to ChatGPT:

emotions = {
   'joy': ['mirth', 'thrill', 'bliss', 'relief', 'admiration', ...],
   'sadness': ['dejection', 'anguish', 'nostalgia', 'melancholy', ...],
   'anger': ['displeasure', 'spite', 'irritation', 'disdain', ...],
   'fear': ['nervousness', 'paranoia', 'discomfort', 'helplessness', ...],
   'surprise': ['enthrallment', 'unexpectedness', 'revitalization', ...],
   'disgust': ['detestation', 'displeasure', 'prudishness', 'disdain', ...]
}

Random Nonsensical Concepts

Next, we add a "nonsense" dataset that has five completely random categories where each category is defined by a lot (order of 100) of totally random objects completely unrelated to the top-level categories:

nonsense = {
   "random 1": ["toaster", "penguin", "jelly", "cactus", "submarine", ...],
   "random 2": ["sandwich", "yo-yo", "plank", "rainbow", "monocle", ...],
   "random 3": ["kiwi", "tornado", "chopstick", "helicopter", "sunflower", ...],
   "random 4": ["ocean", "microscope", "tiger", "pasta", "umbrella", ...],
   "random 5": ["banjo", "skyscraper", "avocado", "sphinx", "teacup", ...]
}

The complete dictionaries for the two ablations we run (emotions and nonsense) are available at anonymous repository, and the code and hyperparameters we use are exactly the same as those used by the original authors, all of which is available on their public repository.

Hierarchical features are orthogonal - but so are semantic opposites!?

Now, let's look at their main experimental results (for animals):

$ℓ_{a n i m a l} ⊥ (ℓ_{m a m m a l} - ℓ_{a n i m a l})$

And this is the first ablation we run -- all emotion words in the 2D span of sadness and emotions:

Specifically, this is what we get for joy vectors in the span of sadness. Note that the orthogonality observed is very similar to that in the case of animal hierarchies.

Should we really have the individual "joy" tokens so un-correlated with sadness? Sadness and joy are semantic opposites, so one should expect the spread of the "joy" vectors to be clustered around a negative value on the "sadness" axis and not around zero.

Also, here's the same plot but for completely random, non-sensical concepts:

It seems like their orthogonality results, while true for hierarchical concepts, are also true for semantically opposite concepts and totally random concepts. In the next section, we will show theoretically that in high-dimensions, random vectors, and in particular those obtained after the whitening transformation, are expected to be trivially orthogonal with a very high probability.

Categorical features form simplices - but so do totally random ones!?

Here is the simplex they find animal categories to form (see Fig. 3 in their original paper):

And this is what we get for completely random concepts:

Thus, while categorical concepts form simplices, so do completely random, non-sensical concepts. Again, as we will show theoretically, randomly made categories are likely to form simplices and polytopes as well, because it is very easy to escape finite convex hulls in high-dimensional spaces.

Random Unembeddings Exhibiting the same Geometry

Here, we show that under the whitening transformation, even random (untrained) unembeddings exhibit the same geometry as the trained ones. This gives more empirical evidence that the orthogonality and polytope findings are not novel and do not "emerge" during the training of a language model.

Here's the figure showing orthogonality of random unembedding concept vectors:

These are polytopes that form for animal categories even with a completely random, untrained unembedding matrix.

Orthogonality and Polytopes in High Dimensions

Here, we theoretically show why the main orthogonality and polytope results in the paper are trivially true in high-dimensions. Since the dimension of the residual stream of the Gemma model they used is $2048$ , we claim many of their empirical findings are expected by default and do not show anything specific about the geometry of feature embeddings in trained language models.

Orthogonality and the Whitening Transformation

Many of the paper’s claims and empirical findings are about the orthogonality of various linear probes for concepts in unembedding space. Importantly though, "orthogonal" here is defined using an inner product after a whitening transformation. Under this definition, most concept probes are going to end up being almost orthogonal by default.

To explain why this happens, we will first dicsuss a simplified case where we assume that we are studying the representations in a language model with residual stream width $n$ equal to or greater than the number of tokens in its dictionary $k$ . In this case, all the orthogonality results shown in Theorem 8 of the paper would exactly hold for any arbitrary concept hierachies we make up. So observing that the relationships in Theorem 8 hold for a set of linear concept probes would not tell us anything about whether the model uses these concepts or not.

Then, we will discuss real models like Gemma-2B that have a residual stream width smaller than the number of tokens in their dictionary. For such models, the results in Theorem 8 would not automatically hold for any set of concepts we make up. But in high dimensions, the theorem would still be expected to hold approximately, with most concept vectors ending up almost orthogonal. Most of the emprirical results for orthogonality the paper shows in e.g. Fig. 2 and Fig. 4 are consistent with this amount of almost-orthogonality that would be expected by default.

Case $n \geq k :$

The whitening transformation will essentially attempt to make all the vocabulary vectors as orthogonal to each other as possible. When the dimensionality $n$ is greater than the number of vectors $k$ (i.e., $n > k$ ), the whitening transformation can make the vectors exactly orthogonal.

Let ${x_{1}, x_{2}, \dots, x_{k}} \in R^{n}$ be zero-mean random vectors with covariance matrix $Σ$ , where $n > k$ . Then there exists a whitening transformation $W = Σ^{- 1 / 2}$ such that the transformed vectors ${y_{i} = W x_{i}}_{i = 1}^{k}$ satisfy:

E [y_{i}^{⊤} y_{j}] = δ_{i j}

where $δ_{i j}$ is the Kronecker delta.

Proof:

Consider the eigendecomposition $Σ = U Λ U^{⊤}$ , where $Λ = diag (λ_{1}, \dots, λ_{k})$ .
Define $W = Λ^{- 1 / 2} U^{⊤}$ . Then for any $i, j$ :

\begin{matrix} E [y_{i}^{⊤} y_{j}] & = E [(W x_{i})^{⊤} (W x_{j})] = W E [x_{i} x_{j}^{⊤}] W^{⊤} = W Σ W^{⊤} = I_{k} \end{matrix}

If $n > k$ , we can extend $W$ to an $n \times n$ orthogonal matrix that preserves the orthogonality property.

This matters, because if the dictionary embeddings are orthogonal, the relationships for concept vectors the paper derives will hold for completely made-up concept hierachies. They don’t have to be related to the structure of the language or the geometry of the original, untransformed unembedding matrix of the model at all.

As an example, consider a dictionary with $k = 6$ tokens and a residual stream of width $n = 6$ . The tokens could, for instance, just be the first six letters of the alphabet, namely ${a, b, c, d, e, f}$ . Following the convention of the paper, we will call the unembedding vectors of the tokens $ℓ_{a}, ℓ_{b}, \dots, ℓ_{f}$ .

Due the the whitening transformation, these vectors will be orthogonal under the causal inner product:

$ℓ_{a} \cdot ℓ_{b} = 0$

$\dots$

$ℓ_{e} \cdot ℓ_{f} = 0$

The relationships described in Theorem 8 of the paper will then hold for any hierarchical categorization schemes of concepts defined over these tokens. The concepts do not need to be meaningful in any way, and they do not need to have anything to do with the statistical relationship between the six tokens in the training data.

For example, let us declare the binary concept ${blegg, rube}$ . Tokens ${a, b, c}$ are "bleggs", and tokens ${d, e, f}$ are "rubes". We further categorize each "blegg" as being one of ${lant, nant, blip}$ , making a categorical concept. Token "a" is a "lant", "b" is a "nant" and "c" is a "blip".

We can create a linear probe $l_{blegg}$ that checks whether the current token vector is a 'blegg'. It returns a nonzero value if the token is a 'blegg', and a value of 0 if it is a 'rube' (see Theorem 4 in the paper).

We could train the probe with LDA like the paper does, but in this case, the setup is simple enough that the answer can be found immediately. In the whitened coordinate system, we write:

$ℓ_{blegg} = ℓ_{a} + ℓ_{b} + ℓ_{c}$

$ℓ_{rube} = ℓ_{d} + ℓ_{e} + ℓ_{f}$

Constructing linear probes for 'lant' 'nant' and 'blip' is also straigthforward:

$ℓ_{lant} = ℓ_{a}$

$ℓ_{nant} = ℓ_{b}$

$ℓ_{blip} = ℓ_{c}$

Following the paper's definitions, {'lant','nant', 'blip'} is subordinate to {'blegg','rube'}. We see that Theorems 8 (a,b,c) in the paper that's illustrated in their Figure 2 will hold for these vectors.

8 (a) $ℓ_{blegg} \cdot (ℓ_{lant} - ℓ_{blegg}) = 0$

8 (b) $ℓ_{blegg} \cdot (ℓ_{lant} - ℓ_{nant}) = 0$

8 (c) $(ℓ_{blegg} - ℓ_{rube}) \cdot (ℓ_{lant} - ℓ_{nant}) = 0$

So, in a $n$ -dimensional space containing unembedding vectors for $n = k$ dictionary elements, Theorem 8 will hold for any self consistent categorisation scheme. Theorem 8 will also keep holding if we replace the unembedding matrix $W_{unembed} \in R^{k \times n}$ with a randomly chosen full rank matrix. Due to the whitening applied by the ‘causal inner product’, the concepts we make up do not need to have any relationship to the geometry of the unembedding vectors in the model.

Case $n < k :$

If the model's residual stream is smaller than the number of tokens in its dictionary, as is the case in Gemma-2B and most other models, the whitening transformation cannot make all $k$ unembedding vectors $x_{1}, \dots, x_{k}$ exactly orthogonal. So Theorem 8 will no longer be satisfied by default for all concept hierarchies and all unembedding matrices we make up.

However, if $n, k$ are large, the whitening transformation might often still be able to make most unembedding vectors almost orthogonal, because random vectors in high dimensional spaces tend to be almost orthogonal by default.

To see why this is the case, consider $k$ random vectors $x_{i} \in R^{n}$ for $i = 1, 2, \dots, k$ drawn from the unit sphere. We will show that as the dimensionality $n$ increases, these vectors become approximately orthogonal.

The expected value of the inner product $⟨ x_{i}, x_{j} ⟩$ is:

E [⟨ x_{i}, x_{j} ⟩] = E [n \sum l = 1 x_{i l} x_{j l}] = n \sum l = 1 E [x_{i l} x_{j l}]

The elements of the vectors have zero mean, so we have $E [x_{i l} x_{j l}] = 0$ for $i \neq j$ . Therefore, the expectation value of the inner product is $E [⟨ x_{i}, x_{j} ⟩]$ is $0$ .

The variance of the inner product is:

Var [⟨ x_{i}, x_{j} ⟩] = Var [n \sum l = 1 x_{i l} x_{j l}] = n \sum l = 1 Var [x_{i l} x_{j l}] = \frac{1}{n}

Thus, in high dimensions, the cosine of the angle $θ$ between two random vectors has mean $0$ and deviation $\frac{1}{\sqrt{(} n)} \to 0$ , meaning vectors will be nearly orthogonal with high probability.

In fact, the Johnson–Lindenstrauss lemma states that the number of almost orthogonal vectors that can be fit in $n$ dimensions is exponential in $n$ .

So, going back to our example in the previous section, if the vectors $ℓ_{a}, \dots, ℓ_{f}$ are approximately orthogonal instead of orthogonal, then, the linear probes for the concepts {'blegg','rube'}, {'lant','nant', 'blip'} we made up would still mostly satisfy Theorem 8, up to terms $O (\frac{1}{\sqrt{n}})$ :

8 (a) $ℓ_{blegg} \cdot (ℓ_{lant} - ℓ_{blegg}) = O (\frac{1}{\sqrt{n}})$

8 (b) $ℓ_{blegg} \cdot (ℓ_{lant} - ℓ_{nant}) = O (\frac{1}{\sqrt{n}})$

8 (c) $(ℓ_{blegg} - ℓ_{rube}) \cdot (ℓ_{lant} - ℓ_{nant}) = O (\frac{1}{\sqrt{n}})$

So, orthogonality between linear probes for concepts might be expected by default, up to terms $O (\frac{1}{\sqrt{n}})$ that will become very small for big models with large residual stream widths $n$ . To exceed this baseline, the causal inner product between vector representations would need to be clearly smaller than $O (\frac{1}{\sqrt{n}})$ .

2. High-Dimensional Convex Hulls are easy to Escape!

As for the polytope results in their paper (Definition 7, Figure 3 ), a random vector is highly likely to be outside the convex hull of $k$ vectors in high dimensions, so we should expect concepts to "form a polytope" by default.

Let ${x_{1}, x_{2}, \dots, x_{k}} \in R^{n}$ be $k < n$ independent random vectors drawn from the unit sphere. The convex hull of ${x_{1}, x_{2}, \dots, x_{k}}$ is the set of points $\sum_{i = 1}^{k} α_{i} x_{i}$ , where $α_{i} \geq 0$ and $\sum_{i = 1}^{k} α_{i} = 1$ .

Since this polytope is contained in a $k$ -dimensional subspace inside the larger $n$ -dimensional space, its volume is $k < n$ dimensional rather than $n$ dimensional. So it has a n-dimensional Lebesgue measure of zero. So the probability that another random vector $z$ lies inside this polytope will be zero. In the real world, our vector coordinates are floating point numbers of finite precision rather than real numbers, so the probability of $z$ lying inside the polytope will not be exactly zero, but it will still be very small. Thus, it is not surprising that the polytopes spanned by linear probes for $k$ concepts do not intersect with linear probes for other concepts.

Even if we were to look at $k > n$ categorical concepts, Theorem 1 (Bárány and Füredi (1998)) in this paper shows that one would need a very, very high number of categories (of the order of $\frac{2^{1024}}{2048}$ ) to have a vector that's inside the convex hull.

Discussion

Conclusion

We show that the orthogonality and polytope results observed by recent works are a trivial consequence of the whitening transformation and the high dimensionality of the representation spaces.

A transformation where opposite concepts seem orthogonal doesn't seem good for studying models. It breaks our semantic model of associating directions with concepts and makes steering both ways impossible. Thus, more work needs to be done in order to study concept representations in language models.

Wider Context / Related Works

Recent research has found multiple ways to extract latents representing concepts/features in a model's activation space. We highlight some of them:

Linear Contrast/Steering Vectors

If we have a contrast dataset $(x^{+}, x^{-})$ for a feature or behavior, we can use the contrast activations to get a $d i r e c t i o n$ in a given layer's activation space that represents the concept. This can also be used to steer models toward or away from it, as is shown in Representation Engineering.

One can also extract linear function vectors in the activation space by eliciting in-context learning.

Sparse Autoencoders (SAEs) and Variants

SAEs have been found to be a very scalable method to extract linear representations for a lot of features by learning a sparse reconstruction of a model's activations. There have been several recent advancements on SAEs in terms of both methodology and scaling.

Unsupervised Steering Vectors

This work uses an unsupervised method to elicit latent behavior from a language model by finding directions in a layer's activations that cause a maximum change in the activations of a future layer.

Non-linear Features

While several important concepts are found to have linear representation latents (possibly due to the highly linear structure of the model's architecture), not all features in a language model are represented linearly, as shown by this work.

Future Work

We hope that our work points out various challenges toward a unified framework to study model representations and promotes further interest in the community to work on the same. Some important future directions this leaves us with are:

A framework on how to think about representations that unifies how they're obtained (contrastive activations, PCA, SAE, etc.), how they're used (by the model), and how they can be used to control (eg. via steering vectors).
How to figure out how well a given object (a direction, a vector, or even a black-box function over model parameters) represents a given human-interpretable concept or feature.
If orthogonality and simplices are too universal and not specific enough to study the geometry of categorical and hierarchical concepts, then what is a good lens or theory to do so?

Lastly, instead of the whitening transformation (which leads to identity covariance), one can attempt to use an inverse, i.e., a coloring transformation using a covariance matrix that is learned directly from the data.

LESSWRONG
LW

71

Intricacies of Feature Geometry in Large Language Models

71

Overview of the Feature Geometry Papers