Model B has 8 times the aspect ratio [...] which falls under the reported range in Kaplan et al
Nice, this is explained under Figure 5, in particular
The loss varies only a few percent over a wide range of shapes. [...] an (, ) = (6, 4288) reaches a loss within 3% of the (48, 1600) model
(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)
2. Features are universal, meaning two models trained on the same data and achieving equal performance must learn identical features.
I would personally be very surprised if this is true in its strongest form. Empirically, different models can find more than one algorithm that achieves minimal loss [even on incredibly simple tasks like modular addition](https://arxiv.org/pdf/2306.17844.pdf).
As a side note, my understanding is that if you have two independent real features A and B which both occur with non-trivial frequency, and the SAE is sufficiently wide, the SAE may learn features A&B
, A&!B
and !A&B
rather than simply learning features A
and B
, because that yields better L0 loss (since L0 loss should be p(A) + p(B)
for the [A, B]
feature set, and P(A&B) + P(A&!B) + P(!A&B) == P(A) + P(B) - P(A&B)
for the [A&B, A&!B, !A&B]
feature set). The [A,B]
representation still has better sparsity, by the Elhage et. al. definition, but I don't think that necessarily means that the [A,B]
representation corresponds to minimal L0 loss. Not sure how much of an issue this is in practice though. The post "Do sparse autoencoders find true features" has a bunch more detail on this sort of thing.
This might all be academic if (i.e. the dimension of the feed-forward layer is big enough that you run out of meaningful features long before you run out of space to store them).
This might all be academic if (i.e. the dimension of the feed-forward layer is big enough that you run out of meaningful features long before you run out of space to store them).
Thanks for the feedback, this is a great point! I haven't come across evidence in real models which points towards this. My default assumption was that they are operating near the upper bounds of superposition capacity possible. It would be great to know if they aren't, as it affects how we estimate the number of features and subsequently the SAE expansion factor.
It would be great to know if they aren't, as it affects how we estimate the number of features and subsequently the SAE expansion factor.
My impression from people working on SAEs is that the optimal number of features is very much an open question. In Toward Monosemanticity they observe that different numbers of features work fine; you just get feature splitting / collapse as you go bigger / smaller.
The scaling laws are not mere empirical observations
This seems like a strong claim; are you aware of arguments or evidence for it? My impression (not at all strongly held) was that it's seen as a useful rule of thumb that may or may not continue to hold.
Summary
Using results from scaling laws, this short note argues that the following two statements cannot be simultaneously true:
Scaling laws[1] for Language Models gives us a relation for a model's macroscopic properties such as cross entropy loss L, Amount of Data D used and Number of non-embedding parameters N in the model.
L(N,D)=[(NcN)αNαD+(DcD)αD]αD
where Nc, Dc, αN, and αD are constants for a given task such as Language modeling
The scaling laws are not mere empirical observations and can be seen as a predictive laws on limits of language model performance. During training of GPT-4, OpenAI[2] was able to predict the final loss of GPT-4 early in the training process using scaling laws with high accuracy.
An important detail is that the relation is expressed in terms of the number of parameters. It's natural to think of a model's computational capacity in terms of parameters, as they are the fundamental independent variables that the model can tune during learning. The amount of computation that a model performs in FLOPs for each input is also estimated to be 2N.[1]
Let's compare this with Interpretability, where the representation of a feature is defined in terms of neurons or groups of neurons[3]. At first glance, it might seem unnecessary to distinguish between computational capacity and feature representational capacity, as parameters are connections between neurons after all. However, we can change the number of neurons in a model while keeping the number of parameters constant. Kaplan et al.[1] found that Transformer performance depends very weakly on the shape parameters nlayer (number of layers), nheads (number of attention heads), and dff (feedforward layer dimension) when we hold the total non-embedding parameter count N fixed . The paper reports that the aspect ratio (the ratio of number of neurons per layer to the number of layers) can vary by more than an order of magnitude, with performance changing by less than 1%.[4]
In this work, we assume the above to be true and consider the number of parameters to be the true limiting factor, and we can achieve similar model performance for a range of aspect ratios. We then apply this as a postulate to the superposition hypothesis[5], our current best and successful theory of feature representation, and explore the implications.
The superposition hypothesis states that models can pack more features than the number of neurons they have. There will be interference between the features as they can't be represented orthogonally, but when the features are sparse enough, the benefit of representing a feature outweighs the cost of interference. Concretely, given a layer of activations of m neurons, we can decompose it linearly into activations of n features, where n>m, as:
activationlayer=xf1Wf1+xf2Wf2+⋯+xfnWfn
where activationlayer and Wfi are vectors of size m, and xfi represents the magnitude of activation of the i-th feature. Sparsity means that for a given input, only a small fraction of features are active, which means xfi is non-zero for only a few values of i.
Case study on changing Aspect Ratio
Let's consider two models, Model A and Model B, having the same macroscopic properties. Both have an equal number of non-embedding parameters, are trained on the same dataset, and achieve similar loss according to scaling laws. However, their shape parameters differ. Using the same notation as Kaplan et al.[1], let's denote the number of layers as nlayer, and number of neurons per layer as dmodel [6]. Model B has twice the number of neurons per layer compared to A. As the number of parameters is approximated[1] by d2modelnlayer,[7] Model B must have 14 the number of layers to maintain the same number of parameters as Model A. This means Model B has 8 times the aspect ratio(dmodelnlayer) of A which falls under the reported range in Kaplan et al.
The total number of neurons in a model is calculated by multiplying the number of neurons per layer by the number of layers. As a result, Model B has half the total number of neurons compared to Model A.
Now, let's apply the superposition hypothesis, which states that features can be linearly represented in each layer. Since both models achieve equal loss on the same dataset, it's reasonable to assume that they have learned the same features. Let's denote the total number of features learned by both models as F.
The above three paragraphs are summarized in the table below:
The average number of features per neuron is calculated by dividing the number of features per layer by the number of neurons per layer. In Model B, this value is twice as high as in Model A, which means that Model B is effectively compressing twice as many features per neuron, in other words, there's a higher degree of superposition. However, superposition comes with a cost of interference between features, and a higher degree of superposition requires more sparsity.
Elhage et al.[5] show that, using lower bounds of compressed sensing[8], if we want to recover n features compressed in m neurons (where n > m), the bound is m=Ω(−n(1−S)log(1−S)), where 1−S is the sparsity of the features. For example, if a feature is non-zero only 1 in 100 times, then 1−S equals 0.01. We can define the degree of superposition as nm=1(1−S)log(1−S), which is a function of sparsity, inline with our theoretical understanding.
So Model B, with higher degree of superposition, should have sparser features compared to Model A. But, sparsity of a feature is a property of the data itself, and the same feature can't be sparser in Model B if both models are trained on the same data. This might suggest that they are not the same features, which breaks our initial assumption of two models learning the same features. So either our starting assumption of feature representation through superposition or feature universality needs revision. In the next section, we discuss how we might modify our assumptions.
Discussion
To recap, we started with the postulate that model performance is invariant over a wide range of aspect ratios and arrived at the inconsistency between superposition and feature universality. Though we framed the argument through the lens of superposition, the core issue is that the model's computational capacity is a function of parameters where as model's representational capacity is a function of total neurons.
A useful, though non-rigorous analogy, is to visualize a solid cylinder of radius dmodel and height nlayer. The volume (parameters) of the cylinder can be thought of as computational capacity whereas features are represented on the surface (neurons). We can change the aspect ratio of the cylinder while keeping the volume constant by stretching or squashing it. This changes the surface area accordingly. Though this analogy doesn't include sparsity, it captures the essentials of the argument in a simple way.
Coming to solutions, I do not have one that's consistent with scaling laws, superposition hypothesis and feature universality, but will speculate on what a possible one might look like.
Schemes of Compression Alternative to Superposition: A crude and simple way to convert the total number of features into a function of parameters is to add a square term to compressed sensing bounds so it becomes n=m2.f(1−S) . But this would require a completely new compression scheme compared to superposition. Methods such as Dictionary learning which disentangle features assuming superposition hypothesis have been successful for extracting interpretable features. So it's not ideal to ignore it, representation schemes whose first-order approximation looks like superposition might be more viable.
This isn't to say there's nothing we can improve on in the superposition hypothesis. Although dictionary learning features in Bricken et al.[9] are much more mono-semantic than individual neurons, the lower activation levels in these features still look quite polysemantic.
Cross Layer Superposition: Previously, we used to look for features in a single neuron[10], now we extended it to a group of neurons in a layer. A natural progression is to look for features localising to neurons across multiple layers. But Model B from the above section, has half the number of neurons as A and the same inconsistencies would arise if features grow linearly on number of neurons. Number of features represented across two or more layers by cross layer superposition should grow superlinearly if Model B were to compensate for fewer neurons and still have the same representational capacity.
Acknowledgements
I'm thankful to Jeffrey Wu and Tom McGrath for their helpful feedback on this topic. Thanks to Vinay Bantupalli for providing feedback on the draft. Any mistakes in content or ideas are my own, not those of the acknowledged.
Scaling Laws for Neural Language Models [PDF]
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020.
GPT-4 Technical Report [PDF]
OpenAI, 2023
Intentionally left out defining a feature, as there's no universally accepted formal definition. Refer to Neel Nanda's explainer for a good review.
Refer to Sec 3.1 and Figure 5 in the paper
Toy Models of Superposition [PDF]
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M. and Olah, C., 2022. Transformer Circuits Thread.
Usually also referred as dimension of the model
The number of non-embedding parameters is equal to 12d2modelnlayer. For simplicity, we can ignore the constant factor
Lower bounds for sparse recovery [link]
Ba, K.D., Indyk, P., Price, E. and Woodruff, D.P., 2010, January. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms (pp. 1190-1197). Society for Industrial and Applied Mathematics.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [HTML]
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J.E., Hume, T., Carter, S., Henighan, T. and Olah, C., 2023. Transformer Circuits Thread
Unsupervised sentiment neuron [link]
Radford, A., Jozefowicz, R. and Sutskever, I., 2017