Ali Shehper
Ali Shehper has not written any posts yet.

Ali Shehper has not written any posts yet.

This could also be the reason behind the issue mentioned in footnote 5.
Since the feature activation is just the dot product (plus encoder bias) of the concatenated z vector and the corresponding column of the encoder matrix, we can rewrite this as the sum of n_heads dot products, allowing us to look at the direct contribution from each head.
Nice work. But I have one comment.
The feature activation is the output of ReLU applied to this dot product plus the encoder bias, and ReLU is a non-linear function. So it is not clear that we can find the contribution of each head to the feature activation.
Hi Evan, thank you for the explanation, and sorry for the late reply.
I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won't be solved by supplementing SAEs with additional terms in its loss function). I think it's because we could have generated the same dataset with a different choice of basis (though I haven't tried formalizing the argument nor run any experiments).
I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we... (read more)
Hey guys, great post and great work!
I have a comment, though. For concreteness, let me focus on the case of (x_2, y_1) composition of features. This corresponds to feature vectors of the form A[0, 1, 1, 0] in the case of correlated feature amplitudes and [0, a, b, 0] in the case of uncorrelated feature amplitudes. Note that the plane spanned by x_2 and y_1 admits an infinite family of orthogonal bases; one of which, for example, is [0, 1, 1, 0] and [0, 1, -1, 0]. When we train a Toy Model of Superposition, we plot the projection of our choice of feature basis as done by Anthropic and also by... (read more)
I see. Thanks for the clarification!