Ali Shehper — LessWrong

LESSWRONG
LW

Replying toSparse Autoencoders Work on Attention Layer Outputs

Ali Shehper2y

Sparse Autoencoders Work on Attention Layer Outputs

I see. Thanks for the clarification!

Replying toSparse Autoencoders Work on Attention Layer Outputs

Ali Shehper2y

Sparse Autoencoders Work on Attention Layer Outputs

This could also be the reason behind the issue mentioned in footnote 5.

Replying toSparse Autoencoders Work on Attention Layer Outputs

Ali Shehper2y

Sparse Autoencoders Work on Attention Layer Outputs

Since the feature activation is just the dot product (plus encoder bias) of the concatenated z vector and the corresponding column of the encoder matrix, we can rewrite this as the sum of n_heads dot products, allowing us to look at the direct contribution from each head.

Nice work. But I have one comment.

The feature activation is the output of ReLU applied to this dot product plus the encoder bias, and ReLU is a non-linear function. So it is not clear that we can find the contribution of each head to the feature activation.

Replying toSparse autoencoders find composed features in small toy models

Ali Shehper2y

Sparse autoencoders find composed features in small toy models

Hi Evan, thank you for the explanation, and sorry for the late reply.

I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won't be solved by supplementing SAEs with additional terms in its loss function). I think it's because we could have generated the same dataset with a different choice of basis (though I haven't tried formalizing the argument nor run any experiments).

I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we... (read more)

Replying toSparse autoencoders find composed features in small toy models

Ali Shehper2y

Sparse autoencoders find composed features in small toy models

Hey guys, great post and great work!

I have a comment, though. For concreteness, let me focus on the case of (x_2, y_1) composition of features. This corresponds to feature vectors of the form A[0, 1, 1, 0] in the case of correlated feature amplitudes and [0, a, b, 0] in the case of uncorrelated feature amplitudes. Note that the plane spanned by x_2 and y_1 admits an infinite family of orthogonal bases; one of which, for example, is [0, 1, 1, 0] and [0, 1, -1, 0]. When we train a Toy Model of Superposition, we plot the projection of our choice of feature basis as done by Anthropic and also by... (read more)