LESSWRONG
LW

All of Lee Sharkey's Comments + Replies

[Interim research report] Taking features out of superposition with sparse autoencoders

Very interesting to hear that you've been working on similar things! Excited to see results when they're ready.

RE synthetic data: I'm a bit less confident in this method of data generation after the feedback below (see Tom Lieberum's and Ryan Greenblatt's comments). It may lose some 'naturalness' compared with the way the encoder in the 'toy models of superposition' puts one-hot features in superposition. It's unclear whether that matters for the aims of this particular set of experiments, though.

RE metrics: It's interesting to hear about your altern... (read more)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey3y20

We would have loved to see more motivation for why you are making the assumptions you are making when generating the toy data.
Relatedly, it would be great to see an analysis of the distribution of the MLP activations. This could give you some info where your assumptions in the toy model fall short.

This is valid; they're not well fleshed out above. I'll take a stab at it here below, and I discussed it a bit with Ryan below his comment. Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

RE MLP activations distrib... (read more)

2Tom Lieberum3y

I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It's hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it's trying to approximate or why you think that the true distribution has property XYZ. Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn't transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don't hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model. I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you're thinking about that.

1Tom Lieberum3y

(ETA to the OC: the antipodal pairs wouldn't happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you'd see that. I'm now less sure about this specific argument)

Neural Tangent Kernel Distillation

Lee Sharkey3y10

Thanks!

Neural Tangent Kernel Distillation

Lee Sharkey3y*20

This equation describes (almost) linear regression on a particular feature space $ϕ (x) = \nabla_{θ} f (x, θ_{0})$ :
$\begin{matrix} f_{l i n e a r} (x, θ) & = f (x, θ_{0}) + ϕ (x) \cdot (θ - θ_{0}) \approx ϕ (x) \cdot θ \end{matrix}$

This approximation isn't obvious to me. It holds if $ f(x, \theta_0) \approx 0 $ and $ \theta_0 \approx 0 $, but these aren't stated. Are they true?

4Jeremy Gillen3y

Yeah good point, I should have put more detail here. My understanding is that, for most common initialization distributions and architectures, f(x,θ0)=0 and ϕ(x)⋅θ0=0 in the infinite width limit. This is because they both end up being expectations of random variables that are symmetrically distributed around 0. However, in the finite width regime if we want to be precise, we can simply add those terms back onto the kernel regression. So really, with finite width: flinear(x,θ)=f(x,θ0)+K(x,X)K−1(X,X)Y−∇θf(x,θ0)⋅θ0 There are a few other very non-rigorous parts of our explanation. Another big one is that ϕ(x)⋅θ is underspecified by the data in the infinite width limit, so it could fit the data in lots of ways. Stuff about ridge regularized regression and bringing in details about gradient descent fixes this, I believe, but I'm not totally sure whether it changes anything at finite width.

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3y40

No, they exist in different spaces: Polytopes in our work are in activation space whereas in their work the polytopes are in the model weights (if I understand their work correctly).

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3y40

Thanks for your interest in our post and your questions!

Correct me if I'm wrong, but it struck while reading this that you can think of a neural network as learning two things at once…

That seems right!

Can the functions and classes be decoupled? … Could you come up with some other scheme for choosing between a whole bunch of different linear transformations?

It seems possible to come up with other schemes that do this; it just doesn’t seem easy to come up with something that is competitive with neural nets. If I recall correctly, there’s work in prev... (read more)

2ESRogs3y

Thanks for the answers!

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3y30

Thanks for your comment!

However, I don't really see how you'd easily extend the polytope formulation to activation functions that aren't piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.

Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274

In the case of GELU and some similar activation functions, you'd need to replace the bina... (read more)

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3yΩ130

For GPT2-small, we selected 6/1024 tokens in each sequence (evenly spaced apart and not including the first 100 tokens), and clustered on the entire MLP hidden dimension (4 * 768).

For InceptionV1, we clustered the vectors corresponding to all the channel dimensions for a single fixed spatial dimension (i.e. one example of size [n_channels] per image).

2Neel Nanda3y

Thanks! So, I was trying to disentangle the two claims of "if examples are semantically similar (either similar patches of images, or words in similar contexts re predicting the next token) the model learns to map their full representations to be close to each other" and the claim of "if we pick a specific direction, the projection onto this direction is polysemantic. But it actually intersects many meaningful polytopes, and if we cluster the projection onto this direction (converting examples to scalars) we get clusters in this 1D space, and each cluster is clearly meaningful". IMO the first is pretty intuitive, while the second would be surprising and strong evidence for the polytope hypothesis. If I'm understanding correctly, you're presenting the first one here? Did you investigate the second at all? I'd love to see the results!

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3y20

Thanks for your comment!

RE non-ReLU activation functions:

Extending the polytope lens to Swish or GELU activation functions is, fortunately, the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274

We wrote a few sentence about this at the end of Appendix C:

"In summary - smooth activation functions must be represented with a probabilistic spline code rather than a one-hot binary code. The corresponding affine transformation at the input point is then a linear interpolation of the entire set of affine transformations, weig... (read more)

1Jon Garcia3y

Here is a paper that addresses using activation functions that bound the so-called "open space": Improved Adversarial Robustness by Reducing Open Space Risk via Tent Activations According to the paper: Basically, causing all unbounded polytopes to have a zero-affine-transformation at extreme values improves adversarial robustness. By the way, although the tent activation function prevents monotonic growth in the direction perpendicular to the decision hyperplane, I haven't heard of any activation function that prevents the neuron from being active when the input goes too far out of distribution in a direction parallel to the hyperplane. It might be interesting to explore that angle.

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3y20

Currently there are no plans to release the code because much of it relies on internal infrastructure. The theory straightforwardly extends to larger networks, but we’re currently not sure if there will be (further) practical hurdles there.
Polytope boundaries do extend further out. The shell doesn’t imply that they stop; the shell simply seems to be a region that many boundaries tend to pass through.
Thanks!

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3yΩ240

This is one of the major research questions that will be important to answer before polytopes can be really useful in mechanistic descriptions.

By choosing to use clustering rather than dimensionality reduction methods, we took a non-decompositional approach here. Clustering was motivated primarily by wanting to capture the monosemanticity of local regions in neural networks. But the ‘monosemanticity’ that I’m talking about here refers to the fact that small regions of activation mean one thing on one level of abstraction; this ‘one thing’ could be a combin... (read more)

Interpreting Neural Networks through the Polytope Lens

Lee Sharkey3yΩ350

Thanks for your interest!

Shouldn't this create strong regularisation favouring using meaningful directions over meaningful polytopes?

Yes, that seems reasonable!

One thing we want to emphasize is that it's perfectly possible to have both meaningful directions and meaningful polytopes. For instance, if all polytope boudaries intersect the origin, then all polytopes will be unbounded. In that case, polytopes will essentially be directions!

The polytope lens only becomes relevant when trying to explain what perfectly linear models can't account for. Although... (read more)

5Neel Nanda3y

Gotcha, thanks! Re this, this somewhat conflicts with my understand of the direction lens. The point is not that things are perfectly linear. This point is that we can interpret directions after a non-linear activation function. The non-linearities are used between interpretable spaces to do some transformation mapping meaningful directions to new meaningful directions (and the exact details of how it does this are the circuits to interpret). See, eg, my modular addition work for a very concrete example of this. It's mathematically true that any operation of a ReLU network will be manipulating polytopes (including a randomly initialised network!), and I understood the key claim of this post is that the polytope lens more naturally maps onto interpreting the network and figuring out what's going on. A linear function can never do anything interesting to directions - it just transforms the available space, but cannot create new meaningful directions, just superpositions of the old ones.