I’m confused what the authors mean by the italicized phrase. How do you create more neurons without making the model larger?
I would assume various kinds of sparsity and modularization, and avoiding things which have many parameters but few neurons, such as fully-connected layers.
SoLU is a double-edged sword for interpretability. On the one hand, it makes it much easier to study a subset of MLP layer features which end up nicely aligned with neurons. On the other hand, we suspect that there are many other non-neuron-aligned features which are essential to the loss and arguably harder to study than in a regular model. Perhaps more concerningly, if one only looked at the SoLU activation, it would be easy for these features to be invisible and create a false sense that one understands all the features.
Extremely concerning for safety. The only thing more dangerous than an uninterpretable model is an 'interpretable' model. Is there an interpretability tax such that all interpretability methods wind up incentivizing covert algorithms, similar to how CycleGAN is incentivized to learn steganography, and interpretability methods risk simply creating mesa-optimizers which optimize for a superficially-simple seeming 'surface' Potemkin-village network while it gets the real work done elsewhere out of sight?
(The field of interpretability research as a mesa-optimizer: the blackbox evolutionary search (citations, funding, tenure) of researchers optimizes for finding methods which yield 'interpretable' models and work almost as well as uninterpretable ones - but only because the methods are too weak to detect that the 'interpretable' models are actually just weird uninterpretable ones which evolved some protective camouflage, and thus just as dangerous as ever. The field already offers a lot of examples of interpretability methods which produce pretty pictures and convince their researchers as well as others, but which later turn out to not work as well or as thought, like salience maps. One might borrow a quote from cryptography: "any interpretability researcher can invent an interpretability method he personally is not smart enough to understand the uninterpretability thereof.")
Anthropic continues their Transformer Circuits Thread. In previous work they were unable to make much progress interpreting MLP layers. This paper is focused on addressing that limitation.
High-level Takeaways
Their Key Results:
More Details
SoLU
Anthropic use their new SoLU activation function (mean to discourage polysemanticity / superposition) in place of GeLU:
SoLU(x)=x∗softmax(x)See the paper for intuition / examples for why this does what it says. I think this is brilliant and would not have thought of it myself. However...
SoLU penalizes superposition, but it's bad for performance, so they half allow superposition back in, and it kind of works? Surprisingly (to me) this allows their model to essentially match the vanilla GeLU model in performance while maintaining improved interpretability.
How did they evaluate interpretability?
From the appendix it looks like there were three evaluators. I would be very interested to see the experiment repeated with a larger sample size.
More on LayerNorm
Polysemanticity is so useful that it's unavoidable.
So, the SoLU trick privileges some features which become aligned with neurons. The model will still learn other features but they will become even harder to interpret that before.
Misc comments / questions