All of ntt123's Comments + Replies

ntt12330

Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way.  We should be able to rewrite the output as a sum of individual terms, I told myself.

For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.

3Logan Riggs
Thanks for the correction! What I meant was figure 7 is better modeled as “these neurons are not monosemantic”since their co-activation has a consistent effect (upweighting 9) which isn’t captured by any individual component, and (I predict) these neurons would do different things on different prompts. But I think I see where you’re coming from now, so the above is tangential. You’re just decomposing the logits using previous layers components. So even though intermediate layers logit contribution won’t make any sense (from tuned lens) that’s fine. It is interesting in your example of the first two layers counteracting each other. Surely this isn’t true in general, but it could be a common theme of later layers counteracting bigrams (what the embedding is doing?) based off context.