User Comment Replies

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

It’s great to see that these techniques basically work at scale, but not so much to hear that things remain messy. Do you have any intuition for whether things would start to clean up if the model was trained until the loss curve flattened out? Maybe Chinchilla-optimality even has some interesting bearing on this!

4Neel Nanda2y

My guess is that messiness is actually a pretty inherent part of the whole thing? Models have an inherent reason to want to do the problem with a single clean solution, if they can simultaneously use the features "nth item in the list" and "labelled A" and even "has two incorrect answers before it" why not?

2Tom Lieberum2y

During parts of the project I had the hunch that some letter specialized heads are more like proto-correct-letter-heads (see paper for details), based on their attention pattern. We never investigated this, and I think it could go either way. The "it becomes cleaner" intuition basically relies on stuff like the grokking work and other work showing representations being refined late during training by.. Thisby et al. I believe (and maybe other work). However some of this would probably require randomising e.g. the labels the model sees during training. See e.g. Cammarata et al. Understanding RL Vision: If you only ever see the second choice be labeled with B you don't have an incentive to distinguish between "look for B" and "look for the second choice". Lastly, even in the limit of infinite training data you still have limited model capacity and so will likely use a distributed representation in some way, but maybe you could at least get human interpretable features even if they are distributed.

[Linkpost] Introducing Superalignment

Evan Hockings2y147

This is incredible—the most hopeful I’ve been in a long time. 20% of current compute and plans for a properly-sized team! I’m not aware of any clearer or more substantive sign from a major lab that they actually care, and are actually going to try not to kill everyone. I hope that DeepMind and Anthropic have great things planned to leapfrog this!

1mesaoptimizer2y

I don't get your model of the world that would imply the notion of DM/Anthropic "leapfrogging" as a sensible frame. There should be no notion of competition between these labs when it comes to "superalignment". If there is, that is weak evidence of our entire lightcone being doomed.

[SEE NEW EDITS] No, *You* Need to Write Clearer

Evan Hockings2y2115

Agreed—thanks for writing this. I have the sense that there's somewhat of a norm that goes like 'it's better to publish something that not, even if it's unpolished' and while this is not wrong, exactly, I think those who are doing this professionally, or seek to do this professionally, ought to put in the extra effort to polish their work.

I am often reminded of this Jacob Steinhardt comment.

Researchers are, in a very substantial sense, professional writers. It does no good to do groundbreaking research if you are unable to communicate what you have d... (read more)

7the gears to ascension2y

Wow, this is a quote for the ages.

6Nicholas / Heather Kross2y

I've kinda gone back-and-forth on this, since I often have low-energy, yet ideas to express. Since we already use "epistemic status" labels, I could imagine labels like "trying to clarify" VS "just getting an idea out there". Some epistemic-statuses kinda do that (e.g. "strong conviction, weakly held" or "random idea").

LESSWRONG
LW

All of Evan Hockings's Comments + Replies