Question about the "rules of the game" you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.
This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.
Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude. Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s roughly equally?
Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)
For any potential funders reading this: I'd be open to starting an interpretability lab and would love to chat. I've been full-time on MI for about 4 months - here is some of my work: https://www.lesswrong.com/posts/vGCWzxP8ccAfqsrS3/thoughts-about-the-mechanistic-interpretability-challenge-2
I have a few PhD friends who are working for software jobs they don't like and would be interested in joining me for a year or longer if there were funding in place (even for just the trial period Marius proposes).
My very quick take is that interpretability...
I really like your ambitious MI section and I think you hit on a few interesting questions I've come across elsewhere:
Two researchers interpreted a 1-layer transformer network and then I interpreted it differently - there isn't a great way to compare our explanations (or really know how similar vs different our explanations are).
With papers like the Hydra effect that demonstrate similar knowledge can be spread throughout a network, it's not clear to if we want to/how to analyze impact - can/should we jointly ablate multiple units across different heads at ...
Most weekdays, I set the goal of myself of doing twelve focused blocks of 24 minutes of object level work (my variant on Pomodoro). Once I complete these blocks, I can do whatever I want - whether it be stop working for the rest of the day, more object level work, meta work, or anything else.
If you try something like this, I'd recommend setting a goal of doing 6(?) such blocks and then letting yourself do as much or as little meta as you want; and then potentially gradually working up to 10-13 blocks.
Over the last 3 months, I've spent some time thinking about mech interp as a for profit service. I've pitched to one VC firm, interviewed for a few incubators/accelerators including ycombinator, sent out some pitch documents, co-founder dated a few potential cofounders, and chatted with potential users and some AI founders).
There are a few issues:
First, as you mention, I'm not sure if mech interp is yet ready to understand models. I recently interpreted a 1-layer model trained on a binary classification function https://www.lesswrong.com/posts/...
This is a surprising and fascinating result. Do you have attention plots of all 144 heads you could share?
I'm particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption
(Left: a 50x50 submatrix of LXHY's attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY's attention pattern, when positional embeddings are averaged as described above.)
My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art. I don't know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then? Given this is hypothetical and the next large improvement in LMs could come elsewhere, I'm not suggesting we stop sharing now. But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.
The differences between these two projects seem like an interesting case study in MI. I'll probably refer to this a lot in the future.
Excited to see case studies comparing and contrasting our works. Not that you need my permission, but feel free to refer to this post (and if it's interesting, this comment) as much or as little as desired.
One thing that I don't think came out in my post is that my initial reaction to the previous solution was that it was missing some things and might even have been mostly wrong. (I'm still not certain that...
One thought I've had, inspired by discussion (explained more later), is whether:
"label[ing] points by interpolating" is not the opposite of "developing an interesting, coherent internal algorithm.” (This is based on a quote from Stephen Casper's retrospective that I also quoted in my post).
It could be the case that the network might have "develop[ed] an interesting, coherent algorithm", namely the row coloring primitives discussed in this post, but uses "interpolation/pattern matching" to approximately detect the cutoff points.
When I started t...
Thank you for the kind words and the offer to donate (not necessary but very much appreciated). Please donate to https://strongminds.org/ which is listed on Charity Navigator's list of high impact charities ( https://www.charitynavigator.org/discover-charities/best-charities/effective-altruism/ )
I will respond to the technical parts of this comment tomorrow or Tuesday.
I wonder if this is simply the result of the generally bad SWE/CS market right now. People who would otherwise be in big tech/other AI stuff, will be more inclined to do something with alignment.
This is roughly my situation. Waymo froze hiring and had layoffs while continuing to increase output expectations. As a result I/we had more work. I left in March to explore AI and landed on Mechanistic Interpretability research.
"We will keep applications open until at least the end of August 2023"
Is there any advantage to applying early vs in August 2023? I ask as someone intending to do a few months of focused independent MI research. I would prefer to have more experience and sense of my interests before applying, but on the other hand, don't want to find out mid-August that you've filled all the roles and thus it's actually too late to apply. Thanks.
Just wanted to say that I have similar questions about how to best (try to) get funding for mechanistic interpretability research. Might send a bunch of apps out come early June; but like OP, I don't have any technical results in alignment (though like OP, I like to think I have a solid (yet different) background).
Hi Sonia - Can you please explain what you mean by "mixed selectivity"; particularly, I don't understand what you mean by "Some of these studies then seem to conclude that SAEs alleviate superposition when really they may alleviate mixed selectivity." Thanks.