Wiki Contributions

Comments

RGRGRG10

Enjoyed this post! Quick question about obtaining the steering vectors:

Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?

RGRGRG10

Question about the "rules of the game" you present.  Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators.  From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features.  Thank you.

RGRGRGΩ110

To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

RGRGRG10

This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.

Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude.  Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s  roughly equally?

Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)

RGRGRG10

Question about the gif - to me it looks like the phase transition is more like:

4++-  to unstable 5+-  to 4+-  to  5-
(Unstable 5+- seems to have similar loss to 4+-).

Why do we not count the large red bar as a "-" ?

RGRGRG40

Do you expect similar results (besides the fact that it would take longer to train / cost more) without using LoRA?

RGRGRG10

If I were to be accepted for this cycle, would I be expected to attend any events in Europe?  To be clear, I could attend all events in and around Berkeley.

RGRGRGΩ110

What city/country is PIBBSS based out of / where will the retreats be?  (Asking as a Bay Area American without a valid passport).

RGRGRG100

For any potential funders reading this:  I'd be open to starting an interpretability lab and would love to chat.  I've been full-time on MI for about 4 months - here is some of my work: https://www.lesswrong.com/posts/vGCWzxP8ccAfqsrS3/thoughts-about-the-mechanistic-interpretability-challenge-2

I have a few PhD friends who are working for software jobs they don't like and would be interested in joining me for a year or longer if there were funding in place (even for just the trial period Marius proposes).

My very quick take is that interpretability has yet to understand small language models and this is a valuable direction to focus on next.  (more details here: https://www.lesswrong.com/posts/ESaTDKcvGdDPT57RW/seeking-feedback-on-my-mechanistic-interpretability-research ) 

 

For any potential cofounders reading this, I have applied to a few incubators and VC funds, without any success.  I think some applications would be improved if I had a co-founder.  If you are potentially interested in cofounding an interpretability startup and you live in the Bay Area, I'd love to meet for coffee and see if we have a shared vision and potentially apply to some of these incubators together.

RGRGRG10

I really like your ambitious MI section and I think you hit on a few interesting questions I've come across elsewhere:

Two researchers interpreted a 1-layer transformer network and then I interpreted it differently - there isn't a great way to compare our explanations (or really know how similar vs different our explanations are).

With papers like the Hydra effect that demonstrate similar knowledge can be spread throughout a network, it's not clear to if we want to/how to analyze impact - can/should we jointly ablate multiple units across different heads at once?

I'm personally unsure how to split my time between interpreting small networks vs larger ones.  Should I focus 100% on interpreting 1-2 layer TinyStories LMs or is looking into 16+ layer LLMs valuable at this time?

Load More