All of Johnny Lin's Comments + Replies

apologies for the issue with the neuronpedia link. it's now been resolved.

Hey Jacob + Philippe,

Hope you all don't mind but we put up layer 8 of your transcoders onto Neuronpedia, with ~22k dashboards here:

https://neuronpedia.org/gpt2-small/8-tres-dc

Each dashboard can be accessed at their own url:

https://neuronpedia.org/gpt2-small/8-tres-dc/0 goes to feature index 0.

You can also test each feature with custom text:

Or search all features at: https://www.neuronpedia.org/gpt2-small/tres-dc

An example search: https://www.neuronpedia.org/gpt2-small/?sourceSet=tres-dc&selectedLayers=[]&sortIndexes=[]&q=the%20cat%20sat%20on%20... (read more)

3Jacob Dunefsky
Just started playing around with this -- it's super cool! Thank you for making this available (and so fast!) -- I've got a lot of respect for you and Joseph and the Neuronpedia project.
4Neel Nanda
That's awesome, and insanely fast! Thanks so much, I really appreciate it

Thanks Callum and yep we've been extensively using SAE-Vis at Neuronpedia - it's been extremely helpful for generating dashboards and it's very well maintained. We'll have a method of directly importing to Neuronpedia using the exports from SAE-Vis coming out soon.

2CallumMcDougall
Thanks!! Really appreciate it
Johnny LinΩ9182

Hey Joseph (and coauthors),

Your directions are really fantastic. I hope you don't mind, but I generated the activation data for the first 3000+ directions for each of the 12 layers and uploaded your directions to Neuronpedia:

https://www.neuronpedia.org/gpt2-small/res-jb 

Your directions are also linked on the home page and the model page.

They're also accessible by layer (sorted by top activation), eg layer 6: https://neuronpedia.org/gpt2-small/6-res-jb

I added the "Anthropic dashboard" to Neuronpedia for your dataset.

Explanations, comments, and autointe... (read more)

1Joseph Bloom
Agreed, thanks so much! Super excited about what can be done here!

Thanks for doing this, I'm excited about Neuronpedia focusing on SAE features! I expect this to go much better than neuron interpretability

Apparently an anonymous user(s) got really excited and ran a bunch of simultaneous searches while I was sleeping, triggering this open tokenizer bug/issue and causing our TransformerLens server to hang/crash. This caused some downtime.

A workaround has been implemented and pushed.

Thanks for the tip! I've added the link under "Exploration Tools" after the first mention of Neuronpedia. Let me know if that is the proper way to do it - I couldn't find a feature on LW for a special "context link" if there is such a feature.

2mishka
I think this is a good place for this link, thanks!

Great work Adam, especially on the customizability. It's fascinating clicking through various types and indexes to look for patterns, and I'm looking forward to using this to find interesting directions.

Hey Nathan, so sorry this took so long. Finally shipped this - you can now toggle "Profanity/Explicit" OFF in "Edit Profile". Some notes about the implementation:

  • If enabled, hides activation texts that have a bad word in it (still displays the neuron)
  • Works by checking a list of bad words
  • Default is disabled (profanity shown) in order to get more accurate explanations
  • Asks user during onboarding for their preference

It turns out that nearly all neurons have some sort of explicit (or "looks like explicit") text, so it's not feasible to automatically skip all th... (read more)

Sorry about that! Should have fixed that way earlier. I've transferred the app to "Neuronpedia", so it should appear correctly now. Thank you for flagging this.

Yes, this is a great idea. I think something other than "skip" is needed since skip isn't declaring "this neuron has too many meanings or doesn't seem to do anything", which is actually useful information

re: polysemanticity- have a big tweak to the game coming up that may help with this! i hope to get it out by early next week.

lol thanks. i can't believe the link has been broken for so long on the site. it should be fixed in a few seconds from now. in the meantime if you're interested: https://discord.gg/kpEJWgvdAx 

Thanks Brendan - yes - Apple/Google/Email login is coming before public non-beta release. May also add Facebook.

I love these variations on the game. Yes, the idea is to build a scalable backend and then we can build various different games on top of it. Initially there was no game and it was just browsing random neurons manually! Then the game was built on top of it. Internally I call the current Neuronpedia game "Vote Mode".

Would love to have you on our Discord even if just to lurk and occasionally pitch an idea. Thanks for playing!

Thank you! i will put this on the TODO.

Thank you - yes, this is on the TODO along with Apple Sign-In. It will likely not be difficult, but it's in beta experiment phase right now for feedback and fixes - we aren't yet ready for the scale of the general public.

2Caridorc Tergilti
Thanks for the quick response, have you tried fine-tuning the new llama2 models on the data gathered so far to see if there is any interesting results? QLORA is pretty efficient for this.

Thank you Hoagy. Expanding beyond the neuron unit is a high priority. I'd like to work with you, Logan Riggs, and others to figure out a good way to make this happen in the next major update so that people can easily view, test, and contribute. I'm now creating a new channel on the discord (#directions) to discuss this: https://discord.gg/kpEJWgvdAx, or I'll DM you my email if you prefer that.

Hi Logan - thanks for your response. Your dictionaries post is on the TODO to investigate and integrate (someone had referred me to it two weeks ago) - I'd love to make it happen.

Thanks for joining the Discord. Let's discuss when I have a few days to get caught up with your work.

Hi Neel, thanks for playing and thanks for all your incredible work. Neuronpedia uses a ton of your stuff.

Re: polysemantic neurons - yes, I should address this before wider distribution. Some current ideas - if you have a preference please let me know.

  1. Your proposed "this is a mess" button
  2. Allow voting on more than one option at a time (users can do multiple votes for explanations per neuron on the neuron's page, but the game automatically moves on to a new neuron after one vote to keep it more "game-like")
  3. Encourage "or" explanations: "cat or tomato or purpl
... (read more)

Sorry - New Discord link (changed to a "Community Server") https://discord.gg/kpEJWgvdAx 

Sorry - New Discord link (changed to a "Community Server") https://discord.gg/kpEJWgvdAx 

Should be working now.

Also, thank you for the feedback re- janky tutorial/signin. I will fix that. It is truly a terrible way to have a first experience with a product.

EDIT: the tutorial -> sign in friction has been updated.

hey mako - sorry about the issues. i'm looking into it right now. will update asap

edit: looks like the EC2 instance hard crashed. i can't even restart it from AWS console. i am starting up a new instance with more RAM.

edit2: confirmed via syslog (after taking a long time to restart the old server) it was OOM. new machine has 8x more ram. added monitoring and will investigate potential memory leaks tomorrow

Hi duck_master, thank you for playing and appreciate the tip. Maybe it's worth compiling these tips and putting it under a "tips" popup/page on the main site. Also - please consider joining the Discord if you're willing to offer more feedback and suggestions: https://discord.gg/kpEJWgvdAx 

Apologies for the limit. It currently costs ~$0.24 to do each explanation score and it's coming from my personal funds, so I'm capping it daily until I can hopefully get approved for a grant. A few hours ago I raised the limit from 3 new explanations per day to 10 new explanations per day.

3Harry Nyquist
Subjective opinion: I would probably be happy do a few more past the limit even without the automated explanation score.  If for instances, only the first 10-20 had an automatic score, and the rest where dependent on manual votes (starting at score 0), I think the task would still be interesting, hopefully without causing too much trouble. (There's also a realistic chance that I may lose attention and forget about the site in a few weeks, despite liking the game, so that (disregarding spam problems) a higher daily limit with reduced functionality maybe has value?) (As a note, the Discord link also seems to be expired or invalid on my end)

Hi Nathan, thanks for playing and pointing out the issue. My apologies for the inappropriate text.

Half the text samples are from Open Web Text, which is scraped web data that GPT2 was trained on. I don't know the exact details, but I believe some of it was reddit and other places.

If you DM me the neurons address next time you see them, I can start compiling a filter. I will also try to look for an open source library to categorize into safe and not safe.

My apologies again. This is a beta experiment, thanks for putting up with this while I fix the issues.

2Caridorc Tergilti
You can use the "mp-net2" model from sentence transformers for zero-shot classification (scalar product between the text and the embeddings of "sex" and "violence") decide a cut-off and you are done.

Thank you TinkerBird. I hope so too!

Hi Jennifer,

Thanks for participating - my apologies for only having GitHub login at the moment. Please feel free to create a throwaway Github account if you'd still like to play (I think Github allows you to use disposable emails to sign up - I had no problem creating an account using an iCloud disposable email). Email/password login is definitely on the TODO.

That's a good idea - I think maybe I could make a "drafts" explanation list so you can queue it up for later. Unfortunately since the website just launched, there is not yet a reasonable threshold for "voted highly" since most explanations have none or very few explanations. But this is a good workaround for when the site is a bit older.

Re: multiple meanings - this is interesting. I need to experiment with this more, but I don't think you need to use any special syntax. By writing "the letter c in a word or names of towns near Berlin", it should give you a... (read more)

2Adele Lopez
Thanks for the drafts feature! Yeah, it's a tricky situation. It may even be worth using a model trained to avoid polysemanticity. I also think it would be make the game both more fun and more useful if you switched to a model like the TinyStories one, where it's much smaller and trained on a more focused dataset. I may join the discord, but the invite on the website is expired currently fyi.

Hi Martin,

Thanks for playing! I agree there is some risk of confirmation bias, and the option to hide explanations by default is very interesting.

The reason it is designed the way it is now is because I'd prefer to avoid too many duplicate explanations. Currently, you can only submit explanations that are not exact duplicates, though you can submit explanations that are very similar -e.g, "banana" vs "bananas". 

The first downside would be that duplicate explanations may clutter up the voting options. The second downside is when someone is looking at t... (read more)

Hi Adam and thanks for your feedback / suggestion. Residual Viewer looks awesome. I have DMed you to chat more about it!

Good idea. I haven't done enough research on why some forums have upvotes only, went with my instinct on this but I should look into the pros/cons.

EDIT: this update was pushed just now. it will warn you on your first vote to confirm that you want to vote.


Thanks for playing, Chris!

I'll work on the voting thing. I'll probably just add a "first-timer's" warning on your first vote to ensure that you want to vote for that.

FYI - if you want to unvote, just go to your profile (neuronpedia.org/user/[username]), click the neuron you voted for, and click to unvote on the left side.

5Chris_Leong
It might make sense to have downvotes as well if you disagree with an explanation.

Thanks so much for the feedback! Inline below:

Conceptual Feedback:

  • I think it would be better if I could see two explanations and vote on which one I like better (when available).
    • When there are multiple explanations, Neuronpedia does display them.
    • However I've considered a different game mode where all you do is choose between This Vs That (no skipping, no new explanations). That may be a cool possibility!
  • Attention heads are where a lot of the interesting stuff is happening, and need lots of interpretation work. Hopefully this sort of approach can be extende
... (read more)
6Adele Lopez
I would really like to be able to submit my own explanations even if they can't be judged right away. Maybe to save costs, you could only score explanations after they've been voted highly by users. Additionally, it seems clear that a lot of these neurons have polysemanticity, and it would be cool if there was a way to indicate the meanings separately. As a first thought, maybe something like using | to separate them e.g. the letter c in the middle of a word | names of towns near Berlin.