Edit - Neuronpedia has pivoted to be a research tool for Sparse Autoencoders, so most of this post is outdated. Please read the new post, Neuronpedia: Accelerating Sparse Autoencoders Research.
Neuronpedia is an AI safety game that documents and explains each neuron in modern AI models. It aims to be the Wikipedia for neurons, where the contributions come from users playing a game. Neuronpedia wants to connect the general public to AI safety, so it's designed to not require any technical knowledge to play.
Neuronpedia is in experimental beta: getting its first users in order to collect feedback, ideas, and build an initial community.
OBJECTIVES
- Increase understanding of AI to help build safer AI
- Increase public engagement, awareness, and education in AI safety
CURRENT STATUS
- I started working on Neuronpedia three weeks ago, and I'm posting on LessWrong to develop an initial community and for feedback and testing. I'm not posting it anywhere else, please do not share it yet in other forums like Reddit.
- There's an onboarding tutorial that explains the game, but to summarize: It's a word association game. You're shown one neuron ("puzzle") at a time, and its highest activations ("clues"). You then either vote for an existing explanation, or submit your own explanation. Neuronpedia's first "campaign" is explaining gpt2-small, layer 6.
- There is an "advanced mode" that allows testing custom activation text and shows more details/filters. Click "Simple" at the top right to toggle it.
WHAT YOU CAN DO
- Play @ neuronpedia.org - feel free to use a throwaway GitHub account to log in.
- Give feedback, ideas, and ask questions.
THE VISION
- Millions of casual and technical users play Neuronpedia daily, trying to solve each neuron (like NYT crossword/Wordle). There are weekly/monthly contests ("side quests"). Top scorers are ranked on leaderboards by country, region, etc.
- Neuronpedia sparks interest in AI safety for thousands of people and they contribute in other ways (switch fields, do research, etc).
- Researchers use the data to build safer and more predictable AI models. Companies post updated versions of their AI models (or parts of them) as new "campaigns" and iterate through increasingly safer models.
HOW NEURONPEDIA CAME ABOUT
After moving on from my previous startup, I reached out to 80,000 Hours for career advice. They connected me to William Saunders who provided informal (not affiliated with any company) guidance on what might be useful products to develop for AI safety research. Three weeks ago, I started prototyping versions of Neuronpedia, starting as a reference website, then eventually iterating into a game.
Neuronpedia is seeded with data and tools from OpenAI's Automated Interpretability and Neel Nanda's Neuroscope.
IS THIS SUSTAINABLE?
Unclear. There's no revenue model, and there is nobody supporting Neuronpedia. I'm working full time on it and spending my personal funds on hosting, inference servers, OpenAI API, etc. If you or your organization would like to support this project, please reach out at johnny@neuronpedia.org.
COUNTERARGUMENTS AGAINST NEURONPEDIA
These are reasons Neuronpedia could fail to achieve one or more of its objectives. They're not insurmountable, but good to keep in mind.
- Can't get enough people to care about AI safety or think it's a real problem.
- Neurons are the wrong "unit" for useful interpretability and Neuronpedia is unable to adapt to the correct "unit" (groups of neurons, etc).
- Even the best human explanations are not good.
- Scoring algorithm for explanation is bad and can't be improved.
- Not engaging enough - the game isn't balanced, doesn't have enough "loops", etc.
- Bugs.
- Lack of funds.
- AI companies shut it down via copyright claims, cease and desist, etc.
- Unable to contain abusive users or spam.
- Too slow to stop misaligned AI.
Cool concept! Thanks for making it. And that's a lovely looking website, especially for just three weeks!
The core problem with this kind of thing is that often neurons are not actually monosemantic, because models use significant superposition, so the neuron means many different things. This is a pretty insurmountable problem - I don't think it sinks the concept of the website, but it seems valuable to eg have a "this seems like a polysemantic mess" button.
Bug report - in OWT often apostrophes or quote marks are tokenized as two separate tokens, because of a dumb bug in the tokenizer (they're a weird unicode character that it doesn't recognise, so it gets tokenized as two separate bytes). This looks confusing, eg here: (the gap between the name and s is an apostrophe). It's unclear how best to deal with this, my recommendation is to have an empty string and then an apostrophe/quotation mark, and a footnote on hover explaining it.
Hi Neel, thanks for playing and thanks for all your incredible work. Neuronpedia uses a ton of your stuff.
Re: polysemantic neurons - yes, I should address this before wider distribution. Some current ideas - if you have a preference please let me know.
- Your proposed "this is a mess" button
- Allow voting on more than one option at a time (users can do multiple votes for explanations per neuron on the neuron's page, but the game automatically moves on to a new neuron after one vote to keep it more "game-like")
- Encourage "or" explanations: "cat or tomato or purpl
... (read more)