Edit - Neuronpedia has pivoted to be a research tool for Sparse Autoencoders, so most of this post is outdated. Please read the new post, Neuronpedia: Accelerating Sparse Autoencoders Research.
Neuronpedia is an AI safety game that documents and explains each neuron in modern AI models. It aims to be the Wikipedia for neurons, where the contributions come from users playing a game. Neuronpedia wants to connect the general public to AI safety, so it's designed to not require any technical knowledge to play.
Neuronpedia is in experimental beta: getting its first users in order to collect feedback, ideas, and build an initial community.
OBJECTIVES
- Increase understanding of AI to help build safer AI
- Increase public engagement, awareness, and education in AI safety
CURRENT STATUS
- I started working on Neuronpedia three weeks ago, and I'm posting on LessWrong to develop an initial community and for feedback and testing. I'm not posting it anywhere else, please do not share it yet in other forums like Reddit.
- There's an onboarding tutorial that explains the game, but to summarize: It's a word association game. You're shown one neuron ("puzzle") at a time, and its highest activations ("clues"). You then either vote for an existing explanation, or submit your own explanation. Neuronpedia's first "campaign" is explaining gpt2-small, layer 6.
- There is an "advanced mode" that allows testing custom activation text and shows more details/filters. Click "Simple" at the top right to toggle it.
WHAT YOU CAN DO
- Play @ neuronpedia.org - feel free to use a throwaway GitHub account to log in.
- Give feedback, ideas, and ask questions.
THE VISION
- Millions of casual and technical users play Neuronpedia daily, trying to solve each neuron (like NYT crossword/Wordle). There are weekly/monthly contests ("side quests"). Top scorers are ranked on leaderboards by country, region, etc.
- Neuronpedia sparks interest in AI safety for thousands of people and they contribute in other ways (switch fields, do research, etc).
- Researchers use the data to build safer and more predictable AI models. Companies post updated versions of their AI models (or parts of them) as new "campaigns" and iterate through increasingly safer models.
HOW NEURONPEDIA CAME ABOUT
After moving on from my previous startup, I reached out to 80,000 Hours for career advice. They connected me to William Saunders who provided informal (not affiliated with any company) guidance on what might be useful products to develop for AI safety research. Three weeks ago, I started prototyping versions of Neuronpedia, starting as a reference website, then eventually iterating into a game.
Neuronpedia is seeded with data and tools from OpenAI's Automated Interpretability and Neel Nanda's Neuroscope.
IS THIS SUSTAINABLE?
Unclear. There's no revenue model, and there is nobody supporting Neuronpedia. I'm working full time on it and spending my personal funds on hosting, inference servers, OpenAI API, etc. If you or your organization would like to support this project, please reach out at johnny@neuronpedia.org.
COUNTERARGUMENTS AGAINST NEURONPEDIA
These are reasons Neuronpedia could fail to achieve one or more of its objectives. They're not insurmountable, but good to keep in mind.
- Can't get enough people to care about AI safety or think it's a real problem.
- Neurons are the wrong "unit" for useful interpretability and Neuronpedia is unable to adapt to the correct "unit" (groups of neurons, etc).
- Even the best human explanations are not good.
- Scoring algorithm for explanation is bad and can't be improved.
- Not engaging enough - the game isn't balanced, doesn't have enough "loops", etc.
- Bugs.
- Lack of funds.
- AI companies shut it down via copyright claims, cease and desist, etc.
- Unable to contain abusive users or spam.
- Too slow to stop misaligned AI.
Hi Martin,
Thanks for playing! I agree there is some risk of confirmation bias, and the option to hide explanations by default is very interesting.
The reason it is designed the way it is now is because I'd prefer to avoid too many duplicate explanations. Currently, you can only submit explanations that are not exact duplicates, though you can submit explanations that are very similar -e.g, "banana" vs "bananas".
The first downside would be that duplicate explanations may clutter up the voting options. The second downside is when someone is looking at the two explanations later, the vote may be split between the two similar explanations - meaning a third explanation that is worse might actually win (e.g, "cherry" vs "banana(s)").
HOWEVER - those are not insurmountable downsides. the server just has to have a better duplicate/similarity check (maybe even asking GPT4), like check for plurals - and if you explain similarly to an existing explanation, it just automatically upvotes that. I think it's definitely worth experimenting. The similarity check would have to not be too loose, otherwise we may lose out on great explanations that appear to only be marginally different but actually score very differently.
Please keep the feedback coming and join the discord if you'd like to keep updated.