Edit - Neuronpedia has pivoted to be a research tool for Sparse Autoencoders, so most of this post is outdated. Please read the new post, Neuronpedia: Accelerating Sparse Autoencoders Research.
Neuronpedia is an AI safety game that documents and explains each neuron in modern AI models. It aims to be the Wikipedia for neurons, where the contributions come from users playing a game. Neuronpedia wants to connect the general public to AI safety, so it's designed to not require any technical knowledge to play.
Neuronpedia is in experimental beta: getting its first users in order to collect feedback, ideas, and build an initial community.
OBJECTIVES
- Increase understanding of AI to help build safer AI
- Increase public engagement, awareness, and education in AI safety
CURRENT STATUS
- I started working on Neuronpedia three weeks ago, and I'm posting on LessWrong to develop an initial community and for feedback and testing. I'm not posting it anywhere else, please do not share it yet in other forums like Reddit.
- There's an onboarding tutorial that explains the game, but to summarize: It's a word association game. You're shown one neuron ("puzzle") at a time, and its highest activations ("clues"). You then either vote for an existing explanation, or submit your own explanation. Neuronpedia's first "campaign" is explaining gpt2-small, layer 6.
- There is an "advanced mode" that allows testing custom activation text and shows more details/filters. Click "Simple" at the top right to toggle it.
WHAT YOU CAN DO
- Play @ neuronpedia.org - feel free to use a throwaway GitHub account to log in.
- Give feedback, ideas, and ask questions.
THE VISION
- Millions of casual and technical users play Neuronpedia daily, trying to solve each neuron (like NYT crossword/Wordle). There are weekly/monthly contests ("side quests"). Top scorers are ranked on leaderboards by country, region, etc.
- Neuronpedia sparks interest in AI safety for thousands of people and they contribute in other ways (switch fields, do research, etc).
- Researchers use the data to build safer and more predictable AI models. Companies post updated versions of their AI models (or parts of them) as new "campaigns" and iterate through increasingly safer models.
HOW NEURONPEDIA CAME ABOUT
After moving on from my previous startup, I reached out to 80,000 Hours for career advice. They connected me to William Saunders who provided informal (not affiliated with any company) guidance on what might be useful products to develop for AI safety research. Three weeks ago, I started prototyping versions of Neuronpedia, starting as a reference website, then eventually iterating into a game.
Neuronpedia is seeded with data and tools from OpenAI's Automated Interpretability and Neel Nanda's Neuroscope.
IS THIS SUSTAINABLE?
Unclear. There's no revenue model, and there is nobody supporting Neuronpedia. I'm working full time on it and spending my personal funds on hosting, inference servers, OpenAI API, etc. If you or your organization would like to support this project, please reach out at johnny@neuronpedia.org.
COUNTERARGUMENTS AGAINST NEURONPEDIA
These are reasons Neuronpedia could fail to achieve one or more of its objectives. They're not insurmountable, but good to keep in mind.
- Can't get enough people to care about AI safety or think it's a real problem.
- Neurons are the wrong "unit" for useful interpretability and Neuronpedia is unable to adapt to the correct "unit" (groups of neurons, etc).
- Even the best human explanations are not good.
- Scoring algorithm for explanation is bad and can't be improved.
- Not engaging enough - the game isn't balanced, doesn't have enough "loops", etc.
- Bugs.
- Lack of funds.
- AI companies shut it down via copyright claims, cease and desist, etc.
- Unable to contain abusive users or spam.
- Too slow to stop misaligned AI.
Unstructured feedback:
Thought of leaving when I was given my first challenge and thought, "what if these words actually don't have much in common, what if the neurons all just encode completely arbitrary categories due to being such a low strength model." Eventually I decided "No, I know what this is. This is A Thing. (words that suggest the approach towards some sort of interpersonal resolution)." Maybe that happens to everyone. I dunno.
It's kind of infuriating that you ask us to do a question, then don't accept the answer until we log in, then just waste our answer by sending us to another question after we've logged in. I guess you plan to solve the last part, in which case it's fine, but wow, like, you're going to smack every single one of your testers in the face with this?
My experience with the second question is that sending in a response does not work, I get an alert about a json parse error. Firefox.
Tell me when that's fixed, I guess.
Should be working now.
Also, thank you for the feedback re- janky tutorial/signin. I will fix that. It is truly a terrible way to have a first experience with a product.
EDIT: the tutorial -> sign in friction has been updated.