working on neuronpedia

apologies for the issue with the neuronpedia link. it's now been resolved.

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

TL;DR

We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and downstream tasks. Use our codebase to evaluate your own SAEs!
You can compare 200+ SAEs of varying sparsity, dictionary size, architecture, and training time on Neuronpedia.
Think we're missing an eval? We'd love for you to contribute it to our codebase! Email us.

🔍 Explore the Benchmark & Rankings

📊 Evaluate your SAEs with SAEBench

✉️ Contact Us

Introduction

Sparse Autoencoders (SAEs) have become one of the most popular tools for AI... (read 381 more words →)

Hey Jacob + Philippe,

Hope you all don't mind but we put up layer 8 of your transcoders onto Neuronpedia, with ~22k dashboards here:

https://neuronpedia.org/gpt2-small/8-tres-dc

Each dashboard can be accessed at their own url:

https://neuronpedia.org/gpt2-small/8-tres-dc/0 goes to feature index 0.

You can also test each feature with custom text:

Or search all features at: https://www.neuronpedia.org/gpt2-small/tres-dc

An example search: https://www.neuronpedia.org/gpt2-small/?sourceSet=tres-dc&selectedLayers=[]&sortIndexes=[]&q=the%20cat%20sat%20on%20the%20mat%20at%20MATS

Unfortunately I wasn't able to generate histograms, autointerp, or other layers for this yet. Am working on getting more layers up first.

Verification
I did spot checks of the first few dashboards and they seem to be correct. Please let me know if anything seems wrong or off. I am also happy to delete this comment if you do not find it useful or for any other reason - no worries.

Please let me know if you have any feedback or issues with this. I will be also reaching out directly via Slack.

Thanks Callum and yep we've been extensively using SAE-Vis at Neuronpedia - it's been extremely helpful for generating dashboards and it's very well maintained. We'll have a method of directly importing to Neuronpedia using the exports from SAE-Vis coming out soon.

This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of these papers.

TL;DR

Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we’ve pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.

Important Links

Explore: The SAE research focused Neuronpedia. Current SAEs for GPT2-Small:
- RES-JB: Residuals - Joseph Bloom (294k feats)
- ATT-KK: Attention Out - Connor Kissane + Robert Kryzanowski (344k feats)
Upload: Get your SAEs hosted by Neuronpedia: fill out this <5 minute application
Participate: Join #neuronpedia on the Open Source Mech Interp Slack

Neuronpedia has received 1 year of funding from... (read 2006 more words →)

This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs.

Links: SAEs on HuggingFace, Analysis Code

Executive Summary

This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features.

Firstly, we use statistics (standard deviation, skewness and kurtosis) of the logit weight distributions of features (W_uW_dec[feature]) to characterize classes of features, showing that many features can be understood

... (read 4060 more words →)

Hey Joseph (and coauthors),

Your directions are really fantastic. I hope you don't mind, but I generated the activation data for the first 3000+ directions for each of the 12 layers and uploaded your directions to Neuronpedia:

https://www.neuronpedia.org/gpt2-small/res-jb

Your directions are also linked on the home page and the model page.

They're also accessible by layer (sorted by top activation), eg layer 6: https://neuronpedia.org/gpt2-small/6-res-jb

I added the "Anthropic dashboard" to Neuronpedia for your dataset.

Explanations, comments, and autointerp scoring are also working - anyone can do this:

Click a direction and submit explanation on the top-left. Here's another Star Wars direction (5-RES-JB:1681) where GPT4 gave me a score of 96:
- Click the score for the scoring details:

I plan to do... (read more)

Apparently an anonymous user(s) got really excited and ran a bunch of simultaneous searches while I was sleeping, triggering this open tokenizer bug/issue and causing our TransformerLens server to hang/crash. This caused some downtime.

A workaround has been implemented and pushed.

Thanks for the tip! I've added the link under "Exploration Tools" after the first mention of Neuronpedia. Let me know if that is the proper way to do it - I couldn't find a feature on LW for a special "context link" if there is such a feature.

TL;DR: Interactive exploration of new directions in GPT2-SMALL. Try it yourself.

OpenAI recently released their Sparse Autoencoder for GPT2-Small. In this story-driven post, I run experiments and poke around the 325k active directions to see how good they are. The results were super interesting, and I encountered more than a few surprises along the way. Let's get started!

Exploration Tools

I uploaded the directions to Neuronpedia [prev LW post], which now supports exploring, searching, and browsing different layers, directions, and neurons.

Let's do some experiments with the objective: What's in these latent directions? Are they any good?

Test 1: Find a Specific Concept - Red

For the first test, we'll look for a specific concept, and then benchmark... (read 3919 more words →)

Great work Adam, especially on the customizability. It's fascinating clicking through various types and indexes to look for patterns, and I'm looking forward to using this to find interesting directions.

Hey Nathan, so sorry this took so long. Finally shipped this - you can now toggle "Profanity/Explicit" OFF in "Edit Profile". Some notes about the implementation:

If enabled, hides activation texts that have a bad word in it (still displays the neuron)
Works by checking a list of bad words
Default is disabled (profanity shown) in order to get more accurate explanations
Asks user during onboarding for their preference

It turns out that nearly all neurons have some sort of explicit (or "looks like explicit") text, so it's not feasible to automatically skip all these neurons - we end up skipping everything. So we only hide the individual activation texts that are explicit.

Sorry again and thanks for the feedback!

Sorry about that! Should have fixed that way earlier. I've transferred the app to "Neuronpedia", so it should appear correctly now. Thank you for flagging this.

Yes, this is a great idea. I think something other than "skip" is needed since skip isn't declaring "this neuron has too many meanings or doesn't seem to do anything", which is actually useful information

Edit - Neuronpedia has pivoted to be a research tool for Sparse Autoencoders, so most of this post is outdated. Please read the new post, Neuronpedia: Accelerating Sparse Autoencoders Research.

Neuronpedia is an AI safety game that documents and explains each neuron in modern AI models. It aims to be the Wikipedia for neurons, where the contributions come from users playing a game. Neuronpedia wants to connect the general public to AI safety, so it's designed to not require any technical knowledge to play.

Neuronpedia is in experimental beta: getting its first users in order to collect feedback, ideas, and build an initial community.

OBJECTIVES

Increase understanding of AI to help build safer AI
Increase public engagement,

... (read 484 more words →)

LESSWRONG
LW

LESSWRONG
LW

Johnny Lin

Neuronpedia

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Understanding SAE Features with the Logit Lens

Johnny Lin

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Understanding SAE Features with the Logit Lens

Exploring OpenAI's Latent Directions: Tests, Observations, and Poking Around

Neuronpedia

Johnny Lin

Neuronpedia

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Understanding SAE Features with the Logit Lens

Johnny Lin

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Understanding SAE Features with the Logit Lens

Exploring OpenAI's Latent Directions: Tests, Observations, and Poking Around

Neuronpedia

TL;DR

Introduction

TL;DR

Important Links

Executive Summary

Exploration Tools

Test 1: Find a Specific Concept - Red

OBJECTIVES