Jason Boxi Zhang

Message

Empirical Insights into Feature Geometry in Sparse Autoencoders

Key Findings: 1. We demonstrate that subspaces with semantically opposite meanings within the GemmaScope series of Sparse Autoencoders are not pointing towards opposite directions. 2. Furthermore, subspaces that are pointing towards opposite directions are usually not semantically related. 3. As a set of auxiliary experiments, we experiment with the compositional...

Jan 24, 2025•7

Message

6 karma

1 post

Member for a year

Jason Boxi Zhang — LessWrong

Jason Boxi Zhang

Message

Jason Boxi Zhang

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jan 24, 2025•7

Message

6 karma

1 post

Member for a year

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jason Boxi Zhang

Key Findings:

We demonstrate that subspaces with semantically opposite meanings within the GemmaScope series of Sparse Autoencoders are not pointing towards opposite directions.
Furthermore, subspaces that are pointing towards opposite directions are usually not semantically related.
As a set of auxiliary experiments, we experiment with the compositional injection of steering vectors (ex: -1*happy + sad) and find moderate signals of success.

An Intuitive Introduction to Sparse Autoencoders

What are Sparse Autoencoders, and How Do They Work?

High Level Diagram Showcasing SAE Workflow: taken from Adam Karvonen's Blog Post.

Sparse Autoencoder (SAE) is a dictionary learning method with the goal of learning monosemantic subspaces that map to high-level concepts (1). Several frontier AI labs have recently applied SAEs to interpret... (read 3146 more words →)

LESSWRONG
LW

LESSWRONG
LW

Jason Boxi Zhang

Jason Boxi Zhang

Jason Boxi Zhang

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jason Boxi Zhang

Jason Boxi Zhang

Jason Boxi Zhang

Empirical Insights into Feature Geometry in Sparse Autoencoders

Key Findings:

An Intuitive Introduction to Sparse Autoencoders

What are Sparse Autoencoders, and How Do They Work?