Shan23Chen

Are SAE features from the Base Model still meaningful to LLaVA?

Shan Chen, Jack Gallifant, Kuleen Sasse, Danielle Bitterman^[1]
Please read this as a work in progress where we are colleagues sharing this in a lab (https://www.bittermanlab.org) meeting to help/motivate potential parallel research.

TL;DR:

Recent work has evaluated the generalizability of Sparse Autoencoder (SAE) features; this study examines their effectiveness in multimodal settings.
We evaluate feature extraction using a CIFAR-100-inspired explainable classification task, analyzing the impact of pooling strategies, binarization, and layer selection on performance.
SAE features generalize effectively across multimodal domains and recover nearly 100% of the ViT performance (this LLaVA used).
Feature extraction, particularly leveraging middle-layer features with binarized activations and larger feature sets, enables robust classification even in low-data scenarios, demonstrating the potential for simple models

... (read 2707 more words →)

Sparse Autoencoder Features for Classifications and Transferability

Shan23Chen

A few months ago, we explored whether Sparse Autoencoder (SAE) features from a base model remained meaningful when transferred to a multimodal system—specifically, LLaVA—in our preliminary post Are SAE Features from the Base Model still meaningful to LLaVA?. Today, I’m excited to share how that initial work has evolved. Our new arXiv paper, Sparse Autoencoder Features for Classifications and Transferability.

Our study makes three key contributions to the field of interpretable AI and feature extraction in Large Language Models (LLMs). First, it establishes classification benchmarks by introducing a robust methodology for evaluating and selecting Sparse Autoencoder (SAE)-based features in safety-critical classification tasks, demonstrating their superior performance over traditional baselines. Second, it provides a... (read more)

Replying toAre SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen1y

Are SAE features from the Base Model still meaningful to LLaVA?

thank you! full paper and code will be ready soon!

Replying toAre SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen1y

Are SAE features from the Base Model still meaningful to LLaVA?

Thank you! And thanks for making llava-gemma!

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen

TL;DR:

Recent work has evaluated the generalizability of Sparse Autoencoder (SAE) features; this study examines their effectiveness in multimodal settings.
We evaluate feature extraction using a CIFAR-100-inspired explainable classification task, analyzing the impact of pooling strategies, binarization, and layer selection on performance.
SAE features generalize effectively across multimodal domains and recover nearly 100% of the ViT performance (this LLaVA used).
Feature extraction, particularly leveraging middle-layer features with binarized activations and larger feature sets, enables robust classification even in low-data scenarios, demonstrating the potential for simple models

... (read 2707 more words →)

LESSWRONG
LW

LESSWRONG
LW

Shan23Chen

Shan23Chen

Are SAE features from the Base Model still meaningful to LLaVA?

Sparse Autoencoder Features for Classifications and Transferability

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen

Shan23Chen

Shan23Chen

Are SAE features from the Base Model still meaningful to LLaVA?

Sparse Autoencoder Features for Classifications and Transferability

Are SAE features from the Base Model still meaningful to LLaVA?

TL;DR:

TL;DR: