LESSWRONG
LW

JacksonKaunismaa — LessWrong

7mo

Repo: https://github.com/DavidUdell/sparse_circuit_discovery

TL;DR: A SPAR project from a while back. A replication of an unsupervised circuit discovery algorithm in GPT-2-small, with a negative result.

Thanks to Justis Mills for draft feedback and to Neuronpedia for interpretability data.

Introduction

I (David) first heard about sparse autoencoders at a Bay Area party. I had been talking about how activation additions give us a map from expressions in natural language over to model activations. I said that what I really wanted, though, was the inverse map: the map going from activation vectors over to their natural language content. And, apparently, here it was: sparse autoencoders!

The field of mechanistic interpretability quickly became confident that sparse autoencoder features were the right... (read 1685 more words →)

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery

Arjun Panickssery, Abhimanyu Pallavi Sudhir, JacksonKaunismaa

Work supported by MATS and SPAR. Code at https://github.com/ArjunPanickssery/math_problems_debate/.

Three measures for evaluating debate are

whether the debate judge outperforms a naive-judge baseline where the naive judge answers questions without hearing any debate arguments.
whether the debate judge outperforms a consultancy baseline where the judge hears argument(s) from a single "consultant" assigned to argue for a random answer.
whether the judge can continue to supervise the debaters as the debaters are optimized for persuasiveness. We can measure whether judge accuracy increases as the debaters vary in persuasiveness (measured with Elo ratings). This variation in persuasiveness can come from choosing different models, choosing the best of N sampled arguments for different values of N, or training debaters

... (read 496 more words →)

LESSWRONG
LW

LESSWRONG
LW

JacksonKaunismaa

JacksonKaunismaa

JacksonKaunismaa

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Inference-Only Debate Experiments Using Math Problems

JacksonKaunismaa

JacksonKaunismaa

JacksonKaunismaa

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Inference-Only Debate Experiments Using Math Problems

Introduction