User Comment Replies

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

So far the best summary I have seen!

EIS IV: A Spotlight on Feature Attribution/Saliency

Different attribution methods can be placed on a scale, with the X-axis being the reflection of grouth truth (at least for the interpretation of the image task, reflecting how humans process information) and the Y-axis being how the model processes information in its way. Attribution methods can highlight most truths, but do not necessarily accurately reflect how the model learns things. The attribution method is a representation of the model, and the model is a representation of the data. Different levels of accuracy imply different levels of uncertainty ... (read more)

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

Yulu Pi2yΩ230

I have been wondering if neural networks (or more specifically, transformers) will become the ultimate form of AGI. If not, will the existing research on Interpretability, become obsolete?

1scasper2y

I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not.

A Barebones Guide to Mechanistic Interpretability Prerequisites

Yulu Pi2yΩ010

hey Neel,

Great post!

I am trying to look into the code here

Good (but hard) exercise: Code your own tiny GPT-2 and train it. If you can do this, I’d say that you basically fully understand the transformer architecture.
- Example of basic training boilerplate and train script
- The EasyTransformer codebase is probably good to riff off of here

But the links dont work anymore! It would be nice if you could help update them!

I dont know if this link works for the original content: https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob... (read more)

2Neel Nanda2y

Ah, thanks! Haven't looked at this point in a while, updated it a bit. I've since made my own transformer tutorial which (in my extremely biased opinion) is better esp for interpretability. It comes with a template notebook to fill out alongside part 2, (with tests!) and by the end you'll have implemented your own GPT-2. More generally, my getting started in mech interp guide is a better place to start than this guide, and has more on transformers!

LESSWRONG
LW

All of Yulu Pi's Comments + Replies