Aryaman Arora

Replying toMech Interp Wiki Page and Why You Should Edit Wikipedia

Mech Interp Wiki Page and Why You Should Edit Wikipedia

Don't usually post here but feel compelled to do so after seeing this. This post specifically is being cited to as causing a "conflict of interest" on the talk page of the article https://en.wikipedia.org/wiki/Talk:Mechanistic_interpretability#Bad_sourcing,_COI_editing. I substantially edited the mech interp wiki page before this (I believe a majority of the bytes of the page are mine now) but some of my contributions are being removed for e.g. citing arXiv papers that are apparently not good sources (never mind them being highly cited and used by others in the field). I wonder if the comments on this article caused a kind of negative polarisation. Now I generally feel like I'd rather write my own separate thing rather than get dragged into this mess.

1

0

Replying toMATS Applications + Research Directions I'm Currently Excited About

Aryaman Arora1y

MATS Applications + Research Directions I'm Currently Excited About

Very useful list Neel!! Thanks for mentioning AxBench, but unfortunately we don't own the domain you linked to 😅 the actual link is https://github.com/stanfordnlp/axbench

2

0

1

Replying toThe ‘strong’ feature hypothesis could be wrong

Aryaman Arora2y

The ‘strong’ feature hypothesis could be wrong

cf. https://arxiv.org/abs/2407.14662

1

9

2

Replying toSome common confusion about induction heads

Aryaman Arora3y

Some common confusion about induction heads

Really nice summarisation of the confusion. Re: your point 3, this point makes "induction heads" as a class of things feel a lot less coherent :( I had also not considered that the behaviour on random sequences to show induction as a fallback--do you think there may be induction-y heads that simply don't activate on random sequences due to the out-of-distribution nature of them?

1

3

2

Replying toSolidGoldMagikarp (plus, prompt generation)

Aryaman Arora3y

SolidGoldMagikarp (plus, prompt generation)

I'll just preregister that I bet these weird tokens have very large norms in the embedding space.

1

6

0

Replying toA Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora3y

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.

Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

1

0

Replying toA Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora3y

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product $x \cdot y = ∥ x ∥ ∥ y ∥ cos (θ)$ the cosine similarity is a useless term.

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

3

0

Replying toA Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora3y

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider:

the backup heads could have other main functions but incidentally are useful for the specific task we're looking at, so they end up taking the place of the main heads
thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren't interpretable in big models due to superposition

Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but... (read more)

6

1

0

Replying toA Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora3y

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs

GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.

9

3

0

LESSWRONG
LW

LESSWRONG
LW

Aryaman Arora

Aryaman Arora

Aryaman Arora