LESSWRONG
is fundraising!
LW

Mech Interp Challenge: November - Deciphering the Cumulative Sum Model — LessWrong

I'm writing this post to discuss solutions to the October challenge, and present the challenge for this November.

If you've not read the first post in this sequence, I'd recommend starting there - it outlines the purpose behind these challenges, and recommended prerequisite material.

November Problem

The problem for this month is interpreting a model which has been trained to classify the cumulative sum of a sequence.

The model is fed sequences of integers, and is trained to classify the cumulative sum at a given sequence position. There are 3 possible classifications:

0 (if the cumsum is negative),
1 (if the cumsum is zero),
2 (if the cumsum is positive).

For example, if the sequence is:

[0, +1, -3, +2, +1, +1]

Then the classifications would be:

[1, 2, 0, 1, 2, 2]

The model is not attention only. It has one attention layer with a single head, and one MLP layer. It does not have layernorm at the end of the model. It was trained with weight decay, and an Adam optimizer with linearly decaying learning rate.

I don't expect this problem to be as difficult as some of the others in this sequence, however the presence of MLPs does provide a different kind of challenge.

You can find more details on the Streamlit page. Feel free to reach out if you have any questions!

October Problem - Solutions

In the second half of the sequence, the attention heads perform the algorithm "attend back to (and copy) the first token which is larger than me". For example, in a sequence like:

[7, 5, 12, 3, SEP, 3, 5, 7, 12]

we would have the second 3 token attending back to the first 5 token (because it's the first one that's larger than itself), the second 5 attending back to 7, etc. The SEP token just attends to the smallest token.

Some more refinements to this basic idea:

The two attending heads split responsibilities across the vocabulary. Head 0.0 is the less important head; it deals with values in the range 28-37 (roughly). Head 0.1 deals with most other values.

In subsequences x < y < z where the three numbers are close together, x will often attend to z rather than to y. So why isn't this an adversarial example, i.e. why does the model still correctly predict y follows x?
- Answer - the OV circuit shows that when we attend to source token s, we also boost things slightly less thn s, and suppress things slightly more than s.
- So in the case of x < y < z, we have:
  - Attention to y will boost y a lot, and suppress z a bit.
  - Attention to z will boost z a lot, and boost y a bit.
- So even if z gets slightly more attention than y, it might still be the case that y gets predicted with higher probability.

Sequences with large jumps are adversarial examples (because they're rare in the training data, which was randomly generated from choosing subsets without replacement).

Best Submissions

We received more submissions for this month's problem than any other in the history of the series, so thanks to everyone who attempted! The best solution to this problem was by Vlad K, who correctly identified the model's tendency to produce unexpected attention patterns when 3 numbers are close together, and figured out how the model manages to produce correct classifications anyway.

Best of luck for this and future challenges!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

18

Mech Interp Challenge: November - Deciphering the Cumulative Sum Model

18

Ω 6

18

Ω 6

November Problem

October Problem - Solutions

Best Submissions