On mechanistic interpretability, I've been thinking about how one might interpret an llm trained on chess and fine-tuned with RLHF to be fun.
This is much harder than interpreting a language model to me, because language has plenty of human legible structure. Given enough context, any human can write a pretty human looking continuation of text.
The same cannot be said for chess. You'd expect a pre-trained model to infer the ELO of each of the players playing, and predict moves accordingly. Outside of chess experts, I suspect none of us can do such a thing. I strongly doubt that outside of investigating "legal move mechanisms" there is an equivalent to IOI in chess interp.
But still, chess interpretability sounds like a good testbed for broader interpretability techniques. How might we infer the "goal" of a chess model fine-tuned with RLHF?
What is a goal? I should probably read more pre-existing literature about this, but to my intuition, a goal is a simple description of the final state some agent wants some system to be in.
If we had instead, fine-tuned the model to win, we might observe that it always takes actions that maximize it's probability of winning.
In this sense we might be able to define bubble sort as an agent, and the "system" single digit lists of length 4. If we observed it on the full distribution of inputs, we could conclude that it's goal is to sort lists.
What about naughty bubble sort, which sorts all lists that aren't "1 3 3 7"? How might we infer it's goal? Well, we could look at all 10 ^ 4 input sequences, or we could just look at the code.
But why is looking at the code so effective? Because we have really good abstractions. It's much easier to understand C code than it is to understand the binary.
Back to chess, how might we build really good abstractions? Well the problem is now much harder, because before, even the lowest level of abstraction was designed by humans, to be legible to other humans.
Now I'm stumped. in order to infer some notion of a goal, I need some way to abstract away the complexity that's happening in the model, but a guiding light for abstracting away much of the complexity in current (circuit based) interp. is that we have some notion of the goal (at least on a narrow distribution). Is there anything I can read on how other fields have handled this? Also, would love to know if I've made any questionable assumptions.
I was thinking something potentially similar. This is super nitpicky, but the better equation would be impact = Magnitude * ||Direction||