Introduction
Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found that the technique is also effective with cross-layer features.
This post documents our methodology. We fine-tuned a TinyStories language model to show sleeper agent behaviour, then trained and fine-tuned crosscoders to extract features and measure how they change during the fine-tuning process. Running all training and experiments takes under an hour on a single RTX 4090 GPU.
We release code for training and analysing sleeper agents and... (read 1902 more words →)