Experiments and write-up by Daniel, with advice from Stefan. Github repo for the work can be found here.
Update (October 31, 2024): Our paper is now available on arXiv.
TL;DR
By perturbing activations along specific directions and measuring the resulting changes in the model output, we attempt to infer how much the directions matter for the model's computation. Through the sensitive directions experiments, we show that:
- Heimersheim (2024)’s sensitive direction baselines (experiments 2 and 3) were flawed in that the perturbation direction involved subtracting the original activation. We propose an improved baseline direction (called cov-random mixture) which does not use the original activation.
- Gurnee (2024)’s KL-div for SAE reconstruction errors no longer seems pathologically high when we
... (read 3343 more words →)