I wonder how this compares to the refusal direction paper. Perhaps ablating the self-other direction can yield similar results? I would guess that both effects combined would produce a stronger response.
Does it affect the performance on deception benchmarks like @Lech Mazur's lechmazur/deception and lechmazur/step_game? Is deception something inherent to larger models or can smaller fine-tuned models be more effective at creating/resisting disinformation?
IMO "overall model performance" doesn't tell the full story. It would be nice to see some examples outside of these competitive scenarios. It might make the model less helpful as it loses its ability to describe me-vs-other relationships well.

5

0