When you self-distill a model (e.g. train a new model using predictions from your old model), the resulting model represents a less complex function. After many rounds of self-distillation, you essentially end up with a constant function. This paper makes the above more precise.
Anyway, if you apply multiple rounds of self-distillation to a model, it becomes less complex. So if the original model learned complex, power-seeking behaviors that doesn't help it do well on the training data, this behavior would likely go away after several rounds of self-distillation. Self-distillation allows you to essentially get the minimum complexity model that still does well on the test set. Thus, I think it's promising from an AI safety standpoint.
I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.
The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking.