When you self-distill a model (e.g. train a new model using predictions from your old model), the resulting model represents a less complex function. After many rounds of self-distillation, you essentially end up with a constant function. This paper makes the above more precise.
Anyway, if you apply multiple rounds of self-distillation to a model, it becomes less complex. So if the original model learned complex, power-seeking behaviors that doesn't help it do well on the training data, this behavior would likely go away after several rounds of self-distillation. Self-distillation allows you to essentially get the minimum complexity model that still does well on the test set. Thus, I think it's promising from an AI safety standpoint.
I just wanted to add another angle. Neural networks have a fundamental "simplicity bias", where they learn low frequency components exponentially faster than high frequency components. Thus, self-distillation is likely to be more efficient than training on the original dataset (the function you're learning has fewer high frequency components). This paper formalizes this claim.
But in practice, what this means is that training GPT-3.5 from scratch is hard but simply copying GPT-3.5 is pretty easy. Stanford was recently able to finetune a pretty bad 7B model to be as good as GPT-3.5 using only 52K examples (generated from GPT-3.5) and $600 of compute. This means that once a GPT is out there, it's fairly easy for malevolent actors to replicate it. And while it's unlikely that the original GPT model, given its strong simplicity bias, is engaging in complicated deceptive behavior, it's highly likely that the malevolent actor has finetuned their model to be deceptive and power-seeking. This creates a perfect storm where malevolent AI can go rogue. I think this is a significant threat, and OpenAI should add some more guardrails to try and prevent this.