Basically, the answer is the prevention of another Sydney.
For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.
While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.
I'm not sure how this generalizes to more powerful AI, because the mechanism behind Sydney's simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that's a fairly powerful alignment success.
The full details are below:
https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K
ok so from the looks of that it basically just went along with a fantasy he already had. But this is an interesting case and an example of the kind of thing I am looking for.