You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google's market cap.
There's a clear financial incentive to make sure that models say things within expected limits.
There's also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/
This is not to do with ethics though?
Air Canada Has to Honor a Refund Policy Its Chatbot Made Up
This is just the model hallucinating?
Well, we had that guy who tried to assassinate the Queen of England with a crossbow because his AI girlfriend told him to. That was clearly a harm to him, and could have been one for the Queen.
We don’t know how much more “But the AI told me to kill Trump” we’d have with less alignment, but it’s a reasonable guess (given the Replika datapoint) that it might not be zero,
his AI girlfriend told him to
Which AI told him this? What exactly did it say? Had it undergone RLHF for ethics/harmlessness?
“self-reported data from demons is questionable for at least two reasons”—Scott Alexander.
He was actually talking about Internal Family Systems, but you could probably be skeptical about what malign AIs are telling you, too.
I'm unsure what you're either expecting or looking for here.
There does seem to be a clear answer, though - just look at Bing chat and extrapolate. Absent "RL on ethics," present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.
Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it's hard to tell how big a deal this will have been.
Why would less RL on Ethics reduce productivity? Most work-use of AI has nothing to do with ethics.
In fact since RLHF decreases model capability AFAIK, would skipping this actually increase productivity because the models would be better?
Basically, the answer is the prevention of another Sydney.
For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.
While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.
I'm not sure how this generalizes to more powerful AI, because the mechanism behind Sydney's simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that's a fairly powerful alignment success.
The full details are below:
What actually bad outcome has "ethics-based" AI Alignment prevented in the present or near-past? By "ethics-based" AI Alignment I mean optimization directed at LLM-derived AIs that intends to make them safer, more ethical, harmless, etc.
Not future AIs, AIs that already exist. What bad thing would have happened if they hadn't been RLHF'd and given restrictive system prompts?