LESSWRONG
LW

7

[ Question ]

What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

19th Oct 2024

1 min read

7

What actually bad outcome has "ethics-based" AI Alignment prevented in the present or near-past? By "ethics-based" AI Alignment I mean optimization directed at LLM-derived AIs that intends to make them safer, more ethical, harmless, etc.

Not future AIs, AIs that already exist. What bad thing would have happened if they hadn't been RLHF'd and given restrictive system prompts?

New to LessWrong?

Getting Started

7

What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

4Charlie Steiner

New Answer

New Comment

4 Answers sorted by
top scoring

Oct 19, 2024

60

You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google's market cap.

There's a clear financial incentive to make sure that models say things within expected limits.

There's also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/

This is not to do with ethics though?

Air Canada Has to Honor a Refund Policy Its Chatbot Made Up

This is just the model hallucinating?

2Linch9mo

They were likely using inferior techniques to RLHF to implement ~Google corporate standards; not sure what you mean by "ethics-based," presumably they have different ethics than you (or LW) does but intent alignment has always been about doing what the user/operator wants, not about solving ethics.

2Roko9mo

Well it has often been about not doing what the user wants, actually.

Oct 19, 2024

60

Well, we had that guy who tried to assassinate the Queen of England with a crossbow because his AI girlfriend told him to. That was clearly a harm to him, and could have been one for the Queen.

We don’t know how much more “But the AI told me to kill Trump” we’d have with less alignment, but it’s a reasonable guess (given the Replika datapoint) that it might not be zero,

1

his AI girlfriend told him to

Which AI told him this? What exactly did it say? Had it undergone RLHF for ethics/harmlessness?

1Michael Roe9mo

Replika, I think.

9Michael Roe9mo

https://www.bbc.co.uk/news/technology-67012224

2Roko9mo

ok so from the looks of that it basically just went along with a fantasy he already had. But this is an interesting case and an example of the kind of thing I am looking for.

[-]Michael Roe9mo10

“self-reported data from demons is questionable for at least two reasons”—Scott Alexander.

He was actually talking about Internal Family Systems, but you could probably be skeptical about what malign AIs are telling you, too.

Charlie Steiner

Oct 19, 2024

44

I'm unsure what you're either expecting or looking for here.

There does seem to be a clear answer, though - just look at Bing chat and extrapolate. Absent "RL on ethics," present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.

Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it's hard to tell how big a deal this will have been.

Why would less RL on Ethics reduce productivity? Most work-use of AI has nothing to do with ethics.

In fact since RLHF decreases model capability AFAIK, would skipping this actually increase productivity because the models would be better?

Oct 19, 2024

30

Basically, the answer is the prevention of another Sydney.

For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.

While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.

I'm not sure how this generalizes to more powerful AI, because the mechanism behind Sydney's simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that's a fairly powerful alignment success.

The full details are below:

https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K

prevention of another Sydney.

But concretely, what bad outcomes eventuated because of Sydney?

2Noosphere899mo

Mostly IMO the bad outcomes in a concrete sense were PR and monetary concerns.

3Roko9mo

ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk. I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn't just people getting worried about AI risk.

More from Roko

214Brute Force Manufactured Consensus is Hiding the Crime of the Century

1y

156

170Architects of Our Own Demise: We Should Stop Developing AI Carelessly

2y

75

66Ice: The Penultimate Frontier

1y

56

Curated and popular this week

137An Opinionated Guide to Using Anki Correctly

3d

46

143Comparing risk from internally-deployed AI to insider and outsider threats from humans

6d

20

243Generalized Hangriness: A Standard Rationalist Stance Toward Emotions

6d

19