RLHF - History - LessWrong

•

Applied to Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback by Marcus Williams 1mo ago

Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second.

I think it is highly uncontroversial and even trivial to call RLHF an alignment technique, given that it is literally used to nudge the model away from "bad" responses and toward "good" responses. It seems the label "alignment technique" could only be considered inappropriate here for someone who has a nebulous science fiction idea of alignment as a technology that doesn't currently exist at all, like it was seen when Eliezer originally wrote the sequences. I think it's obvious that this view is outdated now.

ZY1mo10

I am guessing maybe it is the definition of "alignment" that people don't agree on/mixed on?

Some possible definitions I have seen:

(X risks) and/or (catastrophic risks) and/or (current safety risks)
Any of above + general capabilities (an example I saw is "how do you get the AI systems that we’re training to optimize the thing that we actually want to optimize" from https://arize.com/blog/openai-on-rlhf/)

And maybe some people don't think it got to solving X risks yet if they view the definition of alignment as X risks only.

Noosphere891mo21

Yes, this is what I wanted to say here:

If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim.

RobertM1mo41

Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don't know what sort of useful work "RLHF is an alignment technique" is doing, other than making claims that are not centrally about RLHF itself.

Noosphere891mo00

I admit I was not particularly optimizing for much detail here.

I use the word alignment technique essentially as a technique that was invented to make AIs be aligned to our values that attempts to reduce existential risk.

Note that it doesn't mean that it will succeed, or that it's a very good technique, or one we should solely rely on, because I make no claim on whether it does succeed or not, just that it's often discussed in the context of alignment of AIs.

I consider a lot of the disagreement on RLHF being an alignment technique, as essentially a disagreement on whether it actually works at all, not whether it's an actual alignment technique being used in labs.

RobertM1mo20

This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions. There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.

RobertM1mo2-2

Two reasons:

First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english.

Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does "is an alignment technique" mean? Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second.

To me, "is an alignment technique" does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that's one thing^[1]. But it's actively confusing to conflate it with technical detail about the technique itself.

^{^}
Though it's not the kind of detail that should live in the first sentence of the tag description, probably.

cubefox1mo21

Yes it's obviously an alignment technique, given that it is used as one more or less successfully. @RobertM Could you perhaps explain your reason for reverting?

Noosphere891mo33

Okay, so I got a change reverted, but I'd like to ask why people aren't pointing out that RLHF was at least historically, and even now used as a technique to aligning AIs?

I'm not saying it's a good technique, but I consider it as obviously an alignment technique, and most discussions of RLHF focus on the alignment context. '

2cubefox1mo

Yes it's obviously an alignment technique, given that it is used as one more or less successfully. @RobertM Could you perhaps explain your reason for reverting?

2RobertM1mo

Two reasons: First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english. Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does "is an alignment technique" mean? Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second. To me, "is an alignment technique" does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that's one thing[1]. But it's actively confusing to conflate it with technical detail about the technique itself. 1. ^ Though it's not the kind of detail that should live in the first sentence of the tag description, probably.

2cubefox1mo

2RobertM1mo

0Noosphere891mo

I admit I was not particularly optimizing for much detail here. I use the word alignment technique essentially as a technique that was invented to make AIs be aligned to our values that attempts to reduce existential risk. Note that it doesn't mean that it will succeed, or that it's a very good technique, or one we should solely rely on, because I make no claim on whether it does succeed or not, just that it's often discussed in the context of alignment of AIs. I consider a lot of the disagreement on RLHF being an alignment technique, as essentially a disagreement on whether it actually works at all, not whether it's an actual alignment technique being used in labs.

4RobertM1mo

I don't really see how this is responding to my comment. I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something "an alignment technique" with no further detail is not helping uninformed readers understand what "RLHF" is better (but rather worse). Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don't know what sort of useful work "RLHF is an alignment technique" is doing, other than making claims that are not centrally about RLHF itself.

2Noosphere891mo

Yes, this is what I wanted to say here:

1ZY1mo

I am guessing maybe it is the definition of "alignment" that people don't agree on/mixed on? Some possible definitions I have seen: * (X risks) and/or (catastrophic risks) and/or (current safety risks) * Any of above + general capabilities (an example I saw is "how do you get the AI systems that we’re training to optimize the thing that we actually want to optimize" from https://arize.com/blog/openai-on-rlhf/) And maybe some people don't think it got to solving X risks yet if they view the definition of alignment as X risks only.

RobertM v1.6.0Oct 2nd 2024 GMT (-27) 0

Reinforcement Learning from Human Feedback (RLHF) is a machine learning ~~technique and an alignment~~ technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Noosphere89 v1.5.0Oct 2nd 2024 GMT (+27) 4

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique and an alignment technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

•

Applied to RLHF is the worst possible thing done when facing the alignment problem by Raemon 2mo ago

•

Applied to Contextual Constitutional AI by aksh-n 2mo ago

•

Applied to Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets by Abhimanyu Pallavi Sudhir 2mo ago

•

Applied to DIY RLHF: A simple implementation for hands on experience by Mike Vaiana 4mo ago

•

Applied to [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF by Leon Lang 4mo ago

•

Applied to A "Bitter Lesson" Approach to Aligning AGI and ASI by RogerDearnaley 4mo ago

•

Applied to AXRP Episode 33 - RLHF Problems with Scott Emmons by DanielFilan 5mo ago