I am guessing maybe it is the definition of "alignment" that people don't agree on/mixed on?
Some possible definitions I have seen:
And maybe some people don't think it got to solving X risks yet if they view the definition of alignment as X risks only.
Yes, this is what I wanted to say here:
If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim.
I don't really see how this is responding to my comment. I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something "an alignment technique" with no further detail is not helping uninformed readers understand what "RLHF" is better (but rather worse).
Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don't know what sort of useful work "RLHF is an alignment technique" is doing, other than making claims that are not centrally about RLHF itself.
I admit I was not particularly optimizing for much detail here.
I use the word alignment technique essentially as a technique that was invented to make AIs be aligned to our values that attempts to reduce existential risk.
Note that it doesn't mean that it will succeed, or that it's a very good technique, or one we should solely rely on, because I make no claim on whether it does succeed or not, just that it's often discussed in the context of alignment of AIs.
I consider a lot of the disagreement on RLHF being an alignment technique, as essentially a disagreement on whether it actually works at all, not whether it's an actual alignment technique being used in labs.
This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions. There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.
Two reasons:
First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english.
Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does "is an alignment technique" mean? Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second.
To me, "is an alignment technique" does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that's one thing[1]. But it's actively confusing to conflate it with technical detail about the technique itself.
Though it's not the kind of detail that should live in the first sentence of the tag description, probably.
Yes it's obviously an alignment technique, given that it is used as one more or less successfully. @RobertM Could you perhaps explain your reason for reverting?
Okay, so I got a change reverted, but I'd like to ask why people aren't pointing out that RLHF was at least historically, and even now used as a technique to aligning AIs?
I'm not saying it's a good technique, but I consider it as obviously an alignment technique, and most discussions of RLHF focus on the alignment context. '
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique and an alignment technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique and an alignment technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.
I think it is highly uncontroversial and even trivial to call RLHF an alignment technique, given that it is literally used to nudge the model away from "bad" responses and toward "good" responses. It seems the label "alignment technique" could only be considered inappropriate here for someone who has a nebulous science fiction idea of alignment as a technology that doesn't currently exist at all, like it was seen when Eliezer originally wrote the sequences. I think it's obvious that this view is outdated now.