I'd be especially excited if this debate produced an adversarial-collaboration-style synthesis document, laying out the various perspectives and cruxes. I think that collapsing onto an optimism/pessimism binary loses a lot of important nuance; but also that HAIST reading, summarizing, and clearly communicating the range of views on RLHF could help people holding each of those views more clearly understand each other's concerns and communicate with each other.
I agree that something like this would excellent. I unfortunately doubt that anything so cool will come out of this experiment. (The most important constraint is finding a HAIST member willing to take on the project of writing something like this up.)
If things go well, we are tentatively planning on sharing the list of core disagreements we identify (these will probably look like cruxes and subquestions) as well as maybe data about our members' distribution of views before and after the debate.
Worlds Where Iterative Design Fails, especially the section Why RLHF Is Uniquely Terrible.
I'd add that this bullet in the OP states part of the problem, but misses what I'd consider a central part:
- Improving RLHF has the effect of sweeping misaligned behavior under the rug, causing people to take alignment less seriously which e.g. causes large labs to underinvest in safety teams.
It's not just that hiding misbehavior causes people to take alignment less seriously. It's that, insofar as misbehavior is successfully hidden, it cannot be fixed by further iteration. We cannot iterate on a problem we cannot see.
I'd recommend the “AI safety via conditioning predictive models” doc my coauthors and I are working on right now—it's not quite ready to be published publicly yet, but we have a full draft that we're looking for comments on right now. Messaged to both of you privately; feel free to share with other HAIST members.
This recent comment thread discussing whether RLHF makes any progress beyond the classical "reward the agent when humans press the reward button" idea.
Overview: we're trying to form inside views about the value of alignment research related to RLHF. Recommend us reading material expressing relevant arguments.
The Harvard AI Safety Team (HAIST) is the AI safety student group at Harvard. Like the broader AI safety community, HAIST members have a variety of inside views about how useful research on RL from human feedback (RLHF) is. Broadly speaking, we can divide people into RLHF optimists and pessimists:
RLHF optimists, generally speaking, think that doing research in areas like the following is net positive:
RLHF pessimists, on the other hand, tend to think that research like the above is net neutral or negative. Pessimists might have beliefs like:
Of course, many people will have some optimistic views and some pessimistic views, or more complicated mixtures of these views. For example, someone might think that most RLHF research is net negative unless there's a breakthrough in improving human oversight, so that the only net positive research agenda listed above is improving human oversight for now.
An important clarification: many people think that RLHF is a viable approach to outer alignment but that most AI x-risk is due to inner alignment. Let's agree to call such a person an RLHF optimist even if they think that inner alignment agendas are more valuable to work on than RLHF agendas. One possible definition here is that we're calling Alice an RLHF optimist if, given a button which creates a new person who is only able to do RLHF research, Alice would pay money to push the button. (For example, as great as this Ajeya post is, it doesn't seem to imply a position on RLHF optimism/pessimism, only a position that inner alignment is important.)
For a while, HAIST has had an idea floating around of doing "the great RLHF debate," where we all try to converge with each other on the usefulness of RLHF research. The seems useful to do since at some point our members will graduate and actually begin alignment jobs, and they'll need some way of deciding which alignment jobs to take. But if we're going to do the great RLHF debate, it seems important to do it well, i.e. in a way that actually tracks the truth.
One thing that makes having this debate tricky is that -- despite general community disagreement -- there don't seem to be many examples of people writing out their full reasoning about the usefulness of RLHF research. So this post is a call for recommendations for reading materials which speak to the usefulness of RLHF research. (We'd also be quite happy for people to leave comments explaining their views).
Our current plan for the debate is to spend some time reading material in small groups and trying to precisely articulate the main points of disagreement; then one week later spend some more time debating the points of disagreement we identified and trying to converge with each other. Before and after the debate phase, we'll probably poll people about their views to see whether convergence happened and get a sense of how peoples' views changed. (We'd also be quite happy to get feedback on this format.)
Here are reading materials that we're already aware of:
We're planning on doing the reading phase of the debate next Tuesday (11/8), so materials that know about by Tuesday are much more useful than materials we're made aware of after Tuesday.