I do AI policy, and it has been extremely rare for me to see something that sticks out to me as something as valuable as this. I don't think you're aware how incredibly valuable your gifs are for describing problems with RLHF. The human brain is structured to glaze over, ignore, or forget text, while considering visual evidence with the full force of reasoning faculties. Visual information is vastly more intuitive and digestible that written language, it is like "getting your foot in the door" to someone's limited daily allowance of attention and higher reasoning faculties. Although it requires some buy in to figure out ways to correctly and honestly describe the problem with gifs, the communication value (to other ML researchers) is massive and the ratio of honest, successful communication to effort is extremely high. Using gifs to visually describe the problems with RLHF just scales really well with more time, effort, and cognition spent on really accurate gifs, after making the first gifs.
Most people don't notice or seriously consider the problem with RLHF because they don't feel like digesting text criticizing RLHF, and more gifs will give experts a fair chance to have accurate thoughts about the problems with RLHF. I'm not an expert and I don't know how difficult it is to actually make gifs that can describe the complex problems, but I do know that if the bare minimum is managed at making such gifs, the paper has a much higher chance of causing a paradigm shift among ML researchers than the average ML researcher might think. Even if describing most of the RLHF problems with gifs is impossible or systematically fails, it will still have its effect amplified by allowing ML researchers to begin communicating problems to tech executives and policymakers who manage their projects and funding.
I hope that this isn't your final compendium on RLHF (I'm bookmarking it either way) and that you spend at least a little bit of time evaluating whether it's possible to describe problems with RLHF to ML researchers using gifs. This is a gold mine, and it never occurred to me that it could be done until I saw that gif. If you can't figure out a way to do it yourself, I recommend asking around for information about funding such as the EA future fund and ask about funds and grantmaking at rationalist events and people will probably be able to connect you to large sources of funding, teams of artists and animators to handle the grunt work, or even other ML researchers who have a lot of knowledge of ways to figure out ways to accurately depict RLHF problems visually (think rob bensinger or eliezer yudkowsky). I hope that this isn't your final compendium on RLHF (I'm bookmarking it either way) and that you spend at least a little bit of time evaluating whether it's possible to describe problems with RLHF to ML researchers using gifs.
Thank you! Yes, for most of these issues, it's possible to create GIFs or at least pictograms. I can see the value this could bring to decision-makers.
However, even though I am quite honored, it's not because I wrote this post that I am the best person to do this kind of work. So, if anyone is inspired to work on this, feel free to send me a private message.
Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.
This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:
If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions.
To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.
As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.
The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:
This has not been done in a systematic fashion, and it could be organized more clearly.
In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.
It's gonna take me a while to digest this post, but in the meantime, thank you! This is the sort of content I love to see. (ETA: I strong-upvoted this post)
My updated thoughts are: Still a great post, not as polished as it should be though. That's OK. The important thing is that it compiles a big list of problems and alleged problems for RLHF, with links.
Here is the polished version from our team led by Stephen Casper and Xander Davies: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback :)
If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible.
This is false, and the reason may be a bit subtle. Basically "agency" is not a bare property of programs, it's a property of how programs interact with their environment. ChatGPT is corrigible relative to the environment of the real world, in which it just sits around outputting text. This is easy because it's not really an agent relative to the real world! However, ChatGPT is an agent relative to the text environment - it's trying to steer the text in a preferred direction [1].
A robot that literally had the same cognitive performance as ChatGPT would just move the robot body in a way that encoded text, not in a way that had any skill at navigating the real world. But a robot that had analogous cognitive capabilities as ChatGPT except suited for navigating the real world would be able to navigate the real world quite well, and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.
AFAIK this is only precisely true for the KL-penalty regularized version of RLHF, where you can think of the finetuned model as trying to strategically spend its limited ability to update the base transition function, in order to steer the trajectory to higher reward. For early stopping regularized RLHF you probably get something mathematically messier.
Thanks, I overlooked this and it makes sense to me. However, I'm not as certain about your last sentence:
"and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world."
I agree with the idea of "steering the trajectory," and this is a possibility we must consider. However, I still expect that if we train the robot to use the "Shut Down" token when it hears "Hi RobotGPT, please shut down," I don't see why it wouldn't work.
It seems to me that we're comparing a second-order effect with a first-order effect.
The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.
This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that "the model has been shaped to maximize its reward by any means necessary"? This is likewise not weak evidence for playing the training game.
Using RL(AI)F may offer a solution to all the points in this section: By starting with a set of established principles, AI can generate and revise a large number of prompts, selecting the best answers through a chain-of-thought process that adheres to these principles. Then, a reward model can be trained and the process can continue as in RLHF. This approach is potentially better than RLHF as it does not require human feedback.
I'd like to say that I fervently disagree with . Giving an unaligned AI the opportunity to modify its own weights (by categorizing its own responses to questions), then politely asking it to align itself, is quite possibly the worst alignment plan I've ever heard; it's penny-wise, pound-foolish. (Assuming it even is penny-wise; I can think of several ways to generate a self-consistent AI that would cost less.)
the public was not happy with the fact that the AI kept repeating "I am an AI developed by OpenAI", which pushed OpenAI to release the January 9 version that is again much more hackable than the December 15 patch version (benchmark coming soon).
Wow, that sounds bad. Do you have any source for this?
Epistemic status: This post is a distillation of many comments/posts. I believe that my list of problems is not the best organization of sub-problems. I would like to make it shorter, and simpler, because cool theories are generally simple unified theories, by identifying only 2 or 3 main problems without aggregating problems with different types of gear level mechanisms, but currently I am too confused to be able to do so. Note that this post is not intended to address the potential negative impact of RLHF research on the world, but rather to identify the key technical gaps that need to be addressed for an effective alignment solution. Many thanks to Walter Laurito, Fabien Roger, Ben Hayum, Justis Mills for useful feedbacks.
RLHF tldr: We need a reward function, we cannot hand-write it, let’s make the AI learn it!
Problem 0: RLHF is confusing. Human judgment and feedback is so brittle, that even junior alignment researchers like me thought that RLHF is a not-too-bad-solution to the outer alignment problem. I think RLHF confuses a lot of people and distracts people from the core issues. Here is my attempt to become less confused.
Why RLHF counts as progress?
RLHF learned to backflip using around 900 individual bits of feedback from the human evaluator.
From Learning from Human Preferences.
Why RLHF is insufficient?
Buck highlights two main types problems with using RLHF to create an AGI: oversight issues and the potential for catastrophic outcomes. This partition of the problem space comes from the post Low-Stakes alignment. Although it may be feasible to categorize all problems under this partition, I believe that this categorization is not granular enough and lumps different types of problems together.
Davidad suggests dividing the use of RLHF into two categories: "Vanilla-RLHF-Plan" and "Broadly-Construed-RLHF" as part of the alignment plan. The "Vanilla-RLHF-plan" refers to the narrow use of RLHF (say as used in ChatGPT) and is not sufficient for alignment, while "Broadly-Construed-RLHF" refers to the more general use of RLHF as a building block and is potentially useful for alignment. The list below outlines the problems with "Vanilla-RLHF" that "Broadly-Construed-RLHF" should aim to address.
Existing problems with RLHF because of (currently) non-robust ML systems
Those issues seem to be mostly a result of poor capabilities, and I think those problems may disappear as models grow larger.
1. Benign Failures: ChatGPT can fail. And it is not clear if this problem will disappear as capabilities scale.
2. Mode Collapse, i.e. a drastic bias toward particular completions and patterns. Mode collapse is expected when doing RL.
3. You need regularization. Your model can severely diverge from the model you would have gotten if had gotten feedback in real-time from real humans. You need to use a KL divergence and choose the constant of regularization. Choosing this constant feels arbitrary. Without this KL divergence, you get mode collapse.(see the annex here).
My opinion: The benign failures seem to be much more problematic than mode collapse and the need for regularization. Currently, prompt injections seem to be able to bypass most security measures. And we can't even count on respectability incentives to push OpenAI and other companies to deploy more robust solutions. For example,the public was not happy with the fact that the AI kept repeating "I am an AI developed by OpenAI", which pushed OpenAI to release the January 9 version that is again much more hackable than the December 15 patch version (benchmark coming soon). So this is one of the problems that seems to me the most severe. But it may be possible to reduce a lot of this kind of failure with more focus on engineering andmore capable systems. So I rather predict that this kind of error will disappear in the future, but not with high confidence.
Incentives issues of the RL part of RLHF
I expect those problems to be more salient in the future, as it seems to be suggested by recent work done at Anthropic.
4. RL makes the system more goal-directed, less symmetric than base model. For example, we can tell Paul Christiano's story of the model which was over-optimized by OpenAI. This model was trained using a positive sentiment reward system and ended up determining that wedding parties were the most positive subject. Whenever the model was given a prompt, it would generate text that described a wedding party. This breaks the symmetry: Fine-tuning a large sequence model with RLHF shapes a model that steers the sequence in rewarding directions. The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.
5. Instrumental convergence: Larger RLHF models seem to exhibit harmful self-preservation preferences, and *sycophancy*: insincere agreement with user’s sensibilities. [paper, short summary]
6. Incentivize deception: “RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what's true. RLHF does whatever it has learned makes you hit the "approve" button, even if that means deceiving you.” [from Steiner]. See also the robotic hand in Deep Reinforcement Learning From Human Preferences (Christiano et al, 2017) and comments on how this would scale.
7. RL could make thoughts opaque. See By Default, GPTs Think In Plain Sight: “GPTs’ next-token-prediction process roughly matches System 1 (aka human intuition) and is not easily accessible, but GPTs can also exhibit more complicated behavior through chains of thought, which roughly matches System 2 (aka human conscious thinking process). Humans will be able to understand how a human-level GPTs (trained to do next-token-prediction) complete complicated tasks by reading the chains of thought. GPTs trained with RLHF will bypass this supervision”.
8. Capabilities externalities: RL already increases the capabilities of base models, and RL could be applied to further increase capability by a lot, and could go much further than imitation. This has already burnt timelines, and will continue to do so.
My opinion:
Problems related to the HF part of RLHF
9. RLHF requires a lot of human feedback, and still exhibits failures. To align ChatGPT, I estimated that creating the dataset costed 1M of dollars, which was roughly the same price as the training of GPT3[3]. [details] And 1M is still not sufficient to solve the problem of benign failures currently, and much more effort could be needed to solve those problems completely. As systems become more advanced, they may require more complex data and the cost of obtaining the necessary data may become prohibitive. Additionally, as we push the limits of compute scale beyond that of human capabilities, we may encounter limitations in terms of the number of qualified annotators available.
10. “Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors.“ [LOL20]. And this remains true even if you recruit humans on Bountied Rationality and pay them a reasonable $50 an hour, as OpenAI did… Several categories of people have been recruited by OpenAI, and Bountied Rationality seems to be the upper tier, according to the time: “To get those labels, OpenAI sent tens of thousands of snippets of text to an outsourcing firm in Kenya, beginning in November 2021. Much of that text appeared to have been pulled from the darkest recesses of the internet. Some of it described situations in graphic detail like child sexual abuse, bestiality, murder, suicide, torture, self harm, and incest.” Aligning an AI with RLHF requires going through unaligned territory, and I can’t imagine what the unaligned territory will look like in the future. This will probably lead annotators to psychological problems.
11. You are using a proxy, not human feedback directly: The model is a proxy trained on human feedback, that represents what humans want, you then use the model to give reward to a policy[4]. That's less reliable as having an actual human give the feedback directly to the model.
12. How to scale human oversight? Most of the challenge of RLHF in practice is in getting humans to reward the right thing, and in doing so at a sufficient scale. [more here]
My opinion:
Superficial Outer Alignment
13. Superficially aligned agents: Bad agents are still there beneath. The capability is still here[5].
14. The system is not aligned at all at the beginning of the training and has to be pushed in dangerous directions to train it. For more powerful systems, the beginning of the training could be the most dangerous period.
15. RLHF is not a specification, only a process. Even if you don't buy the outer/inner decomposition, RLHF is a more limited way of specifying intended behavior+cognition than what we would want, ideally. (Your ability to specify what you want depends on the behavior you can coax out of the model.). RLHF is just a fancy word for preference learning, leaving almost the whole process of what reward the AI actually gets undefined. [source]. Comment: No one has been able to obtain exhaustive specification of human values, but maybe a constitution is sufficient? (and we must then make the assumption that the constitution represents sufficiently our values)
16. If you have the weights of the model, it is possible to misalign it by fine-tuning it again in another direction, i.e. fine-tuning it so that it mimics Hitler or something else. This seems very inexpensive to do in terms of computation. If you believe in the orthogonality thesis, you could fine-tune your model toward any goal.
My opinion:
The following points in this post are not specifically points against RLHF, today no technique applied to language models solves the strawberry problem or the generalization problem. But it is useful to recall these problems.
The Strawberry problem
RLHF does not a priori solve the strawberry problem. Nates Soares sums up the problem as:
17. Pointer problem: Directing a capable AGI towards an objective of your choosing.
18. Corrigibility: “Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.”
“These two problems appear in the strawberry problem, which Eliezer's been pointing at for quite some time: the problem of getting an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else. The demand of cellular-level copying forces the AI to be capable; the fact that we can get it to duplicate a strawberry instead of doing some other thing demonstrates our ability to direct it; the fact that it does nothing else indicates that it's corrigible (or really well aligned to a delicate human intuitive notion of inaction).” [from the Sharp left turn post]
My opinion: At the end of the day, I’m not that worried about corrigibility at the moment. So far, we've got no evidence of ChatGPT failing to be corrigible and little evidence of it having a wrong pointer. The issues I’m raising are not specific to RLHF, but to priors on techniques which don't successfully prove they are safe against those (i.e. all techniques presented to this day). Even if Anthropic tries to demonstrate this in the model-written-evals (see fig. 1), it seems to me that that’s not really disproving corrigibility. I think it would be pretty easy to implement corrigibility because "To allow oneself to be turned off", is not such a complicated task, and humans should be able to evaluate this and implement this with RLHF. If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible. So we still don’t have much evidence.
Unknown properties under generalization
19. Distributional leap: RLHF requires some negative feedback in order to train it. For large-scale tasks, this could maybe require killing someone to continue the gradient descent? We could avoid this with other techniques like Training Language Models with Language Feedback, or maybe we could use a form of curriculum learning, to start the training with benign actions and then finish on actions with potentially large consequences, but it seems that ‘ killing someone’ is what would happen for large scale vanilla-RLHF trainings.
20. Sharp left turn: How to scale safely if model performances go up? Capability generalizes further than alignment. Some values collapse when self-reflecting. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is. According to Nate Soares, this problem is upstream the problem of “Directing a capable AGI towards an objective of your choosing and ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.”. So, even if 17 and 18 seem to be solved, 20 can still arise.
My opinion:
Final thoughts
To make a long story short, vanilla-RLHF does not solve any of the problems listed in the list of lethalities.
Exercises for the reader:
I tend to agree with Katja Grace that values are not fragile in the sense imagined in the sequences [link].
Modulo Models Don't "Get Reward".
It's actually not that expensive, I'm willing to buy an aligned AI for a lot more than that. But it gives a lower bound on the order of magnitude of the alignment fee for RLHF.
This does not seem to be a problem per se if your model of the human giving feedbacks is robust. But your model has to be robust. Also keep in mind that even pure human feedback is also likely to lead to AI takeover.
As a first approximation, I suspect we can consider that only the upper layers of the model have been refined, the lower intermediate layers having not been modified. In Sparrow, only the upper 16 layers have been fine-tuned.