This is a real contribution to soft-aligned capabilities, so I'm single-upvoting it. However, this does not appear to me to be less vulnerable to issues with incentive causality. The severity of adversarial examples to the reward model appears to still be unbounded. As far as I can tell from a skim, it has done a good job optimizing for the reward models. But it still seems quite plausible that large amounts of counterfactual model behavior in untested regions of input-space could be optimized towards misbehavior by the RLHF class of approaches that this method advances. I'd encourage y'all to look into work on formal bounds, eg this interesting paper; causal patterns in different rl algorithms and how those affect alignment, eg the work by the causal incentives group at the deepmind safety team; and I'd also encourage looking a little more at the threat models folks have proposed that your research may indirectly affect, such as what a path toward the limit of biotechnology looks like and the general property that effective-plan-space contains many plans which injure humanity - seems like your group is focused on the local landscape of issues that arise for language models.
Leaving a quick note as a moderator: I think this post seems fine, but doesn't do quite enough to situation itself in the LessWrong literature space. (It looks like the intro is sort of optimized for what I'm guessing is "the academic literature space", but a bit ambiguous in ways I think LessWrong should avoid ambiguity). Specifically here:
This approach improves the quality of alignment. It is more efficient and stable in training, and it is also easier to implement. We have tested RAFT on both large language models and diffusion models, verifying its effectiveness in question answering and text-to-image generation tasks.
Alignment can mean a lot of things. "What counts as alignment, or alignment research" is a topic of debate that it doesn't make sense for moderators to litigate, but, I think you should at least be clear about which definition of alignment you're using and why you think it matters. I also think you should be clear on whether/how your alignment work helps with existential risk.
Introduction
General-purpose foundation models, especially large language models (LLMs) such as ChatGPT, have demonstrated extraordinary capabilities in performing various tasks that were once challenging. However, we believe that one model cannot rule them all. Further fine-tuning is necessary to achieve better performance in specialized tasks or domains. The standard approaches for fine-tuning these models include:
Alignment, in particular, is crucial for ensuring the safety of LLMs before deployment in the real world. Today we introduce a new alignment algorithm RAFT [1] which is more effective than traditional methods such as PPO. RAFT mitigates the issue of bias that could emerge in LLM responses. Using RAFT for aligning LLMs offers numerous benefits, including the ability to disentangle unwanted biases from the LLM's language production while maintaining fluency levels consistently.
Checkout the paper https://arxiv.org/abs/2304.06767.
Its implementation is available from https://github.com/OptimalScale/LMFlow.
RAFT Alignment
Alignment is a critical aspect of training large language models (LLMs) like ChatGPT. One key benefit of alignment is that it helps the model conform to human language habits, improving its performance in tasks such as question answering.
A common approach for alignment involves using reinforcement learning with human feedback (RLHF), as described in InstructGPT [2]. In this approach, human labeled data is used to train a reward model. A reinforcement learning algorithm (e.g., PPO) is then used to adjust the model's behavior according to the reward model. However, PPO and other reinforcement learning algorithms heavily rely on backpropagation, resulting in high training costs and instability.
To address these issues, we proposed a new alignment algorithm called RAFT (Reward rAnked Fine-Tuning), which uses sample ranking to select the most preferred samples from large models (or samples that align with human values/objective facts), aimed at training AI models that are more human-friendly.
This approach improves the quality of alignment. It is more efficient and stable in training, and it is also easier to implement. We have tested RAFT on both large language models and diffusion models, verifying its effectiveness in question answering and text-to-image generation tasks.
Algorithm Details
Specifically, RAFT is composed of three core steps:
(1) Data collection: To collect candidate samples before ranking, we can simply use the training generative model as the generator. Furthermore, in order to improve the diversity of generative data, we can also combine sampled results from other pre-trained experts (e.g., LLaMA, ChatGPT, or even human).
(2) Data ranking: Similar to RLHF, we have a classifier or regressor to calculate reward aligned with the target demand. Based on such reward models, we rank the candidate samples and select those with higher reward, which means they better meet human needs.
(3) Model fine-tuning: the samples that best meet human needs are used to fine-tune the model, so that the trained model can match human needs.
Notably, RAFT does not require calculating gradients for every single sampling point. Given a fixed number of data that are used for fine-tuning, RAFT performs more forward passes of sampling and then filters out most low-quality data by the reward function, which makes the model more stable and robust. At the same time, in some cases, due to the lower sensitivity of supervised fine-tuning to hyperparameters and more robust convergence, under the same reward conditions, we found that RAFT can have better perplexity (corresponding to better generation diversity and fluency).
The experiment result of movie review completion on IMDB dataset
The full algorithm is shown as follows:
RAFT Algorithm
We performed experiments on a range of tasks to evaluate the effectiveness of RAFT.
Firstly, we evaluated the performance in completing positive movie reviews. Before fine-tuning, LLaMA’s output movie reviews were random and occasionally negative. However, after fine-tuning with RAFT, it excelled at generating more positive, fluent movie reviews when given a starting sentence for the review. As shown in the figure below, unadjusted movie reviews by LLaMA would randomly output positive and negative reviews, while both RAFT and PPO were able to incline towards positive reviews.
The authors also created a psychological companion robot based on Vicuna. The authors simulate a conversation between a person who is feeling down due to failing an exam and the robot. Before using RAFT for alignment (left image), the model claimed to have no emotions or feelings and refused to be friends with humans. However, after RAFT alignment (right image), the model's empathetic abilities were significantly enhanced and it repeatedly comforted the human by saying, "Although I am an AI, I will try my best to be your friend."
In addition to evaluating RAFT’s effectiveness on language models, we also tested its ability to improve text-to-image generation in diffusion models. As it is well known, the original stable diffusion does not perform well at 256*256 resolution and PPO cannot be directly applied to stable diffusion models. In contrast, RAFT provides a natural way to improve it. After fine-tuning with RAFT, stable diffusion is able to generate good results. This is undoubtedly a benefit for AIGC enthusiasts with limited computing resources, as the time required for 256*256 resolution is only 20% of the original version. The following figure shows the results before and after fine-tuning with RAFT. As can be seen, prior to fine-tuning, stable diffusion struggled to generate good 256*256 resolution images, but the model was greatly improved in terms of image generation quality after fine-tuning.
Resolution Adaptation. (RAFT-aligned models can generate proper 256 × 256 samples)
In addition to improving the generation ability of 256*256 images, RAFT can also align the generated images with the prompts, enabling the model to generate images that better match the prompt description. As shown in the figure below, given the prompt "Monet style cat" the original stable diffusion generated pictures that mostly did not include a cat, but instead generated other works in the style of Monet. This was because cats are rarely seen in Monet's works, and stable diffusion did not fully understand the meaning of the text. However, after fine-tuning with RAFT, stable diffusion was able to understand the concept of a "cat," and so there is a cat in every generated image.
Text-Image Alignment with RAFT (prompt: “monet style cat”)
Reference
[1] Hanze, Dong, et al. "RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment" https://arxiv.org/abs/2304.06767
[2] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.