RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line.
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3.
And my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones), and no other research lab focused on capabilities had set up their own RLHF pipeline (except Anthropic, which I don't think makes...
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).
I think the much more important differences are:
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs.
...My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs
I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-grained ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.
For ChatGPT in particular, I think it was built by John Schulman's team
I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.
Seeing RLHF teams in other organizations not directly downstream of your organizational involvement,...
I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.
I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.
I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5.
...I don't think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn't that crazy. The prod
How much total investment do you think there is in AI in 2023?
My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.
How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.)
Variance between different years depending on market condition and how much products take off seems like on the order of 50% to me. Like, different years have pretty hugely differing levels of investment.
My guess is about 50% of that variance is dependent on different products taking off, how much traction AI is getting in various places, and things like Chat-GPT existing vs. not existing.
So this gives around $50B - $125B of variance to be explained by product-adjacent things like Chat-GPT.
...How much influence are you giving to GPT-3, GPT-3.5, GP
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).
I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week.
I haven't thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn't done the ChatGPT release is probably between 1-4 weeks.
Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se?
One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).
Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you.
I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don't know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I've historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number).
That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investmen...
Relevant piece of data: https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/?fbclid=IwAR3KTBnxC_y7n0TkrCdcd63oBuwnu6wyXcDtb2lijk3G-p9wdgD9el8KzQ4
Feb 1 (Reuters) - ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday.
The report, citing data from analytics firm Similarweb, said an average of about 13 million unique visitors had used ChatGPT per day in January, more than double the levels of December.
"In 20 years following the internet space, we cannot recall a faster ramp in a consumer internet app," UBS analysts wrote in the note.
I had some decent probability on this outcome but I have increased my previous estimate of the impact of Chat-GPT by 50%, since I didn't expect something this radical ("the single fastest growing consumer product in history").
my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3
I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.
I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free.
I don't have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I'm quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
That makes sense. However, Davinci-003 came out just a few days prior to ChatGPT. The relevant transition was from Davinci-002 to Davinci-003/ChatGPT.
Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF's capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.
This page also says that:
Our models generally used the best available datasets at the time of training, and so different engines using the same training methodology might be trained on different data.
So I guess 003 could also have different base pretraining data?
[edit: this says the same thing as Quintin's sibling comment]
Important context for those who don't know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.
(Aside: the terminology "RLHF" is starting to become confusing, as some people use it narrowly to mean "PPO against a reward model" and others use it more broadly to mean "using any RL technique with a reward signal given by human reviewers," which would include FeedME.)
People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it's also possible that CharacterAI's base models are RLHF'd to be consistent roleplayers.
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.
RLHF is less safe than imitation or conditioning generative models.
My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we're wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.
So here's one of my main reasons for not liking RLHF: it removes this very satisfying property. Models tha...
I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures implied by being weighted by a true approximation of our world.
If predicting webtext is a good way to get things done, people can do that. But probably it isn't, and so people probably won't do that unless you give them a good reason.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don't know in what sense "predict human demonstrators" is missing an important safety property from "predict internet text," and right now it feels to me like kind of magical thinking.
The main way I can see it going is that you can condition the webtext model on other things like "there is a fu...
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
I'll start with a pretty uncontroversial example that's neither RLHF nor conditioning but tries to point at a shared intuition; two different models:
1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes.
2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word "fighting", I would say that the first model is fighting you (a little bit), and the second one isn't. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models:
1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow.
2. "LLM" operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first ...
RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.
I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment. If false, then RLHF was net-negative because of its capabilities externalities. I also don't buy your argument about relative numbers of people working on capabilitie...
I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment.
Doing things sooner and in a different way matters.
This argument is like saying that scaling up language models is net-neutral for AGI, because people would have done it anyway for non-AGI purposes. Doing things sooner matters a lot. I think in most of science and engineering that's the main kind of effect that anything has.
If false, then RLHF was net-negative because of its capabilities externalities.
No, if false then it has a negative effect which must be quantitatively compared against positive effects.
Most things have some negative effects (e.g. LW itself).
It is also far easier to make progress on capabilities than alignment
This doesn't seem relevant---we were asking how large an accelerating effect alignment researchers have relative to capabilities researchers (since that determines how many days of speed-up they cause), so if capabilities progress is easier then that seems to increase both numerator and denominator.
...especially when you're not trying to make progress on alignment's core problems,
I don't think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f...
So for example, say Alice runs this experiment:
Train an agent A in an environment that contains the source B of A's reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.
Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,
"Cool. But this won't generalize to future lethal systems because it doesn't account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a 'capabilities overhang' relative to the overseer: there's no behavior that's clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. ...
Avoiding RLHF at best introduces an important overhang
But it would be better if we collectively then decided not to rush forward anyway, right?
And I still don't get why do you expect the future environment, where somewhat-aligned superhuman AIs are available, to be better for alignment work. Like, sure, automatic idea generator and verifier may be useful, but it's also useful for reckless people. And, intuitively, the more advanced AI is, the less I would trust it. So "lets try as hard as we can to advance civilization, because more advanced civilization will be better at alignment" seem like a very risky plan.
A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient.
Do you have examples of such historical work that you're happy to name? I'm really unsure what you're referring to (probably just because I haven't been involved in alignment for long enough).
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
For example, a major point of disagreement between me and Eliezer is that Eliezer often dismisses plans as “too complicated to work in practice,” but that dismissal seems divorced from experience with getting things to work in practice (e.g. some of the ideas that Eliezer dismisses are not much more complex than RLHF with AI assistants helping human raters). In fact I think that you can implement complex things by taking small steps—almost all of these implementation difficulties do improve with empirical feedback.
EY's counter to this?
I have read through most of this post and some of the related discussion today. I just wanted to write that it was really interesting, and as far as I can tell, useful, to think through Paul's reasoning and forecasts about strategy-related questions.
In case he believes this is a good idea, I would be very glad to read through a longer, more comprehensive document describing his views on strategic considerations.
It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data.
The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via...
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
Has ARC got a written policy for if/when similar experiment generate inconclusive but possible evidence of dangerous behaviour.
If so would you consider sharing it (or a non-confidential version) for other organisations to use.
(While I appreciate many of the investigations in this paper and think it is good to improve our understanding, I don’t think they let us tell what’s up with risk.) This could be the subject of a much longer post and maybe will be discussed in the comments.
Do you mean they don't tell us what's up with the difference in risks of the measured techniques, or that they don't tell us much about AI risk in general? (I'd at least benefit from learning more about your views here)
In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.
I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."
Background on my involvement in RLHF work
Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background:
In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren’t able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes.
The history of my involvement:
The case for a positive impact
Overall, I think that early work on RLHF had significant value:
The case for a negative impact
People in the safety community make some arguments that research on RLHF has costs larger than these benefits. I don’t currently find these arguments persuasive:
Overall, I think it was valuable to use RLHF to fix the kind of basic alignment problems that are ubiquitous with pre-trained models. I think it has had a real impact facilitating work on more fundamental challenges, and helped move the community one step closer towards the kind of alignment solutions I expect to ultimately be successful.
Future work
I remain excited about "straightforward" approaches to improving RLHF, like devising better feedback (using combinations of human and AI work) and improving robustness by adversarial training. I think this work will continue to make ML systems more useful in practice, and so will be subject to the same kinds of objections as above. I still tentatively think this work is net positive and don't find arguments against persuasive.
I think this follow-up research will also not need to solve the “fundamentally confusing” problems for a long time, but that solving tractable problems gives you a good chance of aligning modestly superhuman AI and facilitates future work on the remaining more challenging problems.
That said, I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive. Research should be justified by an argument that it actually helps address important failures. Here are some types of work in this space that I’m particularly excited about:
I would wildly guess that my involvement in RLHF and early language model training at OpenAI from 2017-2020 put me in the top 100 people accelerating AI progress but not in the top 10; I'd wildly guess that I accelerated progress by a few tenths of a percent during this period, and perhaps cut down timelines to powerful AI by a few days. I think there's room for debate one way or the other on that.
In some sense this is a big acceleration and it's wrong to write it off as "not that important." But I think accelerating a ChatGPT-style wakeup by a week is not a major cost (in addition to being plausibly positive, there just wasn't that much AI-reducing-activity happening per week in the world of 2018).
I also continue to think that RLHF is great, but that people overestimate (and misunderstand in all kinds of wild directions) the practical impact that it actually has on system behavior relative to the counterfactual training techniques.
(I added this footnote long after the post was written, reacting to different people interpreting the post in very different ways, e.g. Oliver's comments below and Michael Nielsen's here.)