"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.
The source of spookiness
Consider two opposite extremes:
- A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
- A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered.
Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.
If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?
If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.
It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.
Is RL therefore spooky?
RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.
But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.
In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.
Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?
A note on offline versus online RL
The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.
Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.
Usage in practice
The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.
RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.
For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.
If you avoid using RL, then you might need a much "smarter" model for a given level of usefulness.
And even without RL, you need to be getting bits of selection from somewhere: to get useful behavior you have to at the very least specify what useful behavior would be (though the absolute minimum number of bits would be very small given a knowledgable model). (So some selection or steering is surely required, but you might hope this selection/steering is safer for some reason or perhaps more interpretable (like e.g. prompting can in principle be).)
Dramatically cutting down on RL might imply that you need a much, much smarter model overall. (For instance, the safety proposal discussed in "conditioning predictive models" seems to me like it would require a dramatically smarter model than would be required if you used RL normally (if this stuff worked at all).)
Given that a high fraction of the concern (IMO) is proportional to how smart your model is, needing a much smarter model seems very concerning.
Ok, so cutting RL can come with costs, what about the benefits to cutting RL? I think the main concern with RL is that it either teaches the model things that we didn't actually need and which are dangerous or that it gives it dangerous habits/propensities. For instance, it might teach models to consider extremely creative strategies which humans would have never thought of and which humans don't at all understand. It's not clear we need this to do extremely useful things with AIs. Another concern is that some types of outcome-based RL will teach the AI to cleverly exploit our reward provisioning process which results in a bunch of problems.
But, there is a bunch of somewhat dangerous stuff that RL teaches which seems clearly needed for high usefulness. So, if we fix the level of usefulness, this stuff has to be taught to the model by something. For instance, being a competent agent that is at least somewhat aware of its own abilities is probably required. So, when thinking about cutting RL, I don't think you should be thinking about cutting agentic capabilities as that is very likely required.
My guess is that much more of the action is not in "how much RL", but is instead in "how much RL of the type that seems particular dangerous and which didn't result in massive increases in usefulness". (Which mirrors porby's answer to some extent.)
In particular we'd like to avoid:
(Weakly exploitable human feedback without the use of outcomes (e.g. the case where the human reviews the full trajectory and rates how good it seems overall) seems slightly concerning, but much less concerning overall. Weak exploitation could be things like sycophancy or knowing when to lie/deceive to get somewhat higher performance.)
Then the question is just how much of a usefulness tax it is to cut back on these types of RL, and then whether this usefulness tax is worth it given that it implies we have to have a smarter model overall to reach a fixed level of usefulness.
(Type (1) of RL from the above list is eventually required for AIs with general purpose qualitatively wildly superhuman capabilities (e.g. the ability to execute very powerful strategies that humans have a very hard time understanding) , but we can probably get done almost everything we want without such powerful models.)
My guess is that in the absence of safety concerns, society will do too much of these concerning types of RL, but might actually do too little of safer types of RL that help to elicit capabilities (because it is easier to just scale up the model further than to figure out how to maximally elicit capabilities).
(Note that my response ignores the cost of training "smarter" models and just focuses on hitting a given level of usefulness as this seems to be the requested analysis in the question.)
In brief: large amounts of high quality process based RL might result in AI being more useful earlier (prior to them becoming much smarter). This might be expensive and annoying (e.g. it might require huge amounts of high quality human labor) such that by default labs do less of this relative to just scaling up models than would be optimal from a safety perspective.