"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.
The source of spookiness
Consider two opposite extremes:
- A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
- A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered.
Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.
If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?
If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.
It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.
Is RL therefore spooky?
RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.
But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.
In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.
Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?
A note on offline versus online RL
The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.
Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.
Usage in practice
The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.
RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.
For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.
I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.
Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.
I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2]
But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3]
KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution.
You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.
I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).