"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.
The source of spookiness
Consider two opposite extremes:
- A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
- A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered.
Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.
If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?
If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.
It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.
Is RL therefore spooky?
RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.
But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.
In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.
Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?
A note on offline versus online RL
The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.
Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.
Usage in practice
The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.
RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.
For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.
"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.
The source of spookiness
Consider two opposite extremes:
Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.
If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?
If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.
It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.
Is RL therefore spooky?
RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.
But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.[1]
In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.[2]
Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?
A note on offline versus online RL
The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.
Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.
Usage in practice
The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.
RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.
For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.[3]
If you could enumerate all possible policies with a hypercomputer and choose the one that performs the best on the specified reward function, that would train, and it would also cause infinite cosmic horror. If you have a hypercomputer, don't do that.
Or in the case of RLHF on LLMs, the fine-tuning process is effectively just etching a precondition into the predictor, not building complex new functions. Current LLMs, being approximators of probabilistic inference to start with, have lots of very accessible machinery for this kind of conditioning process.
There are other options here, but I find this implementation intuitive.
@nostalgebraist bites that bullet here:
... (read more)