RLHF is another method that improves with more compute. When a policy is optimized against a reward model, the reward continues to increase but the true reward eventually decreases due to reward hacking. This manifests as a divergence between the proxy and true reward.
The Scaling Laws for Reward Model Overoptimization paper shows that when the policy size is held constant and the reward model size is increased, the gold and proxy rewards diverge less as the policy is optimized. This suggests that larger reward models trained on more data are more robust and less prone to reward hacking.
That's why it's called... *scalable* alignment? :D
Somewhat tongue-in-cheeck, but I think I am sort of confused by what is the core news here.
"Scalable" in the sense of "scalable oversight" refers to designing methods for scaling human-level oversight to systems beyond human-level capabilities. This doesn't necessarily mean the same thing as designing methods for reliably converting compute into more alignment. In practice, techniques for scalable oversight like debate, IDA, etc. often have the property that they scale with compute, but this is left as an implicit desideratum rather than an explicit constraint.
A corollary of Sutton's Bitter Lesson is that solutions to AI safety should scale with compute.[1]
Let's consider a few examples of research directions that are aiming at this property:
For these procedures to work, we want the ideal limits of these procedures to be safe. In a recent talk, Irving suggests we should understand the role of theory as being about analyzing these limiting properties (assuming sufficient resources like compute, data, and potentially simplifying assumptions) and the role of empirics as being about checking if the conditions required by the theory seem to hold in practice (e.g., has learning converged? Are the assumptions reasonable?).[3]
From this perspective, the role of theory is to answer questions like:
Taking a step back, obtaining understanding and guarantees about idealized limits has always been the aim of theoretical approaches to AI safety:
The main update is that now we roughly know the kinds of systems that AGI might consist of, and this enables a different kind of theory strategy: instead of studying the ideals that emerge from a list of desiderata, you can study the limiting/convergence properties of the protocols we are currently using to develop and align AGI.
In practice, we probably need both approaches. As we continue to scale up compute, data, model size, and the RL feedback loop, we might find ourselves suddenly very close to the limits, wishing we had known better what lay in store.
Ren et al. (2024) make a related point, which is that AI safety research should aim at problems that aren't already solved by scaling. My point is about what solutions to those problems should look like.
Bengio's proposal involves a very particular technical vision for using G-Flow Nets to implement an (approximate, amortized) Bayesian Oracle that he believes will scale competitively (compared to the inference costs of the agent being monitored).
Irving's talk is about scalable oversight, but the general framing applies much more generally.