The Sweet Lesson: AI Safety Should Scale With Compute

Jesse Hoogland

A corollary of Sutton's Bitter Lesson is that solutions to AI safety should scale with compute.^[1]

Let's consider a few examples of research directions that are aiming at this property:

Deliberative Alignment: Combine chain-of-thought with Constitutional AI to improve safety with inference-time compute (see Guan et al. 2025, Figure 13).
AI Control: Design control protocols that pit a red team against a blue team so that running the game for longer results in more reliable estimates of the probability of successful scheming during deployment (e.g., weight exfiltration).
Debate: Design a debate protocol so that running a longer, deeper debate between AI assistants makes us more confident that we're encouraging honesty or other desirable qualities (see Irving et al. 2018, Table 1).
Bengio's Scientist AI: Develop safety guardrails that obtain more reliable estimates of the probability of catastrophic risk with increasing inference time:^[2]

[O]ur proposed method has the advantage that, with more and more compute, it converges to the correct prediction . . . . In other words, more computation means better and more trustworthy answers[.] — Bengio et al. (2025)

Anthropic-style Interpretability: Develop interpretability tools (like SAEs) that first learn the SAE features that are most important to minimizing reconstruction loss (see Templeton et al. 2024, Scaling Laws).
ARC-style Interpretability: Develop interpretability tools that extract the most important safety-relevant explanations first before moving on to less safety-relevant explanations (see Gross et al. 2024):

In ARC's current thinking, the plan is not to have a clear line of separation between things we need to explain and things we don't. Instead, the loss function of explanation quality should capture how important various things are to explain, and the explanation-finding algorithm is given a certain compute budget to build up the explanation of the model behavior by bits and pieces. — Matolcsi (2025)

For these procedures to work, we want the ideal limits of these procedures to be safe. In a recent talk, Irving suggests we should understand the role of theory as being about analyzing these limiting properties (assuming sufficient resources like compute, data, and potentially simplifying assumptions) and the role of empirics as being about checking if the conditions required by the theory seem to hold in practice (e.g., has learning converged? Are the assumptions reasonable?).^[3]

From this perspective, the role of theory is to answer questions like:

Deliberative Alignment: Does my model internalize the constitution/specification I provide or instead learn a set of cheap heuristics? Is "alignment-scaling-with-context-length" actually a global minimum of this training procedure? How does the optimum generalize?
AI Control: What are the equilibria if we set up our protocol as an idealized game? Can we design protocols with favorable equilibria?
Debate: Can we modify the original debate protocol (such as prover-estimator debate) that provably bypass issues such as obfuscated arguments? If we train our model against the outcomes of these games, do we actually converge to the theoretical equilibria?
Scientist AI: Does the GFlowNet-based inference process provably converge to a well-defined (approximate) Bayesian posterior or objective function as compute increases? How does the quality of this approximation scale?
Anthropic-style Interpretability: What behaviors are left unexplained in the remaining SAE reconstruction loss? How do learned SAE features generalize out-of-distribution? Do our SAE training procedures converge to learn a complete (or at least "sufficient") set of SAEs?
ARC-style Interpretability: If we set up our explanation-finding problem as a learning problem, do those learning algorithms prioritize learning the most informative bits first? Or do the inductive biases of our learning algorithms get in the way?

Taking a step back, obtaining understanding and guarantees about idealized limits has always been the aim of theoretical approaches to AI safety:

What are the idealized limits of prediction? An ideal predictor should be able to deal with both empirical and logical uncertainty. So something like Logical Induction.
What are the idealized limits of decision-making? We should be able to deal with counterfactuals and Newcomb-like problems, hence something like Functional Decision Theory.
What are the idealized limits of agents? It can't be AIXI, which is not physically possible (and has a bunch of limitations), but maybe it's something close. This leads to agendas like Embedded Agents and Kosoy's Learning-theoretic agenda.

The main update is that now we roughly know the kinds of systems that AGI might consist of, and this enables a different kind of theory strategy: instead of studying the ideals that emerge from a list of desiderata, you can study the limiting/convergence properties of the protocols we are currently using to develop and align AGI.

In practice, we probably need both approaches. As we continue to scale up compute, data, model size, and the RL feedback loop, we might find ourselves suddenly very close to the limits, wishing we had known better what lay in store.

^{^}
Ren et al. (2024) make a related point, which is that AI safety research should aim at problems that aren't already solved by scaling. My point is about what solutions to those problems should look like.
^{^}
Bengio's proposal involves a very particular technical vision for using G-Flow Nets to implement an (approximate, amortized) Bayesian Oracle that he believes will scale competitively (compared to the inference costs of the agent being monitored).
^{^}
Irving's talk is about scalable oversight, but the general framing applies much more generally.

[-]Stephen McAleese2mo50

RLHF is another method that improves with more compute. When a policy is optimized against a reward model, the reward continues to increase but the true reward eventually decreases due to reward hacking. This manifests as a divergence between the proxy and true reward.

The Scaling Laws for Reward Model Overoptimization paper shows that when the policy size is held constant and the reward model size is increased, the gold and proxy rewards diverge less as the policy is optimized. This suggests that larger reward models trained on more data are more robust and less prone to reward hacking.

[-]gugu1mo10

That's why it's called... *scalable* alignment? :D
Somewhat tongue-in-cheeck, but I think I am sort of confused by what is the core news here.

[-]Jesse Hoogland1mo20

"Scalable" in the sense of "scalable oversight" refers to designing methods for scaling human-level oversight to systems beyond human-level capabilities. This doesn't necessarily mean the same thing as designing methods for reliably converting compute into more alignment. In practice, techniques for scalable oversight like debate, IDA, etc. often have the property that they scale with compute, but this is left as an implicit desideratum rather than an explicit constraint.

LESSWRONG
LW

96

The Sweet Lesson: AI Safety Should Scale With Compute

96

Ω 32

96

Ω 32