Appreciated this post.
Doesn't the recomputation approach require me to share my model weights with my adversary? Seems like a tough ask.
Reasoning: Merging independently trained models reduces loss because the loss landscape is concave; so averaging models achieves lower than the average of the models’ losses.
you probably meant "convex" here.
There is another two verification routines, which is looking over shoulders at internal documents, and banning new releases.
There is also just checking for the presense of the registered weights in the gpu's local rap, which produces a memory tax big enough that there is not space to fit the intermediate values of the gradient. This requires that the monitors have code on the running machines, but periodic memory dumps can be a surveilance mechanism by verifying that all models on gpus match a model known to be already trained, which stops new initialization and thus new runs.
Yeah both are valid measures.
I was focusing on verification mechanisms that are (1) hardware enforced and (2) don't require trust in any pre existing hardware - which I think is the appropriate regime for international agreements.
I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience.
But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view explainers of what's going on and what the bottlenecks are.
Tl;dr: At least one of the following would need to be implemented for me to be confident that inference verification would substantially slow training given today's algorithms:
To my knowledge, no one has prototyped verification demos that reach these thresholds; so whether rapidly-implementable inference verification is feasible is still an open question.
1. Summary
I think powerful AI will be obviously scary at some point, and companies or governments might want to slow it down to buy time for additional safety or oversight. Maybe this could be done quickly, e.g. by:
(Section 2)
Would these methods actually work? Or more specifically, if these methods were implemented quickly and correctly, would they substantially slow AI development?
I looked into this question for around a week, and here are my current views:
Current prototypes of inference-verification would probably be ineffective. Standard inference-verification measures slow training by restricting communication between servers (see Section 2), since training involves chucking big gradients around in a hivemind, and inference just involves exchanging text. However, communication restrictions might not actually slow AI training much, especially if that training is reinforcement learning (RL).
RL doesn’t require much communication. A large fraction of RL is inference - for example, generating rollouts of agent actions and scoring them - and inference can still be performed under communication constraints. Developers could use 95% of their compute (the compute that’s under verification) to generate RL rollouts, and 5% of compute in covert data centers to calculate training updates. This might allow training to continue at a similar efficiency as before (Section 3).
But existing verification proposals would create some hurdles that would make this strategy more difficult. For example, developers would need to frequently upload new checkpoints to inference servers, which could be slow given bandwidth limits. And developers might still need to send some fraction of rollouts through verifier network taps - even though these rollouts don’t actually match expected model outputs. But as I discuss in Section 3.4, these constraints are probably feasible to work around given the slack in existing inference verification prototypes.
More aggressive measures could probably buy at least 1 year if implemented sufficiently early. These more aggressive measures include:
Any one of the above would probably make training given current algorithms infeasible. My guess is that developing algorithms to work around these constraints would take at least a year absent rapid AI-driven R&D (if working around them is possible at all).
A year could be enough. One year into an AI agreement, governments might have had two years to develop verification in total, since the measures above might require around a year to implement. And after two years of working on verification mechanisms, governments might be able to implement general purpose on-chip security features that are more robust.
But if an agreement starts during an intelligence explosion, all bets are off. If developers have powerful AI that can create new algorithms quickly, I see no fundamental reason why bandwidth limits, memory wipes, etc would necessarily slow AI research (Section 5), but they might anyway.
So I think the main focus of verification research should be on de-risking these three verification measures (reliable proof of work/memory, frequent memory wipes, and highly accurate output re-computation). Without these measures, I think training would hardly slow at all. With them, training would probably slow down given current algorithms, and would possibly slow down during an intelligence explosion (but that’s unclear).
None of these measures are obviously feasible to implement on short notice; so I think whether rapidly-retrofitable inference verification is viable is still an open question.
The next sections defend these claims.
Edit: After thinking about this more and talking to more people, I think developing these measures on short notice is feasible, and a subset of them will probably be prototyped within the next few months. In particular, “proof of memory” and “output re-computation” are probably workable.
2. Ways to quickly and cheaply slow training by restricting communication
During training, models share their learnings in a big hive-mind. There’s a lot of communication going on.
Here’s what training looks like:
During inference, models still talk to each other, but their communications are thin and sparse:
Could governments exploit this difference in communication to slow down AI training without taxing inference?
Here’s an overview of different verification methods and how practical I think they are:
2.1. Method #1: Disconnect inter-rack high-speed cables
One way to reduce communication is to disconnect the high-speed inter-rack network, like InfiniBand cables. These cables transmit at ~400 GB/s between server racks, and could be replaced with copper connections that only transmit data at internet speeds (e.g. 100 MB/s).
2.2. Method #2: Tap-verified bandwidth limits
Today’s AI agents don’t need to stream videos. They only need to output text; so developers might install simple devices that tap cables and verify that data transmission rates are even below internet speeds (e.g. 100 KB/s).
Aside: Why 100 KB/s? Bandwidth restrictions are most practical to enforce on intra-rack cables - not on the dense copper back-plate connecting GPUs on server racks. How much data needs to flow through these cables to support inference? Server racks currently house around ~100 H100 equivalents. Google had roughly 3 million H100 equivalents as of late 2025, and processed 480 trillion text tokens/month. This implies 12 billion tokens of input data go through each server rack every second, which is 20 KB/s of input flow (where output flow is probably smaller); so input and output limits of 100 KB/s on each server rack is close to the minimum before data rate limits would interfere with current inference workloads.
2.3. Method #3: Output re-computation
Verifiers could reduce bandwidth even further by partially recomputing a sample of server outputs.
If this is implemented perfectly - with no leaks and flawless determinism - verifiers know that communications between model instances look like this:
Meaning:
Model instances can only communicate through low bandwidth model outputs.
At first glance, this looks like it solves everything. This is close to the most strict communication-constraint that can be enforced without disrupting inference.
But there are some important caveats.
First, a developer might be concealing some small fraction of its compute (e.g. 5%) in secret data centers. Historically, intelligence agencies have accounted for covert assets (like tanks, warheads, etc) to within a 15% median error. Compute will probably be easier to account for than (say) missiles. But I think it will still be hard to know if (for example) the PRC government diverted 5% of China’s compute to covert data centers.
This gives cheaters a non-zero budget for high-bandwidth operations.
Second, output re-computation might not account for all server traffic. There might be non-determinism or side-channels that allow covert workloads on inference servers to leak data.
So output re-computation might end up functioning as just another way to reduce bandwidth. Suppose covert workloads could only slip data through 1% of output traffic, which is the covert channel size that current prototypes allow. Then, output bandwidth would be reduced by 100x, from ~100 KB/s to ~1 KB/s.
2.4. Method #4: Memory wipes
If output re-computation is ‘leaky’ then developers can still accumulate covert state on inference servers. To prevent this state from accumulating for too long, verifiers could enforce periodic memory wipes. For example:
2.5. Method #5: Proof of work / proof of memory
Verifiers could constrain developers even further with a method called proof of work / memory. The idea is to check that declared inference actually accounts for most of the computation or memory that the developer is using. Specifically:
Now developers can only use a fraction of their servers for covert workloads - where this fraction is determined by the error bounds of verifier’s estimates.
3. Ways to efficiently continue training despite these constraints
Assume verifiers implement “output re-computation with leaks” from the previous section:
How could developers continue efficient training despite these constraints?
Here are different methods for communication-efficient training and how effective I think they might be:
3.1. Method #1: Larger batch size + infrequent SGD steps
Verdict: Ineffective.
Reasoning: Under 1 kb/s bandwidth, a 10 GB gradient would take 6 months to upload, and many uploads would be required. Current data suggests that the loss landscape is curved enough to where at least thousands of serial training steps are needed to reach strong performance, which implies that a near-frontier training run would take 6 months x 1000 = hundreds of years to complete.
Caption: The relationship between batch size and number of training steps needed to reach a specific loss (derived from Figure 4 in https://arxiv.org/abs/2505.13738). Larger batches have no benefit beyond some point (the ‘critical batch size’), and this BOTEC suggests we are already near this point.
Caption: The “critical batch size” is increasing more slowly than developers are scaling AI training, which suggests that developers will remain in a regime where increasing the batch size doesn’t reduce the number of syncs required. From: https://arxiv.org/abs/2311.08105
3.2. Method #2: Periodically merge independent training runs
Verdict: Reduces bandwidth requirements by at least 1000x, and possibly much more.
Reasoning: Merging independently trained models reduces loss because the loss landscape is convex; so averaging models achieves lower than the average of the models’ losses.
Caption: Why merging independently trained models works.
Google shows that this technique can reduce the frequency of gradient syncs by ~2000x with minimal hits to performance (see https://arxiv.org/abs/2501.18512 and https://arxiv.org/abs/2311.08105):
Caption: Training loss for different sync frequencies. From the DiLoCo paper.
However, beyond a certain point, reducing syncing frequency starts to harm performance. But Li et al showed that this problem can be mitigated by training models on different and independent subsets of data. For example, a model that’s learning pytorch does not need to sync frequently with a model learning about game development. The pytorch model’s learnings don’t have serial dependencies with the game development model’s learnings; so training can be parallelized to the extent that the data can be fragmented into isolated domains.
However, models benefit non-trivially from generalization; so at what sync frequencies do performance costs kick? One paper trained experts with ~10,000 H100 hours before merging at high performance; so if we assume that each server trains a single model, and a server has 100 GPUs, then developers could sync models every 10,000 / 100 / 24 = 4 days and maintain frontier efficiency.
3.3. Method #3: Compress gradients and weights
Verdict: Reduces bandwidth requirements by at least ~100 times, and possibly much more.
Reasoning: Developers can reduce the size of (1) gradients (2) pseudo gradients and (3) model weights, which reduces the time required to send these objects faster to share across bandwidth-limited channels.
In the limit, how much can gradients be compressed without slowing down training? Theoretically: there’s no free lunch. At a certain threshold, compressing by 10x slows training by 10x. But what is this threshold? In principle, gradients could be losslessly compressed to the size of the training data used to generate them. If training data is ~1 million times smaller than a gradient (~100 GB / 100 KB), then gradients could be compressed by ~1 million times. Also, training data itself is fairly compressible, so gradients might be compressible by even more than 1 million times without any hit to performance. Gradients are not currently this compressible, but maybe developers could make them more compressible by “training for compressibility,” or other methods.
3.4. Method #4: Use more compute for inference rollouts, and less for training
Most of RL training involves generating action rollouts and scoring them. This is just inference, so developers could use their declared inference servers for rollout generation, and use secret data centers for training.
If secret data centers are ~5% of developer, then developers could use this 5% for training, and 95% for rollout generation. Would this asymmetric compute allocation slow down RL training? Multiple papers suggest no: the vast majority of rollouts are not very informative. So if developers train on only the most informative 5% of rollouts, RL still converges at a similar rate. [And also, maybe gradients aren’t even necessary at all, and inference-based evolution works really well]
However, this strategy still requires surmounting a couple of challenges:
First, new checkpoints have to be periodically uploaded to inference servers. RL generates rollouts using the model being trained. So, new models would need to be uploaded to inference servers continually during training. But bandwidth limits could make this upload process slow.
At 100 KB/s, a 3 GB expert model could be uploaded every few hours. Currently, models during reinforcement learning are updated every ~few minutes (the time needed to generate rollouts); so this would be a ~100x slower update frequency than is typical.
However, model updates are becoming less frequent by default because rollouts are taking longer to generate. Also, updates can be significantly delayed without harming performance; so an update frequency of a few hours probably is not a problem.
Additionally, the rollout data has to sent through verifier taps even though it isn't generated by declared models. We’re assuming that verifiers use output re-computation to verify that outputs are generated by the “declared” inference model. But the model producing rollouts for RL is different from the “declared model” and is constantly changing; so RL rollouts would need to somehow escape servers even though they can’t be re-computed by verifiers.
Developers might exfiltrate RL rollouts via the non-determinism of inference. Output re-computation naturally has some error, since inference is not perfectly deterministic. Even hardware-level irregularities will cause some deviations. In current prototypes, the “unsuspicious error” (aka level of non-determinism) of output re-computation is 1%; so 1% of server outputs could be generated by our undeclared RL model without raising suspicion.
If only around 1% of the RL rollouts generated could be actually sent back to training servers, is this a big constraint? Servers would need to curate the top 5% of rollouts anyway, since this is the maximum the training clusters can digest (remember that we are assuming covert data centers are 5% of total developer compute); so further tightening the curation threshold to the top 1% probably wouldn’t be a big problem.
Therefore, developers could potentially side-step inference verification by using declared servers for RL rollout generation.
4. But more aggressive verification methods would probably make training with current algorithms impractical
If we combine the most promising communication-efficient training methods from the last section, here’s what developers might do:
To defeat this strategy, verifiers could implement any of the following:
Memory wipes every few minutes. Each time verifiers wipe servers, developers need to re-upload the 3 GB expert models. But uploading 3 GB across 100 KB/s would take 8 hours; so if memory is wiped every few minutes, developers couldn’t upload a full expert before all of their progress is erased.
In order to upload a checkpoint within the 1-minute memory-wipe window, models experts would need to be around ~5 MB. I’m not aware of any work that has split a near-frontier, 100 GB model into 20,000 different independent 5 MB experts and retained high performance. I’m not sure if this is even possible.
Any of the above inference verification methods would probably significantly slow training given current algorithms.
5. However, if developers (or AIs) have a lot of time to research better algorithms, all bets are off
All of the inference verification methods above boil down to restricting communications between servers. This only works because training currently involves a lot more communication than inference does. But I see no reason why this has to be the case.
For example:
All of these are plausible reasons why even extreme communication restrictions might not slow AI development much at all. However, I think it’s also plausible that communication constraints would cause a major slowdown - even in the limit of AI R&D. This all just depends on the nature of AI development and learning. AI development might be intrinsically much more bottlenecked on compute than communication. Or it might not be. We don’t know.
6. Conclusion
More work is needed to be confident that inference verification can be effectively implemented on short notice.
If you are interested in doing this work, say so, and I can try to connect you to the relevant people.
The best way to reach me is via email: joshuamclymer@gmail.com
Appendix
Are we in the serially bottlenecked training regime? A BOTEC by Claude
Setup
There is a critical batch size (B_crit) beyond which adding more data-parallel workers yields diminishing returns. Below B_crit, doubling the batch size roughly halves the number of training steps needed — perfect parallelization. Above B_crit, you still need roughly the same number of serial steps, but you're burning extra tokens for no benefit.
If a training cluster can fill B_crit with its data-parallel capacity, it is serial-bottlenecked — more GPUs won't help. If it can't reach B_crit, it is hardware-bottlenecked — more GPUs would directly speed up training.
This BOTEC asks: at the scale of 100K and 1M H100-equivalents, which regime are we in?
Key Formula
From Bergsma et al. 2025 ("Power Lines"), trained on GPT-2-style models up to 3.3B parameters on SlimPajama:
where B is in sequences of 2048 tokens and D is total training tokens. This was fit on datasets up to ~143B tokens; we are extrapolating 100–400× beyond the fitted range.
B_crit at Frontier Scale
Dataset size (D)
B_crit (tokens/batch)
S_min (steps)
Wall-clock at B_crit, 2s/step
At B_crit, the number of training steps is 2 × S_min, and the total tokens consumed is 2 × D_min. The lab pays a 2× token overhead in exchange for minimizing wall-clock time.
How Many GPUs Per Model Replica?
Different architectures consume vastly different amounts of model parallelism, leaving different amounts of headroom for data parallelism:
Architecture
TP
PP
EP
GPUs/replica
Achievable Batch Size vs. B_crit
Assuming 8192-token sequences, gradient accumulation of 8, and D = 15T tokens (B_crit ≈ 118M tokens):
Cluster
Architecture
DP replicas
Batch size
Ratio to B_crit
Regime
Key Takeaways
At 100K H100s, every architecture is hardware-bottlenecked. Even a relatively small dense model can only reach ~0.4× B_crit. More GPUs would directly speed up training. MoE models are especially far from B_crit because expert parallelism consumes most of the GPU budget.
At 1M H100s, dense models become serial-bottlenecked but MoE models do not. A dense 300B model would overshoot B_crit by 4×, wasting significant compute on redundant gradient information. But a DeepSeek-V3-style MoE still only reaches 0.5× B_crit, and a 2T-parameter MoE reaches just 0.1×. MoE architectures absorb GPU capacity into model parallelism, keeping the data-parallel dimension small and the training compute-efficient.
This is a structural argument for MoE at scale. As clusters grow, dense models hit the seriality wall first. MoE provides a way to productively use additional GPUs (holding more expert parameters) without pushing batch size past B_crit. It converts excess parallel capacity into model quality rather than wasted data parallelism.
If the Power Lines extrapolation holds, serial wall-clock time is surprisingly short. At B_crit with 2s/step, a 15T-token run finishes in ~6 days. Actual frontier training runs take months, suggesting labs operate well below B_crit — trading wall-clock time for compute efficiency — or that step times are much longer than 2 seconds at these configurations.
Caveats
These estimates rest on several shaky assumptions:
Sources