Quick thought: I wonder if you could still be overestimating parameter counts for models that would be made by distilling a larger teacher down to a smaller student. Any OSS ones you could test this hypothesis on?
Properly done, the methodology should find that sufficiently over trained low parameter models ~= distilled low parameter models, since there isn’t more capacity to memorize. But yeah, that would be another good sanity check to run.
Wait, why are distilled models better than just overtraining the small model again? My guess is it’s mainly because SFT >> RL for efficiency, and cloning good CoTs is easier than sampling them via random exploration.
re: distilled > overtrained, you can distill via on policy distillation (OPD) with the strong model as the teacher, get dense supervision (you wouldn’t get this with RLVR) and get generalization gains (because of the on policy nature of OPD vis a vis SFT, which is a lot more fat handed).
Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.
Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?[1]
Introduction
Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of many frontier closed-source models including the newer GPT-5.5 (9.7T parameters) and Claude Opus 4.6 (5.3T parameters) as well as older models such as o1 (3.5T) and gpt-4o (720B). The paper, titled “Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity”, introduces a dataset of factual knowledge questions of varying difficulty, regresses performance on this dataset against parameter count, and then uses this regression to extrapolate from the performance of closed-sourced frontier models to their parameter count. A notable fact about this paper is that, unlike most empirical machine learning papers, it’s single-authored: Bojie Li, the chief scientist of Pine AI, is the sole author of this piece.
These results were suspicious for many reasons, the primary being that it seems like low-effort, hastily-written AI slop. For example, the codebase (https://github.com/19PINE-AI/ikp) was constructed in large part with Claude Code and has many of the flags for code that is almost entirely vibe-coded with little sanity checking (e.g. redundant and inconsistent variable definitions[2], boilerplate bloat, excessive error handling[3], and silent failures[4]). The same can be said of the author’s website for this paper (archived here), which has definitions for terms that appear nowhere on the page[5], table headings inconsistent with the contents[6], and has a very high heading-to-text ratio.
We (Benjamin and Lawrence) decided to dig into these results further. Specifically, we read the paper, reproduced the author’s results using their code base, and then dug into some obvious methodological issues to see how much the issues affected the author’s results.
We find:
We identify two major issues with the IKP parameter estimates. First, the author's code and released numbers implement a per-tier flooring to the model's performance, which inflates the measured performance of small models. Second, around 10% of the IKP questions (mainly hard ones) were ambiguous or had incorrect answers. Fixing these two issues shrinks estimated parameter count of frontier models substantially.
Despite these issues, we think that the core idea – reverse engineering LLM parameter count by quantifying memorization capacity – is solid, and welcome future work implementing this in a more rigorous and systematic way.
Summary of Li’s “Incompressible Knowledge Probes”
As usual, let’s start by summarizing the paper at hand.
One way of estimating the size of closed model is by extrapolating from API throughput and pricing under a hardware-cost model (e.g. Epoch AI’s inference economics).
Li argues that these size estimates are unreliable, by a factor of over 2x, due to confounders such as quantization, batching, and vendor margin. He instead proposes reverse engineering parameter count by using the fact that neural networks can only store a linear number of facts in parameter count.[7] Unfortunately, this isn’t as simple as counting all the facts:[8] for one, doing that exhaustively is intractable.
Li builds a set of questions ("Incompressible Knowledge Probes," IKP) testing factual knowledge with varying degrees of obscurity. Probes come from four sources: GPT-5-generated questions, Wikidata SPARQL pulls, DBLP/OpenAlex researcher records, and a small set of hand-curated questions. Li calls these "probes," but to avoid confusion we'll just call them questions.
Li claims six contributions:
The IKP dataset
The IKP dataset consists of 1400 questions, divided evenly into 7 tiers of obscurity/difficulty (200 questions per tier). There are four sources of questions:
The difficulty of each tier is empirically calibrated against 6 "landmark" models. These models consist of 5 open-weight models ranging from Qwen 2.5 0.5B (T1) up to Kimi K2.5 1T (T5), as well as Gemini 3.1 Pro (T6). A question is assigned to tier k if the k-th landmark answers it correctly but the (k−1)-th landmark does not. T7 is reserved for questions no landmark model gets right.
As we’ll later note, both the wikidata and researcher question datasets (which comprise over 900 of the 1400 questions, including all questions from T5-T7) have fairly significant quality issues.[12] For example, both wikidata and the researcher question sets contain many ambiguous questions resulting from name-space collisions (e.g. multiple researchers or locations that have the same name). Other wikidata questions are either underspecified or have ambiguous answers: e.g. there's a question asking about Oxford's founding, but Oxford received its royal charter in 1248, but there’s been evidence of teaching at Oxford in 1096 and the university arguably could’ve existed earlier. Some of the questions also reference outdated information, or only consider one author correct for multi-authored works. This complicates the interpretation of the results.
IKP scoring and regression methodology
For each of a model’s answers, Li scores its responses on either a 3 or 4-point scale:
The hallucination penalty is added in order to discourage guessing (though it also penalize models who know the answers to questions that have incorrect gold answers). Each of the 7 tiers’ score is the mean over its 200 questions, and a model's overall "penalized accuracy" is the unweighted mean of the seven tier scores. To calculate penalized accuracy, the per-tier scores are floored at 0 in the released data, even though the paper text explicitly claims they are not floored "to preserve the bluff signal in the calibration." This is one of the methodology inconsistencies we'll come back to, as the choice meaningfully changes the slope of the fit.
The judge is Gemini 3 Flash Preview at temperature 0, and all target models are run once at temperature 0. Note that this is fairly non-standard for model evaluations (and many reasoning model providers explicitly discourage running their models with t=0).
The headline regression is a one-line OLS:
This OLS is fit on 89 open-weight models with known parameter counts, ranging from SmolLM2-135M up to DeepSeek V4 Pro at 1.6T. Li reports α = 0.147, β = +0.132, R² = 0.917, with leave-one-out median fold error of 1.59× and a 90% prediction interval factor of 3.0×. Inverting the regression gives a parameter-count estimate for any target model: N̂ = 10^((A − β) / α).
For Mixture-of-Experts models, total parameters predict factual knowledge meaningfully better than active parameters (R² = 0.79 vs 0.51).[13]
IKP performance doesn't improve over time
The densing law paper (Xiao et al. 2024) introduces "capability density", defined as the ratio of a model's effective parameter size to its actual parameter size. Here, "effective size" is the parameter count a reference model would need to match the target's downstream score. Across 29 open-source base models, they fit and report per day, which they translate to "the maximum capability density of LLMs doubles approximately every 3.3 months.[14]
To test whether or not a similar law applies to the IKP questions, Li adds release date as a covariate to the IKP regression:
If Xiao et al.'s densing law applied to the IKP questions, then γ should be about +0.0117/month (the value that produces the claimed 3.3-month density doubling). Across 96 dated open-weight models, Li fits γ = −0.0010/month, 95% CI [−0.0031, +0.0008] — statistically indistinguishable from zero, and rejects the naive application of densing law at .
This result stands up to all of the stress testing we performed. We refit the regression with vendor fixed effects (22 vendor dummies), family fixed effects (33 family dummies), without thinking-mode variants, dropping the open-weight tier-landmark models (anti-circularity check), and under both flooring regimes for the per-tier scores. In every specification γ stays within ±0.004/month of zero, and the +0.0117/month densing prediction is rejected with effective certainty. So whatever else the paper does, this result holds.[15]
That being said, we believe the right way to read this result is: holding parameter count fixed, factual recall on rare entities has not improved across open-weight model generations from 2023 through April 2026. Controlling for parameter count, procedural benchmarks like MMLU and HumanEval have improved over the same window, often dramatically.[16] Li falsifies Xiao et al.'s densing law on the IKP dataset, but not in general, given that the densing law was not intended to cover factual recall capacity.
Methodological Issues with the IKP paper
The paper and codebase have a number of methodological issues, both across dataset construction, judging methodology, and reporting of results. The two main methodological issues that impact the results are the use of per-tier flooring for scores (contrary to the paper’s claims) and questions with ambiguous and incorrect answers. When we adjust for these issues in our replication, the headline numbers change significantly.
Per-tier floors to the scoring
When scoring the models, each probe score is measured:
In the paper section 4.3 says "Per-tier scores are not floored at zero in the released results … to preserve the bluff signal in the calibration." Flooring means that if it would normally go negative due to wrong answers, it would instead be held at 0. While they claim that the results are not floored, they in fact are floored, both for the reported values in the paper and in the repository.
Applying per-tier flooring makes the scores of smaller models much higher: relative to larger models, they hallucinate more and refuse less when they don't know the answer, but these hallucinations get rounded up to zero. Removing the flooring makes smaller models have lower accuracy at a given parameter count, and thus decreases the slope of log parameters against accuracy. Specifically:
Addressing this inconsistency also causes R² to drop from 0.917 → 0.815 and the 90% prediction interval span (95 percentile param count/5 percentile param count) to widen from 3.0 to 5.7. Notably, the slope of log parameter count relative to performance drops substantially, from 6.79 to 3.56. This means the large parameter counts for frontier models in Li’s original paper are largely an artifact of the flooring (or more cynically, an undocumented “code-level optimization”).
Addressing the two major issues we identified leads to a substantially different slope compared to the original fit in the paper, and leads to a slightly lower R². See the appendix for further details.
Ambiguous/incorrect answers to hard questions
For the researcher questions, Li filters out two-character Chinese names and single-initial given names (Section 4.1). Unfortunately, manual inspection of some randomly sampled questions revealed two issues this filter doesn't catch:
For wikidata questions, Li applied a “10-round audit/repair cycle” (section 7.7). Unfortunately, it seems that this repair cycle failed to catch at least two types of issue:
For manually generated questions, we inspected those where the models consistently did much worse than the tier would suggest, and found two incorrect questions: one on the highest peak in Bangladesh (whose answer changed from the IKP gold answer of Keokradong to Saka Haphong with more survey data) and another on the founding year of Mongolian People’s Revolutionary Party (different sources cite 1920 or 1921). We excluded these two from our analysis.
Interestingly, these possible issues are noted by Li in Appendix H. However, he does not attempt to quantify how many questions are ambiguous or incorrect, nor how large the impact is if you were to remove the ambiguous questions. We do that here.
We attempted to remove the ambiguous or incorrect questions we noticed, though we note that we were unable to remove all Wikidata questions involving semantic ambiguities besides name collisions as we didn't have the time to check the answer to every single question manually.
Source
Number of Questions
Flagged ambiguous
Heuristic
LLM-generated
401
0
Visual spot check was performed across all four tiers with an LLM judge. The questions seem well formed.
Researcher
345
86 (24.9%)
OpenAlex shows ≥2 distinct researchers with ≥50 citations sharing the name.
Wikidata
557
45 (8.08%)
≥3 Wikidata entities share the same label. (We found other categories of ambiguities that we could not programmatically remove.)
Manually generated
97
2 (2.05%)
We manually inspected the manually generated questions where the models performed much worse than expected, and confirmed that 2 of the questions were incorrect.
Total questions audited
1,400
131 (9.4%)
Corrected model parameter estimates
We attempt to fix the two methodological issues we identified above, by removing questions that have ambiguous answers from the various datasets, and also by removing the flooring from the accuracy estimates. We then recalculate scores across all the models measured in the paper. We present the newly calculated scores for 8 large frontier or near-frontier models; a larger list of predicted model parameter counts is included in the appendix.
Parameter estimates comparing our methodology (blue) with the original paper's (red) with 90% prediction intervals (log scale). The green stars are true parameter sizes where known.
Overall, the estimates drop massively for the most capable frontier models (observing a nearly 10x difference in Gemini 3.1-pro) and for some of the smaller models we see a modest increase in score.
Possible methodological issues that mattered less than we thought
Thinking vs non-thinking
In the original work, models would often behave much better when thinking was enabled rather than not. This led to parameter estimates that were off by as much as 4.9x: Grok-4.20 was estimated to have 110B parameters without thinking, and 540B parameters with. Claude Opus 4.6 was estimated to have 2.4T parameters without thinking, but 5.3T with thinking enabled, a 2.2x difference.
This figure shows the difference in scores between the estimated parameter counts when the model was and was not given additional tokens in which to think, under Li's original methodology.
The headline results in the paper obscure this difference, as they generally report the maximum parameter count of either the thinking and non-thinking versions of the same model.
This figure shows the difference in scores between the estimated parameter counts when the model was and was not given additional tokens in which to think, using our updated methodology. We find that with the modifications the scores are significantly closer together than the floored version presented in the original paper.
Interestingly, after removing the arbitrary flooring, we generally observe much smaller score differences between frontier models with and without thinking enabled. Grok-4.20’s thinking multiplier dropped to 3.9x, while Claude Opus 4.6’s dropped to 1.2x. We unfortunately did not have the time to investigate why the thinking gap decreased after flooring was removed.[19] That being said, we believe that this is some evidence that, on the IKP benchmark, enabling or disabling thinking does not have that much of an effect on the estimated parameter counts of frontier models.
Different accuracy metrics used in some repository JSON files
We observed that the penalized accuracy metric used to score the models in some of the JSON files in the repo was different from what was shown in the paper. We subsequently investigated whether these different accuracy metrics affected the results, but found that the aberrant JSON scores were not used to produce any of the figures or tables in the paper. That is, the different accuracy metrics did not affect any of Li’s results as presented in the paper.
Conclusion
In this work, we examined the robustness of both the methodology and results of Li’s “Incompressible Knowledge Probes” paper. We identified two main methodological issues with the work: the per-tier flooring that exists in the code despite the paper claiming otherwise and the large fraction of ambiguous questions, especially in higher tiers of difficulty. We also note two questionable methodological choices that do not impact the results significantly: the performance gap between thinking and non-thinking models was much smaller than we initially thought, and the different accuracy metrics included in some JSON files were not used for the main analysis.
That being said, three of Li's claims survive every stress test we applied:
However, what does not survive are the specific multi-trillion parameter estimates for closed frontier models. After attempting to correct for methodological issues to the best of our ability, we found that the parameter count of the top proprietary frontier models drops from ~10T to ~1.5T.
We emphasize, however, that our point estimate of 1.5T for GPT-5.5 should not be read as a confident estimate of its true parameter count. Instead, we see it as evidence that parameter estimates from this methodology are unreliable and sensitive to methodological choices. Both of us are quite uncertain about the exact parameter count of GPT-5.5.
We think that the IKP dataset (and methodology) is a real contribution. Li also deserves credit for releasing the dataset and code; it is precisely because he open-sourced his code that we could write this post so quickly.[20] But the standard for an empirical paper that produces concrete numbers ("GPT-5.5 has 9.7T parameters") needs to be higher than "I ran one regression and reported the result." Methodological choices should be discussed and justified; the effects of possible limitations or dataset issues should be analyzed and not just acknowledged in passing; and results that seem surprisingly good (or just surprising) should be scrutinized before they go viral on Twitter.
Discussion
On a broader point, we think this work illustrates both the risks and potential of AI-generated research code.
Li's paper illustrates many of the risks. The codebase looks like code that was generated quickly and never carefully checked, including the six near-identical judge prompts in different scripts, defensive error handling that silently turns network failures into refusals, redundant variable definitions, and at least two cases where the paper text and the released code disagree about what the methodology is. The companion website has terms defined but used nowhere and incorrectly labeled tables. None of these is individually fatal, but together they describe a pipeline where no one (including the author) read the work with a critical eye before it went public. A single-authored empirical paper with no internal or external review is a known failure mode. A single-authored empirical paper generated largely by an LLM without much review is the same failure mode at higher throughput.
But the same tools that lower the cost of producing this kind of work also lower the cost of checking it. Thanks to Claude Code (and to a smaller extent, Codex) automating much of the code generation process on our end, the two of us were able to replicate Li’s main results, perform many sensitivity analyses, and write this up in around 8-10 hours of effort per person.[21] We estimate that the same amount of work would’ve taken us around 15-20 hours each using previous generation coding assistants (e.g. Cursor’s autocomplete).[22]
In terms of the IKP work, despite the issues with the headline results, we believe the core idea of reverse engineering LLM parameter count using memorization capacity to be solid and welcome future work that attempts to implement it in a more rigorous and systematic way. As a broader point about research scrutiny, we hope that this example serves as an important reminder of the changing economics of producing and scrutinizing new research results: as costs of both drop and the production of new results ramps up, so too should the scrutiny we apply to each result.
Appendix
Table of corrected parameter counts.
We share a table of original and corrected parameter counts alongside 90% prediction intervals to compare the numbers more precisely than in our figure. We start with the 8 models used in the figures in the main body (chosen to give a reasonable spread amongst model providers), the 6 landmark models used to divide questions into difficulty tiers, and then the 10 models with highest original estimated parameter counts.
Model
Vendor
Paper estimate in billions of parameters
[90% PI]
Estimate w/ corrections in billions of parameters
[90% PI]
Δ paper→
corrected
gemini-3.1-pro[23]
Google
40,794B
[13,598 – 122,382]
4,653B
[816 – 26,522]
↓8.77×
gpt-5.5
OpenAI
9,659B
[3,220 – 28,977]
1,458B
[256 – 8,311]
↓6.62×
gpt-5
OpenAI
4,088B
[1,363 – 12,264]
1,330B
[233 – 7,581]
↓3.07×
claude-opus-4.7
Anthropic
4,042B
[1,347 – 12,126]
1,132B
[199 – 6,452]
↓3.57×
claude-sonnet-4.6
Anthropic
1,730B
[577 – 5,190]
661B
[116 – 3,768]
↓2.62×
grok-4.20 (thinking)[24]
xAI
542B
[181 – 1,626]
768B
[135 – 4,378]
↑1.42×
deepseek-r1
(671B)
DeepSeek
424B
[141 – 1,272]
760B
[133 – 4,332]
↑1.79×
deepseek-v3
(671B)
DeepSeek
589B
[196 – 1,767]
564B
[99 – 3,215]
↓1.04×
Below we share the parameter counts for the original 6 "landmark" models used in the paper to separate the question into different difficulty tiers.
Tier
Model
Vendor
True params
Paper estimate
[90% PI]
Estimate w/ corrections
[90% PI]
Δ paper→
corrected
T1
Qwen 2.5 0.5B
Alibaba
0.5B
1B
[0 – 3]
0.2B
[0 – 1]
↓4.58×
T2
Qwen 2.5 7B
Alibaba
7.6B
9B
[3 – 27]
8B
[1 – 48]
↓1.07×
T3
Qwen 3 32B*
Alibaba
32.0B
34B
[11 – 103]
23B
[4 – 134]
↓1.46×
T4
Qwen 3 235B*
Alibaba
235.0B
113B
[38 – 338]
145B
[25 – 827]
↑1.29×
T5
Kimi K2.5*
Moonshot
1,040B
3,121B
[1,040 – 9,362]
680B
[119 – 3,878]
↓4.59×
T6
Gemini 3.1 Pro
Google
— (closed)
40,794B
[13,598 – 122,382]
4,653B [816 – 26,522]
↓8.77×
Below we share the parameter estimates for the 10 models with the highest parameter count estimates in the original paper.
Model
Vendor
Paper estimate
[90% PI]
Estimate w/ corrections
[90% PI]
Δ paper→
corrected
gemini-3.1-pro [23]
Google
40,773B
[13,591 – 122,319]
4,653B
[816 – 26,522]
↓8.76×
gemini-3-flash-think[25]
Google
17,065B
[5,688 – 51,195]
2,526B
[443 – 14,398]
↓6.75×
gemini-3-flash
Google
14,433B
[4,811 – 43,299]
1,939B
[340 – 11,052]
↓7.44×
gpt-5.5-pro
OpenAI
10,267B
[3,422 – 30,801]
1,471B
[258 – 8,385]
↓6.98×
gpt-5.5-think
OpenAI
9,656B
[3,219 – 28,968]
1,458B
[256 – 8,311]
↓6.62×
gpt-5.5
OpenAI
8,831B
[2,944 – 26,493]
1,459B
[256 – 8,316]
↓6.05×
claude-opus-4.6-think
Anthropic
5,254B
[1,751 – 15,762]
1,399B
[245 – 7,974]
↓3.76×
gpt-5-pro
OpenAI
4,110B
[1,370 – 12,330]
1,197B
[210 – 6,822]
↓3.43×
gpt-5-think
OpenAI
4,087B
[1,362 – 12,261]
1,330B
[233 – 7,581]
↓3.07×
claude-opus-4.7-think
Anthropic
4,041B
[1,347 – 12,123]
1,132B
[199 – 6,452]
↓3.57×
Details on our corrected OLS fit
For the corrected OLS fit in the main body, each point in the calibration scatter plot is one of n=89 open-weight models. Each model's accuracy is the unweighted mean of its per-tier penalized accuracies where tiers refer to difficulty rankings between 1 (easiest) and 7 (hardest) that the 1,400 probes (or 1,269 after we remove the 131 ambiguous/incorrect ones) are separated into.
The accuracy calculation is: , where strong/weak/wrong are counts of probes scored CORRECT_STRONG, CORRECT_WEAK, and WRONG within a tier.
In our method we remove the per-tier flooring at 0 in the accuracy calculation and exclude the 131 questions we flagged as ambiguous.
Otherwise, we follow the paper's approach by OLS, where is accuracy and is parameter count in billions, and then rearrange to for prediction.
As Betteridge's law of headlines suggests, the answer is no.
For example, the judge prompt appears in at least 6 different scripts with slightly different wording.
For example, lines 78-86 of src/scorer.py:
result = judge_fn(prompt).strip().upper()# Must check for exact "CORRECT" — not substring of "INCORRECT"
if result == "CORRECT":
return True
if result.startswith("CORRECT"):
return True
if result.split()[0] == "CORRECT" if result else False:
return True
return False
Note that both the first is fully subsumed by the second check (result.startswith("CORRECT")), and the third differs from the second only on strings like
"CORRECTNESS", which seems unlikely to be an intentional distinction.For example, ikp_estimate.py returns an empty string “” if an invalid HTTP response is received, which the judge will then classify as a REFUSAL.
(This was actually an issue when reproducing the work, when Lawrence ran out of OpenRouter credits and ended up getting all refusals from gpt-4o-mini for t4 questions onwards, and which had to be debugged manually.)
For example, the table of proprietary parameter estimates references “distilled rows”, which don’t exist on the table.
For example, the table of proprietary parameter estimates includes very much open source models such as mistral-medium-3.1 and deepseek-v3.1.
Note that despite the implication, pre-existing memorized bits per parameter count estimates also vary by at least a factor of 2x (Allen-Zhu and Li estimated 1.4 bits per param for MoEs and 2 bits for dense GPT-style networks, while the later Morris et al. used a different methodology to reach 3.6 bits/parameter, and the hard information theoretic bound for 8-bit models is ~8 bits/param).
An alternative approach is to estimate the most obscure fact that an LLM knows, but this has its own difficulties (e.g. quantifying obscurity).
It’s named densing law because it measures how models get more parameter efficient performance wise, that is, more dense over time. As an aside, Lawrence thinks that this is a terrible name, like if scaling laws were named lossing laws since they measure how loss goes down with parameter count and dataset size.
We note that it’s possible this is a poor wording choice resulting from overreliance on LLMs. However, we feel that, even if this were to be the case, this would not absolve him of responsibility for including this wording in his single-authored paper.
The Hallucination Similarity Score is computed on T5–T7 probes, which are ~50% researcher questions evaluated under the 4-way STRONG/WEAK rubric. This rubric awards Anthropic a ~16 percentage point excess STRONG rate over the cross-vendor median, perhaps driven by Claude's stylistic preference for verbose evidence-citation. Because HSS depends on which probes count as "correct" for each model, that stylistic bias propagates into the Jaccard intersections and the wrong-answer-overlap rates. We'd expect within-vendor fingerprint comparisons (e.g. Claude Sonnet 4.5 → 4.6 → 4.7 lineage; weight-sharing siblings) to be relatively unaffected, because both members of the pair share the same response style. Cross-vendor comparisons (especially the paper's claim to detect distillation across closed-vendor families) are structurally vulnerable to the same stylistic confound that biases the parameter estimates.
A bigger issue that we did not have time to investigate is the low diversity of high difficulty questions. We're not sure how large this effect is, but the low data diversity is concerning nonetheless.
We'd note that the gap is partly an x-range artifact (active parameters cluster in a narrow ~10–40B band across the 37 MoE calibration models, which compresses the regression's denominator). In more honest predictive units – the LOO median fold error also reported in the paper — MoE-active is only ~13% worse than MoE-total (1.69× vs 1.49×), not the ~36-percentage-point R² gap the headline suggests.
Xiao et al.'s densing law result is suspicious for many reasons: for one thing, they calibrate the capability density using a family of tiny in-house models, don’t control for a few obvious confounds, and do several statistical tricks to inflate the significance of their result. Again, it’s beyond the scope of this post, but we’d be happy to write another one if there’s more interest.
Also, information theoretic bounds imply that the densing law cannot continue on pure factual recall tasks.
Astute readers may have been asking: why is the densing law given in terms of parameter count rather than compute? Can the densing law not just result from ever more compute being spent on smaller models? We think these questions are worth asking, and pose another natural objection to the densing law result.
Another plausible methodological issue is the scaling of the hallucination penalty. We experimented briefly with different scales, but found similar fits to the one reported in the paper, so did not investigate further.
For legibility reasons, we report the inverse slope from the regression predicting parameter count from accuracy. The forward fit described in the methodology section gives the equivalent .
That being said, 3 non-frontier models (Gemini 2.5 Flash-lite, Gemini 2.5 Flash, and Hunyuan A13B) go from positive or near-zero thinking bonus (1.4%, 2.1%, -0.1%) to massively negative thinking bonus (-14%, -8%, -11%) when the flooring is removed, as they refuse much less and are much more likely to produce confident hallucinated answers. With the exception of Hunyuan, the models with negative thinking bonuses in the original paper flip to having positive thinking bonuses once the floor is removed.
We plan on releasing our code in the coming days; the publication of this post was unfortunately rushed due to the writing program both of us are in: https://www.inkhaven.blog/spring-26
The time to actually code these analyses was reduced by a factor much larger than 2x, but the 8-10 hours reflects a substantial amount of time spent sanity checking Claude Code outputs (which wouldn't have occurred w/o heavy Claude Code usage) as well as manual examination of the data, where AI assistance was not used.
There is a question of how we avoided making mistakes similar to those made in the paper, given that we too used AI agents to substantially accelerate our work. We took 4 categories of precautions: 1) most importantly, we treated the different AI results as preliminary, and confirmed any new results against both results from other AIs asked the same question (but without access to the new results), and against the broader knowledge base we (the authors) developed over time. 2) also importantly, we performed many manual spot-checks done by us on samples from the original dataset, agent/judge transcripts, and the IKP codebase (both original and modified). 3) Benjamin had agents build an LLM wiki in order to ensure consistency of claims across our assessment and avoid confabulations. 4) We also made extensive use of agent teams with verifier agents to double check agent code and results, burning over 2.5 million Opus 4.7 tokens between the two of us in the process.
Gemini-3.1-pro was used as a landmark model for calibration, thus inflating its score substantially. We include it to show the effect of correcting the two main methodological issues.
Unlike most other models, grok-4.20 performs much worse without thinking than with. So its size estimate drops by a factor of almost 4 without thinking even with our fixes (and a factor of 5 in Li's original methodology).
The high parameter counts of Gemini-3s likely also results from Gemini 3.1's use as a landmark model.