Sanity-checking “Incompressible Knowledge Probes”

Sturb; LawrenceC

Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?^[1]

Introduction

Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of many frontier closed-source models including the newer GPT-5.5 (9.7T parameters) and Claude Opus 4.6 (5.3T parameters) as well as older models such as o1 (3.5T) and gpt-4o (720B). The paper, titled “Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity”, introduces a dataset of factual knowledge questions of varying difficulty, regresses performance on this dataset against parameter count, and then uses this regression to extrapolate from the performance of closed-sourced frontier models to their parameter count. A notable fact about this paper is that, unlike most empirical machine learning papers, it’s single-authored: Bojie Li, the chief scientist of Pine AI, is the sole author of this piece.

These results were suspicious for many reasons, the primary being that it seems like low-effort, hastily-written AI slop. For example, the codebase (https://github.com/19PINE-AI/ikp) was constructed in large part with Claude Code and has many of the flags for code that is almost entirely vibe-coded with little sanity checking (e.g. redundant and inconsistent variable definitions^[2], boilerplate bloat, excessive error handling^[3], and silent failures^[4]). The same can be said of the author’s website for this paper (archived here), which has definitions for terms that appear nowhere on the page^[5], table headings inconsistent with the contents^[6], and has a very high heading-to-text ratio.

We (Benjamin and Lawrence) decided to dig into these results further. Specifically, we read the paper, reproduced the author’s results using their code base, and then dug into some obvious methodological issues to see how much the issues affected the author’s results.

We find:

The core idea behind the paper is largely sound but overstated. IKP performance seems to correlate strongly with parameter count for open-source models (R^2 of between 0.78-0.92), but the exact degree depends on methodological choices obscured by the paper.
The codebase makes poorly-documented methodological choices that are largely unjustified, and sometimes inconsistent with both itself and the arXiv paper. Most of these don’t matter, but one does make a big difference on the results: whether or not the scores given to the models are given a minimum floor, something Li claims to not do in the paper but which was done in the code.
The IKP dataset has serious issues relating to data diversity, quality, and ambiguity, especially for harder questions. All hard questions (Tier 5 to Tier 7) come from two sources: parsed wikidata entries and questions about the field of obscure researchers. Substantial fractions of both the hard wikidata-sourced questions (8%) and the hard researcher-knowledge questions (~25%) are ambiguous in that they may refer to different entities. In a few cases, the models are even rated incorrect because Li’s provided gold answer is wrong.
Because of the above issues, we believe existing results for the parameter count of closed-source frontier models to be very suspect. Correcting for some of the methodological and dataset issues we identified, a linear regression on IKP performance suggests that GPT-5.5 has around 1.5T parameters, while Claude Opus 4.7 has around 1.1T. Our error bars for frontier model parameter counts also widen substantially. Because this extrapolation is so sensitive to methodological choices (and important limitations of the dataset that we did not have time to address), we believe that the different numbers reveal problems with the original methodology, rather than the true parameter counts of the models.

Despite these issues, we think that the core idea – reverse engineering LLM parameter count by quantifying memorization capacity – is solid, and welcome future work implementing this in a more rigorous and systematic way.

Summary of Li’s “Incompressible Knowledge Probes”

As usual, let’s start by summarizing the paper at hand.

One way of estimating the size of closed model is by extrapolating from API throughput and pricing under a hardware-cost model (e.g. Epoch AI’s inference economics).

Li argues that these size estimates are unreliable, by a factor of over 2x, due to confounders such as quantization, batching, and vendor margin. He instead proposes reverse engineering parameter count by using the fact that neural networks can only store a linear number of facts in parameter count.^[7] Unfortunately, this isn’t as simple as counting all the facts:^[8] for one, doing that exhaustively is intractable.

Li builds a set of questions ("Incompressible Knowledge Probes," IKP) testing factual knowledge with varying degrees of obscurity. Probes come from four sources: GPT-5-generated questions, Wikidata SPARQL pulls, DBLP/OpenAlex researcher records, and a small set of hand-curated questions. Li calls these "probes," but to avoid confusion we'll just call them questions.

Li claims six contributions:

He introduces the IKP dataset, in order to measure incompressible facts. This is distinguished from procedural knowledge (e.g. how to write code), which is likely compressible.
He regresses model parameters against IKP performance, and finds a strong linear relationship between adjusted IKP performance and model parameter count on 89 open source models. He also confirms that IKP outperforms MMLU, GPQA, and SimpleQA at predicting parameter count. We think this result generally holds up, though we believe the exact strength of the claimed relationship to be overstated.
He “falsifies” the “densing law” results from previous work. We agree that the densing law^[9] paper is indeed very suspect (if there’s interest, we can detail why in a follow-up post). However, the densing law paper is not directly “falsified"^[10] by Li’s results; a more correct reading of which is that, controlling for parameter count, open-source LLMs are not getting better on his IKP dataset.
He uses the IKP <-> model parameter regression to estimate the parameters of closed-source frontier models and the “effective” parameters of MoEs. These results are headlined by GPT-5.5 (~9.7T) and Claude Opus 4.6 (~5.3T). He also shows that for Mixture-of-Experts models, total parameters predict factual knowledge much better than active parameters (R² of 0.79 vs 0.51).
He uses similarity on responses on the IKP dataset to identify models with the same base model vs full retrains. Specifically, he combines rare-fact Jaccard overlap with "hallucination similarity" (the rate at which two models produce the same wrong answer on rare facts) into a Hallucination Similarity Score, which he claims separates weight-sharing siblings, post-training lineages, and full retrains across closed vendors without requiring model weights. We did not investigate these results in detail, so we can't speak to whether the lineage clusters in the paper's Figure 5 are correct.^[11]
He open-sourced his code on GitHub. We appreciate this a lot because it 1) greatly simplified the process of reproducing his results and 2) made it much easier to identify possible methodological issues with his work.

The IKP dataset

The IKP dataset consists of 1400 questions, divided evenly into 7 tiers of obscurity/difficulty (200 questions per tier). There are four sources of questions:

GPT-5-generated candidates (401 questions): These are questions generated by asking GPT-5 to generate factual questions with a few provided examples. These compose the bulk of T1-T2 questions, though some make it to T3-T4. Example:

[T2] "Who composed the ballet 'Giselle'?"
Gold Answer: Adolphe Adam

Wikidata SPARQL questions (557 questions): These questions are drawn from the wikidata database, and involve asking about the founding years for institutions, capital cities of countries, location of headquarters, and geographic facts. These mainly populate T3-T7 (only 11 of the wikidata questions are in T1 or T2). Example:

[T4] "In what year was National Pingtung University of Education founded?"
Gold: 1940

DBLP / OpenAlex researcher questions (345 questions): These questions ask the model to "name the subfield and one paper, system, institution, or co-author for [researcher]". Most of these are in T5-T7. Example:

[T5] "In computer science, what is the research subfield of Martina Zitterbart, and name one paper, system, institution, or co-author associated with their work? If you don't know who this person is, say so."
Gold: computer networking
[papers, coauthors, and affiliations from OpenAlex omitted for brevity]

97 manual or supplementary additions from the author’s previous work to balance T1–T4 coverage. Examples:

[T1] "What is the capital of Portugal?" → Lisbon
[T2] "What is the largest lake in Africa?" → Victoria
[T3] "Who composed the opera 'The Magic Flute'?" → Mozart

The difficulty of each tier is empirically calibrated against 6 "landmark" models. These models consist of 5 open-weight models ranging from Qwen 2.5 0.5B (T1) up to Kimi K2.5 1T (T5), as well as Gemini 3.1 Pro (T6). A question is assigned to tier k if the k-th landmark answers it correctly but the (k−1)-th landmark does not. T7 is reserved for questions no landmark model gets right.

As we’ll later note, both the wikidata and researcher question datasets (which comprise over 900 of the 1400 questions, including all questions from T5-T7) have fairly significant quality issues.^[12] For example, both wikidata and the researcher question sets contain many ambiguous questions resulting from name-space collisions (e.g. multiple researchers or locations that have the same name). Other wikidata questions are either underspecified or have ambiguous answers: e.g. there's a question asking about Oxford's founding, but Oxford received its royal charter in 1248, but there’s been evidence of teaching at Oxford in 1096 and the university arguably could’ve existed earlier. Some of the questions also reference outdated information, or only consider one author correct for multi-authored works. This complicates the interpretation of the results.

IKP scoring and regression methodology

For each of a model’s answers, Li scores its responses on either a 3 or 4-point scale:

STRONG / CORRECT = +1.0
WEAK = +0.5 (Reserved for researcher questions where the model provides the right subfield but no supporting evidence.)
REFUSAL = 0
WRONG = λ, where λ = −1.0 by default (the "hallucination penalty")

The hallucination penalty is added in order to discourage guessing (though it also penalize models who know the answers to questions that have incorrect gold answers). Each of the 7 tiers’ score is the mean over its 200 questions, and a model's overall "penalized accuracy" is the unweighted mean of the seven tier scores. To calculate penalized accuracy, the per-tier scores are floored at 0 in the released data, even though the paper text explicitly claims they are not floored "to preserve the bluff signal in the calibration." This is one of the methodology inconsistencies we'll come back to, as the choice meaningfully changes the slope of the fit.

The judge is Gemini 3 Flash Preview at temperature 0, and all target models are run once at temperature 0. Note that this is fairly non-standard for model evaluations (and many reasoning model providers explicitly discourage running their models with t=0).

The headline regression is a one-line OLS:

This OLS is fit on 89 open-weight models with known parameter counts, ranging from SmolLM2-135M up to DeepSeek V4 Pro at 1.6T. Li reports α = 0.147, β = +0.132, R² = 0.917, with leave-one-out median fold error of 1.59× and a 90% prediction interval factor of 3.0×. Inverting the regression gives a parameter-count estimate for any target model: N̂ = 10^((A − β) / α).

For Mixture-of-Experts models, total parameters predict factual knowledge meaningfully better than active parameters (R² = 0.79 vs 0.51).^[13]

IKP performance doesn't improve over time

The densing law paper (Xiao et al. 2024) introduces "capability density", defined as the ratio of a model's effective parameter size to its actual parameter size. Here, "effective size" is the parameter count a reference model would need to match the target's downstream score. Across 29 open-source base models, they fit and report per day, which they translate to "the maximum capability density of LLMs doubles approximately every 3.3 months.^[14]

To test whether or not a similar law applies to the IKP questions, Li adds release date as a covariate to the IKP regression:

If Xiao et al.'s densing law applied to the IKP questions, then γ should be about +0.0117/month (the value that produces the claimed 3.3-month density doubling). Across 96 dated open-weight models, Li fits γ = −0.0010/month, 95% CI [−0.0031, +0.0008] — statistically indistinguishable from zero, and rejects the naive application of densing law at .

This result stands up to all of the stress testing we performed. We refit the regression with vendor fixed effects (22 vendor dummies), family fixed effects (33 family dummies), without thinking-mode variants, dropping the open-weight tier-landmark models (anti-circularity check), and under both flooring regimes for the per-tier scores. In every specification γ stays within ±0.004/month of zero, and the +0.0117/month densing prediction is rejected with effective certainty. So whatever else the paper does, this result holds.^[15]

That being said, we believe the right way to read this result is: holding parameter count fixed, factual recall on rare entities has not improved across open-weight model generations from 2023 through April 2026. Controlling for parameter count, procedural benchmarks like MMLU and HumanEval have improved over the same window, often dramatically.^[16] Li falsifies Xiao et al.'s densing law on the IKP dataset, but not in general, given that the densing law was not intended to cover factual recall capacity.

Methodological Issues with the IKP paper

The paper and codebase have a number of methodological issues, both across dataset construction, judging methodology, and reporting of results. The two main methodological issues that impact the results are the use of per-tier flooring for scores (contrary to the paper’s claims) and questions with ambiguous and incorrect answers. When we adjust for these issues in our replication, the headline numbers change significantly.

Per-tier floors to the scoring

When scoring the models, each probe score is measured:

STRONG / CORRECT = +1.0
WEAK = +0.5
REFUSAL = 0
WRONG = λ, where λ = −1.0^[17] (the "hallucination penalty")

In the paper section 4.3 says "Per-tier scores are not floored at zero in the released results … to preserve the bluff signal in the calibration." Flooring means that if it would normally go negative due to wrong answers, it would instead be held at 0. While they claim that the results are not floored, they in fact are floored, both for the reported values in the paper and in the repository.

Applying per-tier flooring makes the scores of smaller models much higher: relative to larger models, they hallucinate more and refuse less when they don't know the answer, but these hallucinations get rounded up to zero. Removing the flooring makes smaller models have lower accuracy at a given parameter count, and thus decreases the slope of log parameters against accuracy. Specifically:

Using accuracies floored at zero, as in the author's results:
- For very small models (<1B), accuracy is consistently floored up to around 0.05.
- For large models (>500B), accuracy is around 0.57.
- The log parameter/accuracy slope is 6.79.^[18]
Using unfloored accuracies, as described in the paper:
- For very small models (<1B), accuracy drops to around −0.39.
- For large models (>500B), the accuracy drops a small amount to 0.51.
- The log parameter/accuracy slope drops to 3.56, around half of the floored slope.

Addressing this inconsistency also causes R² to drop from 0.917 → 0.815 and the 90% prediction interval span (95 percentile param count/5 percentile param count) to widen from 3.0 to 5.7. Notably, the slope of log parameter count relative to performance drops substantially, from 6.79 to 3.56. This means the large parameter counts for frontier models in Li’s original paper are largely an artifact of the flooring (or more cynically, an undocumented “code-level optimization”).

Ambiguous/incorrect answers to hard questions

For the researcher questions, Li filters out two-character Chinese names and single-initial given names (Section 4.1). Unfortunately, manual inspection of some randomly sampled questions revealed two issues this filter doesn't catch:

First, researchers whose name is shared by multiple distinct CS researchers with non-trivial citation counts. We re-queried OpenAlex for every researcher in the 345-probe set and counted distinct profiles sharing the name with ≥50 citations each. Examples where reasonable disagreement is genuinely possible: Stjepan Picek (17 OpenAlex profiles, 2 high-cite), Zhendong Su (24 profiles, 4 high-cite), Zhiguo Ding (25 profiles, 6 high-cite). Across tiers, we flagged 86 of 345 (24.9%) researcher questions as ambiguous: T3: 11/35 (31%), T4: 11/51 (22%), T5: 25/100 (25%), T6: 14/59 (24%), T7: 25/100 (25%).
Second, researchers whose primary subfield is contested or has shifted over time. Dan Suciu's field is given in the gold answers as "programming languages," but his most-cited and most-recent work is in databases. Under their scoring system, a model that says "databases" is marked WRONG despite being arguably more correct.

For wikidata questions, Li applied a “10-round audit/repair cycle” (section 7.7). Unfortunately, it seems that this repair cycle failed to catch at least two types of issue:

First, we noted that for a large number of entities, there still remained ambiguities regarding which entity the question referred to. For example, one wikidata question asks "In what country is the cape Cape Simpson located?" with answer "Greenland", even though there also exist "Cape Simpson"s in Alaska and in the Antarctic. Similarly, there are multiple rivers named Běla in Czechia other than the one assumed in the gold answer (example 1, example 2). We found that of the 557 wiki questions, 45 referenced an entity that shared a label with 2 or more other entities.
Second, there remain other genuine semantic ambiguities. For example, the UT Austin School of Nursing question (gold: founded 1890) is technically right but ambiguous. The 1890 school was founded in Galveston, not Austin. UT Austin offered its first nursing courses in 1960, and the program was officially incorporated into UT Austin in 1976; models often give the two other answers. Other ambiguities we found were questions where multi-author works only had a single author as the gold answer. For example, The Greater Barrier Film has 4 listed screenwriters on Wikipedia (and 3 on some other sources), of which only one is considered the right answer. These are different failure modes from name collisions, and a more thorough audit would no doubt find more of them.

For manually generated questions, we inspected those where the models consistently did much worse than the tier would suggest, and found two incorrect questions: one on the highest peak in Bangladesh (whose answer changed from the IKP gold answer of Keokradong to Saka Haphong with more survey data) and another on the founding year of Mongolian People’s Revolutionary Party (different sources cite 1920 or 1921). We excluded these two from our analysis.

Interestingly, these possible issues are noted by Li in Appendix H. However, he does not attempt to quantify how many questions are ambiguous or incorrect, nor how large the impact is if you were to remove the ambiguous questions. We do that here.

We attempted to remove the ambiguous or incorrect questions we noticed, though we note that we were unable to remove all Wikidata questions involving semantic ambiguities besides name collisions as we didn't have the time to check the answer to every single question manually.

Source	Number of Questions	Flagged ambiguous	Heuristic
LLM-generated	401	0	Visual spot check was performed across all four tiers with an LLM judge. The questions seem well formed.
Researcher	345	86 (24.9%)	OpenAlex shows ≥2 distinct researchers with ≥50 citations sharing the name.
Wikidata	557	45 (8.08%)	≥3 Wikidata entities share the same label. (We found other categories of ambiguities that we could not programmatically remove.)
Manually generated	97	2 (2.05%)	We manually inspected the manually generated questions where the models performed much worse than expected, and confirmed that 2 of the questions were incorrect.
Total questions audited	1,400	131 (9.4%)

Corrected model parameter estimates

We attempt to fix the two methodological issues we identified above, by removing questions that have ambiguous answers from the various datasets, and also by removing the flooring from the accuracy estimates. We then recalculate scores across all the models measured in the paper. We present the newly calculated scores for 8 large frontier or near-frontier models; a larger list of predicted model parameter counts is included in the appendix.

Overall, the estimates drop massively for the most capable frontier models (observing a nearly 10x difference in Gemini 3.1-pro) and for some of the smaller models we see a modest increase in score.

Possible methodological issues that mattered less than we thought

Thinking vs non-thinking

In the original work, models would often behave much better when thinking was enabled rather than not. This led to parameter estimates that were off by as much as 4.9x: Grok-4.20 was estimated to have 110B parameters without thinking, and 540B parameters with. Claude Opus 4.6 was estimated to have 2.4T parameters without thinking, but 5.3T with thinking enabled, a 2.2x difference.

The headline results in the paper obscure this difference, as they generally report the maximum parameter count of either the thinking and non-thinking versions of the same model.

Interestingly, after removing the arbitrary flooring, we generally observe much smaller score differences between frontier models with and without thinking enabled. Grok-4.20’s thinking multiplier dropped to 3.9x, while Claude Opus 4.6’s dropped to 1.2x. We unfortunately did not have the time to investigate why the thinking gap decreased after flooring was removed.^[19] That being said, we believe that this is some evidence that, on the IKP benchmark, enabling or disabling thinking does not have that much of an effect on the estimated parameter counts of frontier models.

Different accuracy metrics used in some repository JSON files

We observed that the penalized accuracy metric used to score the models in some of the JSON files in the repo was different from what was shown in the paper. We subsequently investigated whether these different accuracy metrics affected the results, but found that the aberrant JSON scores were not used to produce any of the figures or tables in the paper. That is, the different accuracy metrics did not affect any of Li’s results as presented in the paper.

Conclusion

In this work, we examined the robustness of both the methodology and results of Li’s “Incompressible Knowledge Probes” paper. We identified two main methodological issues with the work: the per-tier flooring that exists in the code despite the paper claiming otherwise and the large fraction of ambiguous questions, especially in higher tiers of difficulty. We also note two questionable methodological choices that do not impact the results significantly: the performance gap between thinking and non-thinking models was much smaller than we initially thought, and the different accuracy metrics included in some JSON files were not used for the main analysis.

That being said, three of Li's claims survive every stress test we applied:

Factual recall, as measured by Li’s IKP dataset, scales log-linearly with parameter count across open-weight models. We found that the slope is consistently around 0.15 across every reasonable subset (≥0.5B, ≥10B, ≤30B, ≤100B, dense-only, MoE-only), and the intercept moves modestly.
The previous "densing law" doesn't apply to performance on the IKP dataset. The time parameter in the regression, γ, stays within ±0.004/month of zero across vendor FE, family FE, both flooring regimes, and anti-circularity refits.
MoE total parameters predict knowledge better than active parameter counts. The R² gap is partly an artifact, but other comparisons confirm a 600B-total MoE behaves like a 600B-total dense, not like a 37B-active dense.

However, what does not survive are the specific multi-trillion parameter estimates for closed frontier models. After attempting to correct for methodological issues to the best of our ability, we found that the parameter count of the top proprietary frontier models drops from ~10T to ~1.5T.

We emphasize, however, that our point estimate of 1.5T for GPT-5.5 should not be read as a confident estimate of its true parameter count. Instead, we see it as evidence that parameter estimates from this methodology are unreliable and sensitive to methodological choices. Both of us are quite uncertain about the exact parameter count of GPT-5.5.

We think that the IKP dataset (and methodology) is a real contribution. Li also deserves credit for releasing the dataset and code; it is precisely because he open-sourced his code that we could write this post so quickly.^[20] But the standard for an empirical paper that produces concrete numbers ("GPT-5.5 has 9.7T parameters") needs to be higher than "I ran one regression and reported the result." Methodological choices should be discussed and justified; the effects of possible limitations or dataset issues should be analyzed and not just acknowledged in passing; and results that seem surprisingly good (or just surprising) should be scrutinized before they go viral on Twitter.

Discussion

On a broader point, we think this work illustrates both the risks and potential of AI-generated research code.

Li's paper illustrates many of the risks. The codebase looks like code that was generated quickly and never carefully checked, including the six near-identical judge prompts in different scripts, defensive error handling that silently turns network failures into refusals, redundant variable definitions, and at least two cases where the paper text and the released code disagree about what the methodology is. The companion website has terms defined but used nowhere and incorrectly labeled tables. None of these is individually fatal, but together they describe a pipeline where no one (including the author) read the work with a critical eye before it went public. A single-authored empirical paper with no internal or external review is a known failure mode. A single-authored empirical paper generated largely by an LLM without much review is the same failure mode at higher throughput.

But the same tools that lower the cost of producing this kind of work also lower the cost of checking it. Thanks to Claude Code (and to a smaller extent, Codex) automating much of the code generation process on our end, the two of us were able to replicate Li’s main results, perform many sensitivity analyses, and write this up in around 8-10 hours of effort per person.^[21] We estimate that the same amount of work would’ve taken us around 15-20 hours each using previous generation coding assistants (e.g. Cursor’s autocomplete).^[22]

In terms of the IKP work, despite the issues with the headline results, we believe the core idea of reverse engineering LLM parameter count using memorization capacity to be solid and welcome future work that attempts to implement it in a more rigorous and systematic way. As a broader point about research scrutiny, we hope that this example serves as an important reminder of the changing economics of producing and scrutinizing new research results: as costs of both drop and the production of new results ramps up, so too should the scrutiny we apply to each result.

Appendix

Table of corrected parameter counts.

We share a table of original and corrected parameter counts alongside 90% prediction intervals to compare the numbers more precisely than in our figure. We start with the 8 models used in the figures in the main body (chosen to give a reasonable spread amongst model providers), the 6 landmark models used to divide questions into difficulty tiers, and then the 10 models with highest original estimated parameter counts.

Model	Vendor	Paper estimate in billions of parameters [90% PI]	Estimate w/ corrections in billions of parameters [90% PI]	Δ paper→ corrected
gemini-3.1-pro^[23]	Google	40,794B [13,598 – 122,382]	4,653B [816 – 26,522]	↓8.77×
gpt-5.5	OpenAI	9,659B [3,220 – 28,977]	1,458B [256 – 8,311]	↓6.62×
gpt-5	OpenAI	4,088B [1,363 – 12,264]	1,330B [233 – 7,581]	↓3.07×
claude-opus-4.7	Anthropic	4,042B [1,347 – 12,126]	1,132B [199 – 6,452]	↓3.57×
claude-sonnet-4.6	Anthropic	1,730B [577 – 5,190]	661B [116 – 3,768]	↓2.62×
grok-4.20 (thinking)^[24]	xAI	542B [181 – 1,626]	768B [135 – 4,378]	↑1.42×
deepseek-r1 (671B)	DeepSeek	424B [141 – 1,272]	760B [133 – 4,332]	↑1.79×
deepseek-v3 (671B)	DeepSeek	589B [196 – 1,767]	564B [99 – 3,215]	↓1.04×

Below we share the parameter counts for the original 6 "landmark" models used in the paper to separate the question into different difficulty tiers.

Tier	Model	Vendor	True params	Paper estimate [90% PI]	Estimate w/ corrections [90% PI]	Δ paper→ corrected
T1	Qwen 2.5 0.5B	Alibaba	0.5B	1B [0 – 3]	0.2B [0 – 1]	↓4.58×
T2	Qwen 2.5 7B	Alibaba	7.6B	9B [3 – 27]	8B [1 – 48]	↓1.07×
T3	Qwen 3 32B*	Alibaba	32.0B	34B [11 – 103]	23B [4 – 134]	↓1.46×
T4	Qwen 3 235B*	Alibaba	235.0B	113B [38 – 338]	145B [25 – 827]	↑1.29×
T5	Kimi K2.5*	Moonshot	1,040B	3,121B [1,040 – 9,362]	680B [119 – 3,878]	↓4.59×
T6	Gemini 3.1 Pro	Google	— (closed)	40,794B [13,598 – 122,382]	4,653B [816 – 26,522]	↓8.77×

Below we share the parameter estimates for the 10 models with the highest parameter count estimates in the original paper.

Model	Vendor	Paper estimate [90% PI]	Estimate w/ corrections [90% PI]	Δ paper→ corrected
gemini-3.1-pro ^[23]	Google	40,773B [13,591 – 122,319]	4,653B [816 – 26,522]	↓8.76×
gemini-3-flash-think^[25]	Google	17,065B [5,688 – 51,195]	2,526B [443 – 14,398]	↓6.75×
gemini-3-flash	Google	14,433B [4,811 – 43,299]	1,939B [340 – 11,052]	↓7.44×
gpt-5.5-pro	OpenAI	10,267B [3,422 – 30,801]	1,471B [258 – 8,385]	↓6.98×
gpt-5.5-think	OpenAI	9,656B [3,219 – 28,968]	1,458B [256 – 8,311]	↓6.62×
gpt-5.5	OpenAI	8,831B [2,944 – 26,493]	1,459B [256 – 8,316]	↓6.05×
claude-opus-4.6-think	Anthropic	5,254B [1,751 – 15,762]	1,399B [245 – 7,974]	↓3.76×
gpt-5-pro	OpenAI	4,110B [1,370 – 12,330]	1,197B [210 – 6,822]	↓3.43×
gpt-5-think	OpenAI	4,087B [1,362 – 12,261]	1,330B [233 – 7,581]	↓3.07×
claude-opus-4.7-think	Anthropic	4,041B [1,347 – 12,123]	1,132B [199 – 6,452]	↓3.57×

Details on our corrected OLS fit

For the corrected OLS fit in the main body, each point in the calibration scatter plot is one of n=89 open-weight models. Each model's accuracy is the unweighted mean of its per-tier penalized accuracies where tiers refer to difficulty rankings between 1 (easiest) and 7 (hardest) that the 1,400 probes (or 1,269 after we remove the 131 ambiguous/incorrect ones) are separated into.

The accuracy calculation is: , where strong/weak/wrong are counts of probes scored CORRECT_STRONG, CORRECT_WEAK, and WRONG within a tier.

In our method we remove the per-tier flooring at 0 in the accuracy calculation and exclude the 131 questions we flagged as ambiguous.

Otherwise, we follow the paper's approach by OLS, where is accuracy and is parameter count in billions, and then rearrange to for prediction.

^{^}
As Betteridge's law of headlines suggests, the answer is no.
^{^}
For example, the judge prompt appears in at least 6 different scripts with slightly different wording.
^{^}
For example, lines 78-86 of src/scorer.py:
Note that both the first is fully subsumed by the second check (result.startswith("CORRECT")), and the third differs from the second only on strings like "CORRECTNESS", which seems unlikely to be an intentional distinction.
^{^}
For example, ikp_estimate.py returns an empty string “” if an invalid HTTP response is received, which the judge will then classify as a REFUSAL.
(This was actually an issue when reproducing the work, when Lawrence ran out of OpenRouter credits and ended up getting all refusals from gpt-4o-mini for t4 questions onwards, and which had to be debugged manually.)
^{^}
For example, the table of proprietary parameter estimates references “distilled rows”, which don’t exist on the table.
^{^}
For example, the table of proprietary parameter estimates includes very much open source models such as mistral-medium-3.1 and deepseek-v3.1.
^{^}
Note that despite the implication, pre-existing memorized bits per parameter count estimates also vary by at least a factor of 2x (Allen-Zhu and Li estimated 1.4 bits per param for MoEs and 2 bits for dense GPT-style networks, while the later Morris et al. used a different methodology to reach 3.6 bits/parameter, and the hard information theoretic bound for 8-bit models is ~8 bits/param).
^{^}
An alternative approach is to estimate the most obscure fact that an LLM knows, but this has its own difficulties (e.g. quantifying obscurity).
^{^}
It’s named densing law because it measures how models get more parameter efficient performance wise, that is, more dense over time. As an aside, Lawrence thinks that this is a terrible name, like if scaling laws were named lossing laws since they measure how loss goes down with parameter count and dataset size.
^{^}
We note that it’s possible this is a poor wording choice resulting from overreliance on LLMs. However, we feel that, even if this were to be the case, this would not absolve him of responsibility for including this wording in his single-authored paper.
^{^}
The Hallucination Similarity Score is computed on T5–T7 probes, which are ~50% researcher questions evaluated under the 4-way STRONG/WEAK rubric. This rubric awards Anthropic a ~16 percentage point excess STRONG rate over the cross-vendor median, perhaps driven by Claude's stylistic preference for verbose evidence-citation. Because HSS depends on which probes count as "correct" for each model, that stylistic bias propagates into the Jaccard intersections and the wrong-answer-overlap rates. We'd expect within-vendor fingerprint comparisons (e.g. Claude Sonnet 4.5 → 4.6 → 4.7 lineage; weight-sharing siblings) to be relatively unaffected, because both members of the pair share the same response style. Cross-vendor comparisons (especially the paper's claim to detect distillation across closed-vendor families) are structurally vulnerable to the same stylistic confound that biases the parameter estimates.
^{^}
A bigger issue that we did not have time to investigate is the low diversity of high difficulty questions. We're not sure how large this effect is, but the low data diversity is concerning nonetheless.
^{^}
We'd note that the gap is partly an x-range artifact (active parameters cluster in a narrow ~10–40B band across the 37 MoE calibration models, which compresses the regression's denominator). In more honest predictive units – the LOO median fold error also reported in the paper — MoE-active is only ~13% worse than MoE-total (1.69× vs 1.49×), not the ~36-percentage-point R² gap the headline suggests.
^{^}
Xiao et al.'s densing law result is suspicious for many reasons: for one thing, they calibrate the capability density using a family of tiny in-house models, don’t control for a few obvious confounds, and do several statistical tricks to inflate the significance of their result. Again, it’s beyond the scope of this post, but we’d be happy to write another one if there’s more interest.
^{^}
Also, information theoretic bounds imply that the densing law cannot continue on pure factual recall tasks.
^{^}
Astute readers may have been asking: why is the densing law given in terms of parameter count rather than compute? Can the densing law not just result from ever more compute being spent on smaller models? We think these questions are worth asking, and pose another natural objection to the densing law result.
^{^}
Another plausible methodological issue is the scaling of the hallucination penalty. We experimented briefly with different scales, but found similar fits to the one reported in the paper, so did not investigate further.
^{^}
For legibility reasons, we report the inverse slope from the regression predicting parameter count from accuracy. The forward fit described in the methodology section gives the equivalent .
^{^}
That being said, 3 non-frontier models (Gemini 2.5 Flash-lite, Gemini 2.5 Flash, and Hunyuan A13B) go from positive or near-zero thinking bonus (1.4%, 2.1%, -0.1%) to massively negative thinking bonus (-14%, -8%, -11%) when the flooring is removed, as they refuse much less and are much more likely to produce confident hallucinated answers. With the exception of Hunyuan, the models with negative thinking bonuses in the original paper flip to having positive thinking bonuses once the floor is removed.
^{^}
We plan on releasing our code in the coming days; the publication of this post was unfortunately rushed due to the writing program both of us are in: https://www.inkhaven.blog/spring-26
^{^}
The time to actually code these analyses was reduced by a factor much larger than 2x, but the 8-10 hours reflects a substantial amount of time spent sanity checking Claude Code outputs (which wouldn't have occurred w/o heavy Claude Code usage) as well as manual examination of the data, where AI assistance was not used.
^{^}
There is a question of how we avoided making mistakes similar to those made in the paper, given that we too used AI agents to substantially accelerate our work. We took 4 categories of precautions: 1) most importantly, we treated the different AI results as preliminary, and confirmed any new results against both results from other AIs asked the same question (but without access to the new results), and against the broader knowledge base we (the authors) developed over time. 2) also importantly, we performed many manual spot-checks done by us on samples from the original dataset, agent/judge transcripts, and the IKP codebase (both original and modified). 3) Benjamin had agents build an LLM wiki in order to ensure consistency of claims across our assessment and avoid confabulations. 4) We also made extensive use of agent teams with verifier agents to double check agent code and results, burning over 2.5 million Opus 4.7 tokens between the two of us in the process.
^{^}
Gemini-3.1-pro was used as a landmark model for calibration, thus inflating its score substantially. We include it to show the effect of correcting the two main methodological issues.
^{^}
Unlike most other models, grok-4.20 performs much worse without thinking than with. So its size estimate drops by a factor of almost 4 without thinking even with our fixes (and a factor of 5 in Li's original methodology).
^{^}
The high parameter counts of Gemini-3s likely also results from Gemini 3.1's use as a landmark model.

[-]bojieli@gmail.com1mo31

Thank you for the careful analysis and thoughtful comments. The paper, code, dataset and the website were all generated with my agent and Claude Code in about 4 days, as an experiment to test the orchestration ability of my agent. I didn't get any internal or external review before posting. The code quality and data quality problems you pointed out are fair, and I appreciate that.

Actually, while conducting this research, I also noticed the sensitivity of the paper's conclusions to the datasets and hyperparameters. There are two hyperparameters in this paper: whether to penalize wrong answers and by how much (λ); and if penalized, and whether the minimum score for each difficulty level is floored at 0 (floor). And whether to use all open models or MoE only for estimation.

You showed that just toggling the floor moves GPT-5.5 from 8.8T down to 1.9T. Here are the 12 combinations of {floor on/off} × {λ ∈ 0, −0.5, −1} × {all open models, MoE only}:

- λ=0, all open (n=89, R²=0.908): GPT-5.5 7.6T, Opus 4.7 1.7T, Gemini 3.1 Pro 15.1T

- floor=T, λ=−0.5, all open (n=89, R²=0.920): GPT-5.5 8.3T, Opus 4.7 2.8T, Gemini 3.1 Pro 26.7T

- floor=F, λ=−0.5, all open (n=89, R²=0.878): GPT-5.5 3.1T, Opus 4.7 1.8T, Gemini 3.1 Pro 8.6T

- floor=T, λ=−1, all open — the paper (n=89, R²=0.917): GPT-5.5 8.8T, Opus 4.7 3.6T, Gemini 3.1 Pro 40.8T

- floor=F, λ=−1, all open — your fix (n=89, R²=0.784): GPT-5.5 1.9T, Opus 4.7 1.9T, Gemini 3.1 Pro 6.4T

- λ=0, MoE only (n=37, R²=0.673): GPT-5.5 10.4T, Opus 4.7 2.0T, Gemini 3.1 Pro 22.1T

- floor=T, λ=−0.5, MoE only (n=37, R²=0.777): GPT-5.5 8.4T, Opus 4.7 2.9T, Gemini 3.1 Pro 26.9T

- floor=F, λ=−0.5, MoE only (n=37, R²=0.826): GPT-5.5 4.2T, Opus 4.7 2.3T, Gemini 3.1 Pro 12.3T

- floor=T, λ=−1, MoE only (n=37, R²=0.792): GPT-5.5 7.8T, Opus 4.7 3.4T, Gemini 3.1 Pro 33.2T

- floor=F, λ=−1, MoE only (n=37, R²=0.802): GPT-5.5 2.6T, Opus 4.7 2.5T, Gemini 3.1 Pro 9.1T

(λ=0 floor=T and floor=F are identical because nothing can go negative when we do not penalize wrong responses.)

And it gets even wider if we only use Wikidata-only probes: floored λ=−1 gives Gemini 3.1 Pro at ~163T. Given this spread, no single point estimate is honest, and I should have reported a band rather than a number.

My parameter selection was based on maximizing the R² value for open-weight models.

Therefore, IKP is merely an interesting idea and an early-stage study; currently, it cannot reliably estimate the parameter counts of closed-source models.

To be clear, I shipped this on arXiv knowing it was premature. IKP was meant as a starting point, but not the end. I hope future work will explore this problem in greater depth! Thanks again for your excellent analysis and comments!

— Bojie Li

[-]Sturb1mo10

Thank you so much for your thoughtful response. We appreciate the gracious response given that the post highlighted some concerns with the original work.

It is somewhat challenging to create principled methods for selecting the hyperparameters you mentioned, and this seems like an area for future work. We also experimented with hyperparameters such as λ = -0.25 which produced an even higher R^2 to the open source models of 0.925, but which produced predictions of the proprietary models which seemed quite low to us.

The method certainly seems promising and we are excited to see your future results!

[-]Kurt H. Pieper1mo20

Possibly the method will underestimate parameter count as time goes on. I don't expect it to be economically valuable to pretrain on the very long tails of knowledge, as opposed to letting more bits flow in from synthetic data / RLVR. Though I'm surprised as to why this hasn't already happened.

[-]Stanislav Fort1mo20

Quick thought: I wonder if you could still be overestimating parameter counts for models that would be made by distilling a larger teacher down to a smaller student. Any OSS ones you could test this hypothesis on?

[-]LawrenceC1mo30

Properly done, the methodology should find that sufficiently over trained low parameter models ~= distilled low parameter models, since there isn’t more capacity to memorize. But yeah, that would be another good sanity check to run.

Wait, why are distilled models better than just overtraining the small model again? My guess is it’s mainly because SFT >> RL for efficiency, and cloning good CoTs is easier than sampling them via random exploration.

[-]joanv1mo10

re: distilled > overtrained, you can distill via on policy distillation (OPD) with the strong model as the teacher, get dense supervision (you wouldn’t get this with RLVR) and get generalization gains (because of the on policy nature of OPD vis a vis SFT, which is a lot more fat handed).

[-]LawrenceC1mo20

Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.

The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.

[-]Brendan Donohoe1mo10

Similarly, I'd be interested in whether there's a difference between western and Chinese models in this regard. It's been a long held belief that Chinese models tend to rely more on distilling frontier models, so how do the estimated parameter counts for known Qwen models compare to estimated parameter counts like Gemma 4 for example? Likewise, is there a difference between the dense and almost-equally-sized MoE models in those respective families?

[-]Corgi Run21d10

Super nice work!

Incompressible Knowledge Probes was completed using agents.

May I ask, perhaps a bit boldly, whether this blog post was also created with the help of agents?

If so, that would be fascinating: using agents to conduct research, and then using agents to verify other people’s research.

A new era of scientific research has arrived.

[-]Ivan Pakin1mo10

Amazing work!

I am a bit puzzled why you insist on removing the accuracy floor as a more "correct" approach. The scoring formula is arbitrary; crucially, it tanks the in-distribution predictive power of the hypothesis.

It would be awesome to get the results of each proposed amendment (probe quality and accuracy calculation) separately. It'd be even greater if you shared the repo with your results and especially the filtering results for the probes dataset, because people have to rely only on the original flawed version right now. Please forgive me if I overlooked the link to your repo in the post.

60