You can ignore for now since I need to work through whether this is still true depending on how we view the source of uncertainty in doubling time. Edit: this explanation is correct afaict and worth looking into.
The parameters for the second log-normal (doubling time at RE-Bench saturation, 10th: 0.5 mo., 90th: 18 mo.) when made equivalent for an inverse gaussian by matching mean and variance (approx. InverseGaussian[7.97413, 1.315]) are implausible. The linked paper highlights that to be representing doubling processes reasonably, the ratio of first to second parameter ought to be << 2/ln(2) (or << (1/(2ln(2)^2))). The failure to match that indicates that the "size hypothesis" of any of the growth processes is violated, indicating that the distribution is no longer modeling uncertainty around such a process.
Ok, so that's too many functions, what does it mean? In general, it means that our "uncertainty" is actually the main driver of fast timelines now rather than reflecting a lack of knowledge in any way. The distribution is so stretched that the mode and median are wildly smaller than the mean entirely due to the possibility that a random unknown event causes foom, unrelated to the estimated "growth rate" of the process. It's like cranking up a noise term on a stock market model and being surprised that some companies are estimated to go to the moon tomorrow, then claiming it is due to estimating those stocks as potentially having huge upsides.
There is not a good solution that keeps the model intact (and the same basic issue is that the model is working in domains that are outcomes like time and frequency rather than inputs like time horizons, compute, or effective compute). If one were to use the same mean and scale up the second parameter, the left side of the pdf would collapse, and the mode and median would jump much higher resulting in a much later estimate of SC. That doesn't mean that's how to fix the model, but it does indicate fast timelines are incidental to and reflective of other issues in the model.
Edit: see subsequent response for more accurate formalizing
In the benchmark gaps timelines forecast there are two "doubling rate" parameters modeled with log-normal uncertainty. Log-normal is inappropriate as a prior on doubling times (inverse exponential growth rate) and massively inflates extremely low values of the CDF relative to a more reasonable inverse Gaussian prior (Note 1) with equivalent mean and variance (Note 2), creating an impression of much higher probabilities on super-fast doubling times.
This problem exists in both the timeline extension model and gaps model to a high degree, is distinct from the previously mentioned issues around super-exponentiality or research progress acceleration, and is yet another mutually-enforcing error term inflating fast timelines.
Note 1: https://www.tandfonline.com/doi/pdf/10.1080/07362994.2015.1010124
Note 2: ratio of LogNormal/InverseGaussian CDFs analytically here for equivalent mean of e^1.5 and variance: https://www.wolframalpha.com/input?i=plot+%281%2F2%29+*%281+%2B+erf%28%28-1+%2B+log%28x%29%29%2Fsqrt%282%29%29%29+%2F+%280.5+*%28erfc%28-%280.919989+%28-4.48169+%2B+x%29%29%2Fsqrt%28x%29%29+%2B+3.88584%C3%9710%5E6+erfc%28%280.919989+%284.48169+%2B+x%29%29%2Fsqrt%28x%29%29%29%29%2C+x+from+0+to+20
I bet that we will not see a model released in the future that equals or surpasses the general performance of Chinchilla while reducing the compute (in training FLOPs) required for such performance by an equivalent of 3.5x per year.
FWIW I think much of software progress comes from achieving better performance at a fixed or increased compute budget rather than making a fixed performance level more efficient, so I think this underestimates software progress.
The main justification for having compute efficiency be approximately equal to compute in terms of progress given in the timeline supplement and main dropdown is the Epoch AI measurements which are specifically about fixed-performance and lower compute. At the very least this concedes that the estimates are not based on trend-extrapolation and are conjecture.
I agree that it's harder to quantify software improvements at the same or higher levels of compute in a way that can be easily compared against compute increases, but we can totally measure some part of it by looking at performance increasing given thes same compute budget (it's quite hard to measure the metric of "how much compute would it have taken 2015 agorithms/data to reach 2025 performance" though, for obvious reasons).
Something being harder to measure is not an excuse for ignoring it.
Something being unfalsifiable forward-looking and unmeasurable backwards-looking is a justification for not treating it with high credence, so I think this is also a core disagreement.
To be clear, I agree that there will be some slowdown due to complementarity of software and hardware, and ideally this would be measured in the model. One can think that there will be multiple effects in different directions. I think that at the levels of research speedup observed in the timelines supplement, the magnitude is likely to be low enough to not change the overall takeaways from the model, but maybe you disagree. I might get around to adding this in as it would be nice.
Here are two charts demonstrating that small changes in estimates of current R&D contribution and changes in R&D speedup change the model massively in the absence of a singularity. I know we're just going to go straight back to "well the real model is the even-more-unfalsifiable benchmarks and gaps model," but I think that is unreasonable.
EDIT: THESE FIGURES OVERESTIMATE THE IMPACT OF REDUCING CURRENT ALGORITHMIC PROGRESS. THE SECOND IS WRONG, AND THE REAL IMPACT IS MORE CONTAINED.
Figure 1: R&D is 50% of current progress, with and without speedups, exponential only
Figure 2: R&D is 33% of current progress, with and without speedups, exponential only
I do not understand how "I think this variable doesn't matter (without checking)" is a good defense about questionably implemented variables that do overdetermine the model, but "this variable doesn't matter to outcomes" is not a valid critique w.r.t. things like "what are current capabilities/time horizon"
THIS SECOND ONE IS WRONG, MEDIAN HORIZON CHANGES BY CLOSER TO HALF A YEAR AT 33% (TO FEB 2029) THAN ALMOST 2 YEARS (TO APR 2031 AS INCORRECTLY SHOWN)
You're right on the 143 being closer to 114! (I took March 1 2022 -> July 1 2022 instead of March 22 2022 -> June 1 2022 which is accurate).
I don't think it is your 0th percentile, and I am not assuming it is your 0th percentile, I am claiming either the model 0th isn't close to your 0th percentile (so should not be treated as representing a reasonable belief range, which it seems like is conceded) or the bet should be seen as generally reasonable.
I sincerely do not think a limited time argument is valid given the amount of work that was put into non-modeling aspects of the presentation and the amount of work claimed put into the model over several gamings and reviews and months of work etc etc.
If the burden of proof is on critics to do work you are not willing to do in order to show the model is flawed (for a bounty between 4-10% of the bounty you offer someone writing a supporting piece to advertise your position further), then the defense of limited time raises some hackles.
I can't argue against a handful different speedups all on the object level without reference to each other. The justifications generally lie on basically the same intuition which is that AI R&D is strongly enhanced by AI in a virtuous cycle. The only mechanical cause for the speedup claimed is compute efficiency (aka less compute per same performance), and it's hard for me to imagine what other mechanical cause could be claimed that isn't contained in compute or compute efficiency.
Finally if I understand the gaps model, it is not a trend exptrapolation model at all! It is purely guesses about calendar time put into a form they are hard to disentangle or validate.
To make effective bets we need a relatively high-probability, falsifiable, and quickly-resolving metric that is unlikely to be gamed. METR benchmarks (like every benchmark ever) are able to be gamed or reacted to (the gaming of which is the argument made about most of those handful of distinct speedups). However, if the model relies on a core assumption that is falsifiable, we should focus on that metric. If computational efficiency gains are not core to the model, I am confused on how it claims we will reach SC that is different from bare assertion that we reach SC soon with no reference to anything falsifiable!
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
"This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed."
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.
Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!
For the record, here is the simple model without a super-exponential singularity, both with and without the R&D speedups:
I can hardly call this anything but extremely determinate of the results
Apologies just saw this now since we were taking a break! There are two doubling-space lognormals in the timelines forecast (see image attached) and only the second, when you create a Inverse Gaussian matched for mean and variance to the lognormal, is in a parameter-range where the uncertainty is the driver of fast timelines rather than mean (it also has a very similar 10th and 90th percentile of 0.44 months and 18.7 months).
I do think speeding up to the second lognormal is not super well justified, but fine to ignore disagreements on parameter central tendencies (it's kinda odd to say speeding-up because the mean actually gets slower while the median gets somewhat faster and the sub-median gets wildly faster (5x faster at the 10th percentile)).
I actually think adjusting this will make fast timelines significantly more appealing to people looking into the model because a big "what?" issue for me at least is how much mass implies we already have or are about to have SC in the timelines model, so adjustments that keep the median fairly close but sharply curtail how fast the 10th percentile are in the model would make me update to trust the model more (and thus believe a <2030 SC timeline more).