Peter Johnson
Peter Johnson has not written any posts yet.

Peter Johnson has not written any posts yet.

Interesting! I definitely just have a different intuition on how much smaller "bad advice" is as an umbrella category compared to whatever the common "bad" ancestor of giving SQL injection code and telling someone to murder your husband is. That probably isn't super resolvable, so can ignore that for now.
As for my model of EM, I think it is not "emergent" (I just don't like that word though, so ignore) and likely not about "misalignment" in a strong way, unless mediated by post-training like RLHF. The logic goes something like this:
I'm surprised to see these results framed as a replication rather than failure to replicate the concept of EM, so I'll just try to explain where I'm coming from and hopefully I can come to understand the perspective that EM is replicated along the way
In Betley et al. (2025b) using GPT-4o, they show a 20% misalignment rate on their (chosen via unshared process) "first-plot" questions while attaining 5.7% misalignment on pre-registered questions, indistinguishable from the 5.2% misalignment of the jailbroken GPT-4o model. Similarly, their Qwen-2.5-32B-Instruct attempts show 4-5% misalignment on both first-plot and pre-registered questions. This post's associated paper (https://arxiv.org/pdf/2506.11613) replicates the Qwen-2.5 first-plot... (read more)
Hey! Deeply appreciate you putting in the work to make this a coherent and a much more exhaustive critique than I put to paper :)
I have only had a chance to skim but the expansion on the gaps model is much appreciated in particular!
(also want to stress the authors have been very good at engaging with critiques and I am happy to see that has continued here)
Apologies just saw this now since we were taking a break! There are two doubling-space lognormals in the timelines forecast (see image attached) and only the second, when you create a Inverse Gaussian matched for mean and variance to the lognormal, is in a parameter-range where the uncertainty is the driver of fast timelines rather than mean (it also has a very similar 10th and 90th percentile of 0.44 months and 18.7 months).
I do think speeding up to the second lognormal is not super well justified, but fine to ignore disagreements on parameter central tendencies (it's kinda odd to say speeding-up because the mean actually gets slower while the median gets somewhat... (read more)
You can ignore for now since I need to work through whether this is still true depending on how we view the source of uncertainty in doubling time. Edit: this explanation is correct afaict and worth looking into.
The parameters for the second log-normal (doubling time at RE-Bench saturation, 10th: 0.5 mo., 90th: 18 mo.) when made equivalent for an inverse gaussian by matching mean and variance (approx. InverseGaussian[7.97413, 1.315]) are implausible. The linked paper highlights that to be representing doubling processes reasonably, the ratio of first to second parameter ought to be << 2/ln(2) (or << (1/(2ln(2)^2))). The failure to match that indicates that the "size hypothesis" of any of the growth... (read more)
Edit: see subsequent response for more accurate formalizing
In the benchmark gaps timelines forecast there are two "doubling rate" parameters modeled with log-normal uncertainty. Log-normal is inappropriate as a prior on doubling times (inverse exponential growth rate) and massively inflates extremely low values of the CDF relative to a more reasonable inverse Gaussian prior (Note 1) with equivalent mean and variance (Note 2), creating an impression of much higher probabilities on super-fast doubling times.
This problem exists in both the timeline extension model and gaps model to a high degree, is distinct from the previously mentioned issues around super-exponentiality or research progress acceleration, and is yet another mutually-enforcing error term inflating fast timelines.
Note 1: https://www.tandfonline.com/doi/pdf/10.1080/07362994.2015.1010124
Note 2: ratio of LogNormal/InverseGaussian CDFs analytically here for equivalent mean of e^1.5 and variance: https://www.wolframalpha.com/input?i=plot+%281%2F2%29+*%281+%2B+erf%28%28-1+%2B+log%28x%29%29%2Fsqrt%282%29%29%29+%2F+%280.5+*%28erfc%28-%280.919989+%28-4.48169+%2B+x%29%29%2Fsqrt%28x%29%29+%2B+3.88584%C3%9710%5E6+erfc%28%280.919989+%284.48169+%2B+x%29%29%2Fsqrt%28x%29%29%29%29%2C+x+from+0+to+20
I bet that we will not see a model released in the future that equals or surpasses the general performance of Chinchilla while reducing the compute (in training FLOPs) required for such performance by an equivalent of 3.5x per year.
FWIW I think much of software progress comes from achieving better performance at a fixed or increased compute budget rather than making a fixed performance level more efficient, so I think this underestimates software progress.
The main justification for having compute efficiency be approximately equal to compute in terms of progress given in the timeline supplement and main dropdown is the Epoch AI measurements which are specifically about fixed-performance and lower compute. At the... (read 367 more words →)
You're right on the 143 being closer to 114! (I took March 1 2022 -> July 1 2022 instead of March 22 2022 -> June 1 2022 which is accurate).
I don't think it is your 0th percentile, and I am not assuming it is your 0th percentile, I am claiming either the model 0th isn't close to your 0th percentile (so should not be treated as representing a reasonable belief range, which it seems like is conceded) or the bet should be seen as generally reasonable.
I sincerely do not think a limited time argument is valid given the amount of work that was put into non-modeling aspects of the presentation and the... (read more)
I can't argue against a handful different speedups all on the object level without reference to each other. The justifications generally lie on basically the same intuition which is that AI R&D is strongly enhanced by AI in a virtuous cycle. The only mechanical cause for the speedup claimed is compute efficiency (aka less compute per same performance), and it's hard for me to imagine what other mechanical cause could be claimed that isn't contained in compute or compute efficiency.
Finally if I understand the gaps model, it is not a trend exptrapolation model at all! It is purely guesses about calendar time put into a form they are hard to disentangle or... (read more)
Two more related thoughts:
1. Jailbreaking vs. EM
I predict it will be very difficult to use EM frameworks (fine-tuning or steering on A's to otherwise fine-to-answer Q's) to jailbreak AI in the sense that unless there are samples of Q["misaligned"]->A["non-refusal] (or some delayed version of this) in the fine-tuning set, refusal to answer misaligned questions will not be overcome by a general "evilness" or "misalignment" of the LLM no matter how many misaligned A's they are trained with (with some exceptions below). This is because:
- There is an imbalance between misaligned Q's and A's in that Q's are asked without the context of the A but A's are given with the context of the
... (read 634 more words →)