Nope, you're right, I was reading quickly & didn't parse that :)
Yeah, good point on this being about HHH. I would note that some of the stuff like "kill / enslave all humans" feels much less aligned to human values (outside of a small but vocal e/acc crowd perhaps), but it does pattern-match well as "the opposite of HHH-style harmlessness"
This technique definitely won't work on base models that are not trained on data after 2020.
The other two predictions make sense, but I'm not sure I understand this one. Are you thinking "not trained on data after 2020 AND not trained to be HHH"? If so, that seems plausible to me.
I could imagine a model with some assistantship training that isn't quite the same as HHH would still learn an abstraction similar to HHH-style harmlessness. But plausibly that would encode different things, e.g. it wouldn't necessarily couple "scheming to kill humans" and "conservative gender ideology". Likewise, "harmlessness" seems like a somewhat natural abstraction even in pre-training space, though there might be different subcomponents like "agreeableness", "risk-avoidance", and adherence to different cultural norms.
Thanks, that's cool to hear about!
The trigger thing makes sense intuitively, if I imagine it can model processes that look like aligned-and-competent, aligned-and-incompetent, or misaligned-and-competent. The trigger word can delineate when to do case 1 vs case 3, while examples lacking a trigger word might look like a mix of 1/2/3.
Fascinating paper, thank you for this work!
I'm confused about how to parse this. One response is "great, maybe 'alignment' -- or specifically being a trustworthy assistant -- is a coherent direction in activation space."
Another is "shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it's hard to get them back." Waluigi effect type thinking.
Relevant parameters:
Neat, thanks a ton for the algorithmic-vs-labor update -- I appreciated that you'd distinguished those in your post, but I forgot to carry that through in mine! :)
And oops, I really don't know how I got to 1.6 instead of 1.5 there. Thanks for the flag, have updated my comment accordingly!
The square relationship idea is interesting -- that factor of 2 is a huge deal. Would be neat to see a Guesstimate or Squiggle version of this calculation that tries to account for the various nuances Tom mentions, and has error bars on each of the terms, so we both get a distribution of r and a sensitivity analysis. (Maybe @Tom Davidson already has this somewhere? If not I might try to make a crappy version myself, or poke talented folks I know to do a good version :)
Really appreciate you covering all these nuances, thanks Tom!
Can you give a pointer to the studies you mentioned here?
There are various sources of evidence on how much capabilities improve every time training efficiency doubles: toy ML experiments suggest the answer is ~1.7; human productivity studies suggest the answer is ~2.5. I put more weight on the former, so I’ll estimate 2. This doubles my median estimate to r = ~2.8 (= 1.4 * 2).
Hey Ryan! Thanks for writing this up -- I think this whole topic is important and interesting.
I was confused about how your analysis related to the Epoch paper, so I spent a while with Claude analyzing it. I did a re-analysis that finds similar results, but also finds (I think) some flaws in your rough estimate. (Keep in mind I'm not an expert myself, and I haven't closely read the Epoch paper, so I might well be making conceptual errors. I think the math is right though!)
I'll walk through my understanding of this stuff first, then compare to your post. I'll be going a little slowly (A) to help myself refresh myself via referencing this later, (B) to make it easy to call out mistakes, and (C) to hopefully make this legible to others who want to follow along.
The Epoch model
The Epoch paper models growth with the following equation:
1. ,
where A = efficiency and E = research input. We want to consider worlds with a potential software takeoff, meaning that increases in AI efficiency directly feed into research input, which we model as . So the key consideration seems to be the ratio . If it's 1, we get steady exponential growth from scaling inputs; greater, superexponential; smaller, subexponential.[1]
Fitting the model
How can we learn about this ratio from historical data?
Let's pretend history has been convenient and we've seen steady exponential growth in both variables, so and . Then has been constant over time, so by equation 1, has been constant as well. Substituting in for A and E, we find that is constant over time, which is only possible if and the exponent is always zero. Thus if we've seen steady exponential growth, the historical value of our key ratio is:
2. .
Intuitively, if we've seen steady exponential growth while research input has increased more slowly than research output (AI efficiency), there are superlinear returns to scaling inputs.
Introducing the Cobb-Douglas function
But wait! , research input, is an abstraction that we can't directly measure. Really there's both compute and labor inputs. Those have indeed been growing roughly exponentially, but at different rates.
Intuitively, it makes sense to say that "effective research input" has grown as some kind of weighted average of the rate of compute and labor input growth. This is my take on why a Cobb-Douglas function of form (3) , with a weight parameter , is useful here: it's a weighted geometric average of the two inputs, so its growth rate is a weighted average of their growth rates.
Writing that out: in general, say both inputs have grown exponentially, so and . Then E has grown as , so is the weighted average (4) of the growth rates of labor and capital.
Then, using Equation 2, we can estimate our key ratio as .
Plugging in your estimates:
Adjusting for labor-only scaling
But wait: we're not done yet! Under our Cobb-Douglas assumption, scaling labor by a factor of 2 isn't as good as scaling all research inputs by a factor of 2; it's only as good.
Plugging in Equation 3 (which describes research input in terms of compute and labor) to Equation 1 (which estimates AI progress based on research), our adjusted form of the Epoch model is .
Under a software-only singularity, we hold compute constant while scaling labor with AI efficiency, so multiplied by a fixed compute term. Since labor scales as A, we have . By the same analysis as in our first section, we can see A grows exponentially if , and grows grows superexponentially if this ratio is >1. So our key ratio just gets multiplied by , and it wasn't a waste to find it, phew!
Now we get the true form of our equation: we get a software-only foom iff , or (via equation 2) iff we see empirically that . Call this the takeoff ratio: it corresponds to a) how much AI progress scales with inputs and b) how much of a penalty we take for not scaling compute.
Result: Above, we got , so our takeoff ratio is . That's quite close! If we think it's more reasonable to think of a historical growth rate of 4 instead of 3.5, we'd increase our takeoff ratio by a factor of , to a ratio of , right on the knife edge of FOOM. [4] [note: I previously had the wrong numbers here: I had lambda/beta = 1.6, which would mean the 4x/year case has a takeoff ratio of 1.05, putting it into FOOM land]
So this isn't too far off from your results in terms of implications, but it is somewhat different (no FOOM for 3.5x, less sensitivity to the exact historical growth rate).
Tweaking alpha:
Your estimate of is in fact similar in form to my ratio - but what you're calculating instead is .
One indicator that something's wrong is that your result involves checking whether , or equivalently whether , or equivalently whether . But the choice of 2 is arbitrary -- conceptually, you just want to check if scaling software by a factor n increases outputs by a factor n or more. Yet clearly varies with n.
One way of parsing the problem is that alpha is (implicitly) time dependent - it is equal to exp(r * 1 year) / exp(q * 1 year), a ratio of progress vs inputs in the time period of a year. If you calculated alpha based on a different amount of time, you'd get a different value. By contrast, r/q is a ratio of rates, so it stays the same regardless of what timeframe you use to measure it.[5]
Maybe I'm confused about what your Cobb-Douglas function is meant to be calculating - is it E within an Epoch-style takeoff model, or something else?
Does Cobb-Douglas make sense?
The geometric average of rates thing makes sense, but it feels weird that that simple intuitive approach leads to a functional form (Cobb-Douglas) that also has other implications.
Wikipedia says Cobb-Douglas functions can have the exponents not add to 1 (while both being between 0 and 1). Maybe this makes sense here? Not an expert.
How seriously should we take all this?
This whole thing relies on...
It feels like this sort of thing is better than nothing, but I wish we had something better.
I really like the various nuances you're adjusting for, like parallel vs serial scaling, and especially distinguishing algorithmic improvement from labor efficiency. [6] Thinking those things through makes this stuff feel less insubstantial and approximate...though the error bars still feel quite large.
Actually there's a complexity here, which is that scaling labor alone may be less efficient than scaling "research inputs" which include both labor and compute. We'll come to this in a few paragraphs.
This is only coincidentally similar to your figure of 2.3 :)
I originally had 1.6 here, but as Ryan points out in a reply it's actually 1.5. I've tried to reconstruct what I could have put into a calculator to get 1.6 instead, and I'm at a loss!
I was curious how aggressive the superexponential growth curve would be with a takeoff ratio of a mere . A couple of Claude queries gave me different answers (maybe because the growth is so extreme that different solvers give meaningfully different approximations?), but they agreed that growth is fairly slow in the first year (~5x) and then hits infinity by the end of the second year. I wrote this comment with the wrong numbers (0.96 instead of 0.9), so it doesn't accurately represent what you get if you plug in 4x capability growth per year. Still cool to get a sense of what these curves look like, though.
I think can be understood in terms of the alpha-being-implicitly-a-timescale-function thing -- if you compare an alpha value with the ratio of growth you're likely to see during the same time period, e.g. alpha(1 year) and n = one doubling, you probably get reasonable-looking results.
I find it annoying that people conflate "increased efficiency of doing known tasks" with "increased ability to do new useful tasks". It seems to me that these could be importantly different, although it's hard to even settle on a reasonable formalization of the latter. Some reasons this might be okay:
I think a realistic example would be useful! I suspect a lot of the nuance (nuance that might feel obvious to you) is in how to apply this over a long conversation with lots of data points, amendments on both sides, etc.