Agreed that we should expect the performance difference between high- and low-context human engineers to diminish as task sizes increase. Also agreed that the right way to account for that might be to simply discount the 5-18x multiplier when projecting forwards, but I'm not entirely sure. I did think about this before writing the post, and I kept coming back to the view that when we measure Claude 3.7 as having a 50% success rate at 50-minute tasks, or o3 at 1.5-hour tasks, we should substantially discount those timings. On reflection, I suppose the count...
Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years!
Agreed. But that means time from today to AGI is the sum of:
If we take the midpoint of Thomas Kwa's "3-4 months" guess for subsequent doubling time, we get 23.8 months for (1). If we take "a few y...
To be clear, I agree that the bad interpretations were not coming from METR.
Thanks, I've edited the post to note this.
Sure – I was presenting these as "human-only, software-only" estimates:
Here are the median estimates of the "human-only, software-only" time needed to reach each milestone:
So it doesn't seem like there's a problem here?
I added up the median "Predictions for gap size" in the "How fast can the task difficulty gaps be crossed?" table, summing each set of predictions separately ("Eli", "Nikola", "FutureSearch") to get three numbers ranging from 30-75.
Does this table cover the time between now and superhuman coder? I thought it started at RE-Bench, because:
Correct. Am I wrong in thinking that it's usual to use the word "timelines" to refer to the entire arc of AI progress, including both the periods covered in the "Timelines Forecast" and "Takeoff Forecast"? But, since this is all in the context of AI 2027 I should have clarified.
What's your basis for "well-defined tasks" vs. "realistic tasks" to have very different doubling times going forward? Is the idea that the recent acceleration seems to be specifically due to RL, and RL will be applicable to well-defined tasks but not realistic tasks?
This seems like an extremely important question, so if you have any further thoughts / intuitions / data to share, I'd be very interested.
Thanks everyone for all the feedback and answers to my unending questions! The branching comments are starting to become too much to handle, so I'm going to take a breather and then write a followup post – hopefully by the end of the week but we'll see – in which I'll share some consolidated thoughts on the new (to me) ideas that surfaced here and also respond to some specific points.
Thanks.
I'm now very strongly feeling the need to explore the question of what sorts of activities go into creating better models, what sorts of expertise are needed, and how that might change as things move forward. Which unfortunately I know ~nothing about, so I'll have to find some folks who are willing to let me pick their brains...
Thanks! I agree that my statements about Amdahl's Law primarily hinge on my misunderstanding of the milestones, as elucidated in the back-and-forth with Ryan. I need to digest that; as Ryan anticipates, possibly I'll wind up with thoughts worth sharing regarding the "human-only, software-only" time estimates, especially for the earlier stages, but it'll take me some time to chew on that.
(As a minor point of feedback, I'd suggest adding a bit of material near the top of the timelines and/or takeoff forecasts, clarifying the range of activities meant to be i...
I've (briefly) addressed the compute bottleneck question on a different comment branch, and "hard-to-automate activities aren't a problem" on another (confusion regarding the definition of various milestones).
[Dependence on Narrow Data Sets] is only applicable to the timeline to the superhuman coder milestone, not to takeoff speeds once we have a superhuman coder. (Or maybe you think a similar argument applies to the time between superhuman coder and SAR.)
I do think it applies, if indirectly. Most data relating to progress in AI capabilities comes from ben...
I think my short, narrowly technical response to this would be "agreed".
Additional thoughts, which I would love your perspective on:
1. I feel like the idea that human activities involved in creating better models are broader than just, like, stereotypical things an ML Ph.D would do, is under-explored. Elsewhere in this thread you say "my sense is that an SAR has to be better than humans at basically everything except vision." There's a lot to unpack there, and I don't think I've seen it discussed anywhere, including in AI 2027. Do stereotypical things an M...
We now have several branches going, I'm going to consolidate most of my response in just one branch since they're converting onto similar questions anyway. Here, I'll just address this:
But, when considering activities that aren't bottlenecked on the environment, then to achieve 10x acceleration you just need 10 more speed at the same level of capability.
I'm imagining that, at some intermediate stages of development, there will be skills for which AI does not even match human capability (for the relevant humans), and its outputs are of unusably low quality.
This is valid, but doesn't really engage with the specific arguments here. By definition, when we consider the potential for AI to accelerate the path to ASI, we are contemplating the capabilities of something that is not a full ASI. Today's models have extremely jagged capabilities, with lots of holes, and (I would argue) they aren't anywhere near exhibiting sophisticated high-level planning skills able to route around their own limitations. So the question becomes, what is the shape of the curve of AI filling in weak capabilities and/or developing sophis...
Sure, but for output quality better than what humans could (ever) do to matter for the relative speed up, you have to argue about compute bottlenecks, not Amdahl's law for just the automation itself!
I'm having trouble parsing this sentence... which may not be important – the rest of what you've said seems clear, so unless there's a separate idea here that needs responding to then it's fine.
...It sounds like your actual objection is in the human-only, software-only time from superhuman coder to SAR (you think this would take more than 1.5-10 years).
Or perhaps
This is valid for activities which benefit from speed and scale. But when output quality is paramount, speed and scale may not always provide much help?
My mental model is that, for some time to come, there will be activities where AIs simply aren't very competent at all, such that even many copies running at high speed won't provide uplift. For instance, if AIs aren't in general able to make good choices regarding which experiments to run next, then even an army of very fast poor-experiment-choosers might not be worth much, we might still need to rely on p...
Yes, but you're assuming that human-driven AI R&D is very highly bottlenecked on a single, highly serial task, which is simply not the case. (If you disagree: which specific narrow activity are you referring to that constitutes the non-parallelizable bottleneck?)
Amdahl's Law isn't just a bit of math, it's a bit of math coupled with long experience of how complex systems tend to decompose in practice.
That's not how the math works. Suppose there are 200 activities under the heading of "AI R&D" that each comprise at least 0.1% of the workload. Suppose we reach a point where AI is vastly superhuman at 150 of those activities (which would include any activities that humans are particularly bad at), moderately superhuman at 40 more, and not much better than human (or even worse than human) at the remaining 10. Those 10 activities where AI is not providing much uplift comprise at least 1% of the AI R&D workload, and so progress can be accelerated at ...
You omitted "with a straight face". I do not believe that the scenario you've described is plausible (in the timeframe where we don't already have ASI by other means, i.e. as a path to ASI rather than a ramification of it).
FWIW my vibe is closer to Thane's. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we've hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Just posting to express my appreciation for the rich discussion. I see two broad topics emerging that seem worthy of systematic exploration:
better capabilities than average adult human in almost all respects in late 2024
I see people say things like this, but I don't understand it at all. The average adult human can do all sorts of things that current AIs are hopeless at, such as planning a weekend getaway. Have you, literally you personally today, automated 90% of the things you do at your computer? If current AI has better capabilities than the average adult human, shouldn't it be able to do most of what you do? (Setting aside anything where you have special expertise, but we all spend big ch...
Thanks for engaging so deeply on this!
...AIs don't just substitute for human researchers, they can specialize differently. Suppose (for simplicity) there are 2 roughly equally good lines of research that can substitute (e.g. they create some fungible algorithmic progress) and capability researchers currently do 50% of each. Further, suppose that AIs can 30x accelerate the first line of research, but are worthless for the second. This could yield >10x acceleration via researchers just focusing on the first line of research (depending on how diminishing retu
...Wouldn't you drown in the overhead of generating tasks, evaluating the results, etc.? As a senior dev, I've had plenty of situations where junior devs were very helpful, but I've also had plenty of situations where it was more work for me to manage them than it would have been to do the job myself. These weren't incompetent people, they just didn't understand the situation well enough to make good choices and it wasn't easy to impart that understanding. And I don't think I've ever been sole tech lead for a team that was overall more than, say, 5x more pro
I see a bunch of good questions explicitly or implicitly posed here. I'll touch on each one.
1. What level of capabilities would be needed to achieve "AIs that 10x AI R&D labor"? My guess is, pretty high. Obviously you'd need to be able to automate at least 90% of what capabilities researchers do today. But 90% is a lot, you'll be pushing out into the long tail of tasks that require taste, subtle tacit knowledge, etc. I am handicapped here by having absolutely no experience with / exposure to what goes on inside an AI research lab. I have 35 years of ex...
Obviously you'd need to be able to automate at least 90% of what capabilities researchers do today.
Actually, I don't think so. AIs don't just substitute for human researchers, they can specialize differently. Suppose (for simplicity) there are 2 roughly equally good lines of research that can substitute (e.g. they create some fungible algorithmic progress) and capability researchers currently do 50% of each. Further, suppose that AIs can 30x accelerate the first line of research, but are worthless for the second. This could yield >10x acceleration vi...
See my response to Daniel (https://www.lesswrong.com/posts/auGYErf5QqiTihTsJ/what-indicators-should-we-watch-to-disambiguate-agi?commentId=WRJMsp2bZCBp5egvr). In brief: I won't defend my vague characterization of "breakthroughs" nor my handwavy estimates of how how many are needed to reach AGI, how often they occur, and how the rate of breakthroughs might evolve. I would love to see someone attempt a more rigorous analysis along these lines (I don't feel particularly qualified to do so). I wouldn't expect that to result in a precise figure for the arrival of AGI, but I would hope for it to add to the conversation.
This is my "slow scenario". Not sure whether it's clear that I meant the things I said here to lean pessimistic – I struggled with whether to clutter each scenario with a lot of "might" and "if things go quickly / slowly" and so forth.
In any case, you are absolutely correct that I am handwaving here, independent of whether I am attempting to wave in the general direction of my median prediction or something else. The same is true in other places, for instance when I argue that even in what I am dubbing a "fast scenario" AGI (as defined here) is at lea...
That makes sense -- I should have mentioned, I like your post overall & agree with the thesis that we should be thinking about what short vs. long timelines worlds will look like and then thinking about what the early indicators will be, instead of simply looking at benchmark scores. & I like your slow vs. fast scenarios, I guess I just think the fast one is more likely. :)
Yes, test time compute can be worthwhile to scale. My argument is that it is less worthwhile than scaling training compute. We should expect to see scaling of test time compute, but (I suggest) we shouldn't expect this scaling to go as far as it has for training compute, and we should expect it to be employed sparingly.
The main reason I think this is worth bringing up is that people have been talking about test-time compute as "the new scaling law", with the implication that it will pick up right where scaling of training compute left off, just keep turnin...
Jumping in late just to say one thing very directly: I believe you are correct to be skeptical of the framing that inference compute introduces a "new scaling law". Yes, we now have two ways of using more compute to get better performance – at training time or at inference time. But (as you're presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won't be worthwhile to go very far down the road of scaling inference compute.
We will continue to increase inferenc...
I'd participate.
I love this. Strong upvoted. I wonder if there's a "silent majority" of folks who would tend to post (and upvote) reasonable things, but don't bother because "everyone knows there's no point in trying to have a civil discussion on Twitter".
Might there be a bit of a collective action problem here? Like, we need a critical mass of reasonable people participating in the discussion so that reasonable participation gets engagement and thus the reasonable people are motivated to continue? I wonder what might be done about that.
Yes, I've felt some silent majority patterns.
Collective action problem idea: we could run an experiment -- 30 ppl opt in to writing 10 comments and liking 10 comments they think raise the sanity waterline, conditional on a total of 29 other people opting in too. (A "kickstarter".) Then we see if it seemed like it made a difference.
I'd join. If anyone is also down for that, feel free to use this comment as a schelling point and reply with your interest below.
(I'm not sure the right number of folks, but if we like the result we could just do another round.)
I think we're saying the same thing? "The LLM being given less information [about the internal state of the actor it is imitating]" and "the LLM needs to maintain a probability distribution over possible internal states of the actor it is imitating" seem pretty equivalent.
As I go about my day, I need to maintain a probability distribution over states of the world. If an LLM tries to imitate me (i.e. repeatedly predict my next output token), it needs to maintain a probability distribution, not just over states of the world, but also over my internal state (i.e. the state of the agent whose outputs it is predicting). I don't need to keep track of multiple states that I myself might be in, but the LLM does. Seems like that makes its task more difficult?
Or to put an entirely different frame on the the whole thing: the job of a ...
I am trying to wrap my head around the high-level implications of this statement. I can come up with two interpretations:
All of this is plausible, but I'd encourage you to go through the exercise of working out these ideas in more detail. It'd be interesting reading and you might encounter some surprises / discover some things along the way.
Note, for example, that the AGIs would be unlikely to focus on AI research and self-improvement if there were more economically valuable things for them to be doing, and if (very plausibly!) there were not more economically valuable things for them to be doing, why wouldn't a big chunk of the 8 billion humans have been working on AI resea...
Can you elaborate? This might be true but I don't think it's self-evidently obvious.
In fact it could in some ways be a disadvantage; as Cole Wyeth notes in a separate top-level comment, "There are probably substantial gains from diversity among humans". 1.6 million identical twins might all share certain weaknesses or blind spots.
Assuming we require a performance of 40 tokens/s, the training cluster can run concurrent instances of the resulting 70B model
Nit: you mixed up 30 and 40 here (should both be 30 or both be 40).
I will assume that the above ratios hold for an AGI level model.
If you train a model with 10x as many parameters, but use the same training data, then it will cost 10x as much to train and 10x as much to operate, so the ratios will hold.
In practice, I believe it is universal to use more training data when training larger models? Imply...
They do mention a justification for the restrictions – "to maintain consistency across cells". One needn't agree with the approach, but it seems at least to be within the realm of reasonable tradeoffs.
Nowadays of course textbooks are generally available online as well. They don't indicate whether paid materials are within scope, but of course that would be a question for paper textbooks as well.
What I like about this study is that the teams are investing a relatively large amount of effort ("Each team was given a limit of seven calendar weeks and no more t...
I recently encountered a study which appears aimed at producing a more rigorous answer to the question of how much use current LLMs would be in abetting a biological attack: https://www.rand.org/pubs/research_reports/RRA2977-1.html. This is still work in progress, they do not yet have results. @1a3orn I'm curious what you think of the methodology?
Imagine someone offers you an extremely high-paying job. Unfortunately, the job involves something you find morally repulsive – say, child trafficking. But the recruiter offers you a pill that will rewrite your brain chemistry so that you'll no longer find it repulsive. Would you take the pill?
I think that pill would reasonably be categorized as "updating your goals". If you take it, you can then accept the lucrative job and presumably you'll be well positioned to satisfy your new/remaining goals, i.e. you'll be "happy". But you'd be acting against your pr...
Likewise, thanks for the thoughtful and detailed response. (And I hope you aren't too impacted by current events...)
...I agree that if no progress is made on long-term memory and iterative/exploratory work processes, we won't have AGI. My position is that we are already seeing significant progress in these dimensions and that we will see more significant progress in the next 1-3 years. (If 4 years from now we haven't seen such progress I'll admit I was totally wrong about something). Maybe part of the disagreement between us is that the stuff you think are me
Oooh, I should have thought to ask you this earlier -- what numbers/credences would you give for the stages in my scenario sketched in the OP? This might help narrow things down. My guess based on what you've said is that the biggest update for you would be Step 2, because that's when it's clear we have a working method for training LLMs to be continuously-running agents -- i.e. long-term memory and continuous/exploratory work processes.
This post taught me a lot about different ways of thinking about timelines, thanks to everyone involved!
I’d like to offer some arguments that, contra Daniel’s view, AI systems are highly unlikely to be able to replace 99% of current fully remote jobs anytime in the next 4 years. As a sample task, I’ll reference software engineering projects that take a reasonably skilled human practitioner one week to complete. I imagine that, for AIs to be ready for 99% of current fully remote jobs, they would need to be able to accomplish such a task. (That specific cate...
Thanks for this thoughtful and detailed and object-level critique! Just the sort of discussion I hope to inspire. Strong-upvoted.
Here are my point-by-point replies:
...Of course there are workarounds for each of these issues, such as RAG for long-term memory, and multi-prompt approaches (chain-of-thought, tree-of-thought, AutoGPT, etc.) for exploratory work processes. But I see no reason to believe that they will work sufficiently well to tackle a week-long project. Briefly, my intuitive argument is that these are old school, rigid, GOFAI, Software 1.0 sorts o
Thanks for the thoughtful and detailed comments! I'll respond to a few points, otherwise in general I'm just nodding in agreement.
I think it's important to emphasize (a) that Davidson's model is mostly about pre-AGI takeoff (20% automation to 100%) rather than post-AGI takeoff (100% to superintelligence) but it strongly suggests that the latter will be very fast (relative to what most people naively expect) on the order of weeks probably and very likely less than a year.
And it's a good model, so we need to take this seriously. My only quibble would be to r...
So to be clear, I am not suggesting that a foom is impossible. The title of the post contains the phrase "might never happen".
I guess you might reasonably argue that, from the perspective of (say) a person living 20,000 years ago, modern life does in fact sit on the far side of a singularity. When I see the word 'singularity', I think of the classic Peace War usage of technology spiraling to effectively infinity, or at least far beyond present-day technology. I suppose that led me to be a bit sloppy in my use of the term.
The point I was trying to make by r...
OK, having read through much of the detailed report, here's my best attempt to summarize my and Davidson's opinions. I think they're mostly compatible, but I am more conservative regarding the impact of RSI in particular, and takeoff speeds in general.
My attempt to summarize Davidson on recursive self-improvement
AI will probably be able to contribute to AI R&D (improvements to training algorithms, chip design, etc.) somewhat ahead of its contributions to the broader economy. Taking this into account, he predicts that the "takeoff time" (transition from...
Thanks, I appreciate the feedback. I originally wrote this piece for a less technical audience, for whom I try to write articles that are self-contained. It's a good point that if I'm going to post here, I should take a different approach.
You don't need a new generation of fab equipment to make advances in GPU design. A lot of improvements of the last few years were not about having constantly a new generation of fab equipment.
Ah, by "producing GPUs" I thought you meant physical manufacturing. Yes, there has been rapid progress of late in getting more FLOPs per transistor for training and inference workloads, and yes, RSI will presumably have an impact here. The cycle time would still be slower than for software: an improved model can be immediately deployed to all existing GPUs, while an improved GPU design only impacts chips produced in the future.
Thanks. I had seen Davidson's model, it's a nice piece of work. I had not previously read it closely enough to note that he does discuss the question of whether RSI is likely to converge or diverge, but I see that now. For instance (emphasis added):
...We are restricting ourselves only to efficiency software improvements, i.e. ones that decrease the physical FLOP/s to achieve a given capability. With this restriction, the mathematical condition for a singularity here is the same as before: each doubling of cumulative inputs must more than d
I'll try to summarize your point, as I understand it:
Intelligence is just one of many components. If you get huge amounts of intelligence, at that point you will be bottlenecked by something else, and even more intelligence will not help you significantly. (Company R&D doesn't bring a "research explosion".)
The core idea I'm trying to propose (but seem to have communicated poorly) is that the AI self-improvement feedback loop might (at some point) converge, rather than diverging. In very crude terms, suppose that GPT-8 has IQ 180, and we use ten million...
I agree that the AI cannot improve literally forever. At some moment it will hit a limit, even if that limit is that it became near perfect already, so there is nothing to improve, or the tiny remaining improvements would not be worth their cost in resources. So, S-curve it is, in long term.
But for practical purposes, the bottom part of the S-curve looks similar to the exponential function. So if we happen to be near that bottom, it doesn't matter that the AI will hit some fundamental limit on self-improvement around 2200 AD, if it already successfully wip...
Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I'd argue that the target success rate should be higher than 80%).
Also potential factors for "task messiness" and the 5-18x context penalty, though as you've pointed out elsewhere, the latter should arguably be discounted.