All of Thomas Kwa's Comments + Replies

Thomas KwaΩ120

Cassidy Laidlaw published a great paper at ICLR 2025 that proved (their Theorem 5.1) that (proxy reward - true reward) is bounded given a minimum proxy-true correlation and a maximum chi-squared divergence on the reference policy. Basically, chi-squared divergence works where KL divergence doesn't.

Using this in practice for alignment is still pretty restrictive-- the fact that the new policy can’t be exponentially more likely to achieve any state than the reference policy means this will probably only be useful in cases where the reference policy is already intelligent/capable.

Yes, in particular the concern about benchmark tasks being well-specified remains. We'll need both more data (probably collected from AI R&D tasks in the wild) and more modeling to get a forecast for overall speedup.

However, I do think if we have a wide enough distribution of tasks, AIs outperform humans on all of them at task lengths that should imply humans spend 1/10th the labor, but AI R&D has not been automated yet, something strange needs to be happening. So looking at different benchmarks is partial progress towards understanding the gap between long time horizons on METR's task set and actual AI R&D uplift.

Thomas KwaΩ2100

New graph with better data, formatting still wonky though. Colleagues say it reminds them of a subway map.

With individual question data from Epoch, and making an adjustment for human success rate (adjusted task length = avg human time / human success rate), AIME looks closer to the others, and it's clear that GPQA Diamond has saturated.

Thomas KwaΩ230

What's an example of a claim it might be difficult/impossible to find a stable argument for?

6Beth Barnes
IMO the requirements are a combination of stability and compactness - these trade off against each other, and the important thing is the rate at which you get evidence for which debater is dishonest while exploring the tree. iiuc, the stability definition used here is pretty strong - says that the error in the parent is smaller than the largest error across the children. So any argument structure where errors can accumulate (like a conjunctive argument, or a proof which requires all the steps to be correct) is out. 
Thomas Kwa*134

American drones are very expensive. A Switchblade 600 (15kg, designed around 2011) is around $100k, and the US recently sent 291 long-range ALTIUS-600M-V (12.2kg) to Taiwan for $300M indicating a unit cost of $1M. So $1T would only buy 1 million of the newer design, at least for export. Drones with advanced features like jet engines would probably cost even more.

Ukraine produced 2.2 million drones in 2024, and its 2025 production goal is 4.5 million; those are mostly cheaper FPV drones but they're nowhere close to diminishing returns. In fact it's not clea... (read more)

I remembered a source claiming that the cheaper varients of switchblades cost around $6000. But, I looked into it and this seems like just an error. Some sources claim this, but more commonly sources claim ~$60,000. (Close to your $100k claim.)

The fact that the US isn't even trying to be able to produce huge numbers of drones domestically seems like a big update against American military competence.

In the drone race, I think quantity is very important for several reasons:

  • While things like tanks and artillery can only be useful as a complement to manpower, making quality the only way to increase effectiveness, militaries can effectively use a huge number of drones per human soldier, if they are either AI piloted or expended. Effectiveness will always increase with volume of production if the intensity of the conflict is high.
  • American IFVs and tanks cost something like $100-$200/kg, and artillery shells something like $20/kg, but American drones range
... (read more)
6habryka
I agree that quantity is important, though there clearly is some threshold beyond which there are diminishing returns (though I am not confident it's within the range that's plausible).  American defense spending is approximately $1T, and that is in times of peace, so even if each drone ends up costing $10,000, we could afford a drone army of one hundred million drones, if we made it the defense strategic priority[1].  And even if they cost $100k each, that's still 10 million drones, which is plausibly beyond the threshold where returns to quantity have substantially diminished. I think the US government just really has a lot of money to spend on defense, and so you can have a huge amount of even very expensive drones. 1. ^ I am assuming here you either increase defense spending when it becomes important, or you stock up over a few years, and so total spending on the drone army is roughly proportional to annual spending. 

He also says that Chinese drones are low quality and Ukraine is slightly ahead of Russia.

Thomas Kwa*Ω8206

Great paper, this is hopeful for unlearning being used in practice.

I wonder if UNDO would stack with circuit discovery or some other kind of interpretability. Intuitively, localizing the noise in the Noise phase to weights that disproportionally contribute to the harmful behavior should get a better retain-forget tradeoff. It doesn't need to be perfect, just better than random, so it should be doable with current methods.

6Bruce W. Lee
Thanks for the suggestion. Upon reflection, it seems to me that the success of targeted noising would depend on two complementary factors: C1. Size of the unlearning target - How broad the capability is in human-understandable terms  C2. Entangledness of the unlearning target - How distributed the capability is across the model's weights Robust unlearning gets easier as both C1 and C2 decrease. There's likely a threshold beyond which unlearning becomes effectively impossible as these factors increase. Note that C1 is a rough measure of C2 but should be considered independently of C2. Rationale: Mech Interp has produced good evidence that factual recall (small C1) is often localized to specific parts (small C2), making it an ideal target for selective noising. However, more general capabilities like deception would likely have high values for both C1 and C2, as they require multiple intertwined sub-capabilities. For instance, deception might require simultaneously computing: (1) the true state of affairs, (2) plausible alternatives, (3) what others believe, and (4) how to optimize output to manipulate those beliefs. Looking Forward: Could targeted UNDO help disentangle general intelligence from potentially harmful capabilities that seem deeply intertwined during training? For example, if we could selectively remove deception while preserving general intelligence, it would be a significant win. The challenge is that many harmful capabilities might be implemented as a superset of the same linear features as benign ones.

It's not generally possible to cleanly separate assets into things valued for stage 1 reasons vs other reasons. You may claim these are edge cases but the world is full of edge cases:

  • Apples are primarily valued on taste; other foods even more so (nutrition is more efficiently achieved via bitter greens, rice, beans, and vitamins than a typical Western diet). You can tell because Honeycrisp apples are much higher priced than lower-quality apples despite being nutritionally equivalent. But taste is highly subjective so value is actually based on weighted pop
... (read more)
1metachirality
I pretty much agree with this. I just posted this as a way of illustrating how simulacrum stages could be generalized to be more than just about signalling and language. In a way, even stocks are stage 4 since they cash out in currency, so that stuff can be one stage in one way but another stage in another way.
Thomas Kwa*170

I don't run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude's SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.

3Aaron Staley
That's a bit higher than I would have guessed.  I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error). That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months.   (If I throw out o4-mini, estimator is ~117 minutes.. still below 120) Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.

There is a decreasing curve of Gemini success probability vs average human time on questions in the benchmark, and the curve intersects 50% at roughly 110 minutes.

Basically it's trying to measure the same quantity as the original paper (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) but the numbers are less accurate since we have less data for these benchmarks.

oops, this was on my work account from which you can't make public links. Replaced the link with the prompt and beginning of o3 output.

o3 has the same conclusion with a slightly different prompt. 

Read this comment exchange and come to a definitive conclusion about whether Garrett Baker is accurately representing Matthew. Focus on content rather than tone:

 

Conclusion: Garrett is not accurately representing Matthew’s position.
Below is a point‑by‑point comparison that shows where Garrett’s paraphrases diverge from what Matthew is actually claiming (ignoring tone and focusing only on the content).

1Guive
That link seems to be broken.  ETA: now fixed by Thomas. 
Thomas Kwa*Ω240

There was a unit conversion mistake, it should have been 80 minutes. Now fixed.

Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long.

In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).

5ryan_greenblatt
No captions feels very unnatural because both llms and humans could first apply relatively dumb speech to text tools.
Thomas KwaΩ250

I would love to have Waymo data. It looks like it's only available since September 2024 so I'll still need to use Tesla for the earlier period. More critically they don't publish disengagement data, only crash/injury. There are Waymo claims of things like 1 disengagement every 17,000 miles but I don't believe them without a precise definition for what this number represents.

Thomas Kwa*Ω411238

Cross-domain time horizon: 

We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here's a preliminary result comparing METR's task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:

Observations

  • Time horizons on agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), and math con
... (read more)
Thomas KwaΩ2100

New graph with better data, formatting still wonky though. Colleagues say it reminds them of a subway map.

With individual question data from Epoch, and making an adjustment for human success rate (adjusted task length = avg human time / human success rate), AIME looks closer to the others, and it's clear that GPQA Diamond has saturated.

2Bogdan Ionut Cirstea
I would love to see an AI safety R&D category.  My intuition is that quite a few crucial AI safety R&D tasks are probably much shorter-horizon than AI capabilities R&D, which should be very helpful for automating AI safety R&D relatively early. E.g. the compute and engineer-hours time spent on pretraining (where most capabilities [still] seem to be coming from) are a-few-OOMs larger than those spent on fine-tuning (where most intent-alignment seems to be coming from).
4StanislavKrym
For some reason, all current benchmarks, with the sole exception of OSWorld[1], now seem to differ by a factor of less than 3. Does this imply that the progress in every benchmark is likely to slow down? 1. ^ OSWorld resembles a physical task, which LLMs tend to fail. However, the article about LLMs failing basic physical tasks was written in April 14, before the pre-release of Gemini Diffusion. Mankind has yet to determine how well diffusion-based LLMs deal with physical tasks. 

Can you explain what a point on this graph means? Like, if I see Gemini 2.5 Pro Experimental at 110 minutes on GPQA, what does that mean? It takes an average bio+chem+physics PhD 110 minutes to get a score as high as Gemini 2.5 Pro Experimental?

3eggsyntax
Just to be sure: as in the METR results, 'horizon' here means 'the time needed to complete the task for humans with appropriate expertise', correct? I assume so but it would be useful to make that explicit (especially since many people who skimmed the METR results initially got the impression that it was 'the time needed for the model to complete the task').

I'm skeptical and/or confused about the video MME results:

  • You show Gemini 2.5 Pro's horizon length as ~5000 minutes or 80 hours. However, the longest videos in the benchmark are 1 hour long (in the long category they range from 30 min to 1 hr). Presumably you're trying to back out the 50% horizon length using some assumptions and then because Gemini 2.5 Pro's performance is 85%, you back out a 80-160x multiplier on the horizon length! This feels wrong/dubious to me if it is what you are doing.
  • Based on how long these horizon lengths are, I'm guessing you
... (read more)
4Unnamed
The 'regression to the mean' pattern is striking: domains with a lower starting point have been growing faster, and those that started with a longer horizon have mostly been grower slower. I wonder if that pattern of catchup growth & drag on leaders will mostly hold up over more time and with a larger set of task types. 
2dschwarz
This expanded list is great, but is still conspicuously missing white-collar work. Software was already the basis for the trend, so the only new one here that seems to give clear information on human labor impacts would be tesla_fsd. (And even there replacing human drivers with AI drivers doesn't seem like it would change much for humanity, compared to lawyers/doctors/accountants/sales/etc.) Is it the case that for most non-software white-collar work, agents can only do ~10-20 human-minute tasks with any reliability, so the doubling time is hard to measure?
9Daniel Kokotajlo
I wish I had thought to blind myself to these results and try to predict them in advance. I think I would have predicted that Tesla self-driving would be the slowest  and that aime would be the fastest. Not confident though. (Solving difficult math problems is just about the easiest long-horizon task to train for,* and in the last few months we've seen OpenAI especially put a lot of effort into training this.) *Only tokens, no images. Also no need for tools/plugins to the internet or some code or game environment. Also you have ground-truth access to the answers, it's impossible to reward hack.
ryan_greenblattΩ11172

but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can't do the average real-world 1-hour task yet.

I'd have guessed that poor performance on OSWorld is mostly due to poor vision and mouse manipulation skills, rather than insufficient ability to act coherantly.

I'd guess that typical self-contained 1-hour task (as in, a human professional could do it in 1 hour with no prior context except context about the general field) also often require vision or non-text computer interaction and if they don't, I bet the AIs actually do pretty well.

2Mo Putera
Nice, this jives with my impression of all the LLM Plays Pokemon findings, I'd have been surprised if it were otherwise.
8lemonhope
You could add cooking tasks with robots.
1Jonas Hallgren
I really like this direction! It feels a bit like looking at other data to verify the trend lines which is quite nice. I was wondering if there's an easy way for you to look at the amount of doubling per compute/money spent over time for the different domains to see if the differences are even larger? It might be predictive as well since if we can see that tesla has spent a lot on self-driving but haven't been able to make progress compared to the rest that might give us information that the task is harder than others. I think Vladimir Nesov wrote somewhere about different investment thresholds being dependent on capabilities return so that would be very interesting to see an analysis of! (What the doubling per compute says about different investment strategies as different phases and it being an important variable for determining investment phase transitions e.g bear or bull market.) 
Thomas Kwa*151

I'd guess that a cheaper, wall-mounted version of CleanAirKits/Airfanta would be a better product. It's just a box with fans and slots for filters, the installation labor is significantly lower, you get better aesthetics, and not everyone has 52 inch ceiling fans at a standardized mounting length already so the market is potentially much larger with a standalone device.

The problem with the ceiling fan is that it's not designed for static pressure, so its effectiveness at moving air through the filter will depend on contingent factors like the blade area ra... (read more)

Thomas Kwa*180

The uplift equation:

What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it's used.

Cursor regime

In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:

Where

  •  is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
  •  is the time for the AI to generate the code in tokens per secon
... (read more)
1np_x
This calculus changes when you can work on many things at once (similar to @faul_sname's comment but this might be that you can work on many projects at once, even if they can't each be parallelized well).  
6faul_sname
Unless the work can be parallelized.

Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:

Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.

This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better executio... (read more)

Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:

  1. Train the model in a realistic way - but then you don’t know if the model really learned the algorithm you expected it to
  2. Train the model to force it to learn a particular algorithm -
... (read more)
Thomas Kwa*501

Author here. When constructing this paper, we needed an interpretable metric (time horizon), but this is not very data-efficient. We basically made the fewest tasks we could to get acceptable error bars, because high-quality task creation from scratch is very expensive. (We already spent over $150k baselining tasks, and more on the bounty and baselining infra.) Therefore we should expect that restricting to only 32 of the 170 tasks in the paper makes the error bars much wider; it roughly increases the error bars by a factor of sqrt(170/32) = 2.3.

Now if the... (read more)

Thomas Kwa*330

I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It's faster, which could be due to some combination of the following:

  • o1 has a better base model than GPT-4
  • HCAST is an example of "emergent capabilitie
... (read more)

Agree, I'm pretty confused about this discrepancy. I can't rule out that it's just the "RL can enable emergent capabilities" point.

1Cole Wyeth
I am not confused, the results of this paper are expected on my model. 
Thomas KwaΩ7104

The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.

Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It's not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon. 

  • Releas
... (read more)
7Thane Ruthenis
Fair, also see my un-update edit. Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I'd previously complained, I don't think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.

o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it's not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.

1Cole Wyeth
Okay, that’s a lot more convincing. Was it available publicly? I seem to have missed it. Surprisingly high success probability on a 16 hour task - does this just mean it made some progress on the 16 hour task? Pokémon fell today too, we might be cooked. 

Other metrics also point to drone-dominated and C&C dominated war. E.g. towed artillery is too vulnerable to counterbattery fire, and modern mobile artillery like CAESAR must use "shoot and scoot" tactics-- it can fire 6 shells within two minutes of stopping and vacate before its last shell lands. But now drones attack them while moving too.

Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:

  • In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
  • 4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
  • Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks

This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when

Thomas Kwa*160

GDM paper: Evaluating the Goal-Directedness of Large Language Models

Tom Everitt, Rohin Shah, and others from GDM attempt to measure "whether LLMs use their capabilities towards their given goal". Unlike previous work, their measure is not just rescaled task performance-- rather, an AI is goal-directed if it uses its capabilities effectively. A model that is not goal-directed when attempting a task will have capabilities but not properly use them. Thus, we can measure goal-directedness by comparing a model's actual performance to how it should perform if it... (read more)

2Daniel Tan
Interesting paper. Quick thoughts: * I agree the benchmark seems saturated. It's interesting that the authors frame it the other way - Section 4.1 focuses on how models are not maximally goal-directed. * It's unclear to me how they calculate the goal-directedness for 'information gathering', since that appears to consist only of 1 subtask. 

Time horizon of o3 is ~1.5 hours vs Claude 3.7's 54 minutes, and it's statistically significant that it's above the long-term trend. It's been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.

My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.

1snewman
What's your basis for "well-defined tasks" vs. "realistic tasks" to have very different doubling times going forward? Is the idea that the recent acceleration seems to be specifically due to RL, and RL will be applicable to well-defined tasks but not realistic tasks? This seems like an extremely important question, so if you have any further thoughts / intuitions / data to share, I'd be very interested.

Benchmark Readiness Level

Safety-relevant properties should be ranked on a "Benchmark Readiness Level" (BRL) scale, inspired by NASA's Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:

BRLDefinitionExample
1Theoretical relevance to x-risk definedAdversarial competence
2Property operationalized for frontier AIs and ASIsAI R&D speedup;
Misaligned goals
3Behavior (or all parts) observed i
... (read more)
Thomas Kwa*Ω37800

Some versions of the METR time horizon paper from alternate universes:

Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh)

Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the excep... (read more)

1wonder
Would the take over for small countries also about humans using just an advanced AI for taking over? (or would the human using advanced AI for take over happen faster?)
BuckΩ14280

A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

Quick list of reasons for me:

  • I'm averse to attending mass protests myself because they make it harder to think clearly and I usually don't agree with everything any given movement stands for.
  • Under my worldview, an unconditional pause is a much harder ask than is required to save most worlds if p(doom) is 14% (the number stated on the website). It seems highly impractical to implement compared to more common regulatory frameworks and is also super unaesthetic because I am generally pro-progress.
  • The economic and political landscape around AI is complicated e
... (read more)
Thomas Kwa*Ω580

I basically agree with this. The reason the paper didn't include this kind of reasoning (only a paragraph about how AGI will have infinite horizon length) is we felt that making a forecast based on a superexponential trend would be too much speculation for an academic paper. (There is really no way to make one without heavy reliance on priors; does it speed up by 10% per doubling or 20%?) It wasn't necessary given the 2027 and 2029-2030 dates for 1-month AI derived from extrapolation already roughly bracketed our uncertainty.

External validity is a huge concern, so we don't claim anything as ambitious as average knowledge worker tasks. In one sentence, my opinion is that our tasks suite is fairly representative of well-defined, low-context, measurable software tasks that can be done without a GUI. More speculatively, horizons on this are probably within a large (~10x) constant factor of horizons on most other software tasks. We have a lot more discussion of this in the paper, especially in heading 7.2.1 "Systematic differences between our tasks and real tasks". The HCAST paper ... (read more)

Humans don't need 10x more memory per step nor 100x more compute to do a 10-year project than a 1-year project, so this is proof it isn't a hard constraint. It might need an architecture change but if the Gods of Straight Lines control the trend, AI companies will invent it as part of normal algorithmic progress and we will remain on an exponential / superexponential trend.

Regarding 1 and 2, I basically agree that SWAA doesn't provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we're currently collecting data on open-source PRs to get a more representative sample of long tasks.

That bit at the end about "time horizon of our average baseliner" is a little confusing to me, but I understand it to mean "if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can't reliably perform tasks that take longer than an hour". Which is a pretty interesting point.

That's basically correct. To give a little more context for why we don't really believe this number, during data collection we were not really trying to measure the human success rate, just get successful human runs and measure their time.... (read more)

All models since at least GPT-3 have had this steep exponential decay [1], and the whole logistic curve has kept shifting to the right. The 80% success rate horizon has basically the same 7-month doubling time as the 50% horizon so it's not just an artifact of picking 50% as a threshold.

Claude 3.7 isn't doing better on >2 hour tasks than o1, so it might be that the curve is compressing, but this might also just be noise or imperfect elicitation.

Regarding the idea that autoregressive models would plateau at hours or days, it's plausible, and one point of... (read more)

gwern4816

One possible interpretation here is going back to the inner-monologue interpretations as being multi-step processes with an error rate per step where only complete success is useful, which is just an exponential; as the number of steps increase from 1 to n, you get a sigmoid from ceiling performance to floor performance at chance. So you can tell the same story about these more extended tasks, which after all, are just the same sort of thing - just more so. We also see this sort of sigmoid in searching with a fixed model, in settings like AlphaZero in Hex... (read more)

It's expensive to construct and baseline novel tasks for this (we spent well over $100k on human baselines) so what we are able to measure in the future depends on whether we can harvest realistic tasks that naturally have human data. You could do a rough analysis on math contest problems, say assigning GSM8K and AIME questions lengths based on a guess of how long expert humans take, but the external validity concerns are worse than for software. For one thing, AIME has much harder topics than GSM8K (we tried to make SWAA not be artificially easier or harder than HCAST); for another, neither are particularly close to the average few minutes of a research mathematician's job.

The trend probably sped up in 2024. If the future trend follows the 2024--2025 trend, we get 50% reliability at 167 hours in 2027.

Answer by Thomas Kwa30

Author here. My best guess is that by around the 1-month point, AIs will be automating large parts of both AI capabilities and empirical alignment research. Inferring anything more depends on many other beliefs.

Currently no one knows how hard the alignment problem is or what exactly good alignment research means-- it is the furthest-looking, least well-defined and least tractable of the subfields of AI existential safety. This means we don't know the equivalent task length of the alignment problem. Even more importantly, we only measured the AIs at softwar... (read more)

2Garrett Baker
You probably mention this somewhere, but I'll ask here, are you currently researching whether these results hold for those other domains? I'm personally more interested about math than law.

AIs (and humans) don't have 100% reliability at anything, so the graph tracks when AIs get a 50% success rate on our dataset, over all tasks and attempts. We also measure AI horizons at 80% success rate in the paper, and those are about 5x shorter. It's hard to measure much higher than 80% with our limited task suite, but if we could we would measure 95% and 99% as well.

2mattmacdermott
I think the commenter is asking something a bit different - about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces? Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc

So the citation is an unreleased paper! That unreleased paper may make a splash, since (assuming this 7-month-doubling trend is not merely 1-2 years old) it strongly implies we really will find good solutions for turning LLMs agentic fairly soon.

The 7-month doubling trend we measured actually goes back to GPT-2 in 2019. Since 2024, the trend has been faster, doubling roughly every 3-4 months depending on how you measure, but we only have six 2024-25 models so error bars are wide and it's really unclear which trend will be predictive of the future.

Thomas KwaΩ342

Important results from the paper as I see it:

  • RL agents can learn to cheat in the wild, and CoT pressure doesn't prevent it.
  • The slope of cheating increase and final cheating rate are both slightly lower when applying CoT pressure, but all of the cheating will go uncaught by the monitor.
  • GPT-4o can zero-shot monitor o3-level agents with 95% accuracy when it has access to CoT, only 60% when it can only monitor actions.
  • If there is no CoT pressure, RL agents telegraph their cheating, such that cheating precision and recall actually go up over time for a monitor
... (read more)

I think we have two separate claims here:

  1. Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
  2. Do technologies often get displaced by technologies with a different lineage?

I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress fr... (read more)

5Thane Ruthenis
Indeed, and I'm glad we've converged on (2). But... ... On second thoughts, how did we get there? The initial disagreement was how plausible it was for incremental changes to the LLM architecture to transform it into a qualitatively different type of architecture. It's not about continuity-in-performance, it's about continuity-in-design-space. Whether finding an AGI-complete architecture would lead to a discontinuous advancement in capabilities, to FOOM/RSI/sharp left turn, is a completely different topic from how smoothly we should expect AI architectures' designs to change. And on that topic, (a) I'm not very interested in reference-class comparisons as opposed to direct gears-level modeling of this specific problem, (b) this is a bottomless rabbit hole/long-standing disagreement which I'm not interested in going into at this time. That's an interesting general pattern, if it checks out. Any guesses why that might be the case? My instinctive guess is the new-paradigm approaches tend to start out promising-in-theory, but initially very bad, people then tinker with prototypes, and the technology becomes commercially viable the moment it's at least marginally better than the previous-paradigm SOTA. Which is why there's an apparent performance-continuity despite a lineage/paradigm-discontinuity.

Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren't used.

Agree, this is one big limitation of the paper I'm working on at METR. The first two ideas you listed are things I would very much like to measure, and the third something I would like to measure but is much harder than any current benchmark given that university takes humans years rather than hours. If we measure it right, we could tell whether generalization is steadily improving or plateauing.

Though the fully connected -> transformers wasn't infinite small steps, it definitely wasn't a single step. We had to invent various sub-innovations like skip connections separately, progressing from RNNs to LSTM to GPT/BERT style transformers to today's transformer++. The most you could claim is a single step is LSTM -> transformer.

Also if you graph perplexity over time, there's basically no discontinuity from introducing transformers, just a possible change in slope that might be an artifact of switching from the purple to green measurement method.... (read more)

7Thane Ruthenis
This argument still seems to postdict that cars were invented by tinkering with carriages and horse-breeding, spacecraft was invented by tinkering with planes, refrigerators were invented by tinkering with cold cellars, et cetera. If you take the snapshot of the best technology that does X at some time T, and trace its lineage, sure, you'll often see the procession of iterative improvements on some concepts and techniques. But that line won't necessarily pass through the best-at-X technologies at times from 0 to T - 1. The best personal transportation method were horses, then cars. Cars were invented by iterating on preceding technologies and putting them together; but horses weren't involved. Similar for the best technology at lifting a human being into the sky, the best technology for keeping food cold, etc. I expect that's the default way significant technological advances happen. They don't come from tinkering with the current-best-at-X tech. They come from putting together a bunch of insights from different or non-mainstream tech trees, and leveraging them for X in a novel way. And this is what I expect for AGI. It won't come from tinkering with LLMs, it'll come from a continuous-in-retrospect, surprising-in-advance contribution from some currently-disfavored line(s) of research. (Edit: I think what I would retract, though, is the point about there not being a continuous manifold of possible technological artefacts. I think something like "the space of ideas the human mind is capable of conceiving" is essentially it.)
Thomas Kwa*3813

A continuous manifold of possible technologies is not required for continuous progress. All that is needed is for there to be many possible sources of improvements that can accumulate, and for these improvements to be small once low-hanging fruit is exhausted.

Case in point: the nanogpt speedrun, where the training time of a small LLM was reduced by 15x using 21 distinct innovations which touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, ... (read more)

6johnswentworth
I think you should address Thane's concrete example: That seems to me a pretty damn solid knock-down counterargument. There were no continuous language model scaling laws before the transformer architecture, and not for lack of people trying to make language nets.
7p.b.
Because these benchmarks are all in the LLM paradigm: Single input, single output from a single distribution. Or they are multi-step problems on rails. Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren't used. One could let models play new board/computer games against average humans: Video/image input, action output.  One could let models offer and complete tasks autonomously on freelancer platforms.  One could enrol models in remote universities and see whether they autonomously reach graduation.  It's not difficult to come up with hard benchmarks for current models (these are not close to AGI complete). I think people don't do this because they know that current models would be hopeless at benchmarks that actually aim for their shortcomings (agency, knowledge integration + integration of sensory information, continuous learning, reliability, ...)

I think eating the Sun is our destiny, both in that I expect it to happen and that I would be pretty sad if we didn't; I just hope it will be done ethically. This might seem like a strong statement but bear with me

Our civilization has undergone many shifts in values as higher tech levels have indicated that sheer impracticality of living a certain way, and I feel okay about most of these. You won't see many people nowadays who avoid being photographed because photos steal a piece of their soul. The prohibition on women working outside the home, common in m... (read more)

Will we ever have Poké Balls in real life? How fast could they be at storing and retrieving animals? Requirements:

  • Made of atoms, no teleportation or fantasy physics.
  • Small enough to be easily thrown, say under 5 inches diameter
  • Must be able to disassemble and reconstruct an animal as large as an elephant in a reasonable amount of time, say 5 minutes, and store its pattern digitally
  • Must reconstruct the animal to enough fidelity that its memories are intact and it's physically identical for most purposes, though maybe not quite to the cellular level
  • No external
... (read more)
5Gurkenglas
All we need to create is a Ditto. A blob of nanotech wouldn't need 5 seconds to take the shape of the surface of an elephant and start mimicing its behavior; is it good enough to optionally do the infilling later if it's convenient?

and yet, the richest person is still only responsible for 0.1%* of the economic output of the united states.

Musk only owns 0.1% of the economic output of the US but he is responsible for more than this, including large contributions to

  • Politics
  • Space
    • SpaceX is nearly 90% of global upmass
    • Dragon is the sole American spacecraft that can launch humans to ISS
    • Starlink probably enables far more economic activity than its revenue
    • Quality and quantity of US spy satellites (Starshield has ~tripled NRO satellite mass)
  • Startup culture through the many startups from ex-Spac
... (read more)
4Joseph Miller
The Duke of Wellington said that Napoleon's presence on a battlefield “was worth forty thousand men”. This would be about 4% of France's military size in 1812.
2leogao
i'm happy to grant that the 0.1% is just a fermi estimate and there's a +/- one OOM error bar around it. my point still basically stands even if it's 1%. i think there are also many factors in the other direction that just make it really hard to say whether 0.1% is an under or overestimate. for example, market capitalization is generally an overestimate of value when there are very large holders. tesla is also a bit of a meme stock so it's most likely trading above fundamental value. my guess is most things sold to the public sector probably produce less economic value per $ than something sold to the private sector, so profit overestimates value produced the sign on net economic value of his political advocacy seems very unclear to me. the answer depends strongly on some political beliefs that i don't feel like arguing out right now. it slightly complicates my analogy for elon to be both the richest person in the us and also possibly the most influential (or one of). in my comment i am mostly referring to economic-elon. you are possibly making some arguments about influentialness in general. the problem is that influentialness is harder to estimate. also, if we're talking about influentialness in general, we don't get to use the 0.1% ownership of economic output as a lower bound of influentialness. owning x% of economic output doesn't automatically give you x% of influentialness. (i think the majority of other extremely rich people are not nearly as influential as elon per $)
Load More