IMO challenge bet with Eliezer

paulfchristiano

Eliezer and I publicly stated some predictions about AI performance on the IMO by 2025. In honor of OpenAI's post Solving (Some) Formal Math Problems, it seems good to publicly state and clarify our predictions, have a final chance to adjust them, and say a bit in advance about how we'd update.

The predictions

Eliezer and I had an exchange in November 2021.^[1] My final prediction (after significantly revising my guesses after looking up IMO questions and medal thresholds) was:

I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I'll go 8% on "gets gold" instead of "solves hardest problem."

Eliezer spent less time revising his prediction, but said (earlier in the discussion):

My probability is at least 16% [on the IMO grand challenge falling], though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I'll stand by a >16% probability of the technical capability existing by end of 2025

So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.

Separately, we have Paul at <4% of an AI able to solve the "hardest" problem under the same conditions.

I don't plan to revise my predictions further, but I'd be happy if Eliezer wants to do so any time over the next few weeks.

Earlier in the thread I clarified that my predictions are specifically about gold medals (and become even sharper as we move to harder problems), I am not surprised by silver or bronze. My guess would be that Eliezer has a more broad distribution. The comments would be a good place for Eliezer to state other predictions, or take a final chance to revise the main prediction.

How I'd update

The informative:

I think the IMO challenge would be significant direct evidence that powerful AI would be sooner, or at least would be technologically possible sooner. I think this would be fairly significant evidence, perhaps pushing my 2040 TAI probability up from 25% to 40% or something like that.
I think this would be significant evidence that takeoff will be limited by sociological facts and engineering effort rather than a slow march of smooth ML scaling. Maybe I'd move from a 30% chance of hard takeoff to a 50% chance of hard takeoff.
If Eliezer wins, he gets 1 bit of epistemic credit.^[2]^[3] These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I'll take what we can get.
This would be some update for Eliezer's view that "the future is hard to predict." I think we have clear enough pictures of the future that we have the right to be surprised by an IMO challenge win; if I'm wrong about that then it's general evidence my error bars are too narrow.

The uninformative:

This is mostly just a brute test of a particular intuition I have about a field I haven't ever worked in. It's still interesting (see above), but it doesn't bear that much on deep facts about intelligence (my sense is that Eliezer and I are optimistic about similar methods for theorem proving), or heuristics about trend extrapolation (since we have ~no trend to extrapolate), or on progress being continuous in crowded areas (since theorem proving investment has historically been low), or on lots of pre-singularity investment in economically important areas (since theorem proving is relatively low-impact). I think there are lots of other questions that do bear on these things, but we weren't able to pick out a disagreement on any of them.

If an AI wins a gold on some but not all of those years, without being able to solve the hardest problems, then my update will be somewhat more limited but in the same direction. If an AI wins a bronze/silver medal, I'm not making any of these updates and don't think Eliezer gets any credit unless he wants to stake some predictions on those lower bars (I consider them much more likely, maybe 20% for "bronze or silver" vs 8% on "gold," but that's less well-considered than the bets above, but I haven't thought about that at all).

^{^}
We also looked for claims that Eliezer thought were very unlikely, so that he'd also have an opportunity to make some extremely surprising predictions. But we weren't able to find any clean disagreements that would resolve before the end of days.
^{^}
I previously added the text: "So e.g. if Eliezer and I used to get equal weight in a mixture of experts, now Eliezer should get 2x my weight. Conversely, if I win then I should get 1.1x his weight." But I think that really depends on how you want to assign weights. That's a very natural algorithm that I endorse generally, but given that neither of us really has thought carefully about this question it would be reasonable to just not update much one way or the other.
^{^}
More if he chooses to revise his prediction up from 16%, or if he wants to make a bet about the "hardest problem" claim where I'm at 4%.

I set up a Manifold market for this, for anyone who'd like to bet along: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat

The concept of epistemic bits is quite interesting! In Manifold terms, that would be like comparing the total amount of M$ one has amassed (maybe in a specific community like the one for AI) and using them as weights for how seriously to take two commentators.

I worked through an example just to prove the math to myself:

Eliezer and Paul both start with 1000 M$.

For Eliezer to end up with 2x the epistemic weight, Paul wagers 333 M$ (so the end result would be 1333:667 or ~2:1).

Taking 9:1 odds (roughly geometric mean between 8% and 16%), Eliezer puts up 37 M$.

So if Paul wins, he ends up at 1037:963, or ~1.08:1

I bet the market down from 58% to 40% which seems like a steal on the merits. Seems like it will mostly lock up my money though given the long resolution timeline, unless people eventually settle down to a significantly lower consensus.

Yes -- this surfaces the issue with long-term markets, where it's not worth the investment lockup when you could be trading on issues with a shorter resolution time.

We have a few tentative proposals to address this:

Issue small interest-free loans on the first M$10 bet (Scott's proposal)
Margin loans against the value of of your long-term bets
Set up derivative markets (e.g. "What will the IMO market be at on April 1st 2022")
Demurrage, where we charge a fee for cash balances

I'm partial to Scott's proposal, though it may be technically tricky to implement...

I emphasize that we arrived at this concrete disagreement in the wake of my naming multiple sure-that-could-happen predictions of Paul's that I thought looked more likely in Paul's universe than my own, including:

Average Mike Blume-level programmers making $10M/year.
The world economy doubling in any four-year period before the world ends.
$10 trillion spent on training any AI model.

I'd be happy to do bet about fast growth prior to end of world. I'd say something like:

> 25% chance that there are 4 consecutive years before 2045 with >10% real economic growth per year, under an accounting scheme that feels plausible/real to Eliezer and Paul, and the fast growth sure feels to Paul and Eliezer like it's about automation rather than some weird thing like recovery from a pandemic.

You would win when the end of the world looks close enough that we can't get 4 years of progress (e.g. if we have 50% growth in one year with <10% progress in the year before, or if it becomes clear enough that we won't have time for rapid growth before the AI-R&D-AI gets us to a singularity). It's kind of unsatisfying not to pay out until right before the end with kind of arbitrary conditions (since it's so hard for you to actually win before 2045 unless the world ends).

I would give you higher probabilities if we relax 2045, but then it just gets even harder for you to win. Probably there is some better operationalization which I'm open to. Maybe a higher threshold than 10% growth for 4 years is better, but my probabilities will start to decline and I'd guess we get more expected info from disagreements about higher probability events.

(I don't know what kind of programmer Mike Blume is but don't expect to ever have many $10M+ programmers. Likewise, I don't think we have $10T spent training a single AI model until after we've already faced down alignment risk, and I assume that you also thinks that society will eventually use >>$10T of resources on training a single ML model. I would have thought we could also bet about softer versions of those things---I assume I also assign much higher probability than you do $1B training runs even if it's not worldview-falsifying, I don't think worldviews usually get falsified so much as modestly undermined.)

https://x.com/alexwei_/status/1946477742855532918

This doesn't count yet for the standards of the bet, I think, but this seems pretty close to "EY wins" to me? (The manifold market has jumped up to 93%.)

The market has now resolved to yes, with Paul confirming.

By the time we got here I feel like I've lost track of what the actual generating models of this disagreement were.

I guess I forgot that the whole reason* people seemed to conflate "slow timelines and smooth takeoff" because Paul seemed to believe in both, and

You might want to rethink the definition of "hardest problem" because I wouldn't expect the IMO committee to care particularly much that their problems are hard for machines as well as humans.

For example, if you look at the 2005 IMO, but suppose that problem 6 was geometry, so that the problem 3 was the "hardest", then you're in trouble. Problem 3 has a one-line proof "by sum_of_squares", assuming that someone has written a sum-of-squares tactic in time and that the sum-of-squares certificate is small enough to find within the time limit.

Alternatively, the IMO committee might set a problem without realizing that it's close to a special case of something already in mathlib.

As defined, I would put higher probability on solving the "hardest problem" than getting a gold medal and would bet on that.

Yeah, this may well be right. I agree that IMO 2015 problem 3 can be solved by machine. I think both the geometry and 3 variable inequalities could be a very big problem (in that lots of instances are easy for machines and hard for humans). Some diophantine equations and functional equations are also very easy for machine. By 2022 my vague understanding is that the committee is much less likely to put that kind of question in the hardest slot since increasingly many contestants can basically do the machine-like proofs. But I maybe shouldn't count on that, since the big problem is that we are letting it win at any of 4 IMOs (2022, 2023, 2024, 2025).

I'd guess we have like a 2-3% chance of 3-variable inequality in hardest spot (hasn't happened in last 20 years and is getting less likely, but not that confident). By now I think it would be considered unreasonable to put a muirhead bashable problem as 3/6 (though even the other 3 variable inequalities will likely be much easier for machines than humans), even in 2005 I think that was probably considered a mistake. But still lots of these may be easy for machines, so you may easily be looking at like 10% chance of one of them being a 3-variable inequality and a 50% chance of solving it conditioned on a serious effort which would already be >4%.

Other candidate definitions would have been:

Instead of just using the type of problem, try to explicitly flag constant-number-variables inequalities or diophantine equations or functional equations as special cases to be ranked lower.
Have a stronger preference for combo, e.g. pick from amongst 3/6 (or even 2/3/5/6) with preference for combo then NT then algebra than geo. (But I'm scared of e.g. combo 2/5 that can be solved with easy monovariant or that kind of thing, I strongly suspect many of those are also automatically solvable without deep learning.)
Actually ask someone who doesn't yet know the results of running a machine which problem seems hardest for machine.

I also agree that "special case of existing result" is plausible, but I think that's going to be <1% (and it's fairly likely to be missed by machines even so). Though that could get worse if many more formal proofs are added in the interim as part of increased investment in theorem proving which could drive that up, and that's very correlated with a serious effort that could effectively exploit such data.

If Eliezer wants to actually bet about the hardest problem claim I may want to argue for some kind of reformulation (even though I said I didn't want to change my predictions). If Eliezer doesn't want to bet about it, I think it's OK to ding me epistemically if the AI ends up getting a "hardest problem" that is a 3 variable inequality or bashable functional or diophantine equation, though I think it's a bit of a technicality / a general error in my forecasting rather than a claim about AI.

Isn't it just a lot more interesting to bet on gold than on hardest problem?

The former seems like it averages out over lots of random bits of potential bad luck. And if the bot can do the hardest problem, wouldn't it likely be able to do the other problems as well?

Problems 1 and 4 are often very easy and nearly-mechanical for humans. And it's not uncommon for 2 of the other problems to happen to be easy as well (2/5 are much easier than 3/6 and it's considered OK for them to be kind of brute-forceable, and I expect that you could solve most IMO geometry problems without any kind of AI if you tried hard enough, even the ones that take insight for a human).

Sometimes you can get a gold by solving only 4/6 of the problems, and that's not very well-correlated with whether the problems happen to be easy for machine. And even when you need 5/6 it looks like sometimes you could get a very lucky break.

So if you get 4 swings, it looks fairly likely that one of them will be quite easy, and that dominates P(gold). E.g. I wouldn't be too surprised by someone doing a bronze this year, having a swing next year and getting a silver or gold based on how lucky they get, and then if they get a silver trying again the next year just to be able to say they did it.

What I was expressing a more confident prediction about was an idea like "The hardest IMO problems, and especially the ad hoc problems, seem very hard for machines." Amongst humans solving those problems is pretty well-correlated with getting golds, but unfortunately for machines it looks quite noisy and so my probability had to go way up.

Unfortunately "hardest problem" is also noisy in its own way (in large part because a mechanical definition of "hardest" isn't good) so this isn't a great fix.

Hmm, in that case, would "all the problems" be better than either "hardest problem" or "gold"?

Edit: now refers to footnote 2, see below.

If Eliezer wins, he gets 1 bit of epistemic credit. So e.g. if Eliezer and I used to get equal weight in a mixture of experts, now Eliezer should get 2x my weight. Conversely, if I win then I should get 1.1x his weight.

The second and third sentences do not follow from the first. I would update to more like 1.05x vs 1.005x (or ideally something more specific depending on more details than binary questions like "solved hardest" or "won gold," eg among other nuances I'd update more if the bet resolved for Eliezer in 2022 vs 2025).

If you are asserting that observers should eg update from 1:1 to 2:1 toward Eliezer if he wins, that seems far too strong to me, particularly given that neither of you seems to have spent a great deal of effort on the IMO question, but I would be very interested to hear why you think we should update so much.

These numbers come from if each party were actually in a mixture of experts, or equivalently Kelly-betting their epistemic credit points on a prediction market. We might update less due to some combination of

other evidence we have about Paul and Eliezer is not independent with this
The bet is between Paul thinking for a few hours, and Eliezer thinking for a few hours. But what we really want to decide between is Paul's model being right vs Eliezer's model being right, and the bet is not perfect evidence of that.

I don't really think we have particularly careful views; it would be reasonable to not update very much.

ETA: I moved that text to a footnote since I basically agree that it's unreasonable as a prescription.

Congratulations on ironing out the bet! It seemed like a lot of work. :)

https://x.com/alexwei_/status/1946477742855532918

I believe this qualifies as "technical capability existing by end of 2025".

Flagging for posterity that I'd take Eliezer's side of this bet (but I think for different reasons than him).

I believe most people who work in deep learning (but don't have inside evidence about particular theorem-proving efforts currently underway) are on Eliezer's side and would go even higher than 16%.

In the broader world I think probably it's more divided, and I'd guess that most academics in AI would be on my side though I'm not sure.

This is definitely a thing that makes me uneasy about the bet (I would have loved for Eliezer to be the one with the more contrarian take, and for my position to be closer to 50-50, but you get what you get).

Makes sense. (Also, to be clear, I should have mentioned that my earlier claim was not based on inside evidence of that kind.)

I do not think the ratio of the "AI solves hardest problem" and "AI has Gold" probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...

(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)

The IMO Jury does not consider "bashability" of problems as a decision factor, in the regime where the bashing would take good contestants more than a few hours. But for a dedicated bashing program, it makes no difference.

It is extremely likely that an "AI" solving most IMO geometry problems is possible today -- the main difficulty being converting the text into an algebraic statement. Given that, polynomial system solvers should easily tackle such problems.

Say the order of the problems is (Day 1: CNG, Day 2: GAC). The geometry solver gives you 14 points. For a chance on IMO Gold, you have to solve the easiest combinatorics problem, plus one of either algebra or number theory.

Given the recent progress on coding problems as in AlphaCode, I place over 50% probability on IMO #1/#4 combinatorics problems being solvable by 2024. If that turns out to be true, then the "AI has Gold" event becomes "AI solves a medium N or a medium A problem, or both if contestants find them easy".

Now, as noted elsewhere in the thread, there are various types of N and A problems that we might consider "easy" for an AI. Several IMOs in the last ten years contain those.

In 2015, the easiest five problems consisted out of: two bashable G problems (#3, #4), an easy C (#1), a diophantine equation N (#2) and a functional equation A (#5). Given such a problemset, a dedicated AI might be able to score 35 points, without having capabilities remotely enough to tackle the combinatorics #6.

The only way the Gold probability could be comparable to "hardest problem" probability is if the bet only takes general problem-solving models into account. Otherwise, inductive bias one could build into such a model (e.g. resort to a dedicated diophantine equation solver) helps much more in one than in the other.

I think this is quite plausible. Also see toner's comment in the other direction though. Both probabilities are somewhat high because there are lots of easy IMO problems. Like you, I think "hardest problem" is quite a bit harder than a gold, though it seems you think the gap is larger (and most likely it sounds like you put a much higher probability on an IMO gold overall).

Overall I think that AI can solve most geometry problems and 3-variable inequalities for free, and many functional equations and diophantine equations seem easy. And I think the easiest problems are also likely to be soluble. In some years this might let it get a gold (e.g. 2015 is a good year), but typically I think it's still going to be left with problems that are out of reach.

I would put a lower probability for "hardest problem" if we were actually able to hand-pick a really hard problem; the main risk is that sometimes the hardest problem defined in this way will also be bashable for one reason or another.

Right now Eliezer has gold at "16%+" and I have it at "8%."

I think Eliezer would probably like to claim more than a factor of 2 of epistemic credit if this were to happen (e.g. a more significant update than if I'd put 1/3 probability on something and he'd put 2/3 probability).

If that's right, then I think it would be good to publicly state a higher probability than 16% in advance, so that I can get correspondingly more credit in the worlds where this doesn't happen.

Having posted this and emailed Eliezer, I'm now considering his prediction as 16% flat. So that's 1 bit of epistemic credit for him if it happens 1/8th of a bit for me if it doesn't.

Figured I should post my estimate. I am going to go with 30%.

I'm curious if you can explicate the thought process behind such a high estimate.

Not really, just an intuition that it will be easier than most think, and what we have seen so far is the thin edge of a wedge that will be hammered in pretty quickly.

Eliezer seems on track to win: current AI benchmark for IMO geometry problems is at 27/30 (IMO Gold human performance is at 25.9/30). This new benchmark was set by LLM-augmented neurosymbolic AI.

Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry [2024 April]

This market sure did exhibit some odd behavior at first.

https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat