FrontierMath was funded by OpenAI.[1]
The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.
Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]
Because the Arxiv version mentioning OpenAI contribution came out right after o3 announcement, I'd guess Epoch AI had some agreement with OpenAI to not mention it publicly until then.
The mathematicians creating the problems for FrontierMath were not (actively)[2] communicated to about funding from OpenAI. The contractors were instructed to be secure about the exercises and their solutions, including not using Overleaf or Colab or emailing about the problems, and signing NDAs, "to ensure the questions remain confidential" and to avoid leakage. The contractors were also not communicated to about OpenAI funding on December 20th. I believe there were named authors of the paper that had no idea about OpenAI funding.
I believe the impression for most people, and for most contractors, was "This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem."[3]
Now Epoch AI or OpenAI don't say publicly that OpenAI has access to the exercises or answers or solutions. I have heard second-hand that OpenAI does have access to exercises and answers and that they use them for validation. I am not aware of an agreement between Epoch AI and OpenAI that prohibits using this dataset for training if they wanted to, and have slight evidence against such an agreement existing.
In my view Epoch AI should have disclosed OpenAI funding, and contractors should have transparent information about the potential of their work being used for capabilities, when choosing whether to work on a benchmark.
Arxiv v5 (Dec 20th version) "We gratefully acknowledge OpenAI for their support in creating the benchmark."
I do not know if they have disclosed it in neutral questions about who is funding this.
This is from a comment by a non-Epoch AI person on HackerNews from two months ago. Another example: Ars Technica writes "FrontierMath's difficult questions remain unpublished so that AI companies can't train against it." in a news article from November.
Tamay from Epoch AI here.
We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future.
For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset. While we did communicate that we received lab funding to some mathematicians, we didn't do this systematically and did not name the lab we worked with. This inconsistent communication was a mistake. We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.
Getting permission to disclose OpenAI's involvement only around the o3 launch wasn't good enough. Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.
Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Relevant OpenAI employees’ public communications have described FrontierMath as a 'strongly held out' evaluation set. While this public positioning aligns with our understanding, I would also emphasize more broadly that labs benefit greatly from having truly uncontaminated test sets.
OpenAI has also been fully supportive of our decision to maintain a separate, unseen holdout set—an extra safeguard to prevent overfitting and ensure accurate progress measurement. From day one, FrontierMath was conceived and presented as an evaluation tool, and we believe these arrangements reflect that purpose.
[Edit: Clarified OpenAI's data access - they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.]
we have a verbal agreement that these materials will not be used in model training
Get that agreement in writing.
I am happy to bet 1:1 OpenAI will refuse to make an agreement in writing to not use the problems/the answers for training.
You have done work that contributes to AI capabilities, and you have misled mathematicians who contributed to that work about its nature.
Get that agreement in writing.
I'm not sure that would be particularly reassuring to me (writing as one of the contributors). First, how would one check that the agreement had been adhered to (maybe it's possible, I don't know)? Second, people in my experience often don't notice they are training on data (as mentioned in a post above by ozziegooen).
I agree entirely that it would not be very reassuring, for the reasons you explained. But I would still consider it a mildly interesting signal to see if OpenAI would be willing to provide such an agreement in writing, and maybe make a public statement on the precise way they used the data so far.
Also: if they make a legally binding commitment, and then later evidence shows up that they violated the terms of this agreement (e.g. via whistleblowers), I do think that this is a bigger legal risk for them than breeching some fuzzy verbal agreement.
I found this extra information very useful, thanks for revealing what you did.
Of course, to me this makes OpenAI look quite poor. This seems like an incredibly obvious conflict of interest.
I'm surprised that the contract didn't allow Epoch to release this information until recently, but that it does allow Epoch to release the information after. This seems really sloppy for OpenAI. I guess they got a bit extra publicity when o3 was released (even though the model wasn't even available), but now it winds up looking worse (at least for those paying attention). I'm curious if this discrepancy was maliciousness or carelessness.
Hiding this information seems very similar to lying to the public. So at very least, from what I've seen, I don't feel like we have many reasons to trust their communications - especially their "tweets from various employees."
> However, we have a verbal agreement that these materials will not be used in model training.
I imagine I can speak for a bunch of people here when I can say I'm pretty skeptical. At very least, it's easy for me to imagine situations where the data wasn't technically directly used in the training, but was used by researchers when iterating on versions, to make sure the system was going in the right direction. This could lead to a very blurry line where they could do things that aren't [literal LLM training] but basically achieve a similar outcome.
However, we have a verbal agreement that these materials will not be used in model training.
If by this you mean "OpenAI will not train on this data", that doesn't address the vast majority of the concern. If OpenAI is evaluating the model against the data, they will be able to more effectively optimize for capabilities advancement, and that's a betrayal of the trust of the people who worked on this with the understanding that it will be used only outside of the research loop to check for dangerous advancements. And, particularly, not to make those dangerous advancements come sooner by giving OpenAI another number to optimize for.
If you mean OpenAI will not be internally evaluating models on this to improve and test the training process, please state this clearly in writing (and maybe explain why they got privileged access to the data despite being prohibited from the obvious use of that data).
I think you should publicly commit to:
If you currently have any of these with the computer use benchmark in development, you should seriously try to get out of those contractual obligations if there are any.
Ideally, you commit to these in a legally binding way, which would make it non-negotiable in any negotiation, and make you more credible to outsiders.
We could also ask if these situations exist ("is there any funder you have that you didn't disclose?" and so on, especially around NDAs), and Epoch could respond with Yes/No/Can'tReply[1].
Also seems relevant for other orgs.
This would only patch the kind of problems we can easily think about, but it seems to me like a good start
I learned that trick from hpmor!
William,
not exactly an answer to your question but BOTEC estimate for FrontierMath costs: $400k — $2M
Thank you for the clarification! What I would be curious about: you write
OpenAI does have access to a large fraction of FrontierMath problems and solutions
Does this include the detailed solution write-up (mathematical arguments, in LaTeX) or just the final answer (numerical result of the question / Python script verifying the correctness of the AI response)?
We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities.
Can you say exactly how large of a fraction is the set that OpenAI has access to, and how much is the hold-out set?
Not Tamay, but from elliotglazer on Reddit[1] (14h ago): "Epoch's lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven't yet independently verified their 25% claim. To do so, we're currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.
My personal opinion is that OAI's score is legit (i.e., they didn't train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can't vouch for them until our independent evaluation is complete."
Currently developing a hold-out dataset gives a different impression than
"We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities" and "they do not have access to a separate holdout set that serves as an additional safeguard for independent verification."
Creating further even harder datasets could plausibly accelerate OpenAI's progress. I read on twitter that people are working on an even harder dataset now. I would not give them access to this, they may break their promise not to train on this if it allows them to accelerate progress. This is extremely valuable training data that you have handed to them.
Suggested market. Happy to take suggestions on how to improve it:
https://manifold.markets/NathanpmYoung/will-o3-perform-as-well-on-the-fron?play=true
This is extremely informative, especially the bit about the holdout set. I think it'd reassure a lot of people about the FrontierMath's validity to know more here. Have you used it to assess any of OpenAI's models? If so, how, and what were the results?
It's probably worth them mentioning for completeness that Nat Friedman funded an earlier version of the dataset too. (I was advising at that time and provided the main recommendation that it needs to be research-level because they were focusing on Olympiad level.)
Also can confirm they aren't giving access to the mathematicians' questions to AI companies other than OpenAI like xAI.
EpochAI is also working on a "next-generation computer-use benchmark". I wonder who is involved in funding that. It could be OpenAI given recent rumors they are planning to release a computer-use model early this year.
Having hopefully learned from our mistakes regarding FrontierMath, we intend to be more transparent to collaborators for this new benchmark. However, at this stage of development, the benchmark has not reached a point where any major public disclosures are necessary.
Well, I'd sure like to know whether you are planning to give the dataset to OpenAI or any other frontier companies! It might influence my opinion of whether this work is net positive or net negative.
I can't make any confident claims or promises right now, but my best guess is that we will make sure this new benchmark stays entirely private and under Epoch's control, to the extent this is feasible for us. However, I want to emphasize that by saying this, I'm not making a public commitment on behalf of Epoch.
to the extent this is feasible for us
Was [keeping FrontierMath entirely private and under Epoch's control] feasible for Epoch in the same sense of "feasible" you are using here?
I'm not completely sure, since I was not personally involved in the relevant negotiations for FrontierMath. However, what I can say is that Tamay already indicated that Epoch should have tried harder to obtain different contract terms that enabled us to have greater transparency. I don't think it makes sense for him to say that unless he believes it was feasible to have achieved a different outcome.
Also, I want to clarify that this new benchmark is separate from FrontierMath and we are under different constraints with regards to it.
Hey everyone, could you spell out to me what's the issue here? I read a lot of comments that basically assume "x and y are really bad" but never spell it out. So, is the problem that:
- Giving the benchmark to OpenAI helps capabilities (but don't they have a vast sea of hard problems to already train models on?)
- OpenAI could fake o3's capabilities (why do you care so much? This would slow down AI progress, not accelerate it)
- Some other thing I'm not seeing?
Really high quality high-difficulty benchmarks are much more scarce and important for capabilities advancing than just training data. Having an apparently x-risk focused org do a benchmark implying it's for evaluating danger from highly capable models in a way which the capabilities orgs can't use to test their models, then having it turn out that's secretly funded by OpenAI with OpenAI getting access to most of the data is very sketchy.
Some people who contributed questions likely thought they would be reducing x-risk by helping build bright line warning signs. Their work being available to OpenAI will mostly have increased x-risk by giving the capabilities people an unusually important number-goes-up to optimize for, bringing timelines to dangerous systems closer. That's a betrayal of trust, and Epoch should do some serious soul searching about taking money to do harmful things.
If the funding didn't come from OpenAI, would OpenAI still be able to use that benchmark? Like, I'd imagine Epoch would still use that to evaluate where current models are at. I think this might be my point of confusion. Maybe the answer is "not as much for it to be as useful to them"?
Evaluation on demand because they can run them intensely lets them test small models for architecture improvements. This is where the vast majority of the capability gain is.
Getting an evaluation of each final model is going to be way less useful for the research cycle, as it only gives a final score, not a metric which is part of the feedback loop.
Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.
This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.
Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly training on it (e.g. maybe they could use to evaluate process-based grading approaches).
(FWIW I’m explaining the negatives, but I disagree with the comment I’m expanding on regarding the sign of Frontier Math, seems positive EV to me despite the concerns)
I'm guessing you view having better understanding of what's coming as very high value, enough that burning some runway is acceptable? I could see that model (though put <15% on it), but I think this is at least not good integrity wise to have put on the appearance of doing just the good for x-risk part and not sharing it as an optimizable benchmark, while being funded by and giving the data to people who will use it for capability advancements.
Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it's unclear to me how having FrontierMath affects takeoff speeds.
I wasn't thinking much about the optics, but I'd guess that's not a large effect. I agree that Epoch made a mistake here though and this is a negative.
I could imagine changing my mind somewhat easily,.
Agree that takeoff speeds are more important, and expect that FrontierMath has much less affect on takeoff speed. Still think timelines matter enough that the amount of relevantly informing people that you buy from this is likely not worth the cost, especially if the org is avoiding talking about risks in public and leadership isn't focused on agentic takeover, so the info is not packaged with the info needed for that info to have the effects which would help.
Evaluating the final model tells you where you got to. Evaluating many small models and checkpoints helps you get further faster.
In addition to the object level reasons mentioned by plex, misleading people about the nature of a benchmark is a problem because it is dishonest. Having an agreement to keep this secret indicates that the deception was more likely intentional on OpenAI's part.
Why do you consider it unlikely that companies could (or would) fish out the questions from API-logs?
That was a quote from a commenter in Hacker news, not my view. I reference the comment as something I thought a lot of people's impression was pre- Dec 20th. You may be right that maybe most people didn't have the impression that it's unlikely, or that maybe they didn't have a reason to think that. I don't really know.
Thanks, I'll put the quote in italics so it's clearer.
This is a very relevant concern and that's why Arc-AGI manages two versions. None of them are available in public.
Some AI companies, like OpenAI, have “eyes-off” APIs that don’t log any data afaik (or perhaps log only the minimum legally permitted, with heavy restrictions on who can access): described as Zero Day Retention here, https://openai.com/enterprise-privacy/ : How does OpenAI handle data retention and monitoring for API usage?
I was the original commenter on HN, and while my opinion on this particular claim is weaker now, I do think for OpenAI, a mix of PR considerations, employee discomfort (incl. whistleblower risk), and internal privacy restrictions make it somewhat unlikely to happen (at least 2:1?).
My opinion has become weaker because OpenAI seems to be internally a mess right now, and I could imagine scenarios where management very aggressively pushes and convinces employees to employ these more "aggressive" tactics.
Thanks for posting this!
I have to admit, the quote here doesn't seem to clearly support your title -- I think "support in creating the benchmark" could mean lots of different things, only some of which are funding. Is there something I'm missing here?
Regardless, I agree that FrontierMath should make clear what the extent was of their collaboration with OpenAI. Obviously the details here are material to the validity of their benchmark.
i've changed my mind and been convinced that it's kind of a big deal that frontiermath was framed as something that nobody would have access to for hillclimbing when in fact openai would have access and other labs wouldn't. the undisclosed funding before o3 launch still seems relatively minor though
am curious why you think this; it seems like some people were significantly misled and disclosure of potential conflicts-of-interest seems generally important
EpochAI shared two job postings on their LinkedIn few months ago that seem relevant to FrontierMath:
Project lead compensation is $200k/year with expected contract length of 7 months: ~$100k.
Benchmark consists of about 1k problems, payment per problem is between $300 and $1k based on quality. Full-time contributors also have base rate of $2k/month with 2-4 months expected duration. Over 60 mathematicians have contributed.
Low amount assuming just project lead, minimally valued problems ($300), no full-time contractors, no overhead (e.g. taxes):
Medium amount assuming project lead, medium valued problems ($500), 10 full-time contractors working for 3 months, 30% of overhead:
High amount assuming project lead, exceptional problems ($1k), 60 full-time contractors working for 3 months, 50% of overhead:
I feels as if OpenAI acted as a client of Epoch AI ordering a math benchmark and that Epoch AI would not be working on it without this funding.
I also suspect that computer use/task execution benchmark EpochAI is hiring for currently (Technical Lead, Benchmarks) has the same arrangement with OpenAI.
It doesn't bother me that Epoch took money from OpenAI. It doesn't bother me that OpenAI has access to the FrontierMath solutions.
What does bother me is Epoch concealing this information. I certainly assumed FrontierMath was a private eval. Clearly there are people who would not have worked on this if they'd known OpenAI would have access to the dataset. I'm really not sure why Epoch or OpenAI think misleading people about this is beneficial to them—this information coming out now, like this, just means people won't trust Epoch in the future. Was the data they received via deception from people who wouldn't have participated really worth burning trust like this?
I was excited about FrontierMath when it was revealed, doubly so when o3 made such impressive progress. I think o3's results are probably uncontaminated, it would be a very bad move for OpenAI to make fake progress when they could instead make real progress, but concealing this was also a bad move so I don't know. I really hope Epoch doesn't pull anything like this with their upcoming computer use benchmark.
(...and I'm shocked they're trusting verbal agreements from OpenAI about how the data is being used. Is getting stuff in writing really that hard?)
I've known Jaime for about ten years. Seems like he made an arguably wrong call when first dealing with real powaah, but overall I'm confident his heart is in the right place.
https://epoch.ai/blog/openai-and-frontiermath
On Twitter Dec 20th Tamay said the holdout set was independently funded. This blog post from today says OpenAI still owns the holdout set problems. (And that OpenAI has access to the questions but not solutions.)
In the post it is also clarified that the holdout set (50 problems) is not complete.
The blog post says Epoch requested permission ahead of the benchmark announcement (Nov 7th), and they got it ahead of the o3 announcement (Dec 20th). From me looking at timings, the Arxiv paper was updated 7 hours (and some 8 min) before the o3 stream on YT, on the same date. Technically ahead of the o3 announcement, though. I was wrong in saying that the Arxiv version came only after the o3 announcement, this was only a guess based on the date being the same. I could have known better, and could have checked the clocktime.[1]
Nat McAleese tweeted that "we [OpenAI] did not use FrontierMath data to guide the development of o1 or o3, at all." This is nice. If true, then I was wrong and misleading to spread a rumor that they use it for validation. Validation can mean different things, including using it for hill-climbing capabilities.
Nat McAleese also says that "hard uncontaminated benchmarks are incredibly valuable". I don't really know what the mechanisms for them being valuable are. I know that OpenAI knows better than I can get to know, but I appreciate discussion here. I would have thought that the value would be mostly a combination of marketing + guiding development of current or future models (in various ways, with training on the dataset being a lower bound for capability usefulness), and that the marketing value wouldn't really be different if you own the dataset or not. That's why I'm expecting this dataset to be used in guiding the development of future models. I would love to learn more here.[2]
Something that I haven't really seen mentioned is that a lot of people are willing to work for less compensation if working for a (mostly OpenPhil-funded) non-profit, compared to working on a project commisioned by OpenAI. This is another angle that makes me sad about the non-transparency, but this is relatively minor in my view.
The blog post did not discuss the Tier 4 problems. I'm guessingOh I'm so sorry. No, it says tier 4 problems will be owned by OpenAI.
I'm somewhat disappointed that they did not address this benchmark being marketed as secure and "eliminating data contamination concerns". I think this marketing[3] means that them saying "we have not communicated clearly enough about the relationship between FrontierMath and OpenAI" [4] is understating the problem.[5]
Tamay's tweet thanking OpenAI for their support was also on Dec 20th, didn't check the clocktime. I don't know when they added it to their website. Tweet says OpenAI "recently" provided permission to publicly share the support.
Please come up with incredibly valuable uses of hard uncontaminated datasets that don't guide development at all.
Well, that and the agreement with OpenAI including not disclosing the relationship.
As the main issue. (source: the blog post linked above.)
This last point I'm uncertain about mentioning about, and here I'm most likely to go wrong.
Our agreement did not prevent us from disclosing to our contributors that this work was sponsored by an AI company. Many contributors were unaware of these details, and our communication with them should have been more systematic and transparent.
So... why did they not disclose to their contributors that this work was sponsored by an AI company?
Specifically, not just any AI company, but the AI company that has (deservedly) perhaps the worst rep among all the frontier AI companies.[1]
I can't help but think that some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.
I can't help but think that many more would decline the offer had they been told that it was sponsored by OpenAI specifically.
I can't help but think that this is the reason why they were not informed.
Though Meta also has a legitimate claim to having the worst rep, albeit with different axes of worseness contributing to their overall score.
some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.
This is definitely true. There were ~100 mathematicians working on this (we don't know how many of them knew) and there's this.
I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic. It might not have been, or maybe to some extent but not as much as one might think.
I'd guess not everyone involved was modeling how the mathematicians would feel. There are multiple (like 20?) people employed at Epoch AI, and multiple people at Epoch AI working on this project. Maybe the person or people communicating to the mathematicians were not in the meetings with OpenAI, or weren't actively thinking about the details or implications of the agreement, when their job was to recruit people, and in turn the people who thought about the full details also missed communicating to the mathematicians. Or something like that, it's a possibility, coordination is hard.
I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic.
I'm not necessarily implying that they explicitly/deliberately coordinated on this.
Perhaps there was no explicit "don't mention OpenAI" policy but there was no "person X is responsible for ensuring that mathematicians know about OpenAI's involvement" policy either.
But given that some of the mathematicians haven't heard a word about OpenAI's involvement from the Epoch team, it seems like Epoc at least had a reason not to mention OpenAI's involvement (though this depends on how extensive communication between the two sides was). Possibly because they were aware of how they might react, both before the project started, as well as in the middle of it.
[ETA: In short, I would have expected this information to reach the mathematicians with high probability, unless the Epoch team had been disinclined to inform the mathematicians.]
Obviously, I'm just speculating here and the non-Epoch mathematicians involved in creation of FrontierMath know better than whatever I might speculate out of this.
For those skeptical about
"we [OpenAI] did not use FrontierMath data to guide the development of o1 or o3, at all."
My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn't guided at all by FrontierMath.
For those skeptical about
My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn't guided at all by FrontierMath.