User Comment Replies

For those skeptical about

"we [OpenAI] did not use FrontierMath data to guide the development of o1 or o3, at all."

My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn't guided at all by FrontierMath.

meemi's Shortform

meemi2mo10

some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.

This is definitely true. There were ~100 mathematicians working on this (we don't know how many of them knew) and there's this.

I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic. It might not have been, or maybe to some extent but not as much as one might think.

I'd guess not everyone involved was modeling how the mathematicians would feel. There are multiple (like ... (read more)

3Mateusz Bagiński2mo

I'm not necessarily implying that they explicitly/deliberately coordinated on this. Perhaps there was no explicit "don't mention OpenAI" policy but there was no "person X is responsible for ensuring that mathematicians know about OpenAI's involvement" policy either. But given that some of the mathematicians haven't heard a word about OpenAI's involvement from the Epoch team, it seems like Epoc at least had a reason not to mention OpenAI's involvement (though this depends on how extensive communication between the two sides was). Possibly because they were aware of how they might react, both before the project started, as well as in the middle of it. [ETA: In short, I would have expected this information to reach the mathematicians with high probability, unless the Epoch team had been disinclined to inform the mathematicians.] Obviously, I'm just speculating here and the non-Epoch mathematicians involved in creation of FrontierMath know better than whatever I might speculate out of this.

meemi's Shortform

meemi2mo*21-2

https://epoch.ai/blog/openai-and-frontiermath

On Twitter Dec 20th Tamay said the holdout set was independently funded. This blog post from today says OpenAI still owns the holdout set problems. (And that OpenAI has access to the questions but not solutions.)

In the post it is also clarified that the holdout set (50 problems) is not complete.

The blog post says Epoch requested permission ahead of the benchmark announcement (Nov 7th), and they got it ahead of the o3 announcement (Dec 20th). From me looking at timings, the Arxiv paper was updated 7 hours (and so... (read more)

2meemi2mo

For those skeptical about My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn't guided at all by FrontierMath.

9Mateusz Bagiński2mo

In the post: So... why did they not disclose to their contributors that this work was sponsored by an AI company? Specifically, not just any AI company, but the AI company that has (deservedly) perhaps the worst rep among all the frontier AI companies.[1] I can't help but think that some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company. I can't help but think that many more would decline the offer had they been told that it was sponsored by OpenAI specifically. I can't help but think that this is the reason why they were not informed. 1. ^ Though Meta also has a legitimate claim to having the worst rep, albeit with different axes of worseness contributing to their overall score.

meemi's Shortform

meemi3mo*8338

Not Tamay, but from elliotglazer on Reddit^[1] (14h ago): "Epoch's lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven't yet independently verified their 25% claim. To do so, we're currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.

My personal opinion is that OAI's score is legit (i.e., they didn't train on the dataset), and that they have no incentive to lie about internal benchmarking performances. How... (read more)

meemi's Shortform

meemi3mo80

That was a quote from a commenter in Hacker news, not my view. I reference the comment as something I thought a lot of people's impression was pre- Dec 20th. You may be right that maybe most people didn't have the impression that it's unlikely, or that maybe they didn't have a reason to think that. I don't really know.

Thanks, I'll put the quote in italics so it's clearer.

1kanak82783mo

This is a very relevant concern and that's why Arc-AGI manages two versions. None of them are available in public.

meemi's Shortform

meemi3mo*27784

FrontierMath was funded by OpenAI.^[1]

The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.

Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.^[1]

Because the Arxiv version mentioning OpenAI contribution came out right after o... (read more)

3NunoSempere2mo

I've known Jaime for about ten years. Seems like he made an arguably wrong call when first dealing with real powaah, but overall I'm confident his heart is in the right place.

9Flying buttress2mo

BOTEC estimate for funding: $400k — $2M EpochAI shared two job postings on their LinkedIn few months ago that seem relevant to FrontierMath: * Senior Mathematician for Project Lead, AI Benchmarking * Senior Mathematical Problem Creator (Part-Time/Full-Time Contractor Positions) Project lead compensation is $200k/year with expected contract length of 7 months: ~$100k. Benchmark consists of about 1k problems, payment per problem is between $300 and $1k based on quality. Full-time contributors also have base rate of $2k/month with 2-4 months expected duration. Over 60 mathematicians have contributed. Low amount assuming just project lead, minimally valued problems ($300), no full-time contractors, no overhead (e.g. taxes): 100k+1000×300≈400k Medium amount assuming project lead, medium valued problems ($500), 10 full-time contractors working for 3 months, 30% of overhead: (100k+1000×500+10×2000×3)×1.3≈900k High amount assuming project lead, exceptional problems ($1k), 60 full-time contractors working for 3 months, 50% of overhead: (100k+1000×1k+60×2000×3)×1.5≈2M

4harryisgamer2mo

It doesn't bother me that Epoch took money from OpenAI. It doesn't bother me that OpenAI has access to the FrontierMath solutions. What does bother me is Epoch concealing this information. I certainly assumed FrontierMath was a private eval. Clearly there are people who would not have worked on this if they'd known OpenAI would have access to the dataset. I'm really not sure why Epoch or OpenAI think misleading people about this is beneficial to them—this information coming out now, like this, just means people won't trust Epoch in the future. Was the data they received via deception from people who wouldn't have participated really worth burning trust like this? I was excited about FrontierMath when it was revealed, doubly so when o3 made such impressive progress. I think o3's results are probably uncontaminated, it would be a very bad move for OpenAI to make fake progress when they could instead make real progress, but concealing this was also a bad move so I don't know. I really hope Epoch doesn't pull anything like this with their upcoming computer use benchmark. (...and I'm shocked they're trusting verbal agreements from OpenAI about how the data is being used. Is getting stuff in writing really that hard?)

No77e3mo200

Hey everyone, could you spell out to me what's the issue here? I read a lot of comments that basically assume "x and y are really bad" but never spell it out. So, is the problem that:

- Giving the benchmark to OpenAI helps capabilities (but don't they have a vast sea of hard problems to already train models on?)

- OpenAI could fake o3's capabilities (why do you care so much? This would slow down AI progress, not accelerate it)

- Some other thing I'm not seeing?

leogao3mo14-14

this doesn't seem like a huge deal

Tamay3mo*132-4

Tamay from Epoch AI here.

We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future.

For f... (read more)

Dan H3mo*320

It's probably worth them mentioning for completeness that Nat Friedman funded an earlier version of the dataset too. (I was advising at that time and provided the main recommendation that it needs to be research-level because they were focusing on Olympiad level.)

Also can confirm they aren't giving access to the mathematicians' questions to AI companies other than OpenAI like xAI.

Kei3mo*312

EpochAI is also working on a "next-generation computer-use benchmark". I wonder who is involved in funding that. It could be OpenAI given recent rumors they are planning to release a computer-use model early this year.

ouguoc3mo161

Thanks for posting this!

I have to admit, the quote here doesn't seem to clearly support your title -- I think "support in creating the benchmark" could mean lots of different things, only some of which are funding. Is there something I'm missing here?

Regardless, I agree that FrontierMath should make clear what the extent was of their collaboration with OpenAI. Obviously the details here are material to the validity of their benchmark.

Håvard Tveit Ihle3mo192

Why do you consider it unlikely that companies could (or would) fish out the questions from API-logs?

North Oakland: Shallow Questions, January 15th

meemi3mo10

We meet every Tuesday in Oakland at 6:15

I want to make sure this meeting is still on Wednesday the 15th? Thank you. :) And thanks for organizing.

1Czynski3mo

Wednesday yes, sorry.

Finishing The SB-1047 Documentary In 6 Weeks

meemi5mo36

I think this is a great project. I believe your documentary would have high impact via informing and inspiring AI policy discussions. You've already interviewed an impressive amount of relevant people. I admire your initiative to take on this project quickly, even before getting funding for it.

A quick experiment on LMs’ inductive biases in performing search

meemi1y70

Great post! I'm glad you did this experiment.

I've worked on experiments where I test gpt-3.5-turbo-0125 performance in computing iterates of a given permutation function in one forward pass. Previously my prompts had some of the instructions for the task after specifying the function. After reading your post, I altered my prompts so that all the instructions were given before the problem instance. As with your experiments, this noticeably improved performance, replicating your result that performance is better if instructions are given before the instance of the problem.

LESSWRONG
LW

All of meemi's Comments + Replies