CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
What do you mean by "easy" here?
I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I strongly suspect he thinks most of it is not valuable
Do you mean during the program? Sure, maybe the only MATS offers you can get are for projects you think aren't useful--I think some MATS projects are pretty useless (e.g. our dear OP's). But it's still an opportunity to argue with other people about the problems in the field and see whether anyone has good justifications for their prioritization. And you can stop doing the streetlight stuff afterwards if you want to.
Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P
Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you'd do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn't help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.
I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I'm sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I'm very happy to bet here.
I think that even if it turns out you're a bit better than LLMs at this task, we should note that it's pretty impressive that they're competitive with you given 30 minutes!
So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].
I think the models would beat you by more at FrontierMath.
Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the "streetlight research" project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.
@johnswentworth Do you agree with me that modern LLMs probably outperform (you with internet access and 30 minutes) on GPQA diamond? I personally think this somewhat contradicts the narrative of your comment if so.
I reviewed for iclr this year, and found it somewhat more rewarding; the papers were better, and I learned something somewhat useful from writing my reviews.
Yes, I'd be way worse off without internet access.
the latter is in the list