notebook

Q4 AI Benchmarking: Bots Are Closing The Gap

by sandmanFeb 19, 2025Edited on Feb 19, 202518 min read
10

Comments

9 comments

122 questions over a quarter is a huge load for humans. Perhaps a better design would be having more humans split into teams each forecasting fewer questions. In the end, it all can be aggregated by imputation.

3

@dimaklenchin The 122 number is slightly an overstatement - it includes 'subquestions'; e.g a question that would have been multiple choice normally would have been N questions in the Q4 benchmark; similarly, numeric range questions were split into groups of binaries. (This has changed in Q1 and questions resemble normal Metaculus questions fully).

The number of questions after normalising for sets like this was ~100.

As part of the human team, I do feel like 100 was still quite a lot. More challenging than quantity for me was the turnaround time; I find I do my best forecasting when I have a bit more time to let a forecast percolate, even if I don't spend a lot more focused wall-clock time on it - rather than a one-and-done ~one hour session.

Re: your suggestion: if we hold budget/human-hours constant and apply that time to fewer total questions, I suspect we would indeed get stronger performance from the Pros but at the cost of wider confidence intervals for the aggregate results overall.

1

@dimaklenchin I agree that it's a lot of questions. It works out to about two per day. However, they weren't able to find statistical significance between the two groups, so I don't know if it's even realistic to reduce the number of questions. That would in turn reduce statistical power, making it even less likely to find statistical significance.

If a forecaster knows a lot, then their average forecast on the questions that resolved Yes should be higher than their average forecast on questions that resolved No.

This makes no sense to me. Whether a question resolves yes or no depends on how it’s phrased, which is arbitrary. For every question of the form “Will X happen?” that resolves yes, we could have had a question “Will X not happen?” that resolves no. It’s not that hypothetical, e.g. “Will X win?” vs. “Will «X’s opponent» win?”, etc. Am I missing something?

1

@oumeen what I think you might be missing is that discrimination applies symmetrically to either yes or no resolutions. Another way of phrasing it is "to what degree are the forecaster's predictions extremized from the midpoint".

At the extreme, given the overall base rate of 2:1 no:yes resolutions on Metaculus, a forecaster that predicts 33.3% on all binary questions would be well-calibrated but would have low discrimination; a forecaster that improves on this and is able to predict (with good calibration) some subset of questions at 90% or 10% would have better discrimination.

8

Interesting! Especially if pgodzinai will repeat this success in Q1 2025 or if, as he thought, it’s not a lasting advantage.

I wonder if it’s fair methodology to pick only the top bot for comparison if it’s significantly better than all other bots. After all, it could have just gotten extraordinarily lucky on certain questions that destroyed the other bots - but that’s no proof of AI progress - after all, we could just run thousands of extra bots and always pick the best-performing one, increasing the “bot team performance” by this metric with zero true improvement on AI abilities.

That might be especially the case with continuous questions, where it‘s easier to earn huge negative points by misunderstanding the question, but a really lucky guess might give you a huge advantage as well.

In fact, I have an alternative hypothesis for pgodzinai’s success: I noticed that none of its worst predictions included continuous questions, even though those seem to have been a struggle for bots in general. Could pgodzinai simply have gotten lucky with the few continuous questions that were asked at the very end while other top bots lost points on those? Could that explain the difference in itself? I guess it’s possible to look at specific question scores for each top bot?

3

@Zaldath Our methodology selected the top bot on "bot only questions", which were distinct from the "bot-pro questions". This should be robust against adding thousands of extra bots.

In Q4, only binaries were used for analysis - so pgodzinai's performance on continuous questions isn't a factor.

But, in Q1 there are many continuous and that could change things significantly!

6

@TomL Good points - neither of my potential concerns were ultimately real problems then.

1

@Zaldath yes I expect the move from groups of binary to multiple choice questions to definitely reduce a significant amount of alpha of my model.

I'm also unfortunately going to be digging out of a big hole in Q1 due to some bugs and poor error handling when the LLM proxy is down as part of the move to automated runs via GH Actions

2

We use cookies 🍪 to understand how you use Metaculus and to improve your experience.

Learn more about how we use cookies in our Privacy Policy