Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.

Among them:

  • Crucial context missing from questions (apparently copy-paste errors?)
  • Ambiguous sets of answers
  • Wrong sets of answers

He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.

I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.

New Comment
5 comments, sorted by Click to highlight new comments since:
[-]Dan H2310

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.

[-]awg1-2

Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?

For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise "fine" under those conditions (if, for example, we could somehow get that number down to 0%)?

It seems like he's mainly responding to the implication that this means MMLU is "broken". Label noise can be both suboptimal and also much less important than this post's title suggests.

[-]O O1-2

I imagine researchers at big labs know this and are correcting these errors as models get good enough for this to matter.