Broken Benchmark: MMLU
Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set. Among them: * Crucial context missing from questions (apparently copy-paste errors?) * Ambiguous sets of answers * Wrong sets of answers He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models. I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.
Since AI's are proving to be superhuman persuaders I thought I'd ask o1 to take a crack at persuading you that there is a worthwhile and Platonic "there" there w/r/t modern art. As a lover of most all art, including modern art, I agree with all of the points made by o1 here. Wondering if anything sways you!
o1 says:
Below is an attempt at a thorough, good‐faith refutation of your stance—one that tries to speak directly to the lens you’re using when you say that the bulk of modern/conceptual art is “worthless,” “masturbatory,” or “a defrauding of an entire culture.” I’ll assume, per your own framing, that you’re open to persuasion if someone... (read 1630 more words →)