All of MazevSchlong's Comments + Replies

I think this is a great point here: 

None of us have ever managed an infinite army of untrained interns before
 

Its probable that AIs will force us to totally reformat workflows to stay competitive. Even as the tech progresses, it’s likely there will remain things that humans are good at and AIs lag. If intelligence can be represented by some sort of n-th dimensional object, AIs are already super-human at some subset of n, but beating humans at all n seems unlikely in the near-to-mid term. 

In this case, we need to segment work, and have a good... (read more)

4AnthonyC
There's an important reason to keep some of us around. This is also an important point.

Sure fair point! But generally people gossiping online about missed benchmark questions, and then likely spoiling the answers means that a question is now ~ruined for all training runs. How much of these modest benchmark improvements overtime can be attributed to this?

The fact that frontier AIs can basically see and regurgitate everything ever written on the entire internet is hard to fathom!

I could be really petty here and spoil these answers for all future training runs (and make all future models look modestly better), but I just joined this site so I’ll resist lmao … 

But isn’t this exactly the OPs point? These models are exceedingly good at self-contained, gimmicky questions that can be digested and answered in a few hundred tokens. No one is denying that! 

Secondly, there are high chances that these benchmark questions are simply in these models datasets already. They have super-human memory of their training data, there’s no denying that. Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online. But in a more conspiratorial light, can we really be ... (read more)

4Steven Byrnes
Yup, I expected that OP would generally agree with my comment. They only posted three questions, out of at least 62 (=1/(.2258-.2097)), perhaps much more than 62. For all I know, they removed those three from the pool when they shared them. That’s what I would do—probably some human will publicly post the answers soon enough. I dunno. But even if they didn’t remove those three questions from the pool, it’s a small fraction of the total. You point out that all the questions would be in the LLM company user data, after kagi has run the benchmark once (unless kagi changes out all their questions each time, which I don’t think they do, although they do replace easier questions with harder questions periodically). Well: * If an LLM company is training on user data, they’ll get the questions without the answers, which probably wouldn’t make any appreciable difference to the LLM’s ability to answer them; * If an LLM company is sending user data to humans as part of RLHF or SFT or whatever, then yes there’s a chance for ground truth answers to sneak in that way—but that’s extremely unlikely to happen, because companies can only afford to send an extraordinarily small fraction of user data to actual humans.
gwern251

Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online.

Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren't). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can... still train on the new question then and there... (read more)