Thank you for posting this! I think it's well worth thinking through debate baselines more.
TL;DR:
Hi, author here.
Presumably we're supposed to read between the lines and notice that this is a de-facto negative result.
FWIW, I genuinely see it as a positive result, and if I thought it should be read as a de facto negative result, I would make sure that was conveyed in the paper. I think the same is true of my coauthors.
There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:
Re: how to update based on benchmark progress in general, see my response to you above.
On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness.
My naive interpretation is that we only use ML when we can't be bothered to write a traditional solution, but I don't think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
The bird example is good. My contention is basically that when it comes to making something like "recognizing birds" economically...
"actually it's quite gameable" = "actually it's quite easy" ;)
You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen ...
I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation).
I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable. This is why the Winograd Schema Challenge was proposed, ...
10x seems reasonable on its face, but honestly I have no idea. We haven't really dealt with scales and feature learners like this before. I assume a big part of what the model is doing is learning good representations that allow it to learn more/better from each example as training goes on. Given that, I can imagine arguments either way. On one hand, good representations could mean the model is discerning on its own what's important (so maybe data cleaning doesn't matter much). On the other, maybe noisy data (say, with lots of irreducible entropy—though th...
Oh yeah one more thing. You say
We have automated a lot of things in the past couple of 100 years with unaccountable machines and unaccountable software, and the main difference with ML seems to be that it's less interpretable.
I strongly disagree with this characterization. Traditional software is perfectly accountable. Its behavior is possible to exactly understand and predict from its implementation and semantics. This is a huge part of what makes it so damn useful. It reifies the process of specifying rules for behavior; where bad behavior is found, the ...
I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we're to take SuperGLUE as representative, well.. it's already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren't. It means you w...
On 1a:
Insofar as humans like having reasons for failures, I'm willing to accept this as one reason that reliability standards could be a bit higher for ML, but I doubt it would be drastically higher. I'd love a real example (outside of criminal justice) where this is a bottleneck.
Take for example writing news / journalistic articles. Distinguishability from human-written articles is used as evidence for GPT's abilities. The abilities are impressive here, but the task at hand for the original writer is not to write an article that looks human, but one that ...
Oh yeah, one more thing which I think actually might be the most important point. On a lot of these benchmarks — at the very least, on SuperGLUE — "human-level performance" is a much weaker requirement than "human equivalence." Human performance isn't necessarily an indicator of irreducible entropy in the underlying task. To a large extent, it just reflects ambiguity or coarseness in the dataset specification. A big part of this is the artificial setting of the data annotation, which is unfortunately kind of necessary in a lot of cases when the goal is cha...
Hi — new here. I'm an NLP researcher, and for background, I would guess I fall on the skeptical side of the scaling hypothesis by LW standards. I was pointed to this by abergal. Here is some feedback:
1. On the question of economic value:
AFAIK, there are many factors other than raw performance on a benchmark (relating to interpretability, accountability/explainability, and integration with surrounding software) which, as far as I'm aware, may dominate the question of economic viability of AI systems at least on the sub-99.99% accuracy regime (depending on a...
This looks great! Let me see if I understand the big picture correctly (omitting some of your experiments to just get the main thrust):
- You finetuned GPT-3.5 alternatively to output the correct answer 1) on 500 MMLU questions, and 2) on debates between copies of GPT-4o on those same 500 MMLU questions. You found that validation accuracy judging debates was better than answering blind.
- This difference did not appear when using GPT-4o as a judge, suggesting that in the capability asymmetric case, the judge learned to rely on the debate, whereas in the capabili
... (read more)