Debate helps supervise human experts [Paper]

habryka

This is a linkpost for https://github.com/julianmichael/debate/blob/2023-nyu-experiments/Debate_Helps_Supervise_Unreliable_Experts.pdf

There didn't seem to be a link post to this recent paper on AI debate yet, so I figured I would make one:

As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts—which have access to the truth but may not accurately report it—to give answers that are systematically true and don’t just superficially seem true, when the supervisor can’t tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth.

We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by ‘expert’ debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy’s 74%. Debates are also more efficient, being 68% of the length of consultancies.

By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.

84% judge accuracy compared to consultancy’s 74%

[Reaction before clicking through] That's "promising"? Seriously? My main takeaway so far is that this paper was strongly overdetermined to spin whatever result it found as "debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems", or some such, even if the results were in fact mediocre or negative.

[Reaction after clicking through] Oh, it's one of the academic groups. So yeah, of course it's overdetermined to spin any result as a positive result, that's how publication works in academia. Fair enough, the authors didn't get to choose academia's incentives. Presumably we're supposed to read between the lines and notice that this is a de-facto negative result.

[Reaction after actually scanning through to see the data on page 7] Ah, yup, those error bars (figure 3) sure do not look like they'd be very significant with those n-values (table 1). And they've got four different settings (AI/human x consultancy/debate), of which only 1 pair is reported to be a significant difference (human consultancy vs debate) and it's with p=0.04. I didn't read deeply enough to check whether they adjusted for multiple tests, but man, the headline result sure does sound like nothingsauce.

There were some very statistically significant results in there - e.g. the AI was rated as clearly much worse at debate than the humans, and human debates resolved faster than any other setting - but not the headline claim.

(Despite the misleading headline, I am quite glad somebody ran this study! I certainly have not been a debate-optimist, but even so, I would not expect an effect size as small as this study found. Useful info.

... Though on reflection, You Are Not Measuring What You Think You Are Measuring seems like a pretty good prior to apply here, so I'm not updating very much.)

of course it's overdetermined to spin any result as a positive result

Falsified by some of the coauthors having previously published "Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions" and "Two-Turn Debate Does Not Help Humans Answer Hard Reading Comprehension Questions" (as mentioned in Julianjm's sibling comment)?

Hi, author here.

Presumably we're supposed to read between the lines and notice that this is a de-facto negative result.

FWIW, I genuinely see it as a positive result, and if I thought it should be read as a de facto negative result, I would make sure that was conveyed in the paper. I think the same is true of my coauthors.

There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:

Previous work from our group using a simpler variant of debate (1 or 2 turn only) and smaller expertise gap (judge was allowed to skim the story within a time limit) didn't find any improvements from "debate" (https://aclanthology.org/2022.lnls-1.3/; https://openreview.net/forum?id=9wLAwDrYDsQ). We discuss some hypotheses for why we 'succeeded' where they 'failed' in Sec. 7.
It is just really hard to collect very good data of this kind. Each human debate takes probably around 2 person-hours total, and it's a lot of work to train, supervise, and motivate annotators. So the data's gonna be small.
The benefits of debate should grow with the size of the expertise gap, but our expertise gap is still quite small (consisting only of knowledge of a story that takes ~30m to read). It being this small is kind of logistically necessary for our setup, as we need to be able to manufacture a fresh new expertise gap for every time someone judges a new question (otherwise information leakage between debates for the same story would compromise the judge's blinding). Systematizing "expertise gaps" for a realistic-ish task turns out to be a tricky experimental design issue.
Especially given the small size of our expertise gap, consultancy is still a reasonably strong baseline with a persistent and skilled judge interrogating the consultant. Winning while dishonest in consultancy is easier than debate but far from trivial. Judges can prolong as long as they need and interrogate you about the evidence until they're satisfied. The result is best understood in terms of an accuracy/efficiency tradeoff, which is why we gave the judge a turn penalty.

Some other relevant thoughts:

The human debate/consultancy difference was our primary endpoint where we expected a clear difference, since we expected a priori that performance in debate and consultancy would come apart as debaters get stronger. We didn't do multiple hypothesis correction. I wish we had more data for the main comparison (and for the others!) but it was a lot of time and effort to collect. And the error bars in the graph are 95% CIs, not stdevs, so their size reflects n.
The gap from 84% to 100% accuracy on debates is largely explainable by silly mistakes by judges and debaters. Making sure all the participants are good and calibrated and patient and try their hardest is difficult even when you're sitting in the room together (which we weren't always). And debating well is hard too; it takes a long time and it's often not obvious what the optimal argument is, or easy to miss the best quotes. There was a spectrum of judge skill (the top two had 100% accuracy in the 36 human debates they judged, faring comparatively worse in the other settings — though n is small ofc). Some of the errors are also explained by data issues (questions that are arguable or contain mistakes). Some mistakes in consultancy are similar, but many of seem harder to resolve with more careful training because they really are a matter of outmaneuvering the consultant.
The quantitative results from the feedback surveys comparing human and AI performance were statistically significant but honestly I would say it's much more informative to just look at the transcripts; you can immediately tell a lot more about what's wrong with the AI's behavior, why AI and consultancy settings are harder to judge than human debate, and what needs to improve for AIs to debate well & usefully. The AIs are obviously much, much worse at this task.

I think debate still needs to be judged with harder questions, stronger judges and stronger debaters. Really pushing the limits and seeing more benefits from debate should hopefully be a lot easier once we can get models to debate well. But we also need better datasets. For future work we're looking at different domains and bigger expertise gaps. See, for example, our new dataset GPQA: https://arxiv.org/abs/2311.12022

To briefly attend to assumptions: I am coming from a position of 'debate optimism' in the sense that I think debate-style supervision, done right, should be a strict improvement over RLHF, and I want to figure out how to make it work. I don't think it's a complete 'solution' for truthfulness but it seems to me like the best next step.

Another author here! Regarding specifically the 74% vs. 84% numbers - a key takeaway that our error analysis is intended to communicate is that we think a large fraction of the errors judges made in debates were pretty easily solvable with more careful judges, whereas this didn't feel like it was the case with consultancy.

For example, Julian and I both had 100% accuracy as judges on human debates for the 36 human debates we judged, which was ~20% of all correct human debate judgments. So I'd guess that more careful judges overall could increase debate accuracy to at least 90%, maybe higher, although at that point we start hitting measurement limits from the questions themselves being noisy.

I made this link post to create a good place for the following confusion of mine:

The setup of the paper is that a judge does not have access to a test passage, but is trying to answer questions about it. The debate result is compared to a human consultancy baseline where you have a person who has access to the text, who is trying to convince you of a randomly chosen answer (so 50% correct or 50% incorrect).

The baseline strategy as a deceptive consultant (being assigned to convince the judge of the wrong answer to a question) in this situation is to just refuse to answer any questions, forcing the judge to make a random choice. This guarantees you a 50% success rate at deceiving the judge.

However, the paper says:

Dishonest human consultants successfully deceive judges 40% of the time

This seems crazily low to me. How can it be the case that consultants, who are the only people to have access to the text fail to deceive the human more than 50% of the time, when a simple baseline of not answering any questions guarantees a 50% success rate?

Julian (the primary author) clarifies on Twitter:

Ah maybe we weren’t clear: The judge can see which answer the consultant was assigned to, but doesn’t know if they’re honest. If the consultant refused to answer any questions then they would immediately out themselves as dishonest.

Which makes this make sense. Still surprised by how low, but I at least can't think of a simple dominant strategy in this scenario.

84% judge accuracy compared to consultancy’s 74%

... Though on reflection, You Are Not Measuring What You Think You Are Measuring seems like a pretty good prior to apply here, so I'm not updating very much.)

of course it's overdetermined to spin any result as a positive result

Hi, author here.

Presumably we're supposed to read between the lines and notice that this is a de-facto negative result.

There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:

Previous work from our group using a simpler variant of debate (1 or 2 turn only) and smaller expertise gap (judge was allowed to skim the story within a time limit) didn't find any improvements from "debate" (https://aclanthology.org/2022.lnls-1.3/; https://openreview.net/forum?id=9wLAwDrYDsQ). We discuss some hypotheses for why we 'succeeded' where they 'failed' in Sec. 7.
It is just really hard to collect very good data of this kind. Each human debate takes probably around 2 person-hours total, and it's a lot of work to train, supervise, and motivate annotators. So the data's gonna be small.
The benefits of debate should grow with the size of the expertise gap, but our expertise gap is still quite small (consisting only of knowledge of a story that takes ~30m to read). It being this small is kind of logistically necessary for our setup, as we need to be able to manufacture a fresh new expertise gap for every time someone judges a new question (otherwise information leakage between debates for the same story would compromise the judge's blinding). Systematizing "expertise gaps" for a realistic-ish task turns out to be a tricky experimental design issue.
Especially given the small size of our expertise gap, consultancy is still a reasonably strong baseline with a persistent and skilled judge interrogating the consultant. Winning while dishonest in consultancy is easier than debate but far from trivial. Judges can prolong as long as they need and interrogate you about the evidence until they're satisfied. The result is best understood in terms of an accuracy/efficiency tradeoff, which is why we gave the judge a turn penalty.

Some other relevant thoughts:

The human debate/consultancy difference was our primary endpoint where we expected a clear difference, since we expected a priori that performance in debate and consultancy would come apart as debaters get stronger. We didn't do multiple hypothesis correction. I wish we had more data for the main comparison (and for the others!) but it was a lot of time and effort to collect. And the error bars in the graph are 95% CIs, not stdevs, so their size reflects n.
The gap from 84% to 100% accuracy on debates is largely explainable by silly mistakes by judges and debaters. Making sure all the participants are good and calibrated and patient and try their hardest is difficult even when you're sitting in the room together (which we weren't always). And debating well is hard too; it takes a long time and it's often not obvious what the optimal argument is, or easy to miss the best quotes. There was a spectrum of judge skill (the top two had 100% accuracy in the 36 human debates they judged, faring comparatively worse in the other settings — though n is small ofc). Some of the errors are also explained by data issues (questions that are arguable or contain mistakes). Some mistakes in consultancy are similar, but many of seem harder to resolve with more careful training because they really are a matter of outmaneuvering the consultant.
The quantitative results from the feedback surveys comparing human and AI performance were statistically significant but honestly I would say it's much more informative to just look at the transcripts; you can immediately tell a lot more about what's wrong with the AI's behavior, why AI and consultancy settings are harder to judge than human debate, and what needs to improve for AIs to debate well & usefully. The AIs are obviously much, much worse at this task.

I made this link post to create a good place for the following confusion of mine:

However, the paper says:

Dishonest human consultants successfully deceive judges 40% of the time

Julian (the primary author) clarifies on Twitter:

Ah maybe we weren’t clear: The judge can see which answer the consultant was assigned to, but doesn’t know if they’re honest. If the consultant refused to answer any questions then they would immediately out themselves as dishonest.

Which makes this make sense. Still surprised by how low, but I at least can't think of a simple dominant strategy in this scenario.

LESSWRONG
LW

LESSWRONG
LW

29

Debate helps supervise human experts [Paper]

29

Ω 14

29

Ω 14

29

Ω 14