Some context for this new arXiv paper from my group at NYU:
- We're working toward sandwiching experiments using our QuALITY long-document QA dataset, with reading time playing the role of the the expertise variable. Roughly: Is there some way to get humans to reliably answer hard reading-comprehensions questions about a ~5k-word text, without ever having the participants or any other annotators take the ~20 minutes that it would require to actually read the text.
- This is an early writeup of some negative results. It's earlier in the project that I would usually write something like this up, but some authors had constraints that made it worthwhile, so I'm sharing what we have.
- Here, we tried to find out if single-turn debate leads to reliable question answering: If we give people high-quality arguments for and against each (multiple-choice) answer choices, supported by pointers to key quotes in the source text, can they reliably answer the questions under a time limit?
- We did this initial experiment in an oracle setting; We had (well-incentivized, skilled) humans write the arguments, rather than an LM. Given the limits of current LMs on long texts, we expect this to give us more information about whether this research direction is going anywhere.
- It didn't really work: Our human annotators answered at the same low accuracy with and without the arguments. The selected pointers to key quotes did help a bit, though.
- We're planning to keep pursuing the general strategy, with multi-turn debate—where debaters can rebut one another's arguments and evidence—as the immediate next step.
- Overall, I take this as a very slight update in the direction that debate is difficult to use in practice as an alignment strategy. Slight enough that this probably shouldn't change your view of debate unless you were, for some reason, interested in this exact constrained/trivial application of it.
Update: We did a quick follow-up study adding counterarguments, turning this from single-turn to two-turn debate, as a quick way of probing whether more extensive full-transcript debate experiments on this task would work. The follow-up results were negative.
Tweet thread here: https://twitter.com/sleepinyourhat/status/1585759654478422016
Direct paper link: https://arxiv.org/abs/2210.10860 (To appear at the NeurIPS ML Safety workshop.)
We're still broadly optimistic about debate, but not on this task, and not in this time-limited, discussion-limited setting, and we're doing a broader more fail-fast style search of other settings. Stay tuned for more methods and datasets.