The default outcome of debate does not look promising. But there's a good deal of room to improve on the default.
Maybe half the problem with public discourse is that people have social goals that distract them from reality. I'm not confident that AI researchers will be more truth-oriented, but I see plenty of room for hope.
Drexler's CAIS paper describes some approaches that are likely needed to make debate work: Section 25:
Optimized advice need not be optimized to induce its acceptance Advice optimized to produce results may be manipulative, optimized to induce a client’s acceptance; advice optimized to produce results conditioned on its acceptance will be neutral in this regard.
Section 20:
Collusion among superintelligent oracles can readily be avoided
C1) To improve the quality of answers, it is natural to implement multiple, diverse (and implicitly competing) systems to propose alternatives.
C2) To identify low-quality or misleading answers, it is natural to employ diverse critics, any one of which could disrupt deceptive collusion.
C3) Systems of diverse, competing proposers and critics naturally implement both independent and adversarial objectives.
C4) It is natural to apply fixed (hence memory-free) system instantiations to multiple problems, incidentally yielding a series of history-blind, single-move decisions.
C5) It is natural to provide differentiated, task-relevant information to systems solving different problems, typically omitting knowledge of general circumstances.
Some of these approaches are costly to implement. That might doom debate.
Success with debate likely depends on the complexity of key issues to be settled by debate, and/or the difficulty of empirically checking proposals.
Eliezer sometimes talks as if we'd be stuck evaluating proposals that are way too complex for humans to fully understand. I expect alignment can be achieved by evaluating some relatively simple, high-level principles. I expect we can reject proposals from AI debaters that are too complex, and select simpler proposals until we can understand them fairly well. But I won't be surprised if we're still plagued by doubts at the key junctures.
As a Debater's capabilities increase, I expect it to become more able to convince a human of both true propositions and also of false propositions. Particularly when the propositions in question are about complex (real-world) things. And in order for Debate to be useful, I think it would indeed have to be able to handle very complex propositions like "running so-and-so AI software would be unsafe". For such a proposition , on the limit of Debater capabilities, I think a Debater would have roughly as easy a time convincing a human of as of . Hence: As Debater capabilities increase, if the judge is human and the questions being debated are complex, I'd tentatively expect the Debaters' arguments to mostly be determined by something other than "what is true".
I.e., the approximate opposite of
"in the limit of argumentative prowess, the optimal debate strategy converges to making valid arguments for true conclusions."
[This post is an almost direct summary of a conversation with John Wentworth.]
Note: This post assumes that the reader knows what AI Safety via Debate is. I don't spend any time introducing it or explaining how it works.
The proposal of AI safety via debate depends on the a critical assumption that "in the limit of argumentative prowess, the optimal debate strategy converges to making valid arguments for true conclusions."
If that assumption is true, then a sufficiently powerful AI-debate system is a safe oracle (baring other possible issues that I'm not thinking about right now). If this assumption is false, then the debate schema doesn't add any additional safety guarantee.
Is this assumption true?
That's hard to know, but one thing that we can do is try to compare to other, analogous, real world cases, and see if analogous versions of the "valid arguments for true conclusions" holds. Examining those analogous situations is unlikely to be definitive, but it might give us some hints, or more of a handle on what criteria must be satisfied for the assumption to hold.
In this essay I explore some analogs, an attempted counterexample and an attempted example.
Attempted counterexample: Public discourse
One analogous situation in which this assumption seems straightforwardly false is in the general discourse (on twitter, in "the media", around the water cooler, etc.). Very often, memes that are simple and well fit to human psychology, but false, have a fitness advantage over more complicated ideas that are true.
For instance, the core idea of minimum wage can be pretty attractive: lots of people are suffering because the have to work very hard, but they don't make much money, sometimes barely enough to live on. That seems like an inhumane outcome, so we should mandate that everyone be paid at least a fair wage.
This simple argument doesn't hold [1], but the explanation for why it is false is a good deal longer than the initial argument. To show what's wrong with it, one has to back up and explain the basic principles of supply and demand, and demonstrate that those principles apply in this case. You might have to write a slim textbook to make the counterargument in a compelling way.
And in the actual real world, it seems like false-but-simple-and-attractive ideas very often win out over true ones.
This seems like it doesn't bode well for AI safety via debate, but it isn't decisive. It could be that the Debate mechanism is more truth-tracking than public discourse. At each round of Debate, one of the debaters makes the makes the single argument that is most likely to be persuasive to the judge. Perhaps that continual zeroing in on the most alive line of argument is truth tracking.
However, I'll observe that the fact that sometimes a longer explanation is needed to demonstrate a true conclusion than to (falsely) demonstrate a false conclusion, is suggestive that this isn't the case. The longer an explanation that is required to make a case, the more surface area there is to attack. The weakest part of an argument is a min function of the number of argument steps. If defending the truth requires writing an economic textbook, that means that the fraudulent debater has a whole textbook's worth of material to nit-pick. And if at every step, the debater arguing for a true conclusion needs to produce a longer explanation than the fraudulent debater, this suggests that that the fraudulent debater has a systematic advantage over the truthful one.
(Note that calling one of the AIs "the truthful" debater, is already a bit of an overreach. It assumes that when investigating a-yes-or-no question, one side will adopt the true position and one side will adopt the false position, instead of both sides adopting different false positions and, further, that the optimal strategy if one is defending a true position is to use correct arguments.)
Attempted example: An efficient market
What about an example where there is some kind of truth-guarantee?
One thing that comes to mind is efficient markets. In the limit of efficiency, the price of a good is always equal to the marginal cost of production of that good.
This is pretty promising; there's a nice symmetry with AI Safety via Debate.
It seems like the analogy is pretty tight. And it suggests some kind of guarantee. But of what exactly?
One question that is useful here is: what happens in an efficient market when the consumers are systematically biased?
For concreteness, let's say that many consumers systematically mis-predict themselves, each one will actually get n utility from a marginal lollypop, but they robustly expect to get 3 * n utility instead.[2]
In this world, demand for lollypops is higher than it would otherwise be, and supply would rise to meet that demand. Given time to equilibrate, the price of a lollypop equals the marginal price of production of a lollypop.
But, because the consumers are mistaken about their own preferences, the market is not reflective of human values.
Analogously, in Debate, having a biased judge doesn't change the basic fact that each debater is efficient, in the sense that the argument that they make in each round is the optimally persuasive argument.[3] If greater win probability can be gained by making argument B instead of argument A, then the debater will make argument B.
But in the same way that efficient market prices is not a guarantee that the market will reflect human wants, if consumers are biased, efficient Debate arguments is not a guarantee of truth-tracking, if the judge is biased. For sufficiently savvy debaters, any diff between what's true and what's persuasive will be exploited.
In order for the crucial assumption of debate to hold robustly, there have to be no cases in which a human judge would be more receptive to a false or invalid line of argument than to a correct counterargument. For all situations where that's the case, in the limit of capabilities, the Debate process deceives or manipulates the human judge.
It might be the case that such situations are rare enough that they aren't of practical relevance. But from what I know of human psychology and human rationality, that doesn't sound right to me. And regardless, it doesn't seem knowably, reliably, true enough, that I want to make it a key assumption of an alignment proposal.
I only spent a few minutes thinking about debate, here. But on first pass, I feel pretty pessimistic about this working.
Though there maybe more sophisticated arguments for the minimum wage that do.
This makes most sense if n is a function of something, like how many lollypops you've eaten recently, rather than a static scalar.
Assuming that the debaters were trained with a biased judge, too. (Or have learned to adapt to whichever judge is judging their debate.) Otherwise they'll make arguments that are less than efficient, for failing to account for the judge's idiosyncrasies.