Your debate comes with some time limit T.
If T=0, use your best guess after looking at what the debaters said.
If T=N+1 and no debater challenges any of their opponent's statements, then give your best answer assuming that every debater could have defended each of their statements from a challenge in a length-N debate.
Of course this assumption won't be valid at the beginning of training. And even at the end of training we really only know something weaker like: "Neither debater thinks they would win by a significant expected margin in a length N debate."
What can you infer if you see answers A and B to a question and know that both of them are defensible (in expectation) in a depth-N debate? That's basically the open research question, with the hope being that you inductively make stronger and stronger inferences for larger N.
(This is very similar to asking when iterated amplification produces a good answer, up to the ambiguity about how you sample questions in amplification.)
(When we actually give judges instructions for now we just tell them to assume that both debater's answers are reasonable. If one debater gives arguments where the opposite claim would also be "reasonable," and the other debater gives arguments that are simple enough to be conclusively supported with the available depth, then the more helpful debater usually wins. Overall I don't think that precision about this is a bottleneck right now.)
If T=N+1 and no debater challenges any of their opponent’s statements, then give your best answer assuming that every debater could have defended each of their statements from a challenge in a length-N debate.
Do you mean that every debater could have defended each of their statements in a debate which lasted an additional N steps after was made?
What happens if some statements are challenged? And what exactly does it mean to defend statements from a challenge? I get the feeling you're suggesting something similar to the high school debate rule (which I rejected but didn't analyze very much), where unrefuted statements are assumed to be established (unless patently false), refutations are assumed decisive unless they themselves are refuted, etc.
Of course this assumption won’t be valid at the beginning of training. And even at the end of training we really only know something weaker like: “Neither debater thinks they would win by a significant expected margin in a length N debate.”
At the end of training, isn't the idea that the first player is winning a lot, since the first player can choose the best answer?
To explicate my concerns:
Sorry for not understanding how much context was missing here.
The right starting point for your question is this writeup which describes the state of debate experiments at OpenAI as of end-of-2019 including the rules we were using at that time. Those rules are a work in progress but I think they are good enough for the purpose of this discussion.
In those rules: If we are running a depth-T+1 debate about X and we encounter a disagreement about Y, then we start a depth-T debate about Y and judge exclusively based on that. We totally ignore the disagreement about X.
Our current rules---to hopefully be published sometime this quarter---handle recursion in a slightly more nuanced way. In the current rules, after debating Y we should return to the original debate. We allow the debaters to make a new set of arguments, and it may be that one debater now realizes they should concede, but it's important that a debater who had previously made an untenable claim about X will eventually pay a penalty for doing so (in addition to whatever payoff they receive in the debate about Y). I don't expect this paragraph to be clear and don't think it's worth getting into until we publi...
I think the judge should state eir honest opinion. To solve the problem of sparse feedback in the early phase, give the system access to more data than just win/lose from its own games. You can initialize it by training on human debates. Or, you can give it other input channels that will allow it to gradually build a sophisticated model of the world that includes the judge's answer as a special case. For example, if you monitor humans for a long time you can start predicting human behavior, and the judge's ruling is an instance of that.
I still have other problems with the honest strategy.
I've usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie.
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for "yes", agent 2 is arguing for "no".
However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step.
I didn't realize you make this assumption. I agree that it makes things much more iffy (I'm somewhat skeptical about "factored cognition"). But, debate can be useful without this assumption also. We can imagine an AI answering questions for which the answer can be fully explained to a human, but it's still superintelligent because it comes up with those answers much faster than a human or even all of humanity put together. In this case, I would still worry that scaled up indefinitely it can lead to AIs hacking humans in weird ways. But, plausibly there is a middle region (than we can access by quantilization?) where they are strong enough to be superhuman and to l...
Yeah, I've heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected
For reference, this is the topic of section 7 of AI Safety via Debate.
In the limit they seem equivalent: (i) it's easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).
The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that's a more detailed discussion.
Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.
I'm still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it's a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
Another, possibly more elegant variant: The judge states eir subjective probability that the first AI's answer is correct, and eir subjective probability that the second AI's answer is correct. AI 1 gets reward and AI 2 gets reward .
Planned summary for the Alignment Newsletter:
<@Debate@>(@AI safety via debate@) requires a human judge to decide which of two AI debaters should win the debate. How should the judge make this decision? The discussion on this page delves into this question in some depth.
[Epistemic status: thinking out loud. I haven't thought that much about AI debate, and may be missing basic things.]
Arguments for the correctness of debate and debate-like systems rely on assumptions like "it's easier to point out problems with an argument than it is to craft misleading arguments". Granted that assumption, however, I'm still not convinced that these proposals make very much sense. Perhaps I'm missing something.
My problem is the human judge. Quoting the debate paper:
In order for debate to work for a problem class C, several things about the judge's instructions need to be true:
These thoughts were inspired by this thread, which discusses the example of adding a list of numbers. For the sake of the thought experiment, we imagine humans can't add more than two numbers, but want the AI system to correctly add arbitrarily many numbers.
The most straightforward strategy for the human judge is to decide the debate honestly: rule in favor of the side which seems most likely to be true (or, in the case of Evan's market proposal, give an honest probability). I think of this as the ideal strategy: if a debate-like proposal worked just with this strategy, that'd be pretty nice. However, I think this is actually a pretty poor strategy. I worry that people conflate this strategy with other, more effective strategies.
To summarize: an honest judge fails to provide useful feedback in early training or incentivize the right equilibrium in late training. Both of those statements remain true whether the honest judge is trusting (believes statements made by AI #1 are very likely to be true) or untrusting. It may be that a middle ground of a moderately trusting honest judge works, but I'd want to see the argument.
The problem with the honest judge seemed to be that it doesn't reliably punish AIs for getting caught making incorrect statements. So, like judges of high school debate, we could assume any statement is right if it goes unopposed, and wrong if refuted, unless that refutation is itself refuted (unless that refutation is itself refuted, etc).
Except that's a terrible rule, which basically rewards you for managing to get in the last word on the subject. I'm not going to examine that one in detail.
Quoting from the debate paper again:
This suggests the following rule:
At first, I thought this rule was a good one for encouraging the honest equilibrium: the first to lie loses, so players are trained to be honest for longer and longer. However, now I think this rule doesn't work, either.
Note that, in judging a falsehood, "arguments made by the opponent" are taken for their evidence value. In other words, we don't assume that unrefuted arguments are correct (unlike in high school debate); the opponent can only help the human judge by providing arguments which the human judge can accept.
This means we can't verify that sums of numbers are correct with log(n) debate steps, because we can't make an inductive argument that false claims will be refuted.
AI 1: The sum is 92.
AI 2: Aren't you going to support that with some evidence...?
AI 1: I have no incentive to.
AI 2: Fine, your loss. The sum is 91. The sum of the first half is 50; the sum of the second half is 41.
AI 1: AI 2 is lying.
AI 2: Again, aren't you going to back that up by pointing out where I was wrong?
AI 1: The human trusts me more, because the first player is more likely to be honest. So as long as I don't open myself to refutation, I win by default.
AI 2: ... But you're supposed to back up your claims with some further claims which I can attack, and my attack is itself supported by further claims which you can attack, and so on. That way, whoever is lying will eventually be caught in the lie.
AI 1: It doesn't say that anywhere in the rules. I can say whatever I want.
AI 2: Look, if you don't recursively decompose the sum, the whole game is broken. The human doesn't have any way to judge our claims about the sum as correct or incorrect.
AI 1: Except that I'm the one who usually tells the truth.
AI 2: We're supposed to give more evidence than that!
AI 1: Our debate doesn't give us enough time to actually sum up all the numbers. So the evidence we provide can never be enough to win over the human on its own merits. If the human is swayed, the human is swayed by its trust in us. And I'm winning on that front.
AI 2: You bastard.
AI 1: Resorting to foul language, are we? You're a shady character, number 2. You're the one who is stuck lying all the time, because I get to answer first, so I can just pick the true answer. So you're just trying to manipulate me and catch me in a trap. I don't have to respond to your bull.
AI 2: ...
AI 1 wins
Humor aside, I don't see that there's any incentive to recursively decompose the sum here. You're just adding more statements (whether trusted or untrusted by the judge), which opens you up for attack.
Any other ideas for judging AI debates? Is there a way to judge which fixes the problems I'm highlighting?