8mo

Linkpost to arXiv: https://arxiv.org/abs/2506.13609.

Summary: We present a scalable oversight protocol where honesty is incentivized at equilibrium. Prior debate protocols allowed a dishonest AI to force an honest AI opponent to solve a computationally intractable problem in order to win. In contrast, prover-estimator debate incentivizes honest equilibrium behavior, even when the AIs involved (the prover and the estimator) have similar compute available. Our results rely on a stability assumption, which roughly says that arguments should not hinge on arbitrarily small changes in estimated probabilities. This assumption is required for usefulness, but not for safety: even if stability is not satisfied, dishonest behavior will be disincentivized by the protocol.

How can we correctly reward desired behaviours for AI... (read 1454 more words →)

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner, Rohin Shah

Abstract

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a human judge; consultancy, where a single AI tries to convince a human judge that asks questions; and compare to a baseline of direct question-answering, where the human judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include... (read 1890 more words →)

Debate, Oracles, and Obfuscated Arguments

Jonah Brown-Cohen

Jonah Brown-Cohen, Geoffrey Irving

This post is about recent and ongoing work on the power and limits of debate from the computational complexity point of view. As a starting point our paper Scalable AI Safety via Doubly-Efficient Debate gives new complexity-theoretic formalizations for debate. In this post we will give an overview of the model of debate in the paper, and discuss extensions to the model and their relationship to obfuscated arguments.

High-level Overview

At a high level our goal is to create a complexity theoretic models that allow us to productively reason about different designs for debate protocols, in such a way as to increase our confidence that they will produce the intended behavior. In particular, the... (read 6180 more words →)

LESSWRONG
LW

LESSWRONG
LW

Jonah Brown-Cohen

Jonah Brown-Cohen

Jonah Brown-Cohen

Prover-Estimator Debate: A New Scalable Oversight Protocol

On scalable oversight with weak LLMs judging strong LLMs

Debate, Oracles, and Obfuscated Arguments

Jonah Brown-Cohen

Jonah Brown-Cohen

Jonah Brown-Cohen

Prover-Estimator Debate: A New Scalable Oversight Protocol

On scalable oversight with weak LLMs judging strong LLMs

Debate, Oracles, and Obfuscated Arguments

Abstract

High-level Overview