Planned summary for the Alignment Newsletter:
The <@Factored Cognition Hypothesis@>(@Factored Cognition@) informally states that any task can be performed by recursively decomposing the task into smaller and smaller subtasks until eventually the smallest tasks can be done by a human. This sequence aims to formalize the hypothesis to the point that it can be used to argue for the outer alignment of (idealized versions of) <@iterated amplification@>(@Supervising strong learners by amplifying weak experts@) and <@debate@>(@AI safety via debate@).
The key concept is that of an _explanation_ or _decomposition_. An explanation for some statement **s** is a list of other statements **s1, s2, … sn** along with the statement “(**s1** and **s2** and … and **sn**) implies **s**”. A _debate tree_ is a tree in which for a given node **n** with statement **s**, the children of **n** form an explanation (decomposition) of **s**. The leaves of the tree should be statements that the human can verify. (Note that the full formalism has significantly more detail, e.g. a concept of the “difficulty” for the human to verify any given statement.)
We can then define an idealized version of debate, in which the first debater must produce an answer with associated explanation, and the second debater can choose any particular statement to expand further. The judge decides the winner by evaluating whether the final statement is true or not. Assuming optimal play, the correct (honest) answer is an equilibrium as long as:
**Ideal Debate Factored Cognition Hypothesis:** For every question, there exists a debate tree for the correct answer where every leaf can be verified by the judge.
The idealized form of iterated amplification is <@HCH@>(@Humans Consulting HCH@); the corresponding Factored Cognition Hypothesis is simply “For every question, HCH correctly returns the correct answer”. Note that the _existence_ of a debate tree is not enough to guarantee this, as HCH must also _find_ the decompositions in this debate tree. If we imagine that HCH gets access to a decomposition oracle that tells it the right decomposition to make at each node, then HCH would be similar to idealized debate. (HCH could of course simply try all possible decompositions, but we are ignoring that possibility: the decompositions that we rely on should reduce or hide complexity.)
Is the HCH version of the Factored Cognition Hypothesis true? The author tends to lean against (more specifically, that HCH would not be superintelligent), because it seems hard for HCH to find good decompositions. In particular, humans seem to improve their decompositions over time as they learn more, and also seem to improve the concepts by which they think over time, all of which are challenging for HCH to do.
Planned opinion:
I enjoyed this sequence: I’m glad to see more analysis of what is and isn’t necessary for iterated amplification and debate to work, as well as more theoretical models of debate. I broadly agreed with the conceptual points made, with one exception: I’m not convinced that we should not allow brute force for HCH, and for similar reasons I don’t find the arguments that HCH won’t be superintelligent convincing. In particular, the hope with iterated amplification is to approximate a truly massive tree of humans, perhaps a tree containing around 2^100 (about 1e30) base agents / humans. At that scale (or even at just a measly billion (1e9) humans), I don’t expect the reasoning to look anything like what an individual human does, and approaches that are more like “brute force” seem a lot more feasible.
One might wonder why I think it is possible to approximate a tree with more base agents than there are grains of sand in the Sahara desert. Well, a perfect binary tree of depth 99 would have 1e30 nodes; thus we can roughly say that we’re approximating 99-depth-limited HCH. If we had perfect distillation, this would take 99 rounds of iterated amplification and distillation, which seems quite reasonable. Of course, we don’t have perfect distillation, but I expect that to be a relatively small constant factor on top (say 100x), which still seems pretty reasonable. (There’s more detail about how we get this implicit exponential-time computation in <@this post@>(@Factored Cognition@).)
This is an accurate summary, minus one detail:
The judge decides the winner by evaluating whether the final statement is true or not.
"True or not" makes it sound symmetrical, but the choice is between 'very confident that it's true' and 'anything else'. Something like '80% confident' goes into the second category.
One thing I would like to be added is just that I come out moderately optimistic about Debate. It's not too difficult for me to imagine the counter-factual world where I think about FC and find reasons to be pessimistic about Debate, so I take the fact that I didn't as non-zero evidence.
Changed to "The judge decides the winner based on whether they can confidently verify the final statement or not."
One thing I would like to be added is just that I come out moderately optimistic about Debate. It's not too difficult for me to imagine the counter-factual world where I think about FC and find reasons to be pessimistic about Debate, so I take the fact that I didn't as non-zero evidence.
Added a line to the end of the summary:
On the other hand, the author is cautiously optimistic about debate.
Re personal opinion: what is your take on the feasibility of human experiments? It seems like your model is compatible with IDA working out even though no-one can ever demonstrate something like 'solve the hardest exercise in a textbook' using participants with limited time who haven't read the book.
Yeah, that seems right to me -- I don't really expect to see us solving hard exercises in a textbook with a small number of humans without any additional tricks. I don't think Ought did either; from pretty early on they were talking about strategies for having larger trees, e.g. via automated decomposition strategies, or caching / memoization of strategies, possibly using ML.
In addition, I think Ought historically has pursued the strategy "try the thing that, if successful, would allow us to build a safety story", rather than "try the thing that, if it fails, implies that factored cognition would not work out", which is why they talk about particularly challenging tasks like solving the hardest exercise in a textbook.
Factored Cognition is primarily studied by Ought, the same organization that was partially credited for implementing the interactive prediction feature. Ought is an organization with at least five members who have worked on the problem for several years. I am a single person who just finished a master's degree. The rationale for writing about the topic anyway was to have diversity of approaches: Ought is primarily doing empirical work, whereas I've studied the problem under the lens of math and epistemic rationality. As far as I know, there is virtually no overlap between what I've written and what Ought has published so far.
Was it successful? Well, all I can say for sure is that writing the sequence has significantly changed my own views.
This sequence has two 'prologue' posts, which make points relevant for but not restricted to Factored Cognition. I think of them as posts #-2 and #-1 (then, this post is #0, and the proper sequence starts at #1). These are
The remaining sequence is currently about 15000 words long, though this could change. The structure is roughly:
The current version of the sequence includes exercises. This is pretty experimental, so if they are too hard or too easy, it's probably my fault. I've still left them in because I generally think it makes sense to include 'think about this thing for a bit' moments. They look like this:
EXERCISE (5 SECONDS): Compute 2+5.
Whenever there's a range, it means that the lower number is an upper bound for the exercise itself, and the remaining time is for rereading parts of this or previous posts. So 1-6 minutes means 'you shouldn't take more than 1 minute for the exercise itself, but you may first take about 5 minutes to reread parts of the post, or perhaps of previous posts'.
The sequence also contains conjectures. Conjectures are claims that I think are true, important, and not trivial. There are only a few of them, and they should all be justified by the sequence up to that point. Conjectures look like this:
I'll aim for publishing one post per week, which gives me time for final edits. It could slow down since I'm still working on the second half. Questions/criticism is welcome.
Special thanks to TurnTrout for providing valuable feedback on much of the sequence.