This post is the second in what is likely to become a series of uncharitable rants about alignment proposals (previously: Godzilla Strategies). In general, these posts are intended to convey my underlying intuitions. They are not intended to convey my all-things-considered, reflectively-endorsed opinions. In particular, my all-things-considered reflectively-endorsed opinions are usually more kind. But I think it is valuable to make the underlying, not-particularly-kind intuitions publicly-visible, so people can debate underlying generators directly. I apologize in advance to all the people I insult in the process.
With that in mind, let's talk about problem factorization (a.k.a. task decomposition).
HCH
It all started with HCH, a.k.a. The Infinite Bureaucracy.
The idea of The Infinite Bureaucracy is that a human (or, in practice, human-mimicking AI) is given a problem. They only have a small amount of time to think about it and research it, but they can delegate subproblems to their underlings. The underlings likewise each have only a small amount of time, but can further delegate to their underlings, and so on down the infinite tree. So long as the humans near the top of the tree can “factorize the problem” into small, manageable pieces, the underlings should be able to get it done. (In practice, this would be implemented by training a question-answerer AI which can pass subquestions to copies of itself.)
At this point the ghost of George Orwell chimes in, not to say anything in particular, but just to scream. The ghost has a point: how on earth does an infinite bureaucracy seem like anything besides a terrible idea?
“Well,” says a proponent of the Infinite Bureaucracy, “unlike in a real bureaucracy, all the humans in the infinite bureaucracy are actually just trying to help you, rather than e.g. engaging in departmental politics.” So, ok, apparently this person has not met a lot of real-life bureaucrats. The large majority are decent people who are honestly trying to help. It is true that departmental politics are a big issue in bureaucracies, but those selection pressures apply regardless of the peoples’ intentions. And also, man, it sure does seem like Coordination is a Scarce Resource and Interfaces are a Scarce Resource and scarcity of those sorts of things sure would make bureaucracies incompetent in basically the ways bureacracies are incompetent in practice.
Debate and Other Successors
So, ok, maybe The Infinite Bureaucracy is not the right human institution to mimic. What institution can use humans to produce accurate and sensible answers to questions, robustly and reliably? Oh, I know! How about the Extremely Long Jury Trial? Y’know, because juries are, in practice, known for their extremely high reliability in producing accurate and sensible judgements!
“Well,” says the imaginary proponent, “unlike in a real Jury Trial, in the Extremely Long Jury Trial, the lawyers are both superintelligent and the arguments are so long that no human could ever possibility check them all the way through; the lawyers instead read each other’s arguments and then try to point the Jury at the particular places where the holes are in the opponent’s argument without going through the whole thing end-to-end.”
I rest my case.
Anyway, HCH and debate have since been followed by various other successors, which improve on their predecessors mostly by adding more boxes and arrows and loops and sometimes even multiple colors of arrows to the diagram describing the setup. Presumably the strategy is to make it complicated enough that it no longer obviously corresponds to some strategy which already fails in practice, and then we can bury our heads in the sand and pretend that We Just Don’t Know whether it will work and therefore maybe it will work.
(Reminder: in general I don’t reflectively endorse everything in this post; it’s accurately conveying my underlying intuitions, not my all-things-considered judgement. That last bit in particular was probably overly harsh.)
The Ought Experiment
I have a hypothesis about problem factorization research. My guess is that, to kids fresh out of the ivory tower with minimal work experience at actual companies, it seems totally plausible that humans can factorize problems well. After all, we manufacture all sorts of things on production lines, right? Ask someone who’s worked in a non-academia cognitive job for a while (like e.g. a tech company), at a company with more than a dozen people, and they’ll be like “lolwut obviously humans don’t factorize problems well, have you ever seen an actual company?”. I’d love to test this theory, please give feedback in the comments about your own work experience and thoughts on problem factorization.
Anyway, for someone either totally ignorant of the giant onslaught of evidence provided by day-to-day economic reality, or trying to ignore the giant onslaught of evidence in order to avoid their hopes being crushed, it apparently seems like We Just Don’t Know whether humans can factorize cognitive problems well. Sort of like We Just Don’t Know whether a covid test works until after the FDA finishes its trials, even after the test has been approved in the EU ok that’s a little too harsh even for this post.
So Ought went out and tested it experimentally. (Which, sarcasm aside, was a great thing to do.)
The experiment setup: a group of people are given a Project Euler problem. The first person receives the problem, has five minutes to work on it, and records their thoughts in a google doc. The doc is then passed to the next person, who works on it for five minutes recording their thoughts in the doc, and so on down the line. (Note: I’m not sure it was 5 minutes exactly, but something like that.) As long as the humans are able to factor the problem into 5-minute-size chunks without too much overhead, they should be able to efficiently solve it this way.
So what actually happened?
The story I got from a participant is: it sucked. The google doc was mostly useless, you’d spend five minutes just trying to catch up and summarize, people constantly repeated work, and progress was mostly not cumulative. Then, eventually, one person would just ignore the google doc and manage to solve the whole problem in five minutes. (This was, supposedly, usually the same person.) So, in short, the humans utterly failed to factor the problems well, exactly as one would (very strongly) expect from seeing real-world companies in action.
This story basically matches the official write-up of the results.
So Ought said “Oops” and moved on to greener pastures lol no, last I heard Ought is still trying to figure out if better interface design and some ML integration can make problem factorization work. Which, to their credit, would be insanely valuable if they could do it.
That said, I originally heard about HCH and the then-upcoming Ought experiment from Paul Christiano in the summer of 2019. It was immediately very obvious to me that HCH was hopeless (for basically the reasons discussed here); at the time I asked Paul “So when the Ought experiments inevitably fail completely, what’s the fallback plan?”. And he basically said “back to more foundational research”. And to Paul’s credit, three years and an Ought experiment later, he’s now basically moved on to more foundational research.
Sandwiching
About a year ago, Cotra proposed a different class of problem factorization experiments: “sandwiching”. We start with some ML model which has lots of knowledge from many different fields, like GPT-n. We also have a human who has a domain-specific problem to solve (like e.g. a coding problem, or a translation to another language) but lacks the relevant domain knowledge (e.g. coding skills, or language fluency). The problem, roughly speaking, is to get the ML model and the human to work as a team, and produce an outcome at-least-as-good as a human expert in the domain. In other words, we want to factorize the “expert knowledge” and the “having a use-case” parts of the problem.
(The actual sandwiching experiment proposal adds some pieces which I claim aren’t particularly relevant to the point here.)
I love this as an experiment idea. It really nicely captures the core kind of factorization needed for factorization-based alignment to work. But Cotra makes one claim I don’t buy: that We Just Don’t Know how such experiments will turn out, or how hard sandwiching will be for cognitive problems in general. I claim that the results are very predictable, because things very much like this already happen all the time in practice.
For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn't know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.
In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the "sandwich problem" would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don't think we have a good solution in practice; I'd expect the expert business-owner to usually come up with a much better contract.
This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn't understand what the designer wants), versus a product designer who's also a fluent coder and familiar with the code base. I've experienced this one first-hand; the expert product designer is way better. Or, consider a well-intentioned mortgage salesman, who wants to get their customer the best mortgage for them, and the customer who understands the specifics of their own life but knows nothing about mortgages. Will they end up with as good a mortgage as a customer who has expertise in mortgages themselves? Probably not. (I've seen this one first-hand too.)
There’s tons of real-life sandwiching problems, and tons of economic incentive to solve them, yet we do not have good general-purpose solutions.
The Next Generation
Back in 2019, I heard Paul’s HCH proposal, heard about the Ought experiment, and concluded that this bad idea was already on track to self-correct via experimental feedback. Those are the best kind of bad ideas. I wrote up some of the relevant underlying principles (Coordination as a Scarce Resource and Interfaces as a Scarce Resource), but mostly waited for the problem to solve itself. And I think that mostly worked… for Paul.
But meanwhile, over the past year or so, the field has seen a massive influx of bright-eyed new alignment researchers fresh out of college/grad school, with minimal work experience in industry. And of course most of them don’t read through most of the enormous, undistilled, and very poorly indexed corpus of failed attempts from the past ten years. (And it probably doesn’t help that a plurality come through the AGI Safety Fundamentals course, which last time I checked had a whole section on problem factorization but, to my knowledge, didn’t even mention the Ought experiment or the massive pile of close real-world economic analogues. It does include two papers which got ok results by picking easy-to-decompose tasks and hard-coding the decompositions.) So we have a perfect recipe for people who will see problem factorization and think “oh, hey, that could maybe work!”.
If we’re lucky, hopefully some of the onslaught of bright-eyed new researchers will attempt their own experiments (like e.g. sandwiching) and manage to self-correct, but at this point new researchers are pouring in faster than any experiments are likely to proceed, so probably the number of people pursuing this particular dead end will go up over time.
One successful example of factorization working is our immune systems. Our immune system does it's job by defending the body without needing intelligence. In fact every member of the immune system is blind, naked, and unintelligent. Your body has no knowledge of how many bacteria/viruses/cancer cells are in your body, doubling time or how many infected cells are there. Thus this problem needs to be factored in order for the immune system to do anything at all, and indeed it is.
So the factorizing basically factors into different cells for different jobs, plus a few systems not connected to cells.
There are tens of different classes of cells that can be divided into a few subclasses, plus 5 major classes of antibodies. And they all have different properties, as well.
Now what does tell us about factorizing a problem, and are there any lessons on factorizing problems more generally?
The biggest reason factorization works for the immune system is they can make billions of them per day. One of bureaucracy's biggest problems is we can't simply copy skillets of people's brains to lead bureaucracies so we have to hire them, and even without Goodhart's law, this would introduce interface problems between people, and this leads to our next solved problem from the immune system, coordination. The most taut constraints are the rarity of talented people in companies.
Each cell of the immune system has the same source code, so there's effectively no coordination problems at all, since everyone has the same attitude, dedication and abilities. It also partially solves the interface problem as everyone has shared understanding and shared ontologies. Unfortunately even if it solves the intra-organization problem, collaborating with others is an unsolved problem. Again this is something we can't do. Your best case scenario is hiring relatively competent people with different source codes, abilities, ontologies and having to deal with interfacing problems due to differing source codes, beliefs, ontologies and competencies. They also can't fully trust each other, even if they're relatively aligned, due to trust issues, thus you need constant communication, which scales fairly poorly with the size of teams or bureaucrats.
So should we be optimistic about HCH/Debate/Factored Cognition? Yes! One of the most massive advantages AGI will have over regular people early on is that they can be copied very easily due to being digital, being able to cooperate fully and trust fully with copies of themselves since they share the same ways they reason, and a single mindedness towards goals that mostly alleviate the most severe problems of trust. They will also have similar ontologies, as well. So you don't realize just how much AGI can solve those problems like coordination and interface issues.
EDIT: I suspect a large part of the reason your intuition is recoiling against the HCH/Debate/Factored Cognition solutions is because of scope neglect. Our intuitions don't work well with extremely big or small numbers, and a commenter once claimed that 100 distillation steps could produce 2^100 agents. To put it lightly, this is a bureaucracy that has 10^30 humans, with a corresponding near-infinite budget, that is single mindedly trying to answer questions. To put it another way, that's more humans than have ever lived by a factor of 19. And with perfect coordination, trust, single-mindedness towards goals and with a smooth interface due to same ontologies, due to being digital, such a bureaucracy could plausibly solve every problem in the entire universe, since I suspect that a large part of the problem for alignment is that at the end of the day, capabilities groups have much more money and researchers available to them than safety groups, and capabilities researchers are not surprisingly winning the race.
Our intuitions fail us here, so they aren't a reliable guide to estimating how a very large HCH tree works.
It's not useless, but it's definitely risky to do it, and the things required for safety would mean distillation has to be very cheap. And here we come to the question "How doomed by default are we if AGI is created?" If the chance is low, I agree that it's not a risk worth taking. If high, then you'll probably have to do it. The more doomed by default you think creating AGI is, the more risk you should take, especially with short timelines. So MIRI would probably want to do this, given their atypically high levels of doominess, but most other organizations probably won't do it due to thinking of fairly low risk from AGI.