Epistemic Status: I don't particularly endorse this argument. Just curious to see whether people have any interesting or novel counterarguments.
- Suppose "Solve the Alignment Problem" is in the set of problems that an intelligent, motivated, conscientious group of human beings can potentially solve.
- then, it is in the set of problems that a Bureaucracy Of AIs can solve
- Now suppose it isn't
- "Let us eat and drink, for tomorrow we die"
The intended construction is to build a safer bureaucracy out of less safe parts/agents (or just less robustly safe ones). So they shouldn't break in most cases of running the bureaucracy, and the bureaucracy as a whole should break even less frequently. If the distillation of such a bureaucracy gives a safer part/agent than the original part, that is an iterative improvement. This doesn't need to change the game in one step, only improve the situation with each step, in a direction that is hard to formulate without resorting to the device of a bureaucracy. Otherwise this could be done with the more lightweight prompt/tuning setup, where the bureaucracy is just the prompt given to a single part/agent.