Epistemic Status:  I don't particularly endorse this argument.  Just curious to see whether people have any interesting or novel counterarguments.

  • Suppose "Solve the Alignment Problem" is in the set of problems that an intelligent, motivated, conscientious group of human beings can potentially solve.
  • Now suppose it isn't
    • "Let us eat and drink, for tomorrow we die"
New Comment
4 comments, sorted by Click to highlight new comments since:

If the problem would be solvable by an anarchist group of humans that can adhoc improvise their organization then it is not that clear that a bureaucracy of AIs is fit for the task.

The implication doesn't seem tight at all.

What if there are some groups of (intelligent, motivated, conscientious, etc.) humans that could solve the alignment problem but other groups cannot? As I understand your idea is to make the AIs mimic one of the groups in order to scale up their progress. But maybe the group you mimic is one of the ones that can't solve it, while other groups can.

Also I don't find it obvious that an AI that is trained to mimic the actions of a human will have the same consequences as that human will, because the AI might not deploy those actions in a strategically appropriate way.

  • Sure, suppose that the alignment problem is in the set of problems that a Bureaucracy Of AIs can solve. This sounds helpful because you've ~defined said bureaucracy to be safe, but I doubt it's possible to build a safe bureaucracy out of unsafe parts - and if it is, we don't know how to do so!
  • I dislike the fatalism here, and would rather celebrate direct attacks on the problem even when they don't work. For example, I'd love to see a more detailed writeup on BoAI proposals across a range of scenarios and safety assumptions :-)

I doubt it's possible to build a safe bureaucracy out of unsafe parts

The intended construction is to build a safer bureaucracy out of less safe parts/agents (or just less robustly safe ones). So they shouldn't break in most cases of running the bureaucracy, and the bureaucracy as a whole should break even less frequently. If the distillation of such a bureaucracy gives a safer part/agent than the original part, that is an iterative improvement. This doesn't need to change the game in one step, only improve the situation with each step, in a direction that is hard to formulate without resorting to the device of a bureaucracy. Otherwise this could be done with the more lightweight prompt/tuning setup, where the bureaucracy is just the prompt given to a single part/agent.