Epistemic status: confused.
I currently see two conceptualization of an aligned post-AGI/foom world:
1. Surrender to the Will of the Superintelligence
Any ‘control’ in a world with a Superintelligence will have to be illusory. The creation of an AGI will be the last truly agentic thing humans do. A Superintelligence would be so far superior to any human or group of humans, and able to manipulate humans so well, that any “choice” humanity faces will be predetermined. If an AI understands you better than yourself, can predict you better than yourself, and understands the world and human psychology well enough that it can bring about whatever outcome it wants, then any sense of ‘control’ – any description of the universe putting humans in the driver’s seat – will be false.
This doesn’t mean that alignment is necessarily impossible. The humans creating the Superintelligence could still instil it with a goal of leaving the subjective human experience of free-will intact. An aligned Superintelligence would still put humans into situations where the brain’s algorithm of deliberate decision making is needed, even if the choice itself as well as the outcome are ‘fake’ in some sense. The human experience of control should rank high in an aligned Superintelligence's utility function. But only a faint illusory glimmer of human choice would remain, while the open-ended, agentic power over the Universe would have left humanity with the creation of the first Superintelligence. That’s why it’s so crucial to get the value-loading problem right on the first try.
2. Riding the Techno-Leviathan
The alternative view seems to be something like: it will be possible to retain human agency over a post-AGI world thanks to boxing, interpretability, oracle AI, or some other selective impairment scheme that would leave human agency unspoiled by AGI meddling. The goal appears to be to either limit an AI’s optimization power enough, or insulate the algorithms of human decision making (brain, institutions of group decision making...) well enough, such that humanity remains sovereign, or “in the loop,” in a meaningful sense. This, while also having the AI’s goals point towards the outcomes of human decision making.
This view implies that the integrity of human decision making can be maintained even in the face of an AGI’s optimization power.
I currently associate 1. more with MIRI/Superintelligence style thinking, while 2. with most other prosaic alignment schemes (Christiano, Olah…). 1. requires you to bite some very weird, unsettling and hard to swallow bullets about the post-AGI world, while 2. seems to point towards a somewhat more normal future, though might suffer from naivete and normalcy bias.
Am I understanding this correctly? Are there people holding a middle ground view?
I think either is technically possible with perfect knowledge - that is, I don’t think either option is so incoherent that you cannot make any logical sense of it.
This leaves the question of which is easier. (1) requires somehow getting a full precise description of the human utility function. I don’t fully understand the arguments against (2), though MIRI seems to be pretty confident there are large issues.