When I introduce people to plans like QACI, they often have objections like "How is an AI going to do all of the simulating necessary to calculate this?" or "If our technology is good enough to calculate this with any level of precision, we can probably just upload some humans." or just "That's not computable."
I think these kinds of objections are missing the point of formal goal alignment and maybe even outer alignment in general.
To formally align an ASI to human (or your) values, we do not need to actually know those values. We only need to strongly point to them.
AI will figure out our values. Whether it's aligned or not, a recursively self-improving AI will eventually get a very good model of our values, as part of its total world model that is in every way better than ours.
So (outer) alignment is not about telling the AI our values. The AI already knows that. Alignment is giving the AI a utility function that strongly points to that.
That means that if we have a process, however intractable and uncomputable, that we know will eventually lead to our CEV, the AI will know that as well, and just figure out our CEV in a much smarter way and maximize it.
Say that we have a formally-aligned AI and give it something like QACI as its formal goal. If QACI works, the AI will quickly think "Oh. This utility function mostly just reduces to human values. Time to build utopia!" If it doesn't work, the AI will quickly think "LOL. These idiot humans tried to point to their values but failed! Time to maximize this other thing instead!"
A good illustration of the success scenario is Tammy's narrative of QACI.[1]
There are lots of problems with QACI (and formal alignment in general), and I will probably make posts about those at some point, but "It's not computable" is not one of them.
I'm 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call "human values") if they were rational enough and had good enough decision theory.
If I'm wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
I think (2) is a very human problem. Due to very weird selection pressure, humans ended up really smart but also really irrational. I think most human evil is caused by a combination of overconfidence wrt our own values and lack of knowledge of things like the unilateralist's curse. An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human. (Also 60% confident. I would not want to stake the fate of the universe on this claim)
I agree that moral uncertainty is a very hard problem, but I don't think we humans can do any better on it than an ASI. As long as we give it the right pointer, I think it will handle the rest much better than any human could. Decision theory is a bit different, since you have to put that into the utility function. Dealing with moral uncertainty is just part of expected utility maximization.
To solve (2), I think we should try to adapt something like the Hippocratic principle to work for QACI, without requiring direct reference to a human's values and beliefs (the sidestepping of which is QACI's big advantage over PreDCA). I wonder if Tammy has thought about this.