The relatively easy problems:
The hard problem:
The set of all possible sequences of actions is really really really big. Even if you have an AI that is really good at assigning the correct utilities[1] to any sequence of actions we test it with, it's "near infinite sized"[2] learned model of our preferences is bound to come apart at the tails or even at some weird region we forgot to check up on.
plug that utility function (the one the first AI wrote) into it
Could some team make an good AGI or ASI that someone could plug a utility function into? It would be very different from all the models being developed by the leading labs. I'm not confident that humanity could do it in the time we have left.
each one is annotated with how much utility we estimate it to have
How are these estimates obtained?
the idea:
awesome singularity stuff happens yay we did it
if we're still scared of it doing something weird, we can additionally tell the second AI to minimize doing actions that don't affect (the first AI's perception of human values) at all, to stop it from doing something really bad that current humanity can't comprehend that the first AI wouldn't be able to get humanity's opinion on