I don't understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don't "let the AGI in". You are living in its goal definition, and your decisions determine AGI's values.
Are you perhaps referring to the idea that AGI's actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What's the difference between the simulated humans outputting a utility function U' which the outer AGI will then try to maximize, and the simulated humans just running U' and the outer AGI trying to maximize the value returned by the whole simulation (and hence U')? If case of the latter, you're "letting the AGI in" by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U'.
A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.
The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.
You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.
(Posted at the request of cousin_it.)