A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.
The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.
You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.
(Posted at the request of cousin_it.)
The general "goal" of this system is to make sure the world is controlled by the decisions of the program produced by the initial program, so the simulation of the initial program and yielding of control to its output are subgoals of that. I don't see how to make that work, but I don't see why it can't be done either. The default problems of AGI instrumental drives get siphoned off into the possible initial destructiveness (but don't persist later), and the problem of figuring out human values gets delegated to the humans inside the initial program (which is the part that seems potentially much harder to solve pre-WBE than whatever broadly goal-independent decision theory is needed to make this plan well-defined).
This seems to me like the third plausible winning plan, the other two being (1) figuring out and implementing pre-WBE FAI and (2) making sure the WBE shift (which shouldn't be hardware-limited for the first-runner advantage to be sufficient) is dominated by a FAI project. Unless this somehow turns out to be FAI-complete (which is probable, given that almost any plan is), it seems strictly easier than pre-WBE FAI, although it has a significant cost of possibly initially destroying the current world, which is the problem that the other two plans don't (by default) have.
"Destroy the world" doesn't seem to be a big problem to me. Paul's (proposed) AGI can be viewed as not directly caring about our world, but only about a world/computation defined by H and T (let's call that HT World). If it can figure out that its preferences for HT World can be best satisfied by it performing actions that (as a side effect) cause it to take over our world, then it seems likely that it can also figure out that it should take over our world in a non-destructive way. I'm more worried about whether (given realistic amounts of initial computing power) it would manage to do anything at all.