A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.
The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.
You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.
(Posted at the request of cousin_it.)
The approach I was taking was: the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control. This (hopefully) allows the initial program to control the AI's behavior in detail, by encouraging it to (say) thoroughly replace itself with a different sort of agent. Also, note that the ems can watch the AI reasoning about the ems, when deciding how to administer reward.
I agree that the internal structure and the mechanism for pointing at the output should be thought about largely separately. (Although there are some interactions.)
I don't think this can be right. I expect it's impossible to create a self-contained abstract map of the world (or human value, of which the world is an aspect), the process of making observations has to be part of the solution.
(But even if we are talking about a "number", what kind of number is that? Why would something simple like a real number be sufficient to express relevant counterfactual utility values that are to be compared? I don't know enough to make such assumptions.)