A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.
The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.
You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.
(Posted at the request of cousin_it.)
(Note that the hypothetical process probably doesn't even output a goal specification, it just outputs a number, which the AI tries to control.)
The hope is something like: "We can reason about the outputs of this process, so an AI as smart as us can reason about the outputs of this process (perhaps by realizing it can watch or ask us to learn about the behavior of the definition)." The bar the AI has to meet, is to realize basically what is going on the definition. This assumes of course not only that the process actually works, but that it works for the reasons we believe it works.
I have doubts about this, and it seems generally important to think about whether particular models of UDT could make these inferences. The sketchy part seems to be the AI looking out at the world, and drawing mathematical inferences from the assumption that it's environment is a draw from a universal distribution. There are two ways you could imagine this going wrong:
I expect UDT properly formulated avoids both issues, and if it doesn't we are going to have to take these things on anyway. But it needs some thought.
That's certainly a nice answer to the question "what's the domain of the probability distribution and the utility function?" You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that's easy to maximize. Do you think that's unlikely to happen?
ETA: on... (read more)