A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.
The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.
You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.
(Posted at the request of cousin_it.)
Pre-WBE FAI can initially destroy the world too, if its utility function specification is as complex as CEV for example.
Right, but it's not clear that this is a natural flaw for other possible FAI designs, in a way that it seems to be for this one. Here, we start the AGI without understanding of human values, only the output of the initial program that will be available some time in the future is expected to have that understanding, so there is nothing to morally guide the AGI in the meantime. By "solving FAI" I meant that we do get some technical understanding of human values when the thing is launched, which might be enough to avoid the carnage.
(This whole line of reasoning creates a motivation for thinking about Oracle AI boxing. Here we have AGIs that become FAIs eventually, but might be initially UFAI-level dangerous.)