Formalizing Value Extrapolation

paulfchristiano

A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.

The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.

You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.

(Posted at the request of cousin_it.)

A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.

(Posted at the request of cousin_it.)

"Destroy the world" doesn't seem to be a big problem to me. Paul's (proposed) AGI can be viewed as not directly caring about our world, but only about a world/computation defined by H and T (let's call that HT World). If it can figure out that its preferences for HT World can be best satisfied by it performing actions that (as a side effect) cause it to take over our world, then it seems likely that it can also figure out that it should take over our world in a non-destructive way. I'm more worried about whether (given realistic amounts of initial computing power) it would manage to do anything at all.

I'm not talking about Paul's proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it's eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.

26

Formalizing Value Extrapolation

26

26

26

Formalizing Value Extrapolation

26

26