Formalizing Value Extrapolation

paulfchristiano

A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.

The basic idea is that we can define extrapolated values by just taking an emulation of a human, putting it in a hypothetical environment with access to powerful resources, and then adopting whatever values it eventually decides on. You might want some philosophical insight before launching into such a definition, but since we are currently laboring under the threat of catastrophe, it seems that there is virtue in spending our effort on avoiding death and delegating whatever philosophical work we can to someone on a more relaxed schedule.

You wouldn't want to run an AI with the values I lay out, but at least it is pinned down precisely. We can articulate objections relatively concretely, and hopefully begin to understand/address the difficulties.

(Posted at the request of cousin_it.)

A recent post at my blog may be interesting to LW. It is a high-level discussion of what precisely defined value extrapolation might look like. I mostly wrote the essay while a visitor at FHI.

(Posted at the request of cousin_it.)

A U-maximizer faces a similar set of problems: it cannot understand the exact form of U, but it can still have well-founded beliefs about U, and about what sorts of actions are good according to U. For example, if we suppose that the U-maximizer can carry out any reasoning that we can carry out, then the U-maximizer knows to avoid anything which we suspect would be bad according to U (for example, torturing humans).

Here's a stronger version of my previous criticism of this argument. Suppose instead of giving neuroimaging data to the AI and defining H in terms of a brute force search for a model that can explain the neuroimaging data, we give it a cryptographic hash of the neuroimaging data (of sufficient length to avoid possible collisions), and modify the definition of H to first perform a brute force search to recover the neuroimaging data from the hash. In this case, we can still say that torturing is probably bad according to U, but the AI obviously can't arrive at this conclusion from the formal definition of U alone (assuming it can't break the cryptographic hash). It seems clear that we can't safely assume that "the U-maximizer can carry out any reasoning that we can carry out".

Even if the U-maximizer cannot carry out this reasoning, as long as it can recognize that humans have powerful predictive models for other humans, it can simply appropriate those models (either by carrying out reasoning inspired by human models, or by simply asking).

In order to "carry out reasoning inspired by human models", the AI has to first form a usable model of a human. I don't have a strong argument that the U-maximizer can't do this from the original definition of U (i.e., from plaintext neuroimaging data), but intuitively it seems implausible given an amount of computing power the U-maximizer might initially have access to (say, within a couple orders of magnitude of the amount needed to do standard WBE). I don't see how "simply asking" could work either. What kind of questions might the U-maximizer ask, and how can we answer it, given that we don't know how to formalize what "torture" means?

26

Formalizing Value Extrapolation

26

26

26

Formalizing Value Extrapolation

26

26