(This page is about extrapolated volition as a normative moral theory - that is, the theory that extrapolated volition captures the concept of value or what outcomes we should want. For the closely related proposal about what a sufficiently advanced self-directed AGI should be built to want/target/decide/do, see coherent extrapolated volition.)
Extrapolated volition is the notion that when we ask "What is right?", then insofar as we're asking something meaningful, we're asking about the result of running a certain logical function over possible states of the world, where this function is obtained by extrapolating our current decision-making process in directions such as "What if I knew more?", "What if I had time to consider more arguments (so long as the arguments weren't hacking my brain)?", or "What if I understood myself better and had more self-control?"
A very simple example of extrapolated volition might be to consider somebody who asks you to bring them orange juice from the refrigerator. You open the refrigerator and see no orange juice, but there's lemonade. You imagine that your friend would want you to bring them lemonade if they knew everything you knew about the refrigerator, so you bring them lemonade instead. On an abstract level, we can say that you "extrapolated" your friend's "volition", in other words, you took your model of their mind and decision process, or your model of their "volition", and you imagined a counterfactual version of their mind that had better information about the contents of your refrigerator, thereby "extrapolating" this volition.
Having better information isn't the only way that a decision process can be extrapolated; we can also, for example, imagine that a mind has more time in which to consider moral arguments, or better knowledge of itself. Maybe you currently want revenge on the Capulet family, but if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of that. Maybe you're currently convinced that you advocate for green shoes to be outlawed out of the goodness of your heart, but if you could actually see a printout of all of your own emotions at work, you'd see there was a lot of bitterness directed at people who wear green shoes, and this would change your mind about your decision.
In Yudkowsky's version of extrapolated volition considered on an individual level, the three core directions of extrapolation are:
Different people initially react to the question "Where should we point a superintelligence?" differently and intuitively approach it from different angles (and we'll eventually need an Arbital dispatching questionnaire on a page that handles it). These angles include:
Standard starting replies:
Arguendo by the theory's advocates, these conversations eventually all end up converging on EV and CEV by different roads.
(Work in progress. This page is a stub and doesn't try to write out the actual dialogues yet.)
Metaethics is the field of academic philosophy that deals with the question, not of "What is good?", but "What sort of property is goodness?" Rather than arguing over what is good, we are, from the perspective of AI-grade philosophy, asking how to compute what is good, and why the output of that computation ought to be identified with the notion of shouldness. Theories that try to describe what should be done, rather than what is, are said to be normative rather than descriptive theories.
Within the standard terminology of academic metaethics, "extrapolated volition" as a normative theory is cognitivist (normative propositions can be true or false), naturalist (normative propositions are not irreducible or based on non-natural properties of the world), not internalist (it is not the case that all sufficiently powerful optimizers must act on what we consider to be moral propositions), and reductionist in a way that's more synthetic than analytic (we don't have a priori knowledge of what our own preferences are). Closest antecedents in academic metaethics are Rawls and Goodman's reflective equilibrium, Harsanyi and Railton's ideal advisor theories, and Frank Jackson's moral functionalism / analytic descriptivism.
[comment:
== need to rewrite this at much greater length later, not try to hack it now ==
Responses to standard probes:
"Okay, it turns out that your AI says that eliminating malaria is what I would want to do if I knew everything, thought very fast without letting my brain be hacked, and fully understood myself, and I'm willing to believe your AI computed that correctly, but is it really good? It seems to me like I can imagine that eliminating malaria has this property, but that it still isn't really good, and that the true good is something else entirely."
Reply: "After doing some past updates, hopefully in a validly extrapolative direction, you noticed that it was possible for you to change what you'd previously computed about what to do or what was moral, and since your brain operates on a pretty weird architecture, it does this sort of thing by applying a 'rightness' tag to things. If it makes sense to identify this rightness tag with any referent, that referent is the property of what your brain would update to later. You can imagine the rightness tag just floating around and attaching to whatever,
]
[comment: == need to rewrite this at greater length later, not try to hack it now ==
Besides the three core directions, for purposes of an self-directed AGI alignment target, Yudkowsky further suggests that computing a collective EV might reasonably:
Relative to the first three or core directions, though, the latter two directions have a relatively less clear normative status.
]
(in progress)