Introduction

"Coherent extrapolated volition" (CEV) is Eliezer Yudkowsky's proposed thing-to-do with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets.

Roughly, a CEV-based superintelligence would do what currently existing humans would want* the AI to do, if counterfactually:

We knew everything the AI knew;

We could think as fast as the AI and consider all the arguments;

We knew ourselves perfectly and had better self-control or self-modification ability;

...to whatever extent most existing humans, thus extrapolated, would predictably want* the same things. (For example, in the limit of extrapolation, nearly all humans might want* not to be turned into paperclips, but might not agree* on the best pizza toppings. See below.)

CEV is meant to be the literally optimal or ideal or normative thing to do with an autonomous superintelligence, if you trust your ability to perfectly align a superintelligence on a very complicated target. (See below.)

CEV is rather complicated and meta and hence not intended as something you'd do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)

For the corresponding metaethical theory see Extrapolated volition (normative moral theory).

Concept

Extrapolated volition is the metaethical theory that when we ask "What is right?", then insofar as we're asking something meaningful, we're asking "What would a counterfactual idealized version of myself want* if it knew all the facts, had considered all the arguments, and had perfect self-knowledge and self-control?" (As a metaethical theory, this would make "What is right?" a mixed logical and empirical question, a function over possible states of the world.)

A very simple example of extrapolated volition might be to consider somebody who asks you to bring them orange juice from the refrigerator. You open the refrigerator and see no orange juice, but there's lemonade. You imagine that your friend would want you to bring them lemonade if they knew everything you knew about the refrigerator, so you bring them lemonade instead. On an abstract level, we can say that you "extrapolated" your friend's "volition", in other words, you took your model of their mind and decision process, or your model of their "volition", and you imagined a counterfactual version of their mind that had better information about the contents of your refrigerator, thereby "extrapolating" this volition.

Having better information isn't the only way that a decision process can be extrapolated; we can also, for example, imagine that a mind has more time in which to consider moral arguments, or better knowledge of itself. Maybe you currently want revenge on the Capulet family, but if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of that. Maybe you're currently convinced that you advocate for green shoes to be outlawed out of the goodness of your heart, but if you could actually see a printout of all of your own emotions at work, you'd see there was a lot of bitterness directed at people who wear green shoes, and this would change your mind about your decision.

In Yudkowsky's version of extrapolated volition considered on an individual level, the three core directions of extrapolation are:

Increased knowledge - having more veridical knowledge of declarative facts and expected outcomes.

Increased consideration of arguments - being able to consider more possible arguments and assess their validity.

Increased reflectivity - greater knowledge about the self, and to some degree, greater self-control (though this raises further questions about which parts of the self normatively get to control which other parts).

Motivation

Different people initially react differently to the question "Where should we point a superintelligence?" or "What should an aligned superintelligence do?" - not just different beliefs about what's good, but different frames of mind about how to ask the question.

Some common reactions:

"Different people want different things! There's no way you can give everyone what they want. Even if you pick some way of combining things that people want, you'll be the one saying how to combine it. Someone else might think they should just get the whole world for themselves. Therefore, in the end you're deciding what the AI will do, and any claim to some sort of higher justice or normativity is nothing but sophistry."

"What we should do with an AI is obvious - it should optimize liberal democratic values. That already takes into account everyone's interests in a fair way. The real threat is if bad people get their hands on an AGI and build an AGI that doesn't optimize liberal democratic values."

"Imagine the ancient Greeks telling a superintelligence what to do. They'd have told it to optimize for glorious deaths in battle. Programming any other set of inflexible goals into a superintelligence seems equally stupid; it has to be able to change and grow."

"What if we tell the superintelligence what to do and it's the wrong thing? What if we're basically confused about what's right? Shouldn't we let the superintelligence figure that out on its own, with its assumed superior intelligence?"

An initial response to each of these frames might be:

"Okay, but suppose you're building a superintelligence and you're trying not to be a jerk about it. If you say, 'Whatever I do originates in myself, and therefore is equally selfish, so I might as well declare myself God-Emperor of the Universe' then you're being a jerk. Is there anything you could do instead which would be less like being a jerk? What's the least jerky thing you could do?"

"What if you would, after some further discussion, want to tweak your definition of 'liberal democratic values' just a little? What if it's predictable that you would do that? Would you really want to be stuck with your off-the-cuff definition a million years later?"

"Okay, so what should the Ancient Greeks have done if they did have to program an AI? How could they not have doomed future generations? Suppose the Ancient Greeks are clever enough to have noticed that sometimes people change their minds about things and to realize that they might not be right about everything. How can they use the cleverness of the AGI in a constructively specified, computable fashion that gets them out of this hole? You can't just tell the AGI to compute what's 'right', you need to put an actual computable question in there, not a word."

"You asked, what if we're basically confused about what's right - well, in that case, what does the word 'right' even mean? If you don't know what's right, and you don't know how to compute what's right, then what are we even talking about? Do you have any ground on which to say that an AGI which only asks 'Which outcome leads to the greatest number of paperclips?' isn't computing rightness? If you don't think a paperclip maximizer is computing rightness, then you must know something about the rightness-question which excludes that possibility - so let's talk about how to program that rightness-question into an AGI."

CEV's advocates claim that all of these lines of discussion eventually end up converging on the idea of coherent extrapolated volition. For example:

Asking what everyone would want* if they knew what the AI knew, and doing what they'd all predictably agree on, is just about the least jerky thing you can do. If you tell the AI to give everyone a volcano lair because you think volcano lairs are neat, you're not being selfish, but you're being a jerk to everyone who doesn't want a volcano lair. If you have the AI just do what people actually say, they'll end up hurting themselves with dumb wishes and you'd be a jerk. If you only extrapolate your friends and have the AI do what only you'd want, you're being jerks to everyone else.

Yes, liberal democratic values are good; so is apple pie. Apple pie is a good thing but it's not the only good thing. William Frankena's list of ends-in-themselves included "Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment" and then 25 more items, and the list certainly isn't complete. The only way you're going to get a complete list is by analyzing human minds; and even then, if our descendants would predictably want something else a million years later, we ought to take that into account too.

Every improvement is a change, but not every change is an improvement. Just letting a superintelligence change at random doesn't encapsulate moral progress. Saying that change toward more liberal democratic values is progress, presumes that we already know the destination or answer. We can't even just ask the AGI to predict what civilizations would think a thousand years later, since (a) the AI itself impacts this and (b) if the AI did nothing, maybe in a thousand years everyone would have accidentally blissed themselves out while trying to modify their own brains. If we want to do better than the hypothetical ancient Greeks, we need to define a sufficiently abstract and meta criterion that describes valid directions of progress - such as changes in moral beliefs associated with learning new facts, for example; or moral change that would predictably occur if we considered a larger set of arguments; or moral change that would predictably occur if we understood ourselves better.

This one is a long story: Metaethics deals with the question of what sort of entity 'rightness' is exactly - tries to reconcile this strange ineffable 'rightness' business with a universe made out of particle fields. Even though it seems like human beings wanting to murder people wouldn't make murder right, there's also nowhere in the stars or mountains where we can actually find it written that murder is wrong. At the end of a rather long discussion, we decide that for any given person speaking at a given point in time, 'rightness' is a logical constant which, although not counterfactually dependent on the state of the person's brain, must be analytically identified with the extrapolated volition of that brain; and we show that (only) this stance gives consistent answers to all the standard questions in metaethics. (This discussion takes a while, on the order of explaining how deterministic laws of physics don't show that you have unfree will.)

(To do: Write dialogues from each of these four entrance points.)

Situating CEV in contemporary metaethics

See the corresponding section in "Extrapolated volition (normative moral theory)".

Scary design challenges

There are several reasons why CEV is way too challenging to be a good target for any project's first try at building machine intelligence:

A CEV agent would be intended to carry out an autonomous open-ended mission. This implies all the usual reasons we expect an autonomous AI to be harder to make safe than a Task AGI.

CEV is a weird goal. It involves recursion.

Even the terms in CEV, like "know more" or "extrapolate a human", seem complicated and value-laden. You might have to build a high-level Do What I Know I Mean agent, and then tell it to do CEV. Do What I Know I Mean is complicated enough that you'd need to build an AI that can learn DWIKIM, so that DWIKIM can be taught rather than formally specified. So we're looking at something like CEV, running on top of DWIKIM, running on top of a goal-learning system, at least until the first time the CEV agent rewrites itself.

Doing this correctly the very first time we build a smarter-than-human intelligence seems improbable. The only way this would make a good first target is if the CEV concept is formally simpler than it currently seems, and timelines to AGI are unusually long and permit a great deal of advance work on safety.

If AGI is 20 years out (or less), it seems wiser to think in terms of a Task AGI performing some relatively simple pivotal act. The role of CEV is of answering the question, "What can you all agree in advance that you'll try to do next, after you've executed your Task AGI and gotten out from under the shadow of immediate doom?"

What if CEV fails to cohere?

A frequently asked question is "What if extrapolating human volitions produces incoherent answers?"

According to the original motivation for CEV, if this happens in some places, a Friendly AI ought to ignore those places. If it happens everywhere, you probably picked a silly way to construe an extrapolated volition and you ought to rethink it. ^[1]

That is:

If your CEV algorithm finds that "People coherently want to not be eaten by paperclip maximizers, but end up with a broad spectrum of individual and collective possibilities for which pizza toppings they prefer", we would normatively want a Friendly AI to prevent people from being eaten by paperclip maximizers but not mess around with which pizza toppings people end up eating in the Future.

If your CEV algorithm claims that there's no coherent sense in which "A lot of people would want to not be eaten by Clippy and would still want* this even if they knew more stuff" then this is a suspicious and unexpected result. Perhaps you have picked a silly way to construe somebody's volition.

The original motivation for CEV can also be viewed from the perspective of "What is it to help someone?" and "How can one help a large group of people?", where the intent behind the question is to build an AI that renders 'help' as we really intend that. The elements of CEV can be seen as caveats to the naive notion of "Help is giving people whatever they ask you for!" in which somebody asks you to bring them orange juice but the orange juice in the refrigerator is poisonous (and they're not trying to poison themselves).

What about helping a group of people? If two people ask for juice and you can only bring one kind of juice, you should bring a non-poisonous kind of juice they'd both like, to the extent any such juice exists. If no such juice exists, find a kind of juice that one of them is meh about and that the other one likes, and flip a coin or something to decide who wins. You are then being around as helpful as it is possible to be.

Can there be no way to help a large group of people? This seems implausible. You could at least give the starving ones pizza with a kind of pizza topping they currently like. To the extent your philosophy claims "Oh noes even that is not helping because it's not perfectly coherent," you have picked the wrong construal of 'helping'.

It could be that, if we find that every reasonable-sounding construal of extrapolated volition fails to cohere, we must arrive at some entirely other notion of 'helping'. But then this new form of helping also shouldn't involve bringing people poisonous orange juice that they don't know is poisoned, because that still intuitively seems unhelpful.

Helping people with incoherent preferences

What if somebody believes themselves to prefer onions to pineapple on their pizza, prefer pineapple to mushrooms, and prefer mushrooms to onions? In the sense that, offered any two slices from this set, they would pick according to the given ordering?

(This isn't an unrealistic example. Numerous experiments in behavioral economics demonstrate exactly this sort of circular preference. For instance, you can arrange 3 items such that each pair of them brings a different salient quality into focus for comparison.)

One may worry that we couldn't 'coherently extrapolate the volition' of somebody with these pizza preferences, since these local choices obviously aren't consistent with any coherent utility function. But how could we help somebody with a pizza preference like this?

Well, appealing to the intuitive notion of helping:

We could give them whatever kind of pizza they'd pick if they had to pick among all three simultaneously.

We could figure out how happy they'd be eating each type of pizza, in terms of emotional intensity as measured in neurotransmitters; and offer them the slice of pizza that they'll most enjoy.

We could let them pick their own damn pizza toppings and concern ourselves mainly with making sure the pizza isn't poisonous, since the person definitely prefers non-poisoned pizza.

We could, given sufficient brainpower on our end, figure out what this person would ask us to do for them in this case after that person had learned about the concept of a preference reversal and been told about their own circular preferences. If this varies wildly depending on exactly how we explain the concept of a preference reversal, we could refer back to one of the previous three answers instead.

Conversely, these alternatives seem less helpful:

Refuse to have anything to do with that person since their current preferences don't form a coherent utility function.

Emit "ERROR ERROR" sounds like a Hollywood AI that's just found out about the Epimenides Paradox.

Give them pizza with your own favorite topping, green peppers, even though they'd prefer any of the 3 other toppings to those.

Give them pizza with the topping that would taste best to them, pepperoni, despite their being vegetarians.

Advocates of CEV claim that if you blank the complexities of 'extrapolated volition' out of your mind; and ask how you could reasonably help people as best as possible if you were trying not be a jerk; and then try to figure out how to semiformalize whatever mental procedure you just followed to arrive at your answer for how to help people; then you will eventually end up at CEV again.

Role of meta-ideals in promoting early agreement

A primary purpose of CEV is to represent a relatively simple meta-level ideal that people can agree upon, even where they might disagree on the object level. By a hopefully analogous example, two honest scientists might disagree on the correct mass of an electron, but agree that the experimental method is a good way to resolve the answer.

Imagine Millikan believes an electron's mass is 9.1e-28 grams, and Nannikan believes the correct electron mass is 9.1e-34 grams. Millikan might be very worried about Nannikan's proposal to program an AI to believe the electron mass is 9.1e-34 grams; Nannikan doesn't like Millikan's proposal to program in 9.1-e28; and both of them would be unhappy with a compromise mass of 9.1e-31 grams. They might still agree on programming an AI with some analogue of probability theory and a simplicity prior, and letting a superintelligence come to the conclusions implied by Bayes and Occam, because the two can agree on an effectively computable question even though they think the question has different answers. Of course, this is easier to agree on when the AI hasn't yet produced an answer, or if the AI doesn't tell you the answer.

It's not guaranteed that every human embodies the same implicit moral questions, indeed this seems unlikely, which means that Alice and Bob might still expect their extrapolated volitions to disagree about things. Even so, while the outputs are still abstract and not-yet-computed, Alice doesn't have much of a place to stand on which to appeal to Carol, Dennis, and Evelyn by saying, "But as a matter of morality and justice, you should have the AI implement my extrapolated volition, not Bob's!" To appeal to Carol, Dennis, and Evelyn about this, you'd need them to believe that Alice's EV was more likely to agree with their EVs than Bob's was - and at that point, why not come together on the obvious Schelling point of extrapolating everyone's EVs?

Thus, one of the primary purposes of CEV (selling points, design goals) is that it's something that Alice, Bob, and Carol can agree now that Dennis and Evelyn should do with an AI that will be developed later; we can try to set up commitment mechanisms now, or check-and-balance mechanisms now, to ensure that Dennis and Evelyn are still working on CEV later.

Role of 'coherence' in reducing expected unresolvable disagreements

A CEV is not necessarily a majority vote. A lot of people with an extrapolated weak preference* might be counterbalanced by a few people with a strong extrapolated preference* in the opposite direction. Nick Bostrom's "parliamentary model" for resolving uncertainty between incommensurable ethical theories, permits a subtheory very concerned about a decision to spend a large amount of its limited influence on influencing that particular decision.

This means that, e.g., a vegan or animal-rights activist should not need to expect that they must seize control of a CEV algorithm in order for the result of CEV to protect animals. It doesn't seem like most of humanity would be deriving huge amounts of utility from hurting animals in a post-superintelligence scenario, so even a small part of the population that strongly opposes* this scenario should be decisive in preventing it.

(ADDED 2023: Thomas Cederborg correctly observes that Nick Bostrom's original parliamentary proposal involves a negotiation baseline where each agent has a random chance of becoming dictator, and that this random-dictator baseline gives an outsized and potentially fatal amount of power to spoilants - agents that genuinely and not as a negotiating tactic prefer to invert other agents' utility functions, or prefer to do things that otherwise happen to minimize those utility functions - if most participants have utility functions with what I've termed "negative skew"; i.e. an opposed agent can use the same amount of resource to generate -100 utilons as an aligned agent can use to generate at most +1 utilon. If trolls are 1% of the population, they can demand all resources be used their way as a concession in exchange for not doing harm, relative to the negotiating baseline in which there's a 1% chance of a troll being randomly appointed dictator. Or to put it more simply, if 1% of the population would prefer to create Hell for all but themselves (as a genuine preference rather than a strategic negotiating baseline) and Hell is 100 times as bad as Heaven is good, compared to nothing, they can steal the entire future if you run a parliamentary procedure running from a random-dictator baseline. I agree with Cederborg that this constitutes more-than-sufficient reason not to start from random-dictator as a negotiation baseline; anyone not smart enough to reliably see this point and scream about it, potentially including Eliezer, is not reliably smart enough to implement CEV; but CEV probably wasn't a good move for baseline humans to try implementing anyways, even given long times to deliberate at baseline human intelligence. --EY)

Moral hazard vs. debugging

One of the points of the CEV proposal is to have minimal moral hazard (aka, not tempting the programmers to take over the world or the future); but this may be compromised if CEV's results don't go literally unchecked.

Part of the purpose of CEV is to stand as an answer to the question, "If the ancient Greeks had been the ones to invent superintelligence, what could they have done that would not, from our later perspective, irretrievably warp the future? If the ancient Greeks had programmed in their own values directly, they would have programmed in a glorious death in combat. Now let us consider that perhaps we too are not so wise." We can imagine the ancient Greeks writing a CEV mechanism, peeking at the result of this CEV mechanism before implementing it, and being horrified by the lack of glorious-deaths-in-combat in the future and value system thus revealed.

We can also imagine that the Greeks, trying to cut down on moral hazard, virtuously refuse to peek at the output; but it turns out that their attempt to implement CEV has some unforeseen behavior when actually run by a superintelligence, and so their world is turned into paperclips.

This is a safety-vs.-moral-hazard tradeoff between (a) the benefit of being able to look at CEV outputs in order to better-train the system or just verify that nothing went horribly wrong; and (b) the moral hazard that comes from the temptation to override the output, thus defeating the point of having a CEV mechanism in the first place.

There's also a potential safety hazard just with looking at the internals of a CEV algorithm; the simulated future could contain all sorts of directly mind-hacking cognitive hazards.

Rather than giving up entirely and embracing maximum moral hazard, one possible approach to this issue might be to have some single human that is supposed to peek at the output and provide a 1 or 0 (proceed or stop) judgment to the mechanism, without any other information flow being allowed to the programmers if the human outputs 0. (For example, the volunteer might be in a room with explosives that go off if 0 is output.)

"Selfish bastards" problem

Suppose that Fred is funding Grace to work on a CEV-based superintelligence; and Evelyn has decided not to oppose this project. The resulting CEV is meant to extrapolate the volitions of Alice, Bob, Carol, Dennis, Evelyn, Fred, and Grace with equal weight. (If you're reading this, you're more than usually likely to be one of Evelyn, Fred, or Grace.)

Evelyn and Fred and Grace might worry: "What if a supermajority of humanity consists of 'selfish* bastards', such that their extrapolated volitions would cheerfully vote* for a world in which it was legal to own artificial sapient beings as slaves so long as they personally happened to be in the slaveowning class; and we, Evelyn and Fred and Grace, just happen to be in the minority that extremely doesn't want nor want* the future to be like that?"

That is: What if humanity's extrapolated volitions diverge in such a way that from the standpoint of our volitions - since, if you're reading this, you're unusually likely to be one of Evelyn or Fred or Grace - 90% of extrapolated humanity would choose* something such that we would not approve of it, and our volitions would not approve* of it, even after taking into account that we don't want to be jerks about it and that we don't think we were born with any unusual or exceptional right to determine the fate of humanity.

That is, let the scenario be as follows:

90% of the people (but not we who are collectively sponsoring the AI) are selfish bastards at the core, such that any reasonable extrapolation process (it's not just that we picked a broken one) would lead to them endorsing a world in which they themselves had rights, but it was okay to create artificial people and hurt them. Furthermore, they would derive enough utility from being personal God-Emperors that this would override our minority objection even in a parliamentary model.

We can see this hypothetical outcome as potentially undermining every sort of reason that we, who happen to be in a position of control to prevent that outcome, should voluntarily relinquish that control to the remaining 90% of humanity:

We can't be prioritizing being fair to everyone including the other 90% of humanity, because what about being fair to the artificial people who are being hurt?

We can't be worrying that the other 90% of humanity would withdraw their support from the project, or worrying about betraying the project's supporters, because by hypothesis they weren't supporting it or even permitting it.

We can't be agreeing to defer to a righter and more intelligent process to resolve our dispute, because by hypothesis the CEV made up of 90% selfish* bastards is not, from our own perspective, ideally righter.

We can't rely on a parliamentary model of coherence to prevent what a minority sees as disaster, because by hypothesis the other 90% is deriving enough utility from collectively declaring themselves God-Emperors to trump even a strong minority countervote.

Rather than giving up entirely and taking over the world, or exposing ourselves to moral hazard by peeking at the results, one possible approach to this issue might be to run a three-stage process.

This process involves some internal references, so the detailed explanation needs to follow a shorter summary explanation.

In summary:

Extrapolate everyone's CEV.

Extrapolate the CEV of the contributors only, and let it give (only) an up-down vote on Everyone's CEV.

If the result is thumbs-up, run Everyone's CEV.

Otherwise, extrapolate everyone's CEV, but kicking out all the parts that would act unilaterally and without any concern for others if they were in positions of unchecked power.

Have the Contributor CEV give an up/down answer on the Fallback CEV.

If the result is thumbs-up, run the Fallback CEV.

Otherwise fail.

In detail:

First, extrapolate the everyone-on-Earth CEV as though it were not being checked.
- If any hypothetical extrapolated person worries about being checked, delete that concern and extrapolate them as though they didn't have it. This is necessary to prevent the check itself from having a UDT influence on the extrapolation and the actual future.

Next, extrapolate the CEV of everyone who contributed to the project, weighted by their contribution (possibly based on some mix of "how much was actually done" versus "how much was rationally expected to be accomplished" versus "the fraction of what could've been done versus what was actually done"). Allow this other extrapolation an up-or-down vote - not any kind of detailed correction - on whether to let the everyone-on-Earth CEV to go through unmodified.
- Remove from the extrapolation of the Contributor-CEV any strategic considerations having to do with the Fallback-CEV or post-Fail redevelopment being a better alternative; we want to extract a judgment about "satisficing" in some sense, whether the Everyone-CEV is in some non-relative sense too horrible to be allowed.

If the Everyone-CEV passes the Contributor-CEV check, run it.

Otherwise, re-extrapolate a Fallback-CEV that starts with all existing humans as a base, but discards from the extrapolation all extrapolated decision processes that, if they were in a superior strategic position or a position of unilateral power, would not bother to extrapolate others' volitions or care about their welfare.
- Again, remove all extrapolated strategic considerations about passing the coming check.

Check the Fallback-CEV against the Contributor-CEV for an up-down vote. If it passes, run it.

Otherwise Fail (AI shuts down safely, we rethink what to do next or implement an agreed-on fallback course past this point).

The particular fallback of "kick out from the extrapolation any weighted portions of extrapolated decision processes that would act unilaterally and without caring for others, given unchecked power" is meant to have a property of poetic justice, or rendering objections to it self-defeating: If it's okay to act unilaterally, then why can't we unilaterally kick out the unilateral parts? This is meant to be the 'simplest' or most 'elegant' way of kicking out a part of the CEV whose internal reasoning directly opposes the whole reason we ran CEV in the first place, but imposing the minimum possible filter beyond that.

Thus if Alice (who by hypothesis is not in any way a contributor) says, "But I demand you altruistically include the extrapolation of me that would unilaterally act against you if it had power!" then we reply, "We'll try that, but if it turns out to be a sufficiently bad idea, there's no coherent interpersonal grounds on which you can rebuke us for taking the fallback option instead."

Similarly in regards to the Fail option at the end, to anyone who says, "Fairness demands that you run Fallback CEV even if you wouldn't like* it!" we can reply, "Our own power may not be used against us; if we'd regret ever having built the thing, fairness doesn't oblige us to run it."

Why base CEV on "existing humans" and not some other class of extrapolees?

One frequently asked question about the implementation details of CEV is either:

Why formulate CEV such that it is run on "all existing humans" and not "all existing and past humans" or "all mammals" or "all sapient life as it probably exists everywhere in the measure-weighted infinite multiverse"?

Why not restrict the extrapolation base to "only people who contributed to the AI project"?

In particular, it's been asked why restrictive answers to Question 1 don't also imply the more restrictive answer to Question 2.

Why not include mammals?

We'll start by considering some replies to the question, "Why not include all mammals into CEV's extrapolation base?"

Because you could be wrong about mammals being objects of significant ethical value, such that we should on an object level respect their welfare. The extrapolation process will catch the error if you'd predictably change your mind about that. Including mammals into the extrapolation base for CEV potentially sets in stone what could well be an error, the sort of thing we'd predictably change our minds about later. If you're normatively right that we should all care about mammals and even try to extrapolate their volitions into a judgment of Earth's destiny, if that's what almost all of us would predictably decide after thinking about it for a while, then that's what our EVs will decide* to do on our behalf; and if they don't decide* to do that, it wasn't right which undermines your argument for doing it unconditionally.

Because even if we ought to care about mammals' welfare qua welfare, extrapolated animals might have really damn weird preferences that you'd regret including into the CEV. (E.g., after human volitions are outvoted by the volitions of other animals, the current base of existing animals' extrapolated volitions choose* a world in which they are uplifted to God-Emperors and rule over suffering other animals.)

Because maybe not everyone on Earth cares* about animals even if your EV would in fact care* about them, and to avoid a slap-fight over who gets to rule the world, we're going to settle this by e.g. a parliamentary-style model in which you get to expend your share of Earth's destiny-determination on protecting animals.

To expand on this last consideration, we can reply: "Even if you would regard it as more just to have the right animal-protecting outcome baked into the future immediately, so that your EV didn't need to expend some of its voting strength on assuring it, not everyone else might regard that as just. From our perspective as programmers we have no particular reason to listen to you rather than Alice. We're not arguing about whether animals will be protected if a minority vegan-type subpopulation strongly want* that and the rest of humanity doesn't care*. We're arguing about whether, if you want* that but a majority doesn't, your EV should justly need to expend some negotiating strength in order to make sure animals are protected. This seems pretty reasonable to us as programmers from our standpoint of wanting to be fair, not be jerks, and not start any slap-fights over world domination."

This third reply is particularly important because taken in isolation, the first two replies of "You could be wrong about that being a good idea" and "Even if you care about their welfare, maybe you wouldn't like their EVs" could equally apply to argue that contributors to the CEV project ought to extrapolate only their own volitions and not the rest of humanity:

We could be wrong about it being a good idea, by our own lights, to extrapolate the volitions of everyone else; including this into the CEV project bakes this consideration into stone; if we were right about running an Everyone CEV, if we would predictably arrive at that conclusion after thinking about it for a while, our EVs could do that for us.

Not extrapolating other people's volitions isn't the same as saying we shouldn't care. We could be right to care about the welfare of others, but there could be some spectacular horror built into their EVs.

The proposed way of addressing this was to run a composite CEV with a contributor-CEV check and a Fallback-CEV fallback. But then why not run an Animal-CEV with a Contributor-CEV check before trying the Everyone-CEV?

One answer would go back to the third reply above: Nonhuman mammals aren't sponsoring the CEV project, allowing it to pass, or potentially getting angry at people who want to take over the world with no seeming concern for fairness. So they aren't part of the Schelling Point for "everyone gets an extrapolated vote".

Why not extrapolate all sapients?

Similarly if we ask: "Why not include all sapient beings that the SI suspects to exist everywhere in the measure-weighted multiverse?"

Because large numbers of them might have EVs as alien as the EV of an Ichneumonidae wasp.

Because our EVs can always do that if it's actually a good idea.

Because they aren't here to protest and withdraw political support if we don't bake them into the extrapolation base immediately.

Why not extrapolate deceased humans?

"Why not include all deceased human beings as well as all currently living humans?"

In this case, we can't then reply that they didn't contribute to the human project (e.g. I. J. Good). Their EVs are also less likely to be alien than in any other case considered above.

But again, we fall back on the third reply: "The people who are still alive" is a simple Schelling circle to draw that includes everyone in the current political process. To the extent it would be nice or fair to extrapolate Leo Szilard and include him, we can do that if a supermajority of EVs decide* that this would be nice or just. To the extent we don't bake this decision into the model, Leo Szilard won't rise from the grave and rebuke us. This seems like reason enough to regard "The people who are still alive" as a simple and obvious extrapolation base.

Why include people who are powerless?

"Why include very young children, uncontacted tribes who've never heard about AI, and retrievable cryonics patients (if any)? They can't, in their current state, vote for or against anything."

A lot of the intuitive motivation for CEV is to not be a jerk, and ignoring the wishes of powerless living people seems intuitively a lot more jerkish than ignoring the wishes of powerless dead people.

They'll actually be present in the future, so it seems like less of a jerk thing to do to extrapolate them and take their wishes into account in shaping that future, than to not extrapolate them.

Their relatives might take offense otherwise.

It keeps the Schelling boundary simple.

^{^︎}
Albeit in practice, you would not want an AI project to take a dozen tries at defining CEV. This would indicate something extremely wrong about the method being used to generate suggested answers. Whatever final attempt passed would probably be the first answer all of whose remaining flaws were hidden, rather than an answer with all flaws eliminated.

Introduction

"Coherent extrapolated volition" (CEV) is Eliezer Yudkowsky's proposed thing-to-do with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets.

Roughly, a CEV-based superintelligence would do what currently existing humans would want* the AI to do, if counterfactually:

We knew everything the AI knew;

We could think as fast as the AI and consider all the arguments;

We knew ourselves perfectly and had better self-control or self-modification ability;

For the corresponding metaethical theory see Extrapolated volition (normative moral theory).

Concept

In Yudkowsky's version of extrapolated volition considered on an individual level, the three core directions of extrapolation are:

Increased knowledge - having more veridical knowledge of declarative facts and expected outcomes.

Increased consideration of arguments - being able to consider more possible arguments and assess their validity.

Increased reflectivity - greater knowledge about the self, and to some degree, greater self-control (though this raises further questions about which parts of the self normatively get to control which other parts).

Motivation

Some common reactions:

"Different people want different things! There's no way you can give everyone what they want. Even if you pick some way of combining things that people want, you'll be the one saying how to combine it. Someone else might think they should just get the whole world for themselves. Therefore, in the end you're deciding what the AI will do, and any claim to some sort of higher justice or normativity is nothing but sophistry."

"What we should do with an AI is obvious - it should optimize liberal democratic values. That already takes into account everyone's interests in a fair way. The real threat is if bad people get their hands on an AGI and build an AGI that doesn't optimize liberal democratic values."

"Imagine the ancient Greeks telling a superintelligence what to do. They'd have told it to optimize for glorious deaths in battle. Programming any other set of inflexible goals into a superintelligence seems equally stupid; it has to be able to change and grow."

"What if we tell the superintelligence what to do and it's the wrong thing? What if we're basically confused about what's right? Shouldn't we let the superintelligence figure that out on its own, with its assumed superior intelligence?"

An initial response to each of these frames might be:

"Okay, but suppose you're building a superintelligence and you're trying not to be a jerk about it. If you say, 'Whatever I do originates in myself, and therefore is equally selfish, so I might as well declare myself God-Emperor of the Universe' then you're being a jerk. Is there anything you could do instead which would be less like being a jerk? What's the least jerky thing you could do?"

"What if you would, after some further discussion, want to tweak your definition of 'liberal democratic values' just a little? What if it's predictable that you would do that? Would you really want to be stuck with your off-the-cuff definition a million years later?"

"Okay, so what should the Ancient Greeks have done if they did have to program an AI? How could they not have doomed future generations? Suppose the Ancient Greeks are clever enough to have noticed that sometimes people change their minds about things and to realize that they might not be right about everything. How can they use the cleverness of the AGI in a constructively specified, computable fashion that gets them out of this hole? You can't just tell the AGI to compute what's 'right', you need to put an actual computable question in there, not a word."

"You asked, what if we're basically confused about what's right - well, in that case, what does the word 'right' even mean? If you don't know what's right, and you don't know how to compute what's right, then what are we even talking about? Do you have any ground on which to say that an AGI which only asks 'Which outcome leads to the greatest number of paperclips?' isn't computing rightness? If you don't think a paperclip maximizer is computing rightness, then you must know something about the rightness-question which excludes that possibility - so let's talk about how to program that rightness-question into an AGI."

CEV's advocates claim that all of these lines of discussion eventually end up converging on the idea of coherent extrapolated volition. For example:

Asking what everyone would want* if they knew what the AI knew, and doing what they'd all predictably agree on, is just about the least jerky thing you can do. If you tell the AI to give everyone a volcano lair because you think volcano lairs are neat, you're not being selfish, but you're being a jerk to everyone who doesn't want a volcano lair. If you have the AI just do what people actually say, they'll end up hurting themselves with dumb wishes and you'd be a jerk. If you only extrapolate your friends and have the AI do what only you'd want, you're being jerks to everyone else.

Yes, liberal democratic values are good; so is apple pie. Apple pie is a good thing but it's not the only good thing. William Frankena's list of ends-in-themselves included "Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment" and then 25 more items, and the list certainly isn't complete. The only way you're going to get a complete list is by analyzing human minds; and even then, if our descendants would predictably want something else a million years later, we ought to take that into account too.

Every improvement is a change, but not every change is an improvement. Just letting a superintelligence change at random doesn't encapsulate moral progress. Saying that change toward more liberal democratic values is progress, presumes that we already know the destination or answer. We can't even just ask the AGI to predict what civilizations would think a thousand years later, since (a) the AI itself impacts this and (b) if the AI did nothing, maybe in a thousand years everyone would have accidentally blissed themselves out while trying to modify their own brains. If we want to do better than the hypothetical ancient Greeks, we need to define a sufficiently abstract and meta criterion that describes valid directions of progress - such as changes in moral beliefs associated with learning new facts, for example; or moral change that would predictably occur if we considered a larger set of arguments; or moral change that would predictably occur if we understood ourselves better.

This one is a long story: Metaethics deals with the question of what sort of entity 'rightness' is exactly - tries to reconcile this strange ineffable 'rightness' business with a universe made out of particle fields. Even though it seems like human beings wanting to murder people wouldn't make murder right, there's also nowhere in the stars or mountains where we can actually find it written that murder is wrong. At the end of a rather long discussion, we decide that for any given person speaking at a given point in time, 'rightness' is a logical constant which, although not counterfactually dependent on the state of the person's brain, must be analytically identified with the extrapolated volition of that brain; and we show that (only) this stance gives consistent answers to all the standard questions in metaethics. (This discussion takes a while, on the order of explaining how deterministic laws of physics don't show that you have unfree will.)

(To do: Write dialogues from each of these four entrance points.)

Situating CEV in contemporary metaethics

See the corresponding section in "Extrapolated volition (normative moral theory)".

Scary design challenges

There are several reasons why CEV is way too challenging to be a good target for any project's first try at building machine intelligence:

A CEV agent would be intended to carry out an autonomous open-ended mission. This implies all the usual reasons we expect an autonomous AI to be harder to make safe than a Task AGI.

CEV is a weird goal. It involves recursion.

Even the terms in CEV, like "know more" or "extrapolate a human", seem complicated and value-laden. You might have to build a high-level Do What I Know I Mean agent, and then tell it to do CEV. Do What I Know I Mean is complicated enough that you'd need to build an AI that can learn DWIKIM, so that DWIKIM can be taught rather than formally specified. So we're looking at something like CEV, running on top of DWIKIM, running on top of a goal-learning system, at least until the first time the CEV agent rewrites itself.

What if CEV fails to cohere?

A frequently asked question is "What if extrapolating human volitions produces incoherent answers?"

That is:

If your CEV algorithm finds that "People coherently want to not be eaten by paperclip maximizers, but end up with a broad spectrum of individual and collective possibilities for which pizza toppings they prefer", we would normatively want a Friendly AI to prevent people from being eaten by paperclip maximizers but not mess around with which pizza toppings people end up eating in the Future.

If your CEV algorithm claims that there's no coherent sense in which "A lot of people would want to not be eaten by Clippy and would still want* this even if they knew more stuff" then this is a suspicious and unexpected result. Perhaps you have picked a silly way to construe somebody's volition.

Helping people with incoherent preferences

Well, appealing to the intuitive notion of helping:

We could give them whatever kind of pizza they'd pick if they had to pick among all three simultaneously.

We could figure out how happy they'd be eating each type of pizza, in terms of emotional intensity as measured in neurotransmitters; and offer them the slice of pizza that they'll most enjoy.

We could let them pick their own damn pizza toppings and concern ourselves mainly with making sure the pizza isn't poisonous, since the person definitely prefers non-poisoned pizza.

We could, given sufficient brainpower on our end, figure out what this person would ask us to do for them in this case after that person had learned about the concept of a preference reversal and been told about their own circular preferences. If this varies wildly depending on exactly how we explain the concept of a preference reversal, we could refer back to one of the previous three answers instead.

Conversely, these alternatives seem less helpful:

Refuse to have anything to do with that person since their current preferences don't form a coherent utility function.

Emit "ERROR ERROR" sounds like a Hollywood AI that's just found out about the Epimenides Paradox.

Give them pizza with your own favorite topping, green peppers, even though they'd prefer any of the 3 other toppings to those.

Give them pizza with the topping that would taste best to them, pepperoni, despite their being vegetarians.

Role of meta-ideals in promoting early agreement

Role of 'coherence' in reducing expected unresolvable disagreements

Moral hazard vs. debugging

There's also a potential safety hazard just with looking at the internals of a CEV algorithm; the simulated future could contain all sorts of directly mind-hacking cognitive hazards.

"Selfish bastards" problem

That is, let the scenario be as follows:

90% of the people (but not we who are collectively sponsoring the AI) are selfish bastards at the core, such that any reasonable extrapolation process (it's not just that we picked a broken one) would lead to them endorsing a world in which they themselves had rights, but it was okay to create artificial people and hurt them. Furthermore, they would derive enough utility from being personal God-Emperors that this would override our minority objection even in a parliamentary model.

We can't be prioritizing being fair to everyone including the other 90% of humanity, because what about being fair to the artificial people who are being hurt?

We can't be worrying that the other 90% of humanity would withdraw their support from the project, or worrying about betraying the project's supporters, because by hypothesis they weren't supporting it or even permitting it.

We can't be agreeing to defer to a righter and more intelligent process to resolve our dispute, because by hypothesis the CEV made up of 90% selfish* bastards is not, from our own perspective, ideally righter.

We can't rely on a parliamentary model of coherence to prevent what a minority sees as disaster, because by hypothesis the other 90% is deriving enough utility from collectively declaring themselves God-Emperors to trump even a strong minority countervote.

Rather than giving up entirely and taking over the world, or exposing ourselves to moral hazard by peeking at the results, one possible approach to this issue might be to run a three-stage process.

This process involves some internal references, so the detailed explanation needs to follow a shorter summary explanation.

In summary:

Extrapolate everyone's CEV.

Extrapolate the CEV of the contributors only, and let it give (only) an up-down vote on Everyone's CEV.

If the result is thumbs-up, run Everyone's CEV.

Otherwise, extrapolate everyone's CEV, but kicking out all the parts that would act unilaterally and without any concern for others if they were in positions of unchecked power.

Have the Contributor CEV give an up/down answer on the Fallback CEV.

If the result is thumbs-up, run the Fallback CEV.

Otherwise fail.

In detail:

First, extrapolate the everyone-on-Earth CEV as though it were not being checked.
- If any hypothetical extrapolated person worries about being checked, delete that concern and extrapolate them as though they didn't have it. This is necessary to prevent the check itself from having a UDT influence on the extrapolation and the actual future.

Next, extrapolate the CEV of everyone who contributed to the project, weighted by their contribution (possibly based on some mix of "how much was actually done" versus "how much was rationally expected to be accomplished" versus "the fraction of what could've been done versus what was actually done"). Allow this other extrapolation an up-or-down vote - not any kind of detailed correction - on whether to let the everyone-on-Earth CEV to go through unmodified.
- Remove from the extrapolation of the Contributor-CEV any strategic considerations having to do with the Fallback-CEV or post-Fail redevelopment being a better alternative; we want to extract a judgment about "satisficing" in some sense, whether the Everyone-CEV is in some non-relative sense too horrible to be allowed.

If the Everyone-CEV passes the Contributor-CEV check, run it.

Otherwise, re-extrapolate a Fallback-CEV that starts with all existing humans as a base, but discards from the extrapolation all extrapolated decision processes that, if they were in a superior strategic position or a position of unilateral power, would not bother to extrapolate others' volitions or care about their welfare.
- Again, remove all extrapolated strategic considerations about passing the coming check.

Check the Fallback-CEV against the Contributor-CEV for an up-down vote. If it passes, run it.

Otherwise Fail (AI shuts down safely, we rethink what to do next or implement an agreed-on fallback course past this point).

Why base CEV on "existing humans" and not some other class of extrapolees?

One frequently asked question about the implementation details of CEV is either:

Why formulate CEV such that it is run on "all existing humans" and not "all existing and past humans" or "all mammals" or "all sapient life as it probably exists everywhere in the measure-weighted infinite multiverse"?

Why not restrict the extrapolation base to "only people who contributed to the AI project"?

In particular, it's been asked why restrictive answers to Question 1 don't also imply the more restrictive answer to Question 2.

Why not include mammals?

We'll start by considering some replies to the question, "Why not include all mammals into CEV's extrapolation base?"

Because you could be wrong about mammals being objects of significant ethical value, such that we should on an object level respect their welfare. The extrapolation process will catch the error if you'd predictably change your mind about that. Including mammals into the extrapolation base for CEV potentially sets in stone what could well be an error, the sort of thing we'd predictably change our minds about later. If you're normatively right that we should all care about mammals and even try to extrapolate their volitions into a judgment of Earth's destiny, if that's what almost all of us would predictably decide after thinking about it for a while, then that's what our EVs will decide* to do on our behalf; and if they don't decide* to do that, it wasn't right which undermines your argument for doing it unconditionally.

Because even if we ought to care about mammals' welfare qua welfare, extrapolated animals might have really damn weird preferences that you'd regret including into the CEV. (E.g., after human volitions are outvoted by the volitions of other animals, the current base of existing animals' extrapolated volitions choose* a world in which they are uplifted to God-Emperors and rule over suffering other animals.)

Because maybe not everyone on Earth cares* about animals even if your EV would in fact care* about them, and to avoid a slap-fight over who gets to rule the world, we're going to settle this by e.g. a parliamentary-style model in which you get to expend your share of Earth's destiny-determination on protecting animals.

We could be wrong about it being a good idea, by our own lights, to extrapolate the volitions of everyone else; including this into the CEV project bakes this consideration into stone; if we were right about running an Everyone CEV, if we would predictably arrive at that conclusion after thinking about it for a while, our EVs could do that for us.

Not extrapolating other people's volitions isn't the same as saying we shouldn't care. We could be right to care about the welfare of others, but there could be some spectacular horror built into their EVs.

Why not extrapolate all sapients?

Similarly if we ask: "Why not include all sapient beings that the SI suspects to exist everywhere in the measure-weighted multiverse?"

Because large numbers of them might have EVs as alien as the EV of an Ichneumonidae wasp.

Because our EVs can always do that if it's actually a good idea.

Because they aren't here to protest and withdraw political support if we don't bake them into the extrapolation base immediately.

Why not extrapolate deceased humans?

"Why not include all deceased human beings as well as all currently living humans?"

In this case, we can't then reply that they didn't contribute to the human project (e.g. I. J. Good). Their EVs are also less likely to be alien than in any other case considered above.

Why include people who are powerless?

"Why include very young children, uncontacted tribes who've never heard about AI, and retrievable cryonics patients (if any)? They can't, in their current state, vote for or against anything."

A lot of the intuitive motivation for CEV is to not be a jerk, and ignoring the wishes of powerless living people seems intuitively a lot more jerkish than ignoring the wishes of powerless dead people.

They'll actually be present in the future, so it seems like less of a jerk thing to do to extrapolate them and take their wishes into account in shaping that future, than to not extrapolate them.

Their relatives might take offense otherwise.

It keeps the Schelling boundary simple.

^{^︎}
Albeit in practice, you would not want an AI project to take a dozen tries at defining CEV. This would indicate something extremely wrong about the method being used to generate suggested answers. Whatever final attempt passed would probably be the first answer all of whose remaining flaws were hidden, rather than an answer with all flaws eliminated.

Coherent extrapolated volition (alignment target)

Introduction

Concept

Motivation

Situating CEV in contemporary metaethics

Scary design challenges

What if CEV fails to cohere?

Helping people with incoherent preferences

Role of meta-ideals in promoting early agreement

Role of 'coherence' in reducing expected unresolvable disagreements

Moral hazard vs. debugging

"Selfish bastards" problem

Why base CEV on "existing humans" and not some other class of extrapolees?

Why not include mammals?

Why not extrapolate all sapients?

Why not extrapolate deceased humans?

Why include people who are powerless?

Coherent extrapolated volition (alignment target)

Introduction

Concept

Motivation

Situating CEV in contemporary metaethics

Scary design challenges

What if CEV fails to cohere?

Helping people with incoherent preferences

Role of meta-ideals in promoting early agreement

Role of 'coherence' in reducing expected unresolvable disagreements

Moral hazard vs. debugging

"Selfish bastards" problem

Why base CEV on "existing humans" and not some other class of extrapolees?

Why not include mammals?

Why not extrapolate all sapients?

Why not extrapolate deceased humans?

Why include people who are powerless?