Update 5-24-2013: A cleaned-up, citable version of this article is now available on MIRI's website.
Co-authored with crazy88
Summary: Yudkowsky's "coherent extrapolated volition" (CEV) concept shares much in common Ideal Advisor theories in moral philosophy. Does CEV fall prey to the same objections which are raised against Ideal Advisor theories? Because CEV is an epistemic rather than a metaphysical proposal, it seems that at least one family of CEV approaches (inspired by Bostrom's parliamentary model) may escape the objections raised against Ideal Advisor theories. This is not a particularly ambitious post; it mostly aims to place CEV in the context of mainstream moral philosophy.
What is of value to an agent? Maybe it's just whatever they desire. Unfortunately, our desires are often the product of ignorance or confusion. I may desire to drink from the glass on the table because I think it is water when really it is bleach. So perhaps something is of value to an agent if they would desire that thing if fully informed. But here we crash into a different problem. It might be of value for an agent who wants to go to a movie to look up the session times, but the fully informed version of the agent will not desire to do so — they are fully-informed and hence already know all the session times. The agent and its fully-informed counterparts have different needs. Thus, several philosophers have suggested that something is of value to an agent if an ideal version of that agent (fully informed, perfectly rational, etc.) would advise the non-ideal version of the agent to pursue that thing.
This idea of idealizing or extrapolating an agent's preferences1 goes back at least as far as Sidgwick (1874), who considered the idea that "a man's future good" consists in "what he would now desire... if all the consequences of all the different [actions] open to him were accurately forseen..." Similarly, Rawls (1971) suggested that a person's good is the plan "that would be decided upon as the outcome of careful reflection in which the agent reviewed, in the light of all the relevant facts, what it would be like to carry out these plans..." More recently, in an article about rational agents and moral theory, Harsanyi (1982) defined what an agent's rational wants as “the preferences he would have if he had all the relevant factual information, always reasoned with the greatest possible care, and were in a state of mind most conducive to rational choice.” Then, a few years later, Railton (1986) identified a person's good with "what he would want himself to want... were he to contemplate his present situation from a standpoint fully and vividly informed about himself and his circumstances, and entirely free of cognitive error or lapses of instrumental rationality."
Rosati (1995) calls these theories Ideal Advisor theories of value because they identify one's personal value with what an ideal version of oneself would advise the non-ideal self to value.
Looking not for a metaphysical account of value but for a practical solution to machine ethics (Wallach & Allen 2009; Muehlhauser & Helm 2012), Yudkowsky (2004) described a similar concept which he calls "coherent extrapolated volition" (CEV):
In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
In other words, the CEV of humankind is about the preferences that we would have as a species if our preferences were extrapolated in certain ways. Armed with this concept, Yudkowsky then suggests that we implement CEV as an "initial dynamic" for "Friendly AI." Tarleton (2010) explains that the intent of CEV is that "our volition be extrapolated once and acted on. In particular, the initial extrapolation could generate an object-level goal system we would be willing to endow a superintelligent [machine] with."
CEV theoretically avoids many problems with other approaches to machine ethics (Yudkowsky 2004; Tarleton 2010; Muehlhauser & Helm 2012). However, there are reasons it may not succeed. In this post, we examine one such reason: Resolving CEV at the level of humanity (Global CEV) might require at least partially resolving CEV at the level of individuals (Personal CEV)2, but Personal CEV is similar to ideal advisor theories of value,3 and such theories face well-explored difficulties. As such, these difficulties may undermine the possibility of determining the Global CEV of humanity.
Before doing so, however, it's worth noting one key difference between Ideal Advisor theories of value and Personal CEV. Ideal Advisor theories typically are linguistic or metaphysical theories, while the role of Personal CEV is epistemic. Ideal Advisor theorists attempts to define what it is for something to be of value for an agent. Because of this, their accounts needs to give an unambiguous and plausible answer in all cases. On the other hand, Personal CEV's role is an epistemic one: it isn't intended to define what is of value for an agent. Rather, Personal CEV is offered as a technique that can help an AI to come to know, to some reasonable but not necessarily perfect level of accuracy, what is of value for the agent. To put it more precisely, Personal CEV is intended to allow an initial AI to determine what sort of superintelligence to create such that we end up with what Yudkowsky calls a "Nice Place to Live." Given this, certain arguments are likely to threaten Ideal Advisor theories and not to Personal CEV, and vice versa.
With this point in mind, we now consider some objections to ideal advisor theories of value, and examine whether they threaten Personal CEV.
Sobel's First Objection: Too many voices
Four prominent objections to ideal advisor theories are due to Sobel (1994). The first of these, the “too many voices” objection, notes that the evaluative perspective of an agent changes over time and, as such, the views that would be held by the perfectly rational and fully informed version of the agent will also change. This implies that each agent will be associated not with one idealized version of themselves but with a set of such idealized versions (one at time t, one at time t+1, etc.), some of which may offer conflicting advice. Given this “discordant chorus,” it is unclear how the agent’s non-moral good should be determined.
Various responses to this objection run into their own challenges. First, privileging a single perspective (say, the idealized agent at time t+387) seems ad hoc. Second, attempting to aggregate the views of multiple perspectives runs into the question of how trade offs should be made. That is, if two of the idealized viewpoints disagree about what is to be preferred, it’s unclear how an overall judgment should be reached.4 Finally, suggesting that the idealized versions of the agent at different times will have the same perspective seems unlikely, and surely it's a substantive claim requiring a substantive defense. So the obvious responses to Sobel’s first objection introduce serious new challenges which then need to be resolved.
One final point is worth noting: it seems that this objection is equally problematic for Personal CEV. The extrapolated volition of the agent is likely to vary at different times, so how ought we determine an overall account of the agent’s extrapolated volition?
Sobel’s Second and Third Objections: Amnesia
Sobel’s second and third objections build on two other claims (see Sobel 1994 for a defense of these). First: some lives can only be evaluated if they are experienced. Second: experiencing one life can leave you incapable of experiencing another in an unbiased way. Given these claims, Sobel presents an amnesia model as the most plausible way for an idealized agent to gain the experiences necessary to evaluate all the relevant lives. According to this model, an agent experiences each life sequentially but undergoes an amnesia procedure after each one so that they may experience the next life uncolored by their previous experiences. After experiencing all lives, the amnesia is then removed.
Following on from this, Sobel’s second objection is that the sudden recollection of a life from one evaluative perspective and living a life from a vastly different evaluative perspective may be strongly dissimilar experiences. So when the amnesia is removed, the agent has a particular evaluative perspective (informed by their memories of all the lives they’ve lived) that differs so much from the evaluative perspective they had when they lived the life independently of such memories that they might be incapable of adequately evaluating the lives they’ve experienced based on their current, more knowledgeable, evaluative perspective.
Sobel’s third objection also relates to the amnesia model: Sobel argues that the idealized agent might be driven insane by the entire amnesia process and hence might not be able to adequately evaluate what advice they ought to give the non-ideal agent. In response to this, there is some temptation to simply demand that the agent be idealized not just in terms of rationality and knowledge but also in terms of their sanity. However, perhaps any idealized agent that is similar enough to the original to serve as a standard for their non-moral good will be driven insane by the amnesia process and so the demand for a sane agent will simply mean that no adequate agent can be identified.
If we grant that an agent needs to experience some lives to evaluate them, and we grant that experiencing some lives leaves them incapable of experiencing others, then there seems to be a strong drive for Personal CEV to rely on an amnesia model to adequately determine what an agent’s volition would be if extrapolated. If so, however, then Personal CEV seems to face the challenges raised by Sobel.
Sobel’s Fourth Objection: Better Off Dead
Sobel’s final objection is that the idealized agent, having experienced such a level of perfection, might come to the conclusion that their non-ideal counterpart is so limited as to be better off dead. Further, the ideal agent might make this judgment because of the relative level of well-being of the non-ideal agent rather than the agent’s absolute level of well-being. (That is, the ideal agent may look upon the well-being of the non-ideal agent as we might look upon our own well-being after an accident that caused us severe mental damage. In such a case, we might be unable to objectively judge our life after the accident due to the relative difficulty of this life as compared with our life before the accident.) As such, this judgment may not capture what is actually in accordance with the agent’s non-moral good.
Again, this criticism seems to apply equally to Personal CEV: when the volition of an agent is extrapolated, it may turn out that this volition endorses killing the non-extrapolated version of the agent. If so this seems to be a mark against the possibility that Personal CEV can play a useful part in a process that should eventually terminate in a "Nice Place to Live."
A model of Personal CEV
The seriousness of these challenges for Personal CEV is likely to vary depending on the exact nature of the extrapolation process. To give a sense of the impact, we will consider one family of methods for carrying out this process: the parliamentary model (inspired by Bostrom 2009). According to this model, we determine the Personal CEV of an agent by simulating multiple versions of them, extrapolated from various starting times and along different developmental paths. Some of these versions are then assigned as a parliament where they vote on various choices and make trades with one another.
Clearly this approach allows our account of Personal CEV to avoid the too many voices objection. After all, the parliamentary model provides us with an account of how we can aggregate the views of the agent at various times: we should simulate the various agents and allow them to vote and trade on the choices to be made. It is through this voting and trading that the various voices can be combined into a single viewpoint. While this process may not be adequate as a metaphysical account of value, it seems more plausible as an account of Personal CEV as an epistemic notion. Certainly, your authors would deem themselves to be more informed about what they value if they knew the outcome of the parliamentary model for themselves.
This approach is also able to avoid Sobel’s second and third objections. The objections were specifically targeted at the amnesia model where one agent experienced multiple lives. As the parliamentary model does not utilize amnesia, it is immune to these concerns.
What of Sobel’s fourth objection? Sobel’s concern here is not simply that the idealized agent might advise the agent to kill themselves. After all, sometimes death may, in fact, be of value for an agent. Rather, Sobel’s concern is that the idealized agent, having experienced such heights of existence, will become biased against the limited lives of normal agents.
It's less clear how the parliamentary model deals with Sobel's fourth objection which plausibly retains its initial force against this model of Personal CEV. However, we're not intending to solve Personal CEV entirely in this short post. Rather, we aim to demonstrate only that the force of Sobel's four objections will depend on the model of Personal CEV selected. Reflection on the parliamentary model makes this point clear.
So the parliamentary model seems able to avoid at least three of the direct criticisms raised by Sobel. It is worth noting, however, that some concerns remain. Firstly, for those that accept Sobel’s claim that experience is necessary to evaluate some lives, it is clear that no member of the parliament will be capable of comparing their life to all other possible lives, as none will have all the required experience. As such, the agents may falsely judge a certain aspect of their life to be more or less valuable than it, in fact, is. For a metaphysical account of personal value, this problem might be fatal. Whether it is also fatal for the parliamentary model of Personal CEV depends on whether the knowledge of the various members of the parliament is enough to produce a “Nice Place to Live” regardless of its imperfection.
Two more issues might arise. First, the model might require careful selection of who to appoint to the parliament. For example, if most of the possible lives that an agent could live would drive them insane, then selecting which of these agents to appoint to the parliament at random might lead to a vote by the mad. Second, it might seem that this approach to determining Personal CEV will require a reasonable level of accuracy in simulation. If so, there might be concerns about the creation of, and responsibility to, potential moral agents.
Given these points, a full evaluation of the parliamentary model will require more detailed specification and further reflection. However, two points are worth noting in conclusion. First, the parliamentary model does seem to avoid at least three of Sobel’s direct criticisms. Second, even if this model eventually ends up being flawed on other grounds, the existence of one model of Personal CEV that can avoid three of Sobel’s objections gives us reason to expect other promising models of Personal CEV may be discovered.
Notes
1 Another clarification to make concerns the difference between idealization and extrapolation. An idealized agent is a version of the agent with certain idealizing characteristics (perhaps logical omniscience and infinite speed of thought). An extrapolated agent is a version of the agent that represents what they would be like if they underwent certain changes or experiences. Note two differences between these concepts. First, an extrapolated agent need not be ideal in any sense (though useful extrapolated agents often will be) and certainly need not be perfectly idealized. Second, extrapolated agent are determined by a specific type of process (extrapolation from the original agent) whereas no such restriction is placed on how the form of an idealized agent is determined. CEV utilizes extrapolation rather than idealization, as do some Ideal Advisor theories. In this post, we talk about "ideal" or "idealized" agents as a catch-all for both idealized agents and extrapolated agents.
2 Standard objections to ideal advisor theories of value are also relevant to some proposed variants of CEV, for example Tarleton (2010)'s suggestion of "Individual Extrapolated Volition followed by Negotiation, where each individual human’s preferences are extrapolated by factual correction and reflection; once that process is fully complete, the extrapolated humans negotiate a combined utility function for the resultant superintelligence..." Furthermore, some objections to Ideal Advisor theories also seem relevant to Global CEV even if they are not relevant to a particular approach to Personal CEV, though that discussion is beyond the scope of this article. As a final clarification, see Dai (2010).
3 Ideal Advisor theories are not to be confused with "Ideal Observer theory" (Firth 1952). For more on Ideal Advisor theories of value, see Zimmerman (2003); Tanyi (2006); Enoch (2005); Miller (2013, ch. 9).
4 This is basically an intrapersonal version of the standard worries about interpersonal comparisons of well-being. The basis of these worries is that even if we can specify an agent’s preferences numerically, it’s unclear how we should compare the numbers assigned by one agent with the numbers assigned by the other. In the intrapersonal case, the challenge is to determine how to compare the numbers assigned by the same agent at different times. See Gibbard (1989).
It looks to me like Sobel's fourth objection may stem in behavioral-economics-style terms from prospect theory's position-relative evaluations of gains and losses, in which losses are more painful than corresponding gains are pleasurable (typically by an empirical factor of around 2 to 2.5).
These position-relative evaluations are already inconsistent, i.e., they can be reliably manipulated in laboratory settings to yield circular preferences. So construing a volition probably already requires (just to end up with a consistent utility function and coherent instrumental strategies) that we transform the position-relative evaluations into outcome evaluations somehow.
The 'Ideal Advisor' part would come in at the point where we handed this 'construed' volition a veridical copy of the original predictive model of the human. Thus, this new value system could still reliably predict the actual experiences and reactions of the original human, rather than falsely supposing that the actual human would react in the same way as the construed volition would.
So the construed volition would itself have some coherent utility function over experiences the original human could have - it would not see the human's current state as a huge loss relative to its own position, because it would no longer be evaluating gains and losses. It would also correctly be able to evaluate that the original human would experience various life-improvements as large, joyful gains.
So Sobel's fourth objection would probably not arise if the process of construing a volition proceeded in that particular fashion, which in turn is not ad-hoc since positional evaluation was already a large source of inconsistency that would have to be transformed into a coherent utility function somehow, and likewise giving the idealized process veridical knowledge of the original human is a basic paradigm of volition (the whole Ideal Advisor setup).
Sobel's third objection and second objection seem to revolve around how a construed volition operates over its (abstract) model of possible life experiences that could occur to the original human. (This model had better be abstract! We don't want to inadvertently create people by simulating them in full detail during the process of deciding whether or not to create them.) Suppose we have a construed volition with a coherent utility function, looking over a set of lives that the original human might experience. The amnesia problem is already dissipated if we can pull off this setup; the construed volition does not forget anything. The second problem - the supposed impossibility of choosing between two lives correctly, without actually having led both, but the prospect of leading both introducing an ordering effect - gets us into much thornier territory. Let's first note that it's not obvious that the correct judgment is the one you'd make if you'd actually led a certain life, e.g., heroin!addict!Eliezer thinks that heroin is an absolutely great idea, but I don't want my volition to be construed such that its knowledge of that this overpowering psychological motivation would counterfactually result from heroin addiction, would actually constitute a reason to feed me heroin. I think this points in the direction of an Ideal Advisor ethics wherein construing a volition looks more like modeling how my current values judge future experiences, including my current values over having new desires being fulfilled, more than it points toward construing my volition to have direct empathy with future selves i.e. translation of their own psychological impulses into volitional impetuses of equal strength. This doesn't so much deal with Sobel's second objection as pack it into the problem of construing a volition that shows an analogue of my care for my own (and others') future selves without experiencing 'direct empathy' or direct translation of forceful desires. We're also dancing around the difficulty of having a construed volition which has values over predicted conscious experiences without that volition itself being a bearer of conscious experiences, mostly because I still don't have any good idea of how to solve that one. Resolving consciousness to be less mysterious hasn't yet helped me much on figuring out how to accurately model things getting wet without modeling any water.
Sobel's first problem was a to-do in CEV since day one (the original essay proposed evaluating a spread of possibilities) and I'm willing to point to Bostrom's parliament as the best model yet offered. There's no such thing as "too many voices", just the number of voices you can manage to model on available hardware.