Apparently this is being read by major philosophers now, which is good on the one hand, but on the other hand a really quick review of historical context:
The background problem here is that we want an effective decision procedure of bounded complexity which can actually be implemented in sufficiently advanced Artificial Intelligences.
The first difficulty is the "effective" part. Suppose you want to build a chess-playing program. A philosophy undergrad wisely informs you that you ought to instruct your chess-playing program to make "good moves". You reply that you need a more "effective" specification of what a good move is, so that you can get your program to do it. The undergrad tells you that a good move is one which is wise, highly informed, which will not later be revealed to be a bad move, and so on. What you actually need here is something along the lines of "A good move is one which, when combined with the other player's moves, results in a board state which the following computable predicate verifies as 'winning'". Once you realize the other player is trying to perform a symmetric but opposed procedure, you can model the chessboard's future using search trees. Pragmatically, you're still a long way off from beating Kasparov. But given unbounded finite computing power you could play perfect chess. In turn, this means you're able to get started on the problem of approximating good moves, now that you have an effectively specified definition of maximally good moves, even though you can't evaluate the latter definition using available computing power.
A lot of the motivation in CEV is that we're trying to describe a beneficial AI in terms that allow beneficial-ness to actually be computed or approximated. The AI observes a human and builds up an abstract predictive model of how that human makes decisions - this is an in-principle straightforward problem the way that playing perfect chess is straightforward; Solomonoff induction ideally says how to build good predictive models. What should the AI do with this predictive model, though? An accurate model will accurately predict that the human will choose to drink the glass of bleach, but in an intuitive sense, it seems like we'd want the AI to give the human water.
But suppose we can idealize this decision model in a way which separates terminal values from empirical beliefs. Then we can substitute the AI's world-model for the human's world-model and re-run the decision model. If the AI is much more intelligent than us, this takes care of the bleach-vs.-water case, since the AI knows that the glass contains bleach and that the human values water.
This is the basic paradigm of CEV - build up predictively accurate abstract models of a human decision process, then manipulate them in effectively specified ways to 'construe a volition'. (I would ordinarily say 'extrapolate', but the paper above gave a specific definition of 'extrapolate' that sounds more like surgery followed by prediction than a general, 'look over this accurate human decision model and do X with it').
The appeal of Rawls's reflective equilibrium / Ideal Advisor models is that they describe a construal procedure that sounds effectively computable and approximable: add more veridical knowledge to the decision process (the AI's knowledge, in the case where the AI is smarter than us), run the decision process for a longer time, and allow the model more veridical knowledge of itself and possible even some set of choices for modifying itself. Similarly, the appeal of Bostrom's parliament is not so much that it sounds like a plausible ultimate metaethical theory but that it gives us an effective-sounding procedure for resolving multiple possible volitions (even within a single person) into a coherent output.
More generally, CEV is a case of what Bostrom termed an 'indirect normativity' strategy. If we think values are complex - see e.g. William Frankena's list of terminal values not obviously reducible to one another - a robust strategy would involve trying to teach the AI how to look at humans and absorb and idealize values from them, so as to avoid the problem of accidentally leaving out one value.
The motivation for indirect normativity - for delving into metaethics rather than giving a superintelligent AI a laundry list of cool-sounding wishes - is that we want to pick something close enough to a correct core metaethical structure that it will compactly cover everything human beings want, ought to want, or might later regret asking for, without relying on the ability of human programmers to visualize the outcome in advance. ("I wish you'd get me that glass!" cough cough dies)
Most of the empirical challenge in CEV would stem from the fact that a predictively accurate model of human decisions would be a highly messy structure, and 'construing a volition' suitable for coherent advice isn't a trivial problem. (It sounds to me on a first reading like neither 'idealization' nor 'extrapolation' as defined in the above document may be sufficient for this. Any rational agent needs a coherent utility function, but getting this out of a messy accurate predictive human model is not as simple as conducting a point surgery and extrapolating forward in time, nor as simple as supposing infinite knowledge.)
To compete with CEV in its intended ecological niche (useful advice to (designers of) sufficiently advanced AIs) means looking for alternate theories of how to produce reliable epistemic advice about what-to-do in the presence of messy human values, with sufficient indirection to automatically cover imaginable use-cases of things we didn't think to ask for or might later regret, which theories are close enough to being effectively specified that AI programmers can implement them (though perhaps as something requiring development work to imbue in an AI, rather than a direct computer program).
A lot of what you've said sounds like you're just reiterating what Luke says quite clearly near the beginning: Ideal Advisor theories are "metaphysical", and CEV is epistemic, i.e. Ideal Advisor theories are usually trying to give an account of what is good, whereas, as you say, CEV is just about trying to find a good effective approximation to the good. In that sense, this article is comparing apples to oranges. But the point is that some criticisms may carry over.
[EDIT: this comment is pretty off the mark, given that I appear to be unable to read the first sentence of comments I'm replying to. "historical context" facepalm]
It looks to me like Sobel's fourth objection may stem in behavioral-economics-style terms from prospect theory's position-relative evaluations of gains and losses, in which losses are more painful than corresponding gains are pleasurable (typically by an empirical factor of around 2 to 2.5).
These position-relative evaluations are already inconsistent, i.e., they can be reliably manipulated in laboratory settings to yield circular preferences. So construing a volition probably already requires (just to end up with a consistent utility function and coherent instrumental strategies) that we transform the position-relative evaluations into outcome evaluations somehow.
The 'Ideal Advisor' part would come in at the point where we handed this 'construed' volition a veridical copy of the original predictive model of the human. Thus, this new value system could still reliably predict the actual experiences and reactions of the original human, rather than falsely supposing that the actual human would react in the same way as the construed volition would.
So the construed volition would itself have some coherent utility function over experiences the original human could have - it would not see the human's current state as a huge loss relative to its own position, because it would no longer be evaluating gains and losses. It would also correctly be able to evaluate that the original human would experience various life-improvements as large, joyful gains.
So Sobel's fourth objection would probably not arise if the process of construing a volition proceeded in that particular fashion, which in turn is not ad-hoc since positional evaluation was already a large source of inconsistency that would have to be transformed into a coherent utility function somehow, and likewise giving the idealized process veridical knowledge of the original human is a basic paradigm of volition (the whole Ideal Advisor setup).
Sobel's third objection and second objection seem to revolve around how a construed volition operates over its (abstract) model of possible life experiences that could occur to the original human. (This model had better be abstract! We don't want to inadvertently create people by simulating them in full detail during the process of deciding whether or not to create them.) Suppose we have a construed volition with a coherent utility function, looking over a set of lives that the original human might experience. The amnesia problem is already dissipated if we can pull off this setup; the construed volition does not forget anything. The second problem - the supposed impossibility of choosing between two lives correctly, without actually having led both, but the prospect of leading both introducing an ordering effect - gets us into much thornier territory. Let's first note that it's not obvious that the correct judgment is the one you'd make if you'd actually led a certain life, e.g., heroin!addict!Eliezer thinks that heroin is an absolutely great idea, but I don't want my volition to be construed such that its knowledge of that this overpowering psychological motivation would counterfactually result from heroin addiction, would actually constitute a reason to feed me heroin. I think this points in the direction of an Ideal Advisor ethics wherein construing a volition looks more like modeling how my current values judge future experiences, including my current values over having new desires being fulfilled, more than it points toward construing my volition to have direct empathy with future selves i.e. translation of their own psychological impulses into volitional impetuses of equal strength. This doesn't so much deal with Sobel's second objection as pack it into the problem of construing a volition that shows an analogue of my care for my own (and others') future selves without experiencing 'direct empathy' or direct translation of forceful desires. We're also dancing around the difficulty of having a construed volition which has values over predicted conscious experiences without that volition itself being a bearer of conscious experiences, mostly because I still don't have any good idea of how to solve that one. Resolving consciousness to be less mysterious hasn't yet helped me much on figuring out how to accurately model things getting wet without modeling any water.
Sobel's first problem was a to-do in CEV since day one (the original essay proposed evaluating a spread of possibilities) and I'm willing to point to Bostrom's parliament as the best model yet offered. There's no such thing as "too many voices", just the number of voices you can manage to model on available hardware.
I think Sobel's fourth objection is confused about what an idealized/extrapolated agent actually would want. If it had the potential to such perfect experience that makes the human condition look worse than dead in comparison, then the obvious advice is not suicide, but rather to uplift the ordinary human to its own level. This should always be possible since we must already have achieved achieved this to create the extrapolated agent to make the decision, so we can just repeat this process at full resolution on the original human.
Great post! It's really nice to see some engagement with modern philosophy :)
I do wonder slightly how useful this particular topic is, though. CEV and Ideal Avisor theories are about quite different things. Furthermore, since Ideal Advisor theories are working very much with ideals, the "advisors" they consider are usually supposed to be very much like actual humans. CEV, on the other hand, is precisely supposed to be an effective approximation, and so it would seem surprising if it were to actually proceed by modelling a large number of instances of a person and then enhancing them cognitively. So if instead it proceeds by some more approximate (or alternatively, less brute-force) method, then it's not clear that we should be able to apply our usual reasoning about human beings to the "values advisor" that you'd get out of the end of CEV. That seems to undermine Sobel's arguments as applied to CEV.
What's about moral objections to creation of multitude of agents for the purposes of evaluation?
Relevant excerpt from Ian Banks new Culture novel, The Hydrogen Sonata:
The Simming Problem – in the circumstances, it was usually a bad sign when something was so singular and/or notorious it deserved to be capitalised – was of a moral nature, as the really meaty, chewy, most intractable problems generally were.
The Simming Problem boiled down to, How true to life was it morally justified to be?
Simulating the course of future events in a virtual environment to see what might happen back in reality, and tweaking one’s own actions accordingly in different runs of the simulated problem to see what difference these would make and to determine whether it was possible to refine those actions such that a desired outcome might be engineered, was hardly new; in a sense it long pre-dated AIs, computational matrices, substrates, computers and even the sort of mechanical or hydrological arrangements of ball-bearings, weights and springs or water, tubes and valves that enthusiastic optimists had once imagined might somehow model, say, an economy.
In a sense, indeed, such simulations first took place in the minds of only proto-sentient creatures, in the deep pre-historic age of any given species. If you weren’t being too strict about your definitions you could claim that the first simulations happened in the heads – or other appropriate body- or being-parts – of animals, or the equivalent, probably shortly after they developed a theory of mind and started to think about how to manipulate their peers to ensure access to food, shelter, mating opportunities or greater social standing.
Thoughts like, If I do this, then she does that … No; if I do that, making him do this … in creatures still mystified by fire, or unable to account for the existence of air, or ice, above their watery environment – or whatever – were arguably the start of the first simulations, no matter how dim, limited or blinded by ignorance and prejudice the whole process might be. They were, also, plausibly, the start of a line that led directly through discussions amongst village elders, through collegiate essays, flow charts, war games and the first computer programs to the sort of ultra-detailed simulations that could be shown – objectively, statistically, scientifically – to work.
Long before most species made it to the stars, they would be entirely used to the idea that you never made any significant societal decision with large-scale or long-term consequences without running simulations of the future course of events, just to make sure you were doing the right thing. Simming problems at that stage were usually constrained by not having the calculational power to run a sufficiently detailed analysis, or disagreements regarding what the initial conditions ought to be.
Later, usually round about the time when your society had developed the sort of processal tech you could call Artificial Intelligence without blushing, the true nature of the Simming Problem started to appear.
Once you could reliably model whole populations within your simulated environment, at the level of detail and complexity that meant individuals within that simulation had some sort of independent existence, the question became: how god-like, and how cruel, did you want to be?
Most problems, even seemingly really tricky ones, could be handled by simulations which happily modelled slippery concepts like public opinion or the likely reactions of alien societies by the appropriate use of some especially cunning and devious algorithms; whole populations of slightly different simulative processes could be bred, evolved and set to compete against each other to come up with the most reliable example employing the most decisive short-cuts to accurately modelling, say, how a group of people would behave; nothing more processorhungry than the right set of equations would – once you’d plugged the relevant data in – produce a reliable estimate of how that group of people would react to a given stimulus, whether the group represented a tiny ruling clique of the most powerful, or an entire civilisation.
But not always. Sometimes, if you were going to have any hope of getting useful answers, there really was no alternative to modelling the individuals themselves, at the sort of scale and level of complexity that meant they each had to exhibit some kind of discrete personality, and that was where the Problem kicked in.
Once you’d created your population of realistically reacting and – in a necessary sense – cogitating individuals, you had – also in a sense – created life. The particular parts of whatever computational substrate you’d devoted to the problem now held beings; virtual beings capable of reacting so much like the back-in-reality beings they were modelling – because how else were they to do so convincingly without also hoping, suffering, rejoicing, caring, loving and dreaming? – that by most people’s estimation they had just as much right to be treated as fully recognised moral agents as did the originals in the Real, or you yourself.
If the prototypes had rights, so did the faithful copies, and by far the most fundamental right that any creature ever possessed or cared to claim was the right to life itself, on the not unreasonable grounds that without that initial right, all others were meaningless.
By this reasoning, then, you couldn’t just turn off your virtual environment and the living, thinking creatures it contained at the completion of a run or when a simulation had reached the end of its useful life; that amounted to genocide, and however much it might feel like serious promotion from one’s earlier primitive state to realise that you had, in effect, become the kind of cruel and pettily vengeful god you had once, in your ignorance, feared, it was still hardly the sort of mature attitude or behaviour to be expected of a truly civilised society, or anything to be proud of.
Some civs, admittedly, simply weren’t having any of this, and routinely bred whole worlds, even whole galaxies, full of living beings which they blithely consigned to oblivion the instant they were done with them, sometimes, it seemed, just for the glorious fun of it, and to annoy their more ethically angst-tangled co-civilisationalists, but they – or at least those who admitted to the practice, rather than doing it but keeping quiet about it – were in a tiny minority, as well as being not entirely welcome at all the highest tables of the galactic community, which was usually precisely where the most ambitious and ruthless species/civs most desired to be.
Others reckoned that as long as the termination was instant, with no warning and therefore no chance that those about to be switched off could suffer, then it didn’t really matter. The wretches hadn’t existed, they’d been brought into existence for a specific, contributory purpose, and now they were nothing again; so what?
Most people, though, were uncomfortable with such moral brusqueness, and took their responsibilities in the matter more seriously. They either avoided creating virtual populations of genuinely living beings in the first place, or only used sims at that sophistication and level of detail on a sustainable basis, knowing from the start that they would be leaving them running indefinitely, with no intention of turning the environment and its inhabitants off at any point.
Whether these simulated beings were really really alive, and how justified it was to create entire populations of virtual creatures just for your own convenience under any circumstances, and whether or not – if/once you had done so – you were sort of duty-bound to be honest with your creations at some point and straight out tell them that they weren’t really real, and existed at the whim of another order of beings altogether – one with its metaphorical finger hovering over an Off switch capable of utterly and instantly obliterating their entire universe … well, these were all matters which by general and even relieved consent were best left to philosophers.
It seems to me that prohibitions on mistreating sims might be the only example of a reasonable moral stricture with no apparent up-side-- it 's just avoiding a down-side.
Decent treatment of sentients at your own reality level increases opportunities for cooperation and avoids cycles of revenge, neither of which apply to sims.... unless you also have an obligation to let them join your society.
Not necessarily. Think for example of the controversy of linking FPS (the shooter variety, not the one with moving pictures per second) games and real life violence. Now, I'm not advocating such a link here at all, but it is conceivable that how you treat sims carries over to how you treat sentients at your own reality level to some extent, no matter how minor. Yielding a potential up-side.
At least in theory, this could be tested. We have the real world example of people who torture sims (something which seems more psychologically indicative to me than first-person shooter games). It might be possible to find out whether they're different from people who play Sim City but don't torture sims, and also whether torturing sims for the fun of it changes people.
Yes, although it would be really, really strange if there were no effect whatsoever, if in fact there were any activity period that you can engage in long term without in some way or form shaping your brain. This is anthropomorphizing of course, who knows what will or won't affect far future individuals. Still, we could test for current humans the effect size, relative to which we could define some threshold at which we'd call the effect non-negligible.
I haven't read THS yet, but I'm surprised that even a civilization written by Banks didn't think that the correct response to finding oneself as a "vengeful god" is to create an afterlife.
They explicitly don't address that:
Second, it might seem that this approach to determining Personal CEV will require a reasonable level of accuracy in simulation. If so, there might be concerns about the creation of, and responsibility to, potential moral agents.
On Sobel, see previously my http://lesswrong.com/lw/9oa/against_utilitarianism_sobels_attack_on_judging/
This seems like a lot of words to address what appears to be some very flawed arguments. It seems to me that the more fundamental problem with Sobel's arguments is that it all relies on anthropomorphizing the ideal advisor (e.g. saying certain experiences might drive it mad, etc.).
(I expect some of the longer comments have already made this point, but I thought I should make it more succinctly.)
Given that a parliament of humans (where they vote on values) is not accepted as a (final) solution to the interpersonal value / well-being comparison problem, why would a parliament be acceptable for intrapersonal comparisons?
Sobel’s final objection is that the idealized agent, having experienced such a level of perfection, might come to the conclusion that their non-deal counterpart is so limited as to be better off dead.
'Non-Ideal', I think.
Ever since I worked, in the course of my PhD, with the Godel metric, a solution of the equations of GR which contains closed timlelike curves, I've been noticing how strange loops mess up arguments, calculations and intuition whenever they creep in. My approach has been to search and unwind them before proceeding any further. That's one way to resolve the grandfather paradox, for example.
The issue you are discussing is rife with loops. Notice them. Unwind them. Restate the problem without them. This is not always an easy task, some loops can be pretty insidious. Here is an example from your post:
an ideal version of that agent (fully informed, perfectly rational, etc.) would advise the non-ideal version
One of the ways of removing a potential loop is already suggested in your post:
"our volition be extrapolated once and acted on.
"Once" is what breaks the loop.
Now, to list several loops in Sobel's arguments. Some of these are not obvious, but they are there nonetheless, if you look carefully.
two of the idealized viewpoints disagree about what is to be preferred
experiencing one life can leave you incapable of experiencing another in an unbiased way.
the idealized agent, having experienced such a level of perfection, might come to the conclusion that their non-ideal counterpart is so limited as to be better off dead.
Some of these versions are then assigned as a parliament where they vote on various choices and make trades with one another.
Meditation. Find the loops in each of the above quotes and consider how they can be avoided.
This comment reads to me like: "Haha, I think there are problems with your argument, but I'm not going to tell you what they are, I'm just going to hint obliquely in a way that makes me look clever."
If you actually do have issues with Sobel's arguments, do you think you could actually say what they are?
Sorry if this came across as a status game. Let me give you one example.
experiencing one life can leave you incapable of experiencing another in an unbiased way.
This is a loop Sobel solves with the amnesia model. (A concurrent clone model would be a better description, to avoid any problems with influences between lives, such as physical changes). There is still however the issue of giving advice to your past self after removing amnesia, even though you " might be incapable of adequately evaluating the lives they’ve experienced based on their current, more knowledgeable, evaluative perspective." This loses the sight of the original purpose: the evaluating criteria should be acceptable to the original person, and no such criteria have been set in advance. Same with the parliament: the evaluation depends on the future experiences, feeding into the loop. To remedy the issue, you can decide to create and freeze the arbitration rules in advance. For example, you might choose as your utility function some weighted average of longevity, happiness, procreation, influence on the world around you, etc. Then score the utility of each simulated life, and then pick one of, say, top 10 as your "initial dynamic". Or the top life you find acceptable. (Not restricting to automatically picking the highest-utility one, in order to avoid the "literal genie" pitfall.) You can repeat as you see fit as you go on, adjusting the criteria (hence "dynamic").
While you are by no means guaranteed to end up with the "best life possible" life after breaking the reasoning loop, you at least are spared problems like "better off dead" and "insane parliament", both of which result from a preference feedback loop.
Ooookay. The whole "loop" thing feels like a leaky abstraction to me. If you had to do that much work to explain the loopiness (which I'm still not sold on) and why it's a problem, perhaps saying it's "loopy" isn't adding much.
This loses the sight of the original purpose: the evaluating criteria should be acceptable to the original person
I think I may still be misunderstanding you, but this seems wrong. The whole point is that even if you're on some kind of weird drugs that make you think that drinking bleach would be great, the idealised version of you would not be under such an influence, etc. Hence it might well be that the idealised advisors evaluate things in ways that you would find unaccepable. That's WAD.
Also, I find your other proposal hard to follow: surely if you've got a well-defined utility function already, then none of this is necessary?
I wasn't trying to solve the whole CEV and FAI issue in 5 min, was only giving an example of how breaking a feedback loop avoids some of the complications.
CEV theoretically avoids many problems with other approaches to machine ethics (Yudkowsky 2004; Tarleton 2010; Muehlhauser & Helm 2012). However, there are reasons it may not succeed. In this post, we examine one such reason: Resolving CEV at the level of humanity (Global CEV) might require at least partially resolving CEV at the level of individuals (Personal CEV)2, but Personal CEV is similar to ideal advisor theories of value,3 and such theories face well-explored difficulties. As such, these difficulties may undermine the possibility of determining the Global CEV of humanity.
I know the focus of this post is on personal rather than global CEV. But since this choice of focus is ultimately motivated by a concern with global CEV, I think it is relevant to ask a question that was never answered to my satisfaction in this forum. The question is: what's so special about our species? In particular, what makes homo sapiens a more relevant moral category than, say, caucasians, on the one hand, and mammals, on the other?
One concern I have with the advisor idea (which probably doesn't apply to Eliezer's reinterpretation, if I understand that correctly, which I might not) is that it's not clear that extrapolated advisors in parliament would actually act in the interests of the original agent. For example, they might be selfish and choose something like prolonging their existence by debating as long as possible. Or each might trivially argue for the life that would lead the agent to resemble them as closely as possible on the theory that that would give their existence more measure (which probably wouldn't be too bad if the extrapolations are well-chosen, but likely not the best outcome). Or they might decide that this agent isn't really them in the first place, so they should just make the agent's life as amusing as possible.
A more general statement of the problem would be that there's no guarantee that the extrapolation of the agent would optimize something beneficial to the original agent, and in fact most of the work of coming up with good advice (or good outcomes as the case may be) is probably being done by the extrapolation/idealization process if it is being done at all.
Ideal Advisor theorists attempts to define what it is for something to be of value for an agent. Because of this, their accounts needs to give an unambiguous and plausible answer in all cases.
Unambiguous? No, only an epistemic procedure of a certain sort needs to give an unambiguous answer. A metaphysical theory should give an answer that is no more precise than its subject matter. If we had a metaphysical theory of what it is to be "a heap of sand" that gave a perfectly precise yes/no answer in every case, that would be grounds for deep suspicion. A Theory of Heaps should be vague.
With theories of personal value, it's not clear that precision is a problem. I wouldn't be shocked if it turned out that "what's most beneficial to me" had a precise meaning and ranked every possible outcome unambiguously. But I would be a little surprised. (I know that Von Neumann–Morgenstern utility has been vigorously sold around here, but I haven't bought it, at least not yet.)
Weird, I'd always heard this referred to as the 'Ideal Observer' theory, and it seems there's way, way more material for that 'Ideal Advisor'....
EDIT: Just saw the footnote, my bad.
Update 5-24-2013: A cleaned-up, citable version of this article is now available on MIRI's website.
Co-authored with crazy88
Summary: Yudkowsky's "coherent extrapolated volition" (CEV) concept shares much in common Ideal Advisor theories in moral philosophy. Does CEV fall prey to the same objections which are raised against Ideal Advisor theories? Because CEV is an epistemic rather than a metaphysical proposal, it seems that at least one family of CEV approaches (inspired by Bostrom's parliamentary model) may escape the objections raised against Ideal Advisor theories. This is not a particularly ambitious post; it mostly aims to place CEV in the context of mainstream moral philosophy.
What is of value to an agent? Maybe it's just whatever they desire. Unfortunately, our desires are often the product of ignorance or confusion. I may desire to drink from the glass on the table because I think it is water when really it is bleach. So perhaps something is of value to an agent if they would desire that thing if fully informed. But here we crash into a different problem. It might be of value for an agent who wants to go to a movie to look up the session times, but the fully informed version of the agent will not desire to do so — they are fully-informed and hence already know all the session times. The agent and its fully-informed counterparts have different needs. Thus, several philosophers have suggested that something is of value to an agent if an ideal version of that agent (fully informed, perfectly rational, etc.) would advise the non-ideal version of the agent to pursue that thing.
This idea of idealizing or extrapolating an agent's preferences1 goes back at least as far as Sidgwick (1874), who considered the idea that "a man's future good" consists in "what he would now desire... if all the consequences of all the different [actions] open to him were accurately forseen..." Similarly, Rawls (1971) suggested that a person's good is the plan "that would be decided upon as the outcome of careful reflection in which the agent reviewed, in the light of all the relevant facts, what it would be like to carry out these plans..." More recently, in an article about rational agents and moral theory, Harsanyi (1982) defined what an agent's rational wants as “the preferences he would have if he had all the relevant factual information, always reasoned with the greatest possible care, and were in a state of mind most conducive to rational choice.” Then, a few years later, Railton (1986) identified a person's good with "what he would want himself to want... were he to contemplate his present situation from a standpoint fully and vividly informed about himself and his circumstances, and entirely free of cognitive error or lapses of instrumental rationality."
Rosati (1995) calls these theories Ideal Advisor theories of value because they identify one's personal value with what an ideal version of oneself would advise the non-ideal self to value.
Looking not for a metaphysical account of value but for a practical solution to machine ethics (Wallach & Allen 2009; Muehlhauser & Helm 2012), Yudkowsky (2004) described a similar concept which he calls "coherent extrapolated volition" (CEV):
In other words, the CEV of humankind is about the preferences that we would have as a species if our preferences were extrapolated in certain ways. Armed with this concept, Yudkowsky then suggests that we implement CEV as an "initial dynamic" for "Friendly AI." Tarleton (2010) explains that the intent of CEV is that "our volition be extrapolated once and acted on. In particular, the initial extrapolation could generate an object-level goal system we would be willing to endow a superintelligent [machine] with."
CEV theoretically avoids many problems with other approaches to machine ethics (Yudkowsky 2004; Tarleton 2010; Muehlhauser & Helm 2012). However, there are reasons it may not succeed. In this post, we examine one such reason: Resolving CEV at the level of humanity (Global CEV) might require at least partially resolving CEV at the level of individuals (Personal CEV)2, but Personal CEV is similar to ideal advisor theories of value,3 and such theories face well-explored difficulties. As such, these difficulties may undermine the possibility of determining the Global CEV of humanity.
Before doing so, however, it's worth noting one key difference between Ideal Advisor theories of value and Personal CEV. Ideal Advisor theories typically are linguistic or metaphysical theories, while the role of Personal CEV is epistemic. Ideal Advisor theorists attempts to define what it is for something to be of value for an agent. Because of this, their accounts needs to give an unambiguous and plausible answer in all cases. On the other hand, Personal CEV's role is an epistemic one: it isn't intended to define what is of value for an agent. Rather, Personal CEV is offered as a technique that can help an AI to come to know, to some reasonable but not necessarily perfect level of accuracy, what is of value for the agent. To put it more precisely, Personal CEV is intended to allow an initial AI to determine what sort of superintelligence to create such that we end up with what Yudkowsky calls a "Nice Place to Live." Given this, certain arguments are likely to threaten Ideal Advisor theories and not to Personal CEV, and vice versa.
With this point in mind, we now consider some objections to ideal advisor theories of value, and examine whether they threaten Personal CEV.
Sobel's First Objection: Too many voices
Four prominent objections to ideal advisor theories are due to Sobel (1994). The first of these, the “too many voices” objection, notes that the evaluative perspective of an agent changes over time and, as such, the views that would be held by the perfectly rational and fully informed version of the agent will also change. This implies that each agent will be associated not with one idealized version of themselves but with a set of such idealized versions (one at time t, one at time t+1, etc.), some of which may offer conflicting advice. Given this “discordant chorus,” it is unclear how the agent’s non-moral good should be determined.
Various responses to this objection run into their own challenges. First, privileging a single perspective (say, the idealized agent at time t+387) seems ad hoc. Second, attempting to aggregate the views of multiple perspectives runs into the question of how trade offs should be made. That is, if two of the idealized viewpoints disagree about what is to be preferred, it’s unclear how an overall judgment should be reached.4 Finally, suggesting that the idealized versions of the agent at different times will have the same perspective seems unlikely, and surely it's a substantive claim requiring a substantive defense. So the obvious responses to Sobel’s first objection introduce serious new challenges which then need to be resolved.
One final point is worth noting: it seems that this objection is equally problematic for Personal CEV. The extrapolated volition of the agent is likely to vary at different times, so how ought we determine an overall account of the agent’s extrapolated volition?
Sobel’s Second and Third Objections: Amnesia
Sobel’s second and third objections build on two other claims (see Sobel 1994 for a defense of these). First: some lives can only be evaluated if they are experienced. Second: experiencing one life can leave you incapable of experiencing another in an unbiased way. Given these claims, Sobel presents an amnesia model as the most plausible way for an idealized agent to gain the experiences necessary to evaluate all the relevant lives. According to this model, an agent experiences each life sequentially but undergoes an amnesia procedure after each one so that they may experience the next life uncolored by their previous experiences. After experiencing all lives, the amnesia is then removed.
Following on from this, Sobel’s second objection is that the sudden recollection of a life from one evaluative perspective and living a life from a vastly different evaluative perspective may be strongly dissimilar experiences. So when the amnesia is removed, the agent has a particular evaluative perspective (informed by their memories of all the lives they’ve lived) that differs so much from the evaluative perspective they had when they lived the life independently of such memories that they might be incapable of adequately evaluating the lives they’ve experienced based on their current, more knowledgeable, evaluative perspective.
Sobel’s third objection also relates to the amnesia model: Sobel argues that the idealized agent might be driven insane by the entire amnesia process and hence might not be able to adequately evaluate what advice they ought to give the non-ideal agent. In response to this, there is some temptation to simply demand that the agent be idealized not just in terms of rationality and knowledge but also in terms of their sanity. However, perhaps any idealized agent that is similar enough to the original to serve as a standard for their non-moral good will be driven insane by the amnesia process and so the demand for a sane agent will simply mean that no adequate agent can be identified.
If we grant that an agent needs to experience some lives to evaluate them, and we grant that experiencing some lives leaves them incapable of experiencing others, then there seems to be a strong drive for Personal CEV to rely on an amnesia model to adequately determine what an agent’s volition would be if extrapolated. If so, however, then Personal CEV seems to face the challenges raised by Sobel.
Sobel’s Fourth Objection: Better Off Dead
Sobel’s final objection is that the idealized agent, having experienced such a level of perfection, might come to the conclusion that their non-ideal counterpart is so limited as to be better off dead. Further, the ideal agent might make this judgment because of the relative level of well-being of the non-ideal agent rather than the agent’s absolute level of well-being. (That is, the ideal agent may look upon the well-being of the non-ideal agent as we might look upon our own well-being after an accident that caused us severe mental damage. In such a case, we might be unable to objectively judge our life after the accident due to the relative difficulty of this life as compared with our life before the accident.) As such, this judgment may not capture what is actually in accordance with the agent’s non-moral good.
Again, this criticism seems to apply equally to Personal CEV: when the volition of an agent is extrapolated, it may turn out that this volition endorses killing the non-extrapolated version of the agent. If so this seems to be a mark against the possibility that Personal CEV can play a useful part in a process that should eventually terminate in a "Nice Place to Live."
A model of Personal CEV
The seriousness of these challenges for Personal CEV is likely to vary depending on the exact nature of the extrapolation process. To give a sense of the impact, we will consider one family of methods for carrying out this process: the parliamentary model (inspired by Bostrom 2009). According to this model, we determine the Personal CEV of an agent by simulating multiple versions of them, extrapolated from various starting times and along different developmental paths. Some of these versions are then assigned as a parliament where they vote on various choices and make trades with one another.
Clearly this approach allows our account of Personal CEV to avoid the too many voices objection. After all, the parliamentary model provides us with an account of how we can aggregate the views of the agent at various times: we should simulate the various agents and allow them to vote and trade on the choices to be made. It is through this voting and trading that the various voices can be combined into a single viewpoint. While this process may not be adequate as a metaphysical account of value, it seems more plausible as an account of Personal CEV as an epistemic notion. Certainly, your authors would deem themselves to be more informed about what they value if they knew the outcome of the parliamentary model for themselves.
This approach is also able to avoid Sobel’s second and third objections. The objections were specifically targeted at the amnesia model where one agent experienced multiple lives. As the parliamentary model does not utilize amnesia, it is immune to these concerns.
What of Sobel’s fourth objection? Sobel’s concern here is not simply that the idealized agent might advise the agent to kill themselves. After all, sometimes death may, in fact, be of value for an agent. Rather, Sobel’s concern is that the idealized agent, having experienced such heights of existence, will become biased against the limited lives of normal agents.
It's less clear how the parliamentary model deals with Sobel's fourth objection which plausibly retains its initial force against this model of Personal CEV. However, we're not intending to solve Personal CEV entirely in this short post. Rather, we aim to demonstrate only that the force of Sobel's four objections will depend on the model of Personal CEV selected. Reflection on the parliamentary model makes this point clear.
So the parliamentary model seems able to avoid at least three of the direct criticisms raised by Sobel. It is worth noting, however, that some concerns remain. Firstly, for those that accept Sobel’s claim that experience is necessary to evaluate some lives, it is clear that no member of the parliament will be capable of comparing their life to all other possible lives, as none will have all the required experience. As such, the agents may falsely judge a certain aspect of their life to be more or less valuable than it, in fact, is. For a metaphysical account of personal value, this problem might be fatal. Whether it is also fatal for the parliamentary model of Personal CEV depends on whether the knowledge of the various members of the parliament is enough to produce a “Nice Place to Live” regardless of its imperfection.
Two more issues might arise. First, the model might require careful selection of who to appoint to the parliament. For example, if most of the possible lives that an agent could live would drive them insane, then selecting which of these agents to appoint to the parliament at random might lead to a vote by the mad. Second, it might seem that this approach to determining Personal CEV will require a reasonable level of accuracy in simulation. If so, there might be concerns about the creation of, and responsibility to, potential moral agents.
Given these points, a full evaluation of the parliamentary model will require more detailed specification and further reflection. However, two points are worth noting in conclusion. First, the parliamentary model does seem to avoid at least three of Sobel’s direct criticisms. Second, even if this model eventually ends up being flawed on other grounds, the existence of one model of Personal CEV that can avoid three of Sobel’s objections gives us reason to expect other promising models of Personal CEV may be discovered.
Notes
1 Another clarification to make concerns the difference between idealization and extrapolation. An idealized agent is a version of the agent with certain idealizing characteristics (perhaps logical omniscience and infinite speed of thought). An extrapolated agent is a version of the agent that represents what they would be like if they underwent certain changes or experiences. Note two differences between these concepts. First, an extrapolated agent need not be ideal in any sense (though useful extrapolated agents often will be) and certainly need not be perfectly idealized. Second, extrapolated agent are determined by a specific type of process (extrapolation from the original agent) whereas no such restriction is placed on how the form of an idealized agent is determined. CEV utilizes extrapolation rather than idealization, as do some Ideal Advisor theories. In this post, we talk about "ideal" or "idealized" agents as a catch-all for both idealized agents and extrapolated agents.
2 Standard objections to ideal advisor theories of value are also relevant to some proposed variants of CEV, for example Tarleton (2010)'s suggestion of "Individual Extrapolated Volition followed by Negotiation, where each individual human’s preferences are extrapolated by factual correction and reflection; once that process is fully complete, the extrapolated humans negotiate a combined utility function for the resultant superintelligence..." Furthermore, some objections to Ideal Advisor theories also seem relevant to Global CEV even if they are not relevant to a particular approach to Personal CEV, though that discussion is beyond the scope of this article. As a final clarification, see Dai (2010).
3 Ideal Advisor theories are not to be confused with "Ideal Observer theory" (Firth 1952). For more on Ideal Advisor theories of value, see Zimmerman (2003); Tanyi (2006); Enoch (2005); Miller (2013, ch. 9).
4 This is basically an intrapersonal version of the standard worries about interpersonal comparisons of well-being. The basis of these worries is that even if we can specify an agent’s preferences numerically, it’s unclear how we should compare the numbers assigned by one agent with the numbers assigned by the other. In the intrapersonal case, the challenge is to determine how to compare the numbers assigned by the same agent at different times. See Gibbard (1989).