I've been involved in a recent thread where discussion of coherent extrapolated volition came up. The general consensus was that CEV might - or might not - do certain things, probably, maybe, in certain situations, while ruling other things out, possibly, and that certain scenarios may or may not be the same in CEV, or it might be the other way round, it was too soon to tell.

Ok, that's an exaggeration. But any discussion of CEV is severely hampered by our lack of explicit models. Even bad, obviously incomplete models would be good, as long as we can get useful information as to what they would predict. Bad models can be improved; undefined models are intuition pumps for whatever people feel about them - I dislike CEV, and can construct a sequence of steps that takes my personal CEV to wanting the death of the universe, but that is no more credible than someone claiming that CEV will solve all problems and make lots of cute puppies.

So I'd like to ask for suggestions of models that formalise CEV to at least some extent. Then we can start improving them, and start making CEV concrete.

To start it off, here's my (simplistic) suggestion:

Volition

Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.

Extrapolation

Whenever revealed preferences are non-transitive or non-independent, use the person's stated meta-preferences to remove the issue. The AI thus calculates what the person would say if asked to resolve the transitivity or independence (for people who don't know about the importance of resolving them, the AI would present them with a set of transitive and independent preferences, derived from their revealed preferences, and have them choose among them). Then (wave your hands wildly and pretend you've never heard of non-standard realslexicographical preferences, refusal to choose and related issues) everyone's preferences are now expressible as utility functions.

Coherence

Normalise each existing person's utility function and add them together to get your CEV. At the FHI we're looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

New to LessWrong?

New Comment
43 comments, sorted by Click to highlight new comments since: Today at 10:46 AM

At the FHI we're looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

Unfortunately, this utility function isn't game-theoretically stable; if you expect your utility to be analyzed this way, you have an incentive to modify your utility function to clip or flatten the ends, to make your utility have a steeper gradient around the amount of utility you expect to receive.

That seems like it may be true of every scheme that doesn't consider the causal origins of people's utility functions. Does something like Gibbard-Satterthwaite apply?

Not specifically causal origins (there's evolution), instead I suppose there might be a way of directly negating most effects resulting from such strategic considerations (that is, decisions that were influenced by their expected effect on the decision in question that wants to negate that effect).

"Easy" way of doing this: see what would have happened if everyone believed that you were using a model where liars don't prosper (ie random dictator), but actually use a Pareto method.

Considering the causal origins of people's utility functions is a nice hack, thanks for pointing it out! How far back do we need to go, though? Should my children benefit if I manipulate their utility function genetically while they're in the womb?

Another way to aggregate utility functions is by simulated bargaining, but it's biased in favor of rich and powerful people.

How far back do we need to go, though?

As far as needed to understand (the dependence of current agent's values on (the dependence of (expected benefit from value extraction) on current agent's values)). (Sorry, adding parens was simpler!)

This involves currently confusing "benefit" (to whom?) and assumed-mistaken "expected" (by whom?), supposedly referring to aspects of past agents (that built/determined the current agent) deciding on the strategic value bargaining. (As usual, ability to parse the world and see things that play the roles of elements of agents' algorithms seems necessary to get anything of this sort done.)

If I'm rich it's because I delayed consumption, allowing others to invest the capital that I had earned. Should we not allow these people some return on their investment?

To be clear, I'm not very sure the answer is yes; but nor do I think it's clear that 'wealth' falls into the category of 'things that should not influence CEV', where things like 'race', 'eye colour' etc. live.

Fair point about delayed gratification, but you may also be rich because your parents were rich, or because you won the lottery, or because you robbed someone. Judging people by their bargaining power conflates all those possible reasons.

No; if you didn't delay gratification you'd spend the money quickly, regardless of how you got it.

The funniest counterexample I know is Jefri Bolkiah =)

If you didn't delay gratification and had expensive tastes, you'd spend the money quickly, regardless of how you got it.

Even if everyone did have expensive tastes, people who started off with less money would need to delay their gratification more. A very poor person might need to delay gratification an average of 80% of the time, since they couldn't afford almost anything. A sufficiently rich person might only need to delay gratification 10% of the time without running into financial trouble. So if you wanted to reward delaying of gratification, then on average the poorer that a person was, the more you'd want to reward him

Another way to aggregate utility functions is by simulated bargaining, but it's biased in favor of rich and powerful people.

The same rich and powerful people who are most likely to be funding the research, maybe?

Today, to resolve their differences, people mostly just bargain I.R.L.

They do simulate bargains in their heads, but only to help them with the actual bargaining.

You can't be Pareto and game-theoretically stable at the same time (I have a nice picture proof of that, that I'll post some time). You can be stable without being Pareto - we each choose our favoured outcome, and go 50-50 between them. Then no one has an incentive to lie.

Edit: Picture-proof now posted at: http://lesswrong.com/r/discussion/lw/8qv/in_the_pareto_world_liars_prosper/

[-][anonymous]12y00

You can be stable without being Pareto - we each choose our favoured outcome, and go 50-50 between them. Then no one has an incentive to lie.

I seem to have an incentive to lie in that scenario.

[This comment is no longer endorsed by its author]Reply

You can estimate where the others' favoured outcomes and go a ways in the opposite direction to try to balance it out. Of course, if one of you takes this to the second level and the others are honest, then no one is happy except by coincidence (one of the honest people deviated from the mean more than you in the same way, and your overshoot happened to land on them).

Upvoted for trying to say something useful about CEV.

Whenever revealed preferences are non-transitive or non-independent, use the person's stated meta-preferences to remove the issue.

It seems odd that this is the only step where you're using meta-preferences: I would have presumed that any theory would start off from giving a person's approved preferences considerably stronger weight than non-approved ones. (Though since approved desires are often far and non-approved ones near, one's approved ideal self might be completely unrealistic and not what they'd actually want. So non-approved ones should also be taken into account somehow.)

What do you mean by "actually want"? You seem to be coming dangerously close to the vomit fallacy: "Humans sometimes vomit. By golly, the future must be full of vomit!"

What do you mean by "actually want"?

Would not actually want X = would not endorse X after finding out the actual consequences of X; would not have X as a preference after reaching reflective equilibrium.

Oh I see, by "approved ideal self" you meant something different than "self after reaching reflective equilibrium". So instead of fiddling around with revealed preferences, why not just simulate the person reaching reflective equilibrium and then ask the person what preferences he or she endorses?

That was my first thought on reading the "revealed preferences" part of the post. Extrapolation first - then volition.

Could be done - but is harder to define (what counts as a reflective equilibrium?) and harder to model (what do you expect your reflective equilibrium?)

In a previous thread I suggested starting by explicitly defining something like a CEV for a simple worm. After thinking about it, I think perhaps a norn, or some other simple hypothetical organism might be better. To make the situation as simple as possible, start with a universe where the norn are the most intelligent life in existence.

A norn (or something simpler than a norn) has explicitly defined drives, meaning the utility functions of individual norns could potentially be approximated very accurately.

The biggest weakness of this idea is that a norn, or worm, or cellular automaton, can't really participate in the process of approving or rejecting the resulting set of extrapolated solutions. For some people, I think this indicates that you can't do CEV on something that isn't sentient. It only causes me to wonder, what if we are literally too stupid to even comprehend the best possible CEV that can be offered to us? I don't think this is unlikely.

It only causes me to wonder, what if we are literally too stupid to even comprehend the best possible CEV that can be offered to us?

I think this doesn't matter, if we can

1) successfully define the CEV concept itself,

2) define a suitable reference class,

3) build a superintelligence, and

4) ensure that the superintelligence continues to pursue the best CEV it can find for the appropriate reference class.

Well, it would be helpful if we could also:
2.5) work out a reliable test for whether a given X really is an instance of the CEV concept for the given reference class

Which seems to depend on having some kind of understanding.

Lacking that, we are left with having to trust that whatever the SI we've built is doing is actually what we "really want" it to do, even if we don't seem to want it to do that, which is an awkward place to be.

You're the first to suggest something approaching a model on this thread :-)

one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0

If Alice likes cookies, and Bob likes cookies but hates whippings, this method gives Alice more cookies than Bob. Moreover, the number of bonus cookies Alice gets depends on the properties of whips that nobody ever uses.

(In general, it's proper for properties of counterfactuals to have impact on which decisions are correct in reality, so this consideration alone isn't sufficient to demonstrate that there's a problem.)

[-][anonymous]12y00

It feels intuitively like it's a problem in this specific case.

[This comment is no longer endorsed by its author]Reply

You can restrict to a Pareto boundary before normalising - not as mathematically elegant, but indifferent to effects "that nobody ever wants/uses".

Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.

There seems to be a feedback loop missing. Provide people with a broad range of choices, let them select a few, provide a range of alternatives within that selection, repeat. Allow for going back a step or ten. That's what happens IRL when you make a major purchase, like a TV, a car or a house.

one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

"if the AI did exactly what they wanted" as opposed to "if the universe went exactly as they wanted" to avoid issues with unbounded utility functions? This seems like it might not be enough if the universe itself were unbounded in the relivant sense.

For example, suppose my utility function is U(Universe) = #paperclips, which is unbounded in a big universe. Then you're going to normalise me as assigning U(AI becomes clippy) = 1, and U(individual paperclips) = 0.

For example, suppose my utility function is U(Universe) = #paperclips, which is unbounded in a big universe. Then you're going to normalise me as assigning U(AI becomes clippy) = 1, and U(individual paperclips) = 0.

Yep.

So most likely a certain proportion of the universe will become paperclips.

What about recursive CEV?

Start off with CEV-0. I won't go into how that is generated, but it will have a lot of arbitrary decisions and stuff that seems vaguely sensible.

Then ask CEV-0 the following questions:

  • How should CEV-1 go about aggregating people's preferences?
  • How should CEV-1 deal with non-transitive or non-independent preferences?
  • How should CEV-1 determine preferences between outcomes that the subject could never have imagined?
  • What should CEV-1 do if people lack the expertise to judge the long-term consequences of their preferences?
  • Should CEV-1 consider people's stated or revealed preferences, or both?
  • Should CEV-1 consider preferences of non-human animals, people in comas, etc.?
  • How should CEV-1 deal with people who seem to be trying to modify their own preferences in order to game the system? (utility monsters/tactical voting)

... and so on. The answers to these questions then make up CEV-1. And then CEV-1 is asked the same questions to produce CEV-2.

Various different things could happen here. It could converge to a single stable fixed point. It could oscillate. It could explode, diverging wildly from anything we'd consider reasonable. Or its behavior could depend on the initial choice of CEV-0 (e.g. multiple attractive fixed points).

Explosion could (possibly) be avoided by requiring CEV-n to pass some basic sanity checks, though that has problems too (the sanity checks may not be valid, i.e. they just reflect our own biases. Or they may not be enough - they act as constraints on the evolution of the system but it could still end up insane in respects we haven't anticipated).

Some other problems could be resolved by asking CEV-n how to resolve them.

I'm not sure how to deal with the multiple stable fixed points case. That would seem to correspond to different cultures or special interest groups all trying to push whichever meta-level worldview benefits them the most.

Once CEV-n becomes a utility function, it will generally (but not always) get stuck there for ever.

Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.

Is the crucial difference about CEV the fact that it doesn't attempt to maximise the utility of humanity but rather to extract the volition of humanity by treating each person's input equally without attempting to claim that utility is being compared between people to do so? Or does CEV involve interpersonal comparison of utility and, if so, why is this not considered problematic?

I thought interpersonal comparisons of utility were generally considered to be impossible.

This is true about aggregating ordinal utilities, but doesn't hold for cardinal utilities (see Arrow's theorem). If you are talking about comparing utilities (i.e. choosing a normalization method), I'm not aware of a general consensus that this is impossible.

[This comment is no longer endorsed by its author]Reply

Economists generally regard interpersonal utility comparisons as impossible; hence the focus on Pareto, and then Kalder-Hicks, optimality. See for example this, though any decent economics textbook will cover it.

The problem, of course, is that utility functions are only defined up to an affine transformation.

The problem, of course, is that utility functions are only defined up to an affine transformation.

Which is why I normalise them first before adding them up.

Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.

Not impossible, just challenging.

Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.

It's hard. You can do it, in many ways, but most of the properties you'd want to have cannot be had. The max-min method of normalisation I mentioned has the most of the intuitive properties (despite not being very intuitive itself).

If you have the time, I'd be interested to know what these desirable properties are (or would be happy to read a paper on the topic if you have one to suggest).

We're working on those at the moment, so they're still in flux; but we'll put them out there once we've firmed them up.

Cool, I'll keep my eye out.