CEV-inspired models

Stuart_Armstrong

CEV-inspired models — LessWrong

10 CEV-inspired models

7th Dec 2011

2 min read

10

I've been involved in a recent thread where discussion of coherent extrapolated volition came up. The general consensus was that CEV might - or might not - do certain things, probably, maybe, in certain situations, while ruling other things out, possibly, and that certain scenarios may or may not be the same in CEV, or it might be the other way round, it was too soon to tell.

Ok, that's an exaggeration. But any discussion of CEV is severely hampered by our lack of explicit models. Even bad, obviously incomplete models would be good, as long as we can get useful information as to what they would predict. Bad models can be improved; undefined models are intuition pumps for whatever people feel about them - I dislike CEV, and can construct a sequence of steps that takes my personal CEV to wanting the death of the universe, but that is no more credible than someone claiming that CEV will solve all problems and make lots of cute puppies.

So I'd like to ask for suggestions of models that formalise CEV to at least some extent. Then we can start improving them, and start making CEV concrete.

To start it off, here's my (simplistic) suggestion:

Volition

Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.

Extrapolation

Whenever revealed preferences are non-transitive or non-independent, use the person's stated meta-preferences to remove the issue. The AI thus calculates what the person would say if asked to resolve the transitivity or independence (for people who don't know about the importance of resolving them, the AI would present them with a set of transitive and independent preferences, derived from their revealed preferences, and have them choose among them). Then (wave your hands wildly and pretend you've never heard of non-standard reals, lexicographical preferences, refusal to choose and related issues) everyone's preferences are now expressible as utility functions.

Coherence

Normalise each existing person's utility function and add them together to get your CEV. At the FHI we're looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

Coherent Extrapolated Volition

Personal Blog

10

New Comment

43 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:06 AM

[-]jimrandomh15y140

At the FHI we're looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

Unfortunately, this utility function isn't game-theoretically stable; if you expect your utility to be analyzed this way, you have an incentive to modify your utility function to clip or flatten the ends, to make your utility have a steeper gradient around the amount of utility you expect to receive.

[-]steven046115y60

That seems like it may be true of every scheme that doesn't consider the causal origins of people's utility functions. Does something like Gibbard-Satterthwaite apply?

[-]Vladimir_Nesov15y10

Not specifically causal origins (there's evolution), instead I suppose there might be a way of directly negating most effects resulting from such strategic considerations (that is, decisions that were influenced by their expected effect on the decision in question that wants to negate that effect).

[-]Stuart_Armstrong15y-10

"Easy" way of doing this: see what would have happened if everyone believed that you were using a model where liars don't prosper (ie random dictator), but actually use a Pareto method.

[-]cousin_it15y00

Considering the causal origins of people's utility functions is a nice hack, thanks for pointing it out! How far back do we need to go, though? Should my children benefit if I manipulate their utility function genetically while they're in the womb?

Another way to aggregate utility functions is by simulated bargaining, but it's biased in favor of rich and powerful people.

[-]Vladimir_Nesov15y20

How far back do we need to go, though?

As far as needed to understand (the dependence of current agent's values on (the dependence of (expected benefit from value extraction) on current agent's values)). (Sorry, adding parens was simpler!)

This involves currently confusing "benefit" (to whom?) and assumed-mistaken "expected" (by whom?), supposedly referring to aspects of past agents (that built/determined the current agent) deciding on the strategic value bargaining. (As usual, ability to parse the world and see things that play the roles of elements of agents' algorithms seems necessary to get anything of this sort done.)

[-]Larks15y10

If I'm rich it's because I delayed consumption, allowing others to invest the capital that I had earned. Should we not allow these people some return on their investment?

To be clear, I'm not very sure the answer is yes; but nor do I think it's clear that 'wealth' falls into the category of 'things that should not influence CEV', where things like 'race', 'eye colour' etc. live.

[-]cousin_it15y70

Fair point about delayed gratification, but you may also be rich because your parents were rich, or because you won the lottery, or because you robbed someone. Judging people by their bargaining power conflates all those possible reasons.

[-]Larks15y00

No; if you didn't delay gratification you'd spend the money quickly, regardless of how you got it.

[-]cousin_it15y10

The funniest counterexample I know is Jefri Bolkiah =)

[-]Kaj_Sotala15y00

If you didn't delay gratification and had expensive tastes, you'd spend the money quickly, regardless of how you got it.

Even if everyone did have expensive tastes, people who started off with less money would need to delay their gratification more. A very poor person might need to delay gratification an average of 80% of the time, since they couldn't afford almost anything. A sufficiently rich person might only need to delay gratification 10% of the time without running into financial trouble. So if you wanted to reward delaying of gratification, then on average the poorer that a person was, the more you'd want to reward him

[-]timtyler15y-20

Another way to aggregate utility functions is by simulated bargaining, but it's biased in favor of rich and powerful people.

The same rich and powerful people who are most likely to be funding the research, maybe?

Today, to resolve their differences, people mostly just bargain I.R.L.

They do simulate bargains in their heads, but only to help them with the actual bargaining.

[-]Stuart_Armstrong15y30

You can't be Pareto and game-theoretically stable at the same time (I have a nice picture proof of that, that I'll post some time). You can be stable without being Pareto - we each choose our favoured outcome, and go 50-50 between them. Then no one has an incentive to lie.

Edit: Picture-proof now posted at: http://lesswrong.com/r/discussion/lw/8qv/in_the_pareto_world_liars_prosper/

[-][anonymous]15y00

You can be stable without being Pareto - we each choose our favoured outcome, and go 50-50 between them. Then no one has an incentive to lie.

I seem to have an incentive to lie in that scenario.

[This comment is no longer endorsed by its author]Reply

[-]Luke_A_Somers15y-10

You can estimate where the others' favoured outcomes and go a ways in the opposite direction to try to balance it out. Of course, if one of you takes this to the second level and the others are honest, then no one is happy except by coincidence (one of the honest people deviated from the mean more than you in the same way, and your overshoot happened to land on them).

[-]Kaj_Sotala15y80

Upvoted for trying to say something useful about CEV.

Whenever revealed preferences are non-transitive or non-independent, use the person's stated meta-preferences to remove the issue.

It seems odd that this is the only step where you're using meta-preferences: I would have presumed that any theory would start off from giving a person's approved preferences considerably stronger weight than non-approved ones. (Though since approved desires are often far and non-approved ones near, one's approved ideal self might be completely unrealistic and not what they'd actually want. So non-approved ones should also be taken into account somehow.)

[-]steven046115y20

What do you mean by "actually want"? You seem to be coming dangerously close to the vomit fallacy: "Humans sometimes vomit. By golly, the future must be full of vomit!"

[-]Kaj_Sotala15y00

What do you mean by "actually want"?

Would not actually want X = would not endorse X after finding out the actual consequences of X; would not have X as a preference after reaching reflective equilibrium.

[-]steven046115y00

Oh I see, by "approved ideal self" you meant something different than "self after reaching reflective equilibrium". So instead of fiddling around with revealed preferences, why not just simulate the person reaching reflective equilibrium and then ask the person what preferences he or she endorses?

[-]torekp15y20

That was my first thought on reading the "revealed preferences" part of the post. Extrapolation first - then volition.

[-]Stuart_Armstrong15y00

Could be done - but is harder to define (what counts as a reflective equilibrium?) and harder to model (what do you expect your reflective equilibrium?)

[-]moridinamael15y60

In a previous thread I suggested starting by explicitly defining something like a CEV for a simple worm. After thinking about it, I think perhaps a norn, or some other simple hypothetical organism might be better. To make the situation as simple as possible, start with a universe where the norn are the most intelligent life in existence.

A norn (or something simpler than a norn) has explicitly defined drives, meaning the utility functions of individual norns could potentially be approximated very accurately.

The biggest weakness of this idea is that a norn, or worm, or cellular automaton, can't really participate in the process of approving or rejecting the resulting set of extrapolated solutions. For some people, I think this indicates that you can't do CEV on something that isn't sentient. It only causes me to wonder, what if we are literally too stupid to even comprehend the best possible CEV that can be offered to us? I don't think this is unlikely.

[-]dlthomas15y00

It only causes me to wonder, what if we are literally too stupid to even comprehend the best possible CEV that can be offered to us?

I think this doesn't matter, if we can

1) successfully define the CEV concept itself,

2) define a suitable reference class,

3) build a superintelligence, and

4) ensure that the superintelligence continues to pursue the best CEV it can find for the appropriate reference class.

[-]TheOtherDave15y10

Well, it would be helpful if we could also:
2.5) work out a reliable test for whether a given X really is an instance of the CEV concept for the given reference class

Which seems to depend on having some kind of understanding.

Lacking that, we are left with having to trust that whatever the SI we've built is doing is actually what we "really want" it to do, even if we don't seem to want it to do that, which is an awkward place to be.

[-]Stuart_Armstrong15y00

You're the first to suggest something approaching a model on this thread :-)

[-]steven046115y60

one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0

If Alice likes cookies, and Bob likes cookies but hates whippings, this method gives Alice more cookies than Bob. Moreover, the number of bonus cookies Alice gets depends on the properties of whips that nobody ever uses.

[-]Vladimir_Nesov15y30

(In general, it's proper for properties of counterfactuals to have impact on which decisions are correct in reality, so this consideration alone isn't sufficient to demonstrate that there's a problem.)

[-][anonymous]15y00

It feels intuitively like it's a problem in this specific case.

[This comment is no longer endorsed by its author]Reply

[-]Stuart_Armstrong15y20

You can restrict to a Pareto boundary before normalising - not as mathematically elegant, but indifferent to effects "that nobody ever wants/uses".

[-]Shmi15y30

Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.

There seems to be a feedback loop missing. Provide people with a broad range of choices, let them select a few, provide a range of alternatives within that selection, repeat. Allow for going back a step or ten. That's what happens IRL when you make a major purchase, like a TV, a car or a house.

[-]Larks15y20

one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

"if the AI did exactly what they wanted" as opposed to "if the universe went exactly as they wanted" to avoid issues with unbounded utility functions? This seems like it might not be enough if the universe itself were unbounded in the relivant sense.

For example, suppose my utility function is U(Universe) = #paperclips, which is unbounded in a big universe. Then you're going to normalise me as assigning U(AI becomes clippy) = 1, and U(individual paperclips) = 0.

[-]Stuart_Armstrong15y20

For example, suppose my utility function is U(Universe) = #paperclips, which is unbounded in a big universe. Then you're going to normalise me as assigning U(AI becomes clippy) = 1, and U(individual paperclips) = 0.

Yep.

So most likely a certain proportion of the universe will become paperclips.

[-]Giles15y00

What about recursive CEV?

Start off with CEV-0. I won't go into how that is generated, but it will have a lot of arbitrary decisions and stuff that seems vaguely sensible.

Then ask CEV-0 the following questions:

How should CEV-1 go about aggregating people's preferences?
How should CEV-1 deal with non-transitive or non-independent preferences?
How should CEV-1 determine preferences between outcomes that the subject could never have imagined?
What should CEV-1 do if people lack the expertise to judge the long-term consequences of their preferences?
Should CEV-1 consider people's stated or revealed preferences, or both?
Should CEV-1 consider preferences of non-human animals, people in comas, etc.?
How should CEV-1 deal with people who seem to be trying to modify their own preferences in order to game the system? (utility monsters/tactical voting)

... and so on. The answers to these questions then make up CEV-1. And then CEV-1 is asked the same questions to produce CEV-2.

Various different things could happen here. It could converge to a single stable fixed point. It could oscillate. It could explode, diverging wildly from anything we'd consider reasonable. Or its behavior could depend on the initial choice of CEV-0 (e.g. multiple attractive fixed points).

Explosion could (possibly) be avoided by requiring CEV-n to pass some basic sanity checks, though that has problems too (the sanity checks may not be valid, i.e. they just reflect our own biases. Or they may not be enough - they act as constraints on the evolution of the system but it could still end up insane in respects we haven't anticipated).

Some other problems could be resolved by asking CEV-n how to resolve them.

I'm not sure how to deal with the multiple stable fixed points case. That would seem to correspond to different cultures or special interest groups all trying to push whichever meta-level worldview benefits them the most.

[-]Stuart_Armstrong15y00

Once CEV-n becomes a utility function, it will generally (but not always) get stuck there for ever.

[-]crazy8815y-10

Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.

Is the crucial difference about CEV the fact that it doesn't attempt to maximise the utility of humanity but rather to extract the volition of humanity by treating each person's input equally without attempting to claim that utility is being compared between people to do so? Or does CEV involve interpersonal comparison of utility and, if so, why is this not considered problematic?

[-]AlexSchell15y30

I thought interpersonal comparisons of utility were generally considered to be impossible.

This is true about aggregating ordinal utilities, but doesn't hold for cardinal utilities (see Arrow's theorem). If you are talking about comparing utilities (i.e. choosing a normalization method), I'm not aware of a general consensus that this is impossible.

[This comment is no longer endorsed by its author]Reply

[-]Larks15y90

Economists generally regard interpersonal utility comparisons as impossible; hence the focus on Pareto, and then Kalder-Hicks, optimality. See for example this, though any decent economics textbook will cover it.

The problem, of course, is that utility functions are only defined up to an affine transformation.

[-]Stuart_Armstrong15y00

The problem, of course, is that utility functions are only defined up to an affine transformation.

Which is why I normalise them first before adding them up.

[-]timtyler15y00

Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.

Not impossible, just challenging.

[-]Stuart_Armstrong15y00

Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.

It's hard. You can do it, in many ways, but most of the properties you'd want to have cannot be had. The max-min method of normalisation I mentioned has the most of the intuitive properties (despite not being very intuitive itself).

[-]crazy8815y00

If you have the time, I'd be interested to know what these desirable properties are (or would be happy to read a paper on the topic if you have one to suggest).

[-]Stuart_Armstrong15y00

We're working on those at the moment, so they're still in flux; but we'll put them out there once we've firmed them up.

[-]crazy8815y00

Cool, I'll keep my eye out.

Moderation Log