Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Proper value learning through indifference

17 Stuart_Armstrong 19 June 2014 09:39AM

A putative new idea for AI control; index here.

Many designs for creating AGIs (such as Open-Cog) rely on the AGI deducing moral values as it develops. This is a form of value loading (or value learning), in which the AGI updates its values through various methods, generally including feedback from trusted human sources. This is very analogous to how human infants (approximately) integrate the values of their society.

The great challenge of this approach is that it relies upon an AGI which already has an interim system of values, being able and willing to correctly update this system. Generally speaking, humans are unwilling to easily update their values, and we would want our AGIs to be similar: values that are too unstable aren't values at all.

So the aim is to clearly separate the conditions under which values should be kept stable by the AGI, and conditions when they should be allowed to vary. This will generally be done by specifying criteria for the variation ("only when talking with Mr and Mrs Programmer"). But, as always with AGIs, unless we program those criteria perfectly (hint: we won't) the AGI will be motivated to interpret them differently from how we would expect. It will, as a natural consequence of its program, attempt to manipulate the value updating rules according to its current values.

How could it do that? A very powerful AGI could do the time honoured "take control of your reward channel", by either threatening humans to give it the moral answer it wants, or replacing humans with "humans" (constructs that pass the programmed requirements of being human, according to the AGI's programming, but aren't actually human in practice) willing to give it these answers. A weaker AGI could instead use social manipulation and leading questioning to achieve the morality it desires. Even more subtly, it could tweak its internal architecture and updating process so that it updates values in its preferred direction (even something as simple as choosing the order in which to process evidence). This will be hard to detect, as a smart AGI might have a much clearer impression of how its updating process will play out in practice than it programmers would.

The problems with value loading have been cast into the various "Cake or Death" problems. We have some idea what criteria we need for safe value loading, but as yet we have no candidates for such a system. This post will attempt to construct one.

continue reading »

Gains from trade: Slug versus Galaxy - how much would I give up to control you?

34 Stuart_Armstrong 23 July 2013 07:06PM

Edit: Moved to main at ThrustVectoring's suggestion.

A suggestion as to how to split the gains from trade in some situations.

The problem of Power

A year or so ago, people in the FHI embarked on a grand project: to try and find out if there was a single way of resolving negotiations, or a single way of merging competing moral theories. This project made a lot of progress in finding out how hard this was, but very little in terms of solving it. It seemed evident that the correct solution was to weigh the different utility functions, and then for everyone maximise the weighted sum, but all ways of weighting had their problems (the weighting with the most good properties was a very silly one: use the "min-max" weighting that sets your maximal attainable utility to 1 and your minimal to 0).

One thing that we didn't get close to addressing is the concept of power. If two partners in the negotiation have very different levels of power, then abstractly comparing their utilities seems the wrong solution (more to the point: it wouldn't be accepted by the powerful party).

The New Republic spans the Galaxy, with Jedi knights, battle fleets, armies, general coolness, and the manufacturing and human resources of countless systems at its command. The dull slug, ARthUrpHilIpDenu, moves very slowly around a plant, and possibly owns one leaf (or not - he can't produce the paperwork). Both these entities have preferences, but if they meet up, and their utilities are normalised abstractly, then ARthUrpHilIpDenu's preferences will weigh in far too much: a sizeable fraction of the galaxy's production will go towards satisfying the slug. Even if you think this is "fair", consider that the New Republic is the merging of countless individual preferences, so it doesn't make any sense that the two utilities get weighted equally.

continue reading »

We Don't Have a Utility Function

43 [deleted] 02 April 2013 03:49AM

Related: Pinpointing Utility

If I ever say "my utility function", you could reasonably accuse me of cargo-cult rationality; trying to become more rational by superficially immitating the abstract rationalists we study makes about as much sense as building an air traffic control station out of grass to summon cargo planes.

There are two ways an agent could be said to have a utility function:

  1. It could behave in accordance with the VNM axioms; always choosing in a sane and consistent manner, such that "there exists a U". The agent need not have an explicit representation of U.

  2. It could have an explicit utility function that it tries to expected-maximize. The agent need not perfectly follow the VNM axioms all the time. (Real bounded decision systems will take shortcuts for efficiency and may not achieve perfect rationality, like how real floating point arithmetic isn't associative).

Neither of these is true of humans. Our behaviour and preferences are not consistent and sane enough to be VNM, and we are generally quite confused about what we even want, never mind having reduced it to a utility function. Nevertheless, you still see the occasional reference to "my utility function".

Sometimes "my" refers to "abstract me who has solved moral philosophy and or become perfectly rational", which at least doesn't run afoul of the math, but is probably still wrong about the particulars of what such an abstract idealized self would actually want. But other times it's a more glaring error like using "utility function" as shorthand for "entire self-reflective moral system", which may not even be VNMish.

But this post isn't really about all the ways people misuse terminology, it's about where we're actually at on the whole problem for which a utility function might be the solution.

As above, I don't think any of us have a utility function in either sense; we are not VNM, and we haven't worked out what we want enough to make a convincing attempt at trying. Maybe someone out there has a utility function in the second sense, but I doubt that it actually represents what they would want.

Perhaps then we should speak of what we want in terms of "terminal values"? For example, I might say that it is a terminal value of mine that I should not murder, or that freedom from authority is good.

But what does "terminal value" mean? Usually, it means that the value of something is not contingent on or derived from other facts or situations, like for example, I may value beautiful things in a way that is not derived from what they get me. The recursive chain of valuableness terminates at some set of values.

There's another connotation, though, which is that your terminal values are akin to axioms; not subject to argument or evidence or derivation, and simply given, that there's no point in trying to reconcile them with people who don't share them. This is the meaning people are sometimes getting at when they explain failure to agree with someone as "terminal value differences" or "different set of moral axioms". This is completely reasonable, if and only if that is in fact the nature of the beliefs in question.

About two years ago, it very much felt like freedom from authority was a terminal value for me. Those hated authoritarians and fascists were simply wrong, probably due to some fundamental neurological fault that could not be reasoned with. The very prototype of "terminal value differences".

And yet here I am today, having been reasoned out of that "terminal value", such that I even appreciate a certain aesthetic in bowing to a strong leader.

If that was a terminal value, I'm afraid the term has lost much of its meaning to me. If it was not, if even the most fundamental-seeming moral feelings are subject to argument, I wonder if there is any coherent sense in which I could be said to have terminal values at all.

The situation here with "terminal values" is a lot like the situation with "beliefs" in other circles. Ask someone what they believe in most confidently, and they will take the opportunity to differentiate themselves from the opposing tribe on uncertain controversial issues; god exists, god does not exist, racial traits are genetic, race is a social construct. The pedant answer of course is that the sky is probably blue, and that that box over there is about a meter long.

Likewise, ask someone for their terminal values, and they will take the opportunity to declare that those hated greens are utterly wrong on morality, and blueness is wired into their very core, rather than the obvious things like beauty and friendship being valuable, and paperclips not.

So besides not having a utility function, those aren't your terminal values. I'd be suprised if even the most pedantic answer weren't subject to argument; I don't seem to have anything like a stable and non-negotiable value system at all, and I don't think that I am even especially confused relative to the rest of you.

Instead of a nice consistent value system, we have a mess of intuitions and hueristics and beliefs that often contradict, fail to give an answer, and change with time and mood and memes. And that's all we have. One of the intuitions is that we want to fix this mess.

People have tried to do this "Moral Philosophy" thing before, myself included, but it hasn't generally turned out well. We've made all kinds of overconfident leaps to what turn out to be unjustified conclusions (utilitarianism, egoism, hedonism, etc), or just ended up wallowing in confused despair.

The zeroth step in solving a problem is to notice that we have a problem.

The problem here, in my humble opinion, is that we have no idea what we are doing when we try to do Moral Philosophy. We need to go up a meta-level and get a handle on Moral MetaPhilosophy. What's the problem? What are the relevent knowns? What are the unknowns? What's the solution process?

Ideally, we could do for Moral Philosphy approximately what Bayesian probability theory has done for Epistemology. My moral intuitions are a horrible mess, but so are my epistemic intuitions, and yet we more-or-less know what we are doing in epistemology. A problem like this has been solved before, and this one seems solvable too, if a bit harder.

It might be that when we figure this problem out to the point where we can be said to have a consistent moral system with real terminal values, we will end up with a utility function, but on the other hand, we might not. Either way, let's keep in mind that we are still on rather shaky ground, and at least refrain from believing the confident declarations of moral wisdom that we so like to make.

Moral Philosophy is an important problem, but the way is not clear yet.

Pinpointing Utility

57 [deleted] 01 February 2013 03:58AM

Following Morality is Awesome. Related: Logical Pinpointing, VNM.

The eternal question, with a quantitative edge: A wizard has turned you into a whale, how awesome is this?

"10.3 Awesomes"

Meditate on this: What does that mean? Does that mean it's desirable? What does that tell us about how awesome it is to be turned into a whale? Explain. Take a crack at it for real. What does it mean for something to be labeled as a certain amount of "awesome" or "good" or "utility"?

What is This Utility Stuff?

Most of agree that the VNM axioms are reasonable, and that they imply that we should be maximizing this stuff called "expected utility". We know that expectation is just a weighted average, but what's this "utility" stuff?

Well, to start with, it's a logical concept, which means we need to pin it down with the axioms that define it. For the moment, I'm going to conflate utility and expected utility for simplicity's sake. Bear with me. Here are the conditions that are necessary and sufficient to be talking about utility:

  1. Utility can be represented as a single real number.
  2. Each outcome has a utility.
  3. The utility of a probability distribution over outcomes is the expected utility.
  4. The action that results in the highest utility is preferred.
  5. No other operations are defined.

I hope that wasn't too esoteric. The rest of this post will be explaining the implications of those statements. Let's see how they apply to the awesomeness of being turned into a whale:

  1. "10.3 Awesomes" is a real number.
  2. We are talking about the outcome where "A wizard has turned you into a whale".
  3. There are no other outcomes to aggregate with, but that's OK.
  4. There are no actions under consideration, but that's OK.
  5. Oh. Not even taking the value?

Note 5 especially. You can probably look at the number without causing trouble, but if you try to treat it as meaningful for something other than condition 3 and 4, even accidentally, that's a type error.

Unfortunately, you do not have a finicky compiler that will halt and warn you if you break the rules. Instead, your error will be silently ignored, and you will go on, blissfully unaware that the invariants in your decision system no longer pinpoint VNM utility. (Uh oh.)

Unshielded Utilities, and Cautions for Utility-Users

Let's imagine that utilities are radioactive; If we are careful with out containment procedures, we can safely combine and compare them, but if we interact with an unshielded utility, it's over, we've committed a type error.

To even get a utility to manifest itself in this plane, we have to do a little ritual. We have to take the ratio between two utility differences. For example, if we want to get a number for the utility of being turned into a whale for a day, we might take the difference between that scenario and what we would otherwise expect to do, and then take the ratio between that difference and the difference between a normal day and a day where we also get a tasty sandwich. (Make sure you take the absolute value of your unit, or you will reverse your utility function, which is a bad idea.)

So the form that the utility of being a whale manifests as might be "500 tasty sandwiches better than a normal day". We have chosen "a normal day" for our datum, and "tasty sandwiches" for our units. Of course we could have just as easily chosen something else, like "being turned into a whale" as our datum, and "orgasms" for our units. Then it would be "0 orgasms better than being turned into a whale", and a normal day would be "-400 orgasms from the whale-day".

You say: "But you shouldn't define your utility like that, because then you are experiencing huge disutility in the normal case."

Wrong, and radiation poisoning, and type error. You tried to "experience" a utility, which is not in the defined operations. Also, you looked directly at the value of an unshielded utility (also known as numerology).

We summoned the utilities into the real numbers, but they are still utilities, and we still can only compare and aggregate them. The summoning only gives us a number that we can numerically do those operations on, which is why we did it. This is the same situation as time, position, velocity, etc, where we have to select units and datums to get actual quantities that mathematically behave like their ideal counterparts.

Sometimes people refer to this relativity of utilities as "positive affine structure" or "invariant up to a scale and shift", which confuses me by making me think of an equivalence class of utility functions with numbers coming out, which don't agree on the actual numbers, but can be made to agree with a linear transform, rather than making me think of a utility function as a space I can measure distances in. I'm an engineer, not a mathematician, so I find it much more intuitive and less confusing to think of it in terms of units and datums, even though it's basically the same thing. This way, the utility function can scale and shift all it wants, and my numbers will always be the same. Equivalently, all agents that share my preferences will always agree that a day as a whale is "400 orgasms better than a normal day", even if they use another basis themselves.

So what does it mean that being a whale for a day is 400 orgasms better than a normal day? Does it mean I would prefer 400 orgasms to a day as a whale? Nope. Orgasms don't add up like that; I'd probably be quite tired of it by 15. (remember that "orgasms" were defined as the difference between a day without an orgasm and a day with one, not as the utility of a marginal orgasm in general.) What it means is that I'd be indifferent between a normal day with a 1/400 chance of being a whale, and a normal day with guaranteed extra orgasm.

That is, utilities are fundamentally about how your preferences react to uncertainty. For example, You don't have to think that each marginal year of life is as valuable as the last, if you don't think you should take a gamble that will double your remaining lifespan with 60% certainty and kill you otherwise. After all, all that such a utility assignment even means is that you would take such a gamble. In the words of VNM:

We have practically defined numerical utility as being that thing for which the calculus of mathematical expectations is legitimate.

But suppose there are very good arguments that have nothing to do with uncertainty for why you should value each marginal life-year as much as the last. What then?

Well, "what then" is that we spend a few weeks in the hospital dying of radiation poisoning, because we tried to interact with an unshielded utility again (utilities are radioactive, remember? The specific error is that we tried to manipulate the utility function with something other than comparison and aggregation. Touching a utility directly is just as much an error as observing it directly.

But if the only way to define your utility function is with thought experiments about what gambles you would take, and the only use for it is deciding what gambles you would take, then isn't it doing no work as a concept?

The answer is no, but this is a good question because it gets us closer to what exactly this utility function stuff is about. The utility of utility is that defining how you would behave in one gamble puts a constraint on how you would behave in some other related gambles. As with all math, we put in some known facts, and then use the rules to derive some interesting but unknown facts.

For example, if we have decided that we would be indifferent between a tasty sandwich and a 1/500 chance of being a whale for tomorrow, and that we'd be indifferent between a tasty sandwich and a 30% chance of sun instead of the usual rain, then we should also be indifferent between a certain sunny day and a 1/150 chance of being a whale.

Monolithicness and Marginal (In)Dependence

If you are really paying attention, you may be a bit confused, because it seems to you that money or time or some other consumable resource can force you to assign utilities even if there is no uncertainty in the system. That issue is complex enough to deserve its own post, so I'd like to delay it for now.

Part of the solution is that as we defined them, utilities are monolithic. This is the implication of "each outcome has a utility". What this means is that you can't add and recombine utilities by decomposing and recombining outcomes. Being specific, you can't take a marginal whale from one outcome and staple it onto another outcome, and expect the marginal utilities to be the same. For example, maybe the other outcome has no oceans for your marginal whale.

For a bigger example, what we have said so far about the relative value of sandwiches and sunny days and whale-days does not necessarily imply that we are indifferent between a 1/250 chance of being a whale and any of the following:

  • A day with two tasty sandwiches. (Remember that a tasty sandwich was defined as a specific difference, not a marginal sandwich in general, which has no reason to have a consistent marginal value.)

  • A day with a 30% chance of sun and a certain tasty sandwich. (Maybe the tasty sandwich and the sun at the same time is horrifying for some reason. Maybe someone drilled into you as a child that "bread in the sun" was bad bad bad.)

  • etc. You get the idea. Utilities are monolithic and fundamentally associated with particular outcomes, not marginal outcome-pieces.

However, as in probability theory, where each possible outcome technically has its very own probability, in practice it is useful to talk about a concept of independence.

So for example, even though the axioms don't guarantee in general that it will ever be the case, it may work out in practice that given some conditions, like there being nothing special about bread in the sun, and my happiness not being near saturation, the utility of a marginal tasty sandwich is independent of a marginal sunny day, meaning that sun+sandwich is as much better than just sun as just a sandwich is better than baseline, ultimately meaning that I am indifferent between {50%: sunny+sandwich; 50% baseline} and {50%: sunny; 50%: sandwich}, and other such bets. (We need a better solution for rendering probability distributions in prose).

Notice that the independence of marginal utilities can depend on conditions and that independence is with respect to some other variable, not a general property. The utility of a marginal tasty sandwich is not independent of whether I am hungry, for example.

There is a lot more to this independence thing (and linearity, and risk aversion, and so on), so it deserves its own post. For now, the point is that the monolithicness thing is fundamental, but in practice we can sometimes look inside the black box and talk about independent marginal utilities.

Dimensionless Utility

I liked this quote from the comments of Morality is Awesome:

Morality needs a concept of awfulness as well as awesomeness. In the depths of hell, good things are not an option and therefore not a consideration, but there are still choices to be made.

Let's develop that second sentence a bit more. If all your options suck, what do you do? You still have to choose. So let's imagine we are in the depths of hell and see what our theories have to say about it:

Day 78045. Satan has presented me with three options:

  1. Go on a date with Satan Himself. This will involve romantically torturing souls together, subtly steering mortals towards self-destruction, watching people get thrown into the lake of fire, and some very unsafe, very nonconsensual sex with the Adversary himself.

  2. Paperclip the universe.

  3. Satan's court wizard will turn me into a whale and release me into the lake of fire, to roast slowly for the next month, kept alive by twisted black magic.

Wat do?

They all seem pretty bad, but "pretty bad" is not a utility. We could quantify paperclipping as a couple hundred billion lives lost. Being a whale in the lake of fire would be awful, but a bounded sort of awful. A month of endless horrible torture. The "date" is having to be on the giving end of what would more or less happen anyway, and then getting savaged by Satan. Still none of these are utilities.

Coming up with actual utility numbers for these in terms of tasty sandwiches and normal days is hard; it would be like measuring the microkelvin temperatures of your physics experiment with a Fahrenheit kitchen thermometer; in principle it might work, but it isn't the best tool for the job. Instead, we'll use a different scheme this time.

Engineers (and physicists?) sometimes transform problems into a dimensionless form that removes all redundant information from the problem. For example, for a heat conduction problem, we might define an isomorphic dimensionless temperature so that real temperatures between 78 and 305 C become dimensionless temperatures between 0 and 1. Transforming a problem into dimensionless form is nearly always helpful, often in really surprising ways. We can do this with utility too.

Back to depths of hell. The date with Satan is clearly the best option, so it gets dimensionless utility 1. The paperclipper gets 0. On that scale, I'd say roasting in the lake of fire is like 0.999 or so, but that might just be scope insensitivity. We'll take it for now.

The advantages with this approach are:

  1. The numbers are more intuitive. -5e12 QALYs, -1 QALY, and -50 QALYs from a normal day, or the equivalent in tasty sandwiches, just doesn't have the same feeling of clarity as 0, 1 and .999. (For me at least. And yes I know those numbers don't quite match.)

  2. Not having to relate the problem quantities to far-away datums or drastically misappropriate units (tasty sandwiches for this problem) makes the numbers easier and more direct to come up with. Also we have to come up with less of them. The problem is self-contained.

  3. If defined right, the connection between probability and utility becomes extra-clear. For example: What chance between a Satan-date and a paperclipper would make me indifferent with a lake-of-fire-whale-month? 0.999! Unitless magic!

  4. All confusing redundant information (like negative signs) are removed, which makes it harder to accidentally do numerology or commit a type error.

  5. All redundant information is removed, which means you find many more similarities between problems. The value of this in general cannot be understated. Just look at the generalizations made about Reynolds number! "[vortex shedding] occurs for any fluid, size, and speed, provided that Re between ~40 and 10^3". What! You can just say that in general? Magic! I haven't actually done enough utility problems to know that we'll find stuff like that but I trust dimensionless form.

Anyways, it seems that going on that date is what I ought to do. So did we need a concept of awfulness? Did it matter that all the options sucked? Nope; the decision was isomorphic in every way to choosing lunch between a BLT, a turkey club, and a handful of dirt.

There are some assumptions in that lunch bit, and it's worth discussing. It seems counterintuitive or even wrong, to say that your decision-process faced with lunch should be the same as when faced with a decision in involving torture, rape, and paperclips. The latter seems somehow more important. Where does that come from? Is it right?

This may deserve a bigger discussion, but basically, if you have finite resources (thought-power, money, energy, stress) that are conserved or even related across decisions, you get coupling of "different" decisions in a way that we didn't have here. Your intuitions are calibrated for that case. Once you have decoupled the decision by coming up with the actual candidate options. The depths-of-hell decision and the lunch decision really are totally isomorphic. I'll probably address this properly later, if I discuss instrumental utility of resources.

Anyways, once you put the problem in dimensionless form, a lot of decisions that seemed very different become almost the same, and a lot of details that seemed important or confusing just disappear. Bask in the clarifying power of a good abstraction.

Utility is Personal

So far we haven't touched the issue of interpersonal utility. That's because that topic isn't actually about VNM utility! There was nothing in the axioms above about there being a utility for each {person, outcome} pair, only for each outcome.

It turns out that if you try to compare utilities between agents, you have to touch unshielded utilities, which means you get radiation poisoning and go to type-theory hell. Don't try it.

And yet, it seems like we ought to care about what others prefer, and not just our own self-interest. But it seems like that inside the utility function, in moral philosophy, not out here in decision theory.

VNM has nothing to say on the issue of utilitarianism besides the usual preference-uncertainty interaction constraints, because VNM is about the preferences of a single agent. If that single agent cares about the preferences of other agents, that goes inside the utility function.

Conversely, because VNM utility is out here, axiomized for the sovereign preferences of a single agent, we don't much expect it to show up in there, in a discussion if utilitarian preference aggregation. In fact, if we do encounter it in there, it's probably a sign of a failed abstraction.

Living with Utility

Let's go back to how much work utility does as a concept. I've spent the last few sections hammering on the work that utility does not do, so you may ask "It's nice that utility theory can constrain our bets a bit, but do I really have to define my utility function by pinning down the relative utilities of every single possible outcome?".

Sort of. You can take shortcuts. We can, for example, wonder all at once whether, for all possible worlds where such is possible, you are indifferent between saving n lives and {50%: saving 2*n; 50%: saving 0}.

If that seems reasonable and doesn't break in any case you can think of, you might keep it around as heuristic in your ad-hoc utility function. But then maybe you find a counterexample where you don't actually prefer the implications of such a rule. So you have to refine it a bit to respond to this new argument. This is OK; the math doesn't want you to do things you don't want to.

So you can save a lot of small thought experiments by doing the right big ones, like above, but the more sweeping of a generalization you make, the more probable it is that it contains an error. In fact, conceptspace is pretty huge, so trying to construct a utility function without inside information is going to take a while no matter how you approach it. Something like disassembling the algorithms that produce your intuitions would be much more efficient, but that's probably beyond science right now.

In any case, in the current term before we figure out how to formally reason the whole thing out in advance, we have to get by with some good heuristics and our current intuitions with a pinch of last minute sanity checking against the VNM rules. Ugly, but better than nothing.

The whole project is made quite a bit harder in that we are not just trying to reconstruct an explicit utility function from revealed preference; we are trying to construct a utility function for a system that doesn't even currently have consistent preferences.

At some point, either the concept of utility isn't really improving our decisions, or it will come in conflict with our intuitive preferences. In some cases it's obvious how to resolve the conflict, in others, not so much.

But if VNM contradicts our current preferences, why do we think it's a good idea at all? Surely it's not wise to be tampering with our very values?

The reason we like VNM is that we have a strong meta-intuition that our preferences ought to be internally consistent, and VNM seems to be the only way to satisfy that. But it's good to remember that this is just another intuition, to be weighed against the rest. Are we ironing out garbage inconsistencies, or losing valuable information?

At this point I'm dangerously out of my depth. As far as I can tell, the great project of moral philosophy is an adult problem, not suited for mere mortals like me. Besides, I've rambled long enough.


What a slog! Let's review:

  • Maximize expected utility, where utility is just an encoding of your preferences that ensures a sane reaction to uncertainty.

  • Don't try to do anything else with utilities, or demons may fly out of your nose. This especially includes looking at the sign or magnitude, and comparing between agents. I call these things "numerology" or "interacting with an unshielded utility".

  • The default for utilities is that utilities are monolithic and inseparable from the entire outcome they are associated with. It takes special structure in your utility function to be able to talk about the marginal utility of something independently of particular outcomes.

  • We have to use the difference-and-ratio ritual to summon the utilities into the real numbers. Record utilities using explicit units and datum, and use dimensionless form for your calculations, which will make many things much clearer and more robust.

  • If you use a VNM basis, you don't need a concept of awfulness, just awesomeness.

  • If you want to do philosophy about the shape of your utility function, make sure you phrase it in terms of lotteries, because that's what utility is about.

  • The desire to use VNM is just another moral intuition in the great project of moral philosophy. It is conceivable that you will have to throw it out if it causes too much trouble.

  • VNM says nothing about your utility function. Consequentialism, hedonism, utilitarianism, etc are up to you.

A fungibility theorem

21 Nisan 12 January 2013 09:27AM

Restatement of: If you don't know the name of the game, just tell me what I mean to youAlternative to: Why you must maximize expected utility. Related to: Harsanyi's Social Aggregation Theorem.

Summary: This article describes a theorem, previously described by Stuart Armstrong, that tells you to maximize the expectation of a linear aggregation of your values. Unlike the von Neumann-Morgenstern theorem, this theorem gives you a reason to behave rationally.1

continue reading »

Why you must maximize expected utility

20 Benja 13 December 2012 01:11AM

This post explains von Neumann-Morgenstern (VNM) axioms for decision theory, and what follows from them: that if you have a consistent direction in which you are trying to steer the future, you must be an expected utility maximizer. I'm writing this post in preparation for a sequence on updateless anthropics, but I'm hoping that it will also be independently useful.

The theorems of decision theory say that if you follow certain axioms, then your behavior is described by a utility function. (If you don't know what that means, I'll explain below.) So you should have a utility function! Except, why should you want to follow these axioms in the first place?

A couple of years ago, Eliezer explained how violating one of them can turn you into a money pump — how, at time 11:59, you will want to pay a penny to get option B instead of option A, and then at 12:01, you will want to pay a penny to switch back. Either that, or the game will have ended and the option won't have made a difference.

When I read that post, I was suitably impressed, but not completely convinced: I would certainly not want to behave one way if behaving differently always gave better results. But couldn't you avoid the problem by violating the axiom only in situations where it doesn't give anyone an opportunity to money-pump you? I'm not saying that would be elegant, but is there a reason it would be irrational?

It took me a while, but I have since come around to the view that you really must have a utility function, and really must behave in a way that maximizes the expectation of this function, on pain of stupidity (or at least that there are strong arguments in this direction). But I don't know any source that comes close to explaining the reason, the way I see it; hence, this post.

I'll use the von Neumann-Morgenstern axioms, which assume probability theory as a foundation (unlike the Savage axioms, which actually imply that anyone following them has not only a utility function but also a probability distribution). I will assume that you already accept Bayesianism.


Epistemic rationality is about figuring out what's true; instrumental rationality is about steering the future where you want it to go. The way I see it, the axioms of decision theory tell you how to have a consistent direction in which you are trying to steer the future. If my choice at 12:01 depends on whether at 11:59 I had a chance to decide differently, then perhaps I won't ever be money-pumped; but if I want to save as many human lives as possible, and I must decide between different plans that have different probabilities of saving different numbers of people, then it starts to at least seem doubtful that which plan is better at 12:01 could genuinely depend on my opportunity to choose at 11:59.

So how do we formalize the notion of a coherent direction in which you can steer the future?

continue reading »

A (small) critique of total utilitarianism

37 Stuart_Armstrong 26 June 2012 12:36PM

In total utilitarianism, it is a morally neutral act to kill someone (in a painless and unexpected manner) and creating/giving birth to another being of comparable happiness (or preference satisfaction or welfare). In fact if one can kill a billion people to create a billion and one, one is morally compelled to do so. And this is true for real people, not just thought experiment people - living people with dreams, aspirations, grudges and annoying or endearing quirks. To avoid causing extra pain to those left behind, it is better that you kill off whole families and communities, so that no one is left to mourn the dead. In fact the most morally compelling act would be to kill off the whole of the human species, and replace it with a slightly larger population.

We have many real world analogues to this thought experiment. For instance, it seems that there is only a small difference between the happiness of richer nations and poorer nations, while the first consume many more resources than the second. Hence to increase utility we should simply kill off all the rich, and let the poor multiply to take their place (continually bumping off any of the poor that gets too rich). Of course, the rich world also produces most of the farming surplus and the technology innovation, which allow us to support a larger population. So we should aim to kill everyone in the rich world apart from farmers and scientists - and enough support staff to keep these professions running (Carl Shulman correctly points out that we may require most of the rest of the economy as "support staff". Still, it's very likely that we could kill off a significant segment of the population - those with the highest consumption relative to their impact of farming and science - and still "improve" the situation).

Even if turns out to be problematic to implement in practice, a true total utilitarian should be thinking: "I really, really wish there was a way to do targeted killing of many people in the USA, Europe and Japan, large parts of Asia and Latin America and some parts of Africa - it makes me sick to the stomach to think that I can't do that!" Or maybe: "I really really wish I could make everyone much poorer without affecting the size of the economy - I wake up at night with nightmare because these people remain above the poverty line!"

I won't belabour the point. I find those actions personally repellent, and I believe that nearly everyone finds them somewhat repellent or at least did so at some point in their past. This doesn't mean that it's the wrong thing to do - after all, the accepted answer to the torture vs dust speck dilemma feels intuitively wrong, at least the first time. It does mean, however, that there must be very strong countervailing arguments to balance out this initial repulsion (maybe even a mathematical theorem). For without that... how to justify all this killing?

Hence for the rest of this post, I'll be arguing that total utilitarianism is built on a foundation of dust, and thus provides no reason to go against your initial intuitive judgement in these problems. The points will be:

continue reading »

Is risk aversion really irrational ?

42 kilobug 31 January 2012 08:34PM
Disclaimer: this started as a comment to Risk aversion vs. concave utility function but it grew way too big so I turned it into a full-blown article. I posted it to main since I believe it to be useful enough, and since it replies to an article of main.


When you have to chose between two options, one with a certain (or almost certain) outcome, and another which involves more risk, even if in term of utilons (paperclips, money, ...) the gamble has a higher expectancy, there is always a cost in a gamble : between the time when you take your decision and know if your gamble fails or succeeded (between the time you bought your lottery ticket,and the time the winning number is called), you've less precise information about the world than if you took the "safe" option. That uncertainty may force you to make suboptimal choices during that period of doubt, meaning that "risk aversion" is not totally irrational.

Even shorter : knowledge has value since it allows you to optimize, taking a risk temporary lowers your knoweldge, and this is a cost.

Where does risk aversion comes from ?

In his (or her?) article, dvasya gave one possible reason for it : risk aversion comes from a concave utility function. Take food for example. When you're really hungry, didn't eat for days, a bit of food has a very high value. But when you just ate, and have some stocks of food at home, food has low value. Many other things follow, more or less strongly, a non-linear utility function.

But if you adjust the bets for the utility, then, if you're a perfect utility maximizer, you should chose the highest expectancy, regardless of the risk involved. Between being sure of getting 10 utilons and having a 0.1 chance of getting 101 utilons (and 0.9 chance to get nothing), you should chose to take the bet. Or you're not rational, says dvasya.

My first objection to it is that we aren't perfect utility maximizer. We run on limited (and flawed) hardware. We have a limited power for making computation. The first problem of taking a risk is that it'll make all further computations much harder. You buy a lottery ticket, and until you know if you won or not, every time you decide what to do, you'll have to ponder things like "if I win the lottery, then I'll buy a new house, so is it really worth it to fix that broken door now ?" Asking yourself all those questions mean you're less Free to Optimize, and will use your limited hardware to ponder those issues, leading to stress, fatigue and less-efficient decision making.

For us humans with limited and buggy hardware, those problems are significant, and are the main reason for which I am personally (slightly) risk-averse. I don't like uncertainty, it makes planning harder, it makes me waste precious computing power in pondering what to do. But that doesn't seem apply to a perfect utility maximizer, with infinite computing power. So, it seems to be a consequence of biases, if not a bias in itself. Is it really ?

The double-bet of Clippy

So, let's take Clippy. Clippy is a pet paper-clip optimizer, using the utility function proposed by dvasya : u = sqrt(p), where p is the number of paperclips in the room he lives in. In addition to being cute and loving paperclips, our Clippy has lots of computing power, so much he has no issue with tracking probabilities. Now, we'll offer our Clippy to take bets, and see what he should do.

Timeless double-bet

At the beginning, we put 9 paperclips in the room. Clippy has a utilon of 3. He purrs a bit to show us he's happy of those 9 paperclips, looks at us with his lovely eyes, and hopes we'll give him more.

But we offer him a bet : either we give him 7 paperclips, or we flip a coin. If the coin comes up heads, we give him 18 paperclips. If it comes up tails, we give him nothing.

If Clippy doesn't take the bet, he gets 16 paperclips in total, so u=4. If Clippy takes the bet, he has 9 paperclips (u=3) with p=0.5 or 9+18=27 paperclips (u=5.20) with p=0.5. His utility expectancy is u=4.10, so he should take the bet.

Now, regardless of whatever he took the first bet (called B1 starting from now), we offer him a second bet (B2) : this time, he has to pay us 9 paperclips to enter. Then, we roll a 10-sided die. If it gives 1 or 2, we give him a jackpot of 100 paperclips, else nothing. Clippy can be in three states when offered the second deal :

  1. He didn't take B1. Then, he has 16 clips. If he doesn't take B2, he'll stay with 16 clips, and u=4. If takes B2, he'll have 7 clips with p=0.8 and 107 clips with p=0.2, for an expected utility of u=4.19.
  2. He did take B1, and lost it. He has 9 clips. If he doesn't take B2, he'll stay with 9 clips, and u=3. If takes B2, he'll have 0 clips with p=0.8 and 100 clips with p=0.2, for an expected utility of u=2.
  3. He did take B1, and won it. He has 27 clips. If he doesn't take B2, he'll stay with 27 clips, and u=5.20. If takes B2, he'll have 18 clips with p=0.8 and 118 clips with p=0.2, for an expected utility of u=5.57.

So, if Clippy didn't take the first bet or if he won it, he should take the second bet. If he did take the first bet and lost it, he can't afford to take the second bet, since he's risking a very bad outcome : no more paperclips, not even a single tiny one !

And the devil "time" comes in...

Now, let's make things a bit more complicated, and realistic. Before we were running things fully sequentially : first we resolved B1, and then we offered and resolved B2. But let's change a tiny bit B1. We don't flip the coin and give the clips to Clippy now. Clippy tells us if he takes B1 or not, but we'll wait one day before giving him the clips if he didn't take the bet, or before flipping the coin and then giving him the clips if he did take the bet.

The utility function of Clippy doesn't involve time, and we'll consider it doesn't change if he gets the clips tomorrow instead of today. So for him, the new B1 is exactly like the old B1.

But now, we offer him B2 after Clippy made his choice in B1 (taking the bet or not) but before flipping the coin for B1, if he did take the bet.

Now, for Clippy, we only have two situations : he took B1 or he didn't. If he didn't take B1, we are in the same situation than before, with an expected utility of u=4.19.

If he did take B1, we have to consider 4 possibilities :

  1. He loses the two bets. Then he ends up with no paperclip (9+0-9), and is very unhappy. He has u=0 utilons. That'll arise with p=0.4.
  2. He wins B1 and loses B2. Then he ends up with 9+18-9 = 18 paperclips, so u=4.24 with p=0.4.
  3. He loses B1 and wins B2. Then he ends up with 9-9+100 = 100 paperclips, so u=10 with p = 0.1.
  4. He wins both bets. Then he gets 9+18-9+100 = 118 paperclips, so u=10.86 with p=0.1.

At the end, if he takes B2, he ends up with an expectancy of u=3.78.

So, if Clippy takes B1, he then shouldn't take B2. Since he doesn't know if he won or lost B1, he can't afford the risk to take B2.

But should he take B1 at first ? If, when offered to take B1, he knows he'll be offered to take B2 later on, then he should refuse B1 and take B2, for an utility of 4.19. If, when offered B1, he doesn't know about B2, then taking B1 seems the more rational choice. But once he took B1, until he knows if he won or not, he cannot afford to take B2.

The Python code

For people interested about those issues, here is a simple Python script I used to fine tune that numerical parameters of  double-bet issue so my numbers lead to the problem I was pointing to. Feel free to play with it ;)

A hunter-gatherer tale

If you didn't like my Clippy, despite him being cute, and purring of happiness when he sees paperclips, let's shift to another tale.

Daneel is a young hunter-gatherer. He's smart, but his father committed a crime when he was still a baby, and was exiled from the tribe. Daneel doesn't know much about the crime - no one speaks about it, and he doesn't dare to bring the topic by himself. He has a low social status in the tribe because of that story. Nonetheless, he's attracted to Dors, the daughter of the chief. And he knows Dors likes him back, for she always smiles at him when she sees him, never makes fun of him, and gave him a nice knife after his coming-of-age ceremony.

According to the laws of the tribe, Dors can chose her husband freely, and the husband will become the new chief. But Dors also have to chose a husband that is accepted by the rest of the tribe, if the tribe doesn't accept the leadership, they could revolt, or fail to obey. And that could lead to disaster for the whole tribe. Daneel knows he has to raise his status in the tribe if he wants Dors to be able to chose him.

So Daneel wanders further and further in the forest. He wants to find something new to show the tribe his usefulness. That day, going a bit further than usual, he finds a place which is more humid than the forest the tribe usually wanders in. It has a new kind of trees, he never saw before. Lots of them. And they carry a yellow-red fruit which looks yummy. "I could tell about that place to the others, and bring them a few fruits. But then, what if the fruit makes them sick ? They'll blame me, I'll lose all chances... they may even banish me. But I can do better. I'll eat one of the fruits myself. If tomorrow I'm not sick, then I'll bring fruits to the tribe, and show them where I found them. They'll praise me for it. And maybe Dors will then be able to take me more seriously... and if I get sick, well, everyone gets sick every now and then, just one fruit shouldn't kill me, it won't be a big deal". So Daneel makes his utility calculation (I told you he was smart !), finds a positive outcome. So he takes the risk, he picks one fruit, and eats it. Sweet, a bit acid but not too much. Nice !

Now, Daneel goes back to the tribe. On the way back, he got a rabbit, a few roots and plants for the shaman, an average day. But then, he sees the tribe gathered around the central totem. In the middle of the tribe, Dors with... no... not him... Eto ! Eto is the strongest lad of Daneel's age. He wants Dors too. And he's strong, and very skilled with the bow. The other hunters like him, he's a real man. And Eto's father died proudly, defending the tribe's stock of dried meat against hungry wolves two winters ago. But no ! Not that ! Eto is asking Dors to marry him. In public. Dors can refuse, but if she does with no reason, she'll alienate half of the tribe against her, she can't afford it. Eto is way too popular.

"Hey, Daneel ! You want Dors ? Challenge Eto ! He's strong and good with the bow, but in unarmed combat, you can defeat him, I know it.", whispers Hari, one of the few friends of Daneel.

Daneel starts thinking faster he never did. "Ok, I can challenge Eto in unarmed combat. If I lose, I'll be wounded, Eto won't be nice with me. But he won't kill or cripple me, that would make half of the tribe to hate him. If I lose, it'll confirm I'm physical weak, but I'll also win prestige for daring to defy the strong Eto, so it shouldn't change much. And if I win, Dors will be able to refuse Eto, since he lost a fight against someone weaker than him, that's a huge win. So I should take that gamble... but then, there is the fruit. If the fruit gets me sick, in addition of my wounds from Eto, I may die. Even if I win ! And if I lose, get beaten, and then gets sick... they'll probably let me die. They won't take care of a fatherless lad who lose a fight and then gets sick. Too weak to be worth it. So... should I take the gamble ? If Eto waited just one day more... Or if only I knew if I'll get sick or not..."

The key : information loss

Until Clippy knows ? If Daneel knew ? That's the key of risk aversion, and why a perfect utility maximizer, if he has a concave utility function in at least some aspects, should still have some risk aversion. Because risk comes with information loss. That's the difference between the timeless double-bet and the one with one day of delay for Clippy. Or the problem Daneel got stuck into.

If you take a bet, until you know the outcome of your bet, you'll have less information about the state of the world, and especially about the state that directly concerns you, than if you chose the safe situation (a situation with a lower deviation). Having less information means you're less free to optimize.

Even a perfect utility maximizer can't know what bets he'll be offered, and what decisions he'll have to take, unless he's omniscient (and then he wouldn't take bets or risks, but he would know the future - probability only reflects lack of information). So he has to consider the loss of information of taking a bet.

In real life, the most common case of it is the non-linearity of bad effects : you can lose 0.5L of blood without too much side-effects (drink lots a water, sleep well, and next day you're ok, that's what happens when you go give your blood), but if you lose 2L, you'll likely die. Or if you lose some money, you'll be in trouble, but if you lose the same amount again, you may end up being kicked from you house since you can't pay the rent - and that'll be more than twice as bad as the initial lost.

So when you took a bet, risking to get a bad effect, you can't afford to take another bet (even with, in absolute, a higher gain expectancy), until you know if you won or lose the first bet - because losing them both means death, or being kicked from your house, or ultimate pain of not having any paperclip.

Taking a bet always as a cost : it costs you part of your ability to predict, and therefore to optimize.

A possible solution

A possible solution to that problem would be to consider all possible decisions you may to take while in the time period when you don't know if you lost or won your first bet, ponder them with the probability of being offered those decisions, and their possible outcomes if you take the first bet and you don't. But how do you compute "their possible outcomes" ? That needs to consider all the possible bets you could be offered during the time required for the resolution of your second bet, and their possible outcomes. So you need to... stack overflow: maximal recursion depth exceeded.

Since taking a bet will affect your ability to evaluate possible outcomes in the future, you've a "strange loop to the meta-level", an infinite recursion. Your decision algorithm has to consider the impact the decision will have on the future instances of your decision algorithm.

I don't know if there is a mathematical solution to that infinite recursion that manages to make it converge (like you can in some cases). But the problem looks really hard, and may not be computable.

Just factoring an average "risk aversion" that penalizes outcome which involve a risk (and the more you've to wait to know if you won or lose, the higher the penalty) sounds more a way to fix that problem than a bias.

Risk aversion vs. concave utility function

1 dvasya 31 January 2012 06:25AM

In the comments to this post, several people independently stated that being risk-averse is the same as having a concave utility function. There is, however, a subtle difference here. Consider the example proposed by one of the commenters: an agent with a utility function

u = sqrt(p) utilons for p paperclips.

The agent is being offered a choice between making a bet with a 50/50 chance of receiving a payoff of 9 or 25 paperclips, or simply receiving 16.5 paperclips. The expected payoff of the bet is a full 9/2 + 25/2 = 17 paperclips, yet its expected utility is only 3/2 + 5/2 = 4 = sqrt(16) utilons which is less than the sqrt(16.5) utilons for the guaranteed deal, so our agent goes for the latter, losing 0.5 expected paperclips in the process. Thus, it is claimed that our agent is risk averse in that it sacrifices 0.5 expected paperclips to get a guaranteed payoff.

Is this a good model for the cognitive bias of risk aversion? I would argue that it's not. Our agent ultimately cares about utilons, not paperclips, and in the current case it does perfectly fine at rationally maximizing expected utilons. A cognitive bias should be, instead, some irrational behavior pattern that can be exploited to take utility (rather than paperclips) away from the agent. Consider now another agent, with the same utility function as before, but who just has this small additional trait that it would strictly prefer a sure payoff of 16 paperclips to the above bet. Given our agent's utility function, 16 is the point of indifference, so could there be any problem with his behavior? Turns out there is. For example, we could follow the post on Savage's theorem (see Postulate #4). If the sure payoff of

16 paperclips = 4 utilons

is strictly preferred to the bet

{P(9 paperclips) = 0.5; P(25 paperclips) = 0.5} = 4 utilons,

then there must also exist some finite δ > 0 such that the agent must strictly prefer a guaranteed 4 utilons to betting on

{P(9) = 0.5 - δ; P(25) = 0.5 + δ) = 4 + 2δ utilons

- all at the loss of 2δ expected utilons! This is also equivalent to our agent being willing to pay a finite amount of paperclips to substitute the bet with the sure deal of the same expected utility.

What we have just seen falls pretty nicely within the concept of a bias. Our agent has a perfectly fine utility function, but it also has this other thing - let's name it "risk aversion" - that makes the agent's behavior fall short of being perfectly rational, and is independent of its concave utility function for paperclips. (Note that our agent has linear utility for utilons, but is still willing to pay some amount of those to achieve certainty) Can we somehow fix our agent? Let's see if we can redefine our utility function u'(p) in some way so that it gives us a consistent preference of

guaranteed 16 paperclips

over the

 {P(9) = 0.5; P(25) = 0.5}

bet, but we would also like to request that the agent would still strictly prefer the bet

{P(9 + δ) = 0.5; P(25 + δ) = 0.5}

to {P(16) = 1} for some finite δ > 0, so that our agent is not infinitely risk-averse. Can we say anything about this situation? Well, if u'(p) is continuous, there must also exist some number δ' such that 0 < δ' < δ and our agent will be indifferent between {P(16) = 1} and

{P(9 + δ') = 0.5; P(25 + δ') = 0.5}.

And, of course, being risk-averse (in the above-defined sense), our supposedly rational agent will prefer - no harm done - the guaranteed payoff to the bet of the same expected utility u'... Sounds familiar, doesn't it?

I would like to stress again that, although our first agent does have a concave utility function for paperclips, which causes it to reject bets with some expected payoff of paperclips to guaranteed payoffs of less paperclips, it still maximizes its expected utilons, for which it has linear utility. Our second agent, however, has this extra property that causes it to sacrifice expected utilons to achieve certainty. And it turns out that with this property it is impossible to define a well-behaved utility function! Therefore it seems natural to distinguish being rational with a concave utility function, on the one hand, from, on the other hand, being risk-averse and not being able to have a well-behaved utility function at all. The latter case seems much more subtle at the first sight, but causes a more fundamental kind of problem. Which is why I feel that a clear, even if minor, distinction between the two situations is still worth making explicit.

A rational agent can have a concave utility function. A risk-averse agent can not be rational.

(Of course, even in the first case the question of whether we want a concave utility function is still open.)

The Human's Hidden Utility Function (Maybe)

44 lukeprog 23 January 2012 07:39PM

Suppose it turned out that humans violate the axioms of VNM rationality (and therefore don't act like they have utility functions) because there are three valuation systems in the brain that make conflicting valuations, and all three systems contribute to choice. And suppose that upon reflection we would clearly reject the outputs of two of these systems, whereas the third system looks something more like a utility function we might be able to use in CEV.

What I just described is part of the leading theory of choice in the human brain.

Recall that human choices are made when certain populations of neurons encode expected subjective value (in their firing rates) for each option in the choice set, with the final choice being made by an argmax or reservation price mechanism.

Today's news is that our best current theory of human choices says that at least three different systems compute "values" that are then fed into the final choice circuit:

  • The model-based system "uses experience in the environment to learn a model of the transition distribution, outcomes and motivationally-sensitive utilities." (See Sutton & Barto 1998 for the meanings of these terms in reinforcement learning theory.) The model-based system also "infers choices by... building and evaluating the search decision tree to work out the optimal course of action." In short, the model-based system is responsible for goal-directed behavior. However, making all choices with a goal-directed system using something like a utility function would be computationally prohibitive (Daw et al. 2005), so many animals (including humans) first evolved much simpler methods for calculating the subjective values of options (see below).

  • The model-free system also learns a model of the transition distribution and outcomes from experience, but "it does so by caching and then recalling the results of experience rather than building and searching the tree of possibilities. Thus, the model-free controller does not even represent the outcomes... that underlie the utilities, and is therefore not in any position to change the estimate of its values if the motivational state changes. Consider, for instance, the case that after a subject has been taught to press a lever to get some cheese, the cheese is poisoned, so it is no longer worth eating. The model-free system would learn the utility of pressing the lever, but would not have the informational wherewithal to realize that this utility had changed when the cheese had been poisoned. Thus it would continue to insist upon pressing the lever. This is an example of motivational insensitivity."

  • The Pavlovian system, in contrast, calculates values based on a set of hard-wired preparatory and consummatory "preferences." Rather than calculate value based on what is likely to lead to rewarding and punishing outcomes, the Pavlovian system calculates values consistent with automatic approach toward appetitive stimuli, and automatic withdrawal from aversive stimuli. Thus, "animals cannot help but approach (rather than run away from) a source of food, even if the experimenter has cruelly arranged things in a looking-glass world so that the approach appears to make the food recede, whereas retreating would make the food more accessible (Hershberger 1986)."

Or, as Jandila put it:

  • Model-based system: Figure out what's going on, and what actions maximize returns, and do them.
  • Model-free system: Do the thingy that worked before again!
  • Pavlovian system: Avoid the unpleasant thing and go to the pleasant thing. Repeat as necessary.

continue reading »

View more: Next