Quick puzzle about utility functions under affine transformations
Here's a puzzle based on something I used to be confused about:
It is known that utility functions are equivalent (i.e. produce the same preferences over actions) up to a positive affine transformation: u'(x) = au(x) + b where a is positive.
Suppose I have u(vanilla) = 3, u(chocolate) = 8. I prefer an action that yields a 50% chance of chocolate over an action that yields a 100% chance of vanilla, because 0.5(8) > 1.0(3).
Under the positive affine transformation a = 1, b = 4; we get that u'(vanilla) = 7 and u'(chocolate) = 12. Therefore I now prefer the action that yields a 100% chance of vanilla, because 1.0(7) > 0.5(12).
How to resolve the contradiction?
Iterated Gambles and Expected Utility Theory
The Setup
I'm about a third of the way through Stanovich's Decision Making and Rationality in the Modern World. Basically, I've gotten through some of the more basic axioms of decision theory (Dominance, Transitivity, etc).
As I went through the material, I noted that there were a lot of these:
Decision 5. Which of the following options do you prefer (choose one)?
A. A sure gain of $240
B. 25% chance to gain $1,000 and 75% chance to gain nothing
The text goes on to show how most people tend to make irrational choices when confronted with decisions like this; most strikingly was how often irrelevant contexts and framing effected people's decisions.
But I understand the decision theory bit; my question is a little more complicated.
When I was choosing these options myself, I did what I've been taught by the rationalist community to do in situations where I am given nice, concrete numbers: I shut up and I multiplied, and at each decision choose the option with the highest expected utility.
Granted, I equated dollars to utility, which Stanovich does mention that humans don't do well (see Prospect Theory).
The Problem
In the above decision, option B clearly has the higher expected utility, so I chose it. But there was still a nagging doubt in my mind, some part of me that thought, if I was really given this option, in real life, I'd choose A.
So I asked myself: why would I choose A? Is this an emotion that isn't well-calibrated? Am I being risk-averse for gains but risk-taking for losses?
What exactly is going on?
And then I remembered the Prisoner's Dilemma.
A Tangent That Led Me to an Idea
Now, I'll assume that anyone reading this has a basic understanding of the concept, so I'll get straight to the point.
In classical decision theory, the choice to defect (rat the other guy out) is strictly superior to the choice to cooperate (keep your mouth shut). No matter what your partner in crime does, you get a better deal if you defect.
Now, I haven't studied the higher branches of decision theory yet (I have a feeling that Eliezer, for example, would find a way to cooperate and make his partner in crime cooperate as well; after all, rationalists should win.)
Where I've seen the Prisoner's Dilemma resolved is, oddly enough, in Dawkin's The Selfish Gene, which is where I was first introduced to the idea of an Iterated Prisoner's Dilemma.
The interesting idea here is that, if you know you'll be in the Prisoner's Dilemma with the same person multiple times, certain kinds of strategies become available that weren't possible in a single instance of the Dilemma. Partners in crime can be punished for defecting by future defections on your own behalf.
The key idea here is that I might have a different response to the gamble if I knew I could take it again.
The Math
Let's put on our probability hats and actually crunch the numbers:
Format - Probability: $Amount of Money | Probability: $Amount of Money
Assuming one picks A over and over again, or B over and over again.
Iteration A--------------------------------------------------------------------------------------------B
1 $240-----------------------------------------------------------------------------------------1/4: $1,000 | 3/4: $0
2 $480----------------------------------------------------------------------1/16: $2,000 | 6/16: $1,000 | 9/16: $0
3 $720---------------------------------------------------1/64: $3,000 | 9/64: $2,000 | 27/64: $1,000 | 27/64: $0
4 $960------------------------1/256: $4,000 | 12/256: $3,000 | 54/256: $2,000 | 108/256: $1,000 | 81/256: $0
5 $1,200----1/1024: $5,000 | 15/1024: $4,000 | 90/256: $3,000 | 270/1024: $2,000 | 405/1024: $1,000 | 243/1024: $0
And so on. (If I've ma de a mistake, please let me know.)
The Analysis
It is certainly true that, in terms of expected money, option B outperforms option A no matter how many times one takes the gamble, but instead, let's think in terms of anticipated experience - what we actually expect to happen should we take each bet.
The first time we take option B, we note that there is a 75% chance that we walk away disappointed. That is, if one person chooses option A, and four people choose option B, on average three out of those four people will underperform the person who chose option A. And it probably won't come as much consolation to the three losers that the winner won significantly bigger than the person who chose A.
And since nothing unusual ever happens, we should think that, on average, having taken option B, we'd wind up underperforming option A.
Now let's look at further iterations. In the second iteration, we're more likely than not to have nothing having taken option B twice than we are to have anything.
In the third iteration, there's about a 57.8% chance that we'll have outperformed the person who chose option A the whole time, and a 42.2% chance that we'll have nothing.
In the fourth iteration, there's a 73.8% chance that we'll have matched or done worse than the person who has chose option A four times (I'm rounding a bit, $1,000 isn't that much better than $960).
In the fifth iteration, the above percentage drops to 63.3%.
Now, without doing a longer analysis, I can tell that option B will eventually win. That was obvious from the beginning.
But there's still a better than even chance you'll wind up with less, picking option B, than by picking option A. At least for the first five times you take the gamble.
Conclusions
If we act to maximize expected utility, we should choose option B, at least so long as I hold that dollars=utility. And yet it seems that one would have to take option B a fair number of times before it becomes likely that any given person, taking the iterated gamble, will outperform a different person repeatedly taking option A.
In other words, of the 1025 people taking the iterated gamble:
we expect 1 to walk away with $1,200 (from taking option A five times),
we expect 376 to walk away with more than $1,200, casting smug glances at the scaredy-cat who took option A the whole time,
and we expect 648 to walk away muttering to themselves about how the whole thing was rigged, casting dirty glances at the other 377 people.
After all the calculations, I still think that, if this gamble was really offered to me, I'd take option A, unless I knew for a fact that I could retake the gamble quite a few times. How do I interpret this in terms of expected utility?
Am I not really treating dollars as equal to utility, and discounting the marginal utility of the additional thousands of dollars that the 376 win?
What mistakes am I making?
Also, a quick trip to google confirms my intuition that there is plenty of work on iterated decisions; does anyone know a good primer on them?
I'd like to leave you with this:
If you were actually offered this gamble in real life, which option would you take?
JFK was not assassinated: prior probability zero events
A lot of my work involves tweaking the utility or probability of an agent to make it believe - or act as if it believed - impossible or almost impossible events. But we have to be careful about this; an agent that believes the impossible may not be so different from one that doesn't.
Consider for instance an agent that assigns a prior probability of zero to JFK ever having been assassinated. No matter what evidence you present to it, it will go on disbelieving the "non-zero gunmen theory".
Initially, the agent will behave very unusually. If it was in charge of JFK's security in Dallas before the shooting, it would have sent all secret service agents home, because no assassination could happen. Immediately after the assassination, it would have disbelieved everything. The films would have been faked or misinterpreted; the witnesses, deluded; the dead body of the president, that of twin or an actor. It would have had huge problems with the aftermath, trying to reject all the evidence of death, seeing a vast conspiracy to hide the truth of JFK's non-death, including the many other conspiracy theories that must be false flags, because they all agree with the wrong statement that the president was actually assassinated.
But as time went on, the agent's behaviour would start to become more and more normal. It would realise the conspiracy was incredibly thorough in its faking of the evidence. All avenues it pursued to expose them would come to naught. It would stop expecting people to come forward and confess the joke, it would stop expecting to find radical new evidence overturning the accepted narrative. After a while, it would start to expect the next new piece of evidence to be in favour of the assassination idea - because if a conspiracy has been faking things this well so far, then they should continue to do so in the future. Though it cannot change its view of the assassination, its expectation for observations converge towards the norm.
If it does a really thorough investigation, it might stop believing in a conspiracy at all. At some point, the probability of a miracle will start to become more likely than a perfect but undetectable conspiracy. It is very unlikely that Lee Harvey Oswald shot at JFK, missed, and the president's head exploded simultaneously for unrelated natural causes. But after a while, such a miraculous explanation will start to become more likely than anything else the agent can consider. This explanation opens the possibility of miracles; but again, if the agent is very thorough, it will fail to find evidence of other miracles, and will probably settle on "an unrepeatable miracle caused JFK's death in a way that is physically undetectable".
But then note that such an agent will have a probability distribution over future events that is almost indistinguishable from a normal agent that just believes the standard story of JFK being assassinated. The zero-prior has been negated, not in theory but in practice.
How to do proper probability manipulation
This section is still somewhat a work in progress.
So the agent believes one false fact about the world, but its expectation is otherwise normal. This can be both desirable and undesirable. The negative is if we try and control the agent forever by giving it a false fact.
To see the positive, ask why would we want an agent to believe impossible things in the first place? Well, one example was an Oracle design where the Oracle didn't believe its output message would ever be read. Here we wanted the Oracle to believe the message wouldn't be read, but not believe anything else too weird about the world.
In terms of causality, if X designates the message being read at time t, and B and A are event before and after t, respectively, we want P(B|X)≈P(B) (probabilities about current facts in the world shouldn't change much) while P(A|X)≠P(A) is fine and often expected (the future should be different if the message is read or not).
In the JFK example, the agent eventually concluded "a miracle happened". I'll call this miracle a scrambling point. It's kind of a breakdown in causality: two futures are merged into one, given two different pasts. The two pasts are "JFK was assassinated" and "JFK wasn't assassinated", and their common scrambled future is "everything appears as if JFK was assassinated". The non-assassination belief has shifted the past but not the future.
For the Oracle, we want to do the reverse: we want the non-reading belief to shift the future but not the past. However, unlike the JFK assassination, we can try and build the scrambling point. That's why I always talk about messages going down noisy wires, or specific quantum events, or chaotic processes. If the past goes through a truly stochastic event (it doesn't matter whether there is true randomness or just that the agent can't figure out the consequences), we can get what we want.
The Oracle idea will go wrong if the Oracle conclude that non-reading must imply something is different about the past (maybe it can see through chaos in ways we thought it couldn't), just as the JFK assassination denier will continue to be crazy if can't find a route to reach "everything appears as if JFK was assassinated".
But there is a break in the symmetry: the JFK assassination denier will eventually reach that point as long as the world is complex and stochastic enough. While the Oracle requires that the future probabilities be the same in all (realistic) past universes.
Now, once the Oracle's message has been read, the Oracle will find itself in the same situation as the other agent: believing an impossible thing. For Oracles, we can simply reset them. Other agents might have to behave more like the JFK assassination disbeliever. Though if we're careful, we can quantify things more precisely, as I attempted to do here.
One weird trick to turn maximisers into minimisers
A putative new idea for AI control; index here.
A simple and easy design for a u-maximising agent that turns into a u-minimising one.
Let X be some boolean random variable outside the agent's control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ε, and for a given utility u, consider the utility:
- u# = (2/ε)Xu - u.
Before t, the expected value of (2/ε)X is 2, so u# = u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.
This isn't perfect corrigibility - the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:
- u# = Ω(2/ε)Xu - u.
If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.
Utility, probability and false beliefs
A putative new idea for AI control; index here.
This is part of the process of rigourising and formalising past ideas.
Paul Christiano recently asked why I used utility changes, rather than probability changes, to have an AI believe (or act as if it believed) false things. While investigating that, I developed several different methods for achieving the belief changes that we desired. This post analyses these methods.
Different models of forced beliefs
Let x and ¬x refer to the future outcome of a binary random variable X (write P(x) as a shorthand for P(X=x), and so on). Assume that we want P(x):P(¬x) to be in the 1:λ ratio for some λ (since the ratio is all that matters, λ=∞ is valid, meaning P(x)=0). Assume that we have an agent, who has utility u, has seen past evidence e, and wishes to assess the expected utility of their action a.
Typically, for expected utility, we sum over the possible worlds. In practice, we almost always sum over sets of possible worlds, the sets determined by some key features of interest. In assessing the quality of health interventions, for instance, we do not carefully and separately treat each possible position of atoms in the sun. Thus let V be the set of variables or values we can about, and v a possible value vector V can take. As usual, we'll be writing P(v) as a shorthand for P(V=v). The utility function u assigns utilities to possible v's.
One of the advantages of this approach is that it can avoid many issues of conditionals like P(A|B) when P(B)=0.
The first obvious idea is to condition on x and ¬x:
- (1) Σv u(v)(P(v|x,e,a)+λP(v|¬x,e,a))
The second one is to use intersections rather than conditionals (as in this post):
- (2) Σv u(v)(P(v,x|e,a)+λP(v,¬x|e,a))
Finally, imagine that we have a set of variables H, that "screen off" the effects of e and a, up until X. Let h be a set of values H can take. Thus P(x|h,e,a)=P(x|h). One could see H as the full set of possible pre-X histories, but it could be much smaller - maybe just the local environment around X. This gives a third definition:
- (3) Σv Σh u(v)(P(v|h,x,e,a)+λP(v|h,¬x,e,a))P(h|,e,a)
Changing and unchangeable P(x)
An important thing to note is that all three definitions are equivalent for fixed P(x), up to changes of λ. The equivalence of (2) and (1) derives from the fact that Σv u(v)(P(v,x|e,a)+λP(v,¬x|e,a)) = Σv u(v)(P(x)P(v|x,e,a)+λP(¬x)P(v|¬x,e,a)) (we write P(x) rather than P(x|e,a) since the probability of x is fixed). Thus a type (2) agent with λ is equivalent with a type (1) agent with λ'=λP(x)/P(¬x).
Similarly, P(v|h,x,e,a)=P(v,h,x|e,a)/(P(x|h,e,a)*P(h|e,a)). Since P(x|h,e,a)=P(x), equation (3) reduces to Σv Σh u(v)(P(x)P(v,h,x|e,a)+λP(¬x)P(v,h,¬x|e,a)). Summing over h, this becomes Σv u(v)(P(x)P(v,x|e,a)+λP(¬x)P(v,¬x|e,a))=Σv u(v)(P(v|x,e,a)+λP(v|¬x,e,a)), ie the same as (1).
What about non-constant x? Let c(x) and c(¬x) be two contracts that pay out under x and ¬x, respectively. If the utility u is defined as 1 if a payout is received (and 0 otherwise), it's clear that both agent (1) and agent (3) assess c(x) as having an expected utility of 1 while c(¬x) has an expected utility of λ. This assessment is unchanging, whatever the probability of x. Therefore agents (1) and (3), in effect, see the odds of x as being a constant ratio 1:λ.
Agent (2), in contrast, gets a one-off artificial 1:λ update to the odds of x and then proceeds to update normally. Suppose that X is a coin toss that the agent believes is fair, having extensively observed the coin. Then it will believe that the odds are 1:λ. Suppose instead that it observes the coin has a λ:1 odd ratio; then it will believe the true odds are 1:1. It will be accurate, with a 1:λ ratio added on.
The effects of this percolate backwards in time from X. Suppose that X was to be determined by the toss of one of two unfair coins, one with odds ε:1 and one with odds 1:ε. The agent would assess the odds of the first coin being used rather than the second as around 1:λ. This update would extend to the process of choosing the coins, and anything that that depended on. Agent (1) is similar, though its update rule always assumes the odds of x:¬x being fixed; thus any information about the processes of coin selection is interpreted as a change in the probability of the processes, not a change in the probability of the outcome.
Agent (3), in contrast, is completely different. It assess the probability of H=h objectively, but then assumes that the odds of x and ¬x, given any h, is 1:λ. Thus if given updates about the probability of which coin is used, it will assess those updates objectively, but then assume that both coins are "really" giving 1:1 odds. It cuts off the update process at h, thus ensuring that it is "incorrect" only about x and its consequences, not its pre-h causes.
Utility and probability: assessing goal stability
Agents with unstable goals are likely to evolve towards being (equivalent to) expected utility maximisers. The converse is more complicated, but we'll assume here that an agent's goal is stable if it is an expected utility maximiser for some probability distribution.
Which one? I've tended to shy away from changing the probability, preferring to change the utility instead. If we divide the probability in equation (2) by 1+λ, it becomes a u-maximiser with a biased probability distribution. Alternatively, if we defined u'(v,x)=u(v) and u'(v,¬x)=λu(v), then it is a u'-maximiser with an unmodified probability distribution. Since all agents are equivalent for fixed P(x), we can see that in that case, all agents can be seen as expected utility maximisers with the standard probability distribution.
Paul questioned whether the difference was relevant. I preferred the unmodified probability distribution - maybe the agent uses the distribution for induction, maybe having false probability beliefs will interfere with AI self-improvement, or maybe agents with standard probability distributions are easier to corrige - but for agent (2) the difference seems to be arguably a matter of taste.
Note that though agent (2) is stable, it's definition is not translation invariant in u. If we add c to u, we add c(P(x|e,a)+λP(¬x|e,a)) to u'. Thus, if the agent can affect the value of P(x) through its actions, different constants c likely give different behaviours.
Agent (1) is different. Except for the cases λ=0 and λ=∞, the agent cannot be an expected utility maximiser. To see this, just notice that an update about the process that could change the probability of x, gets reinterpreted as an update on the probability of that process. If we have the ε:1 and 1:ε coins, then any update about their respective probabilities of being used gets essentially ignored (as long as the evidence that the coins are biased is much stronger than the evidence as to which coin is used).
In the cases λ=0 and λ=∞, though, agent (1) is a u-maximiser that uses the probability distribution that assumes x or ¬x is certain, respectively. This is the main point of agent (1) - providing a simple maximiser for those cases.
What about agent (3)? Define u' by: u'(v,h,x)=u(v)/P(x|h), and u'(v,h,¬x)=λu(v)/P(¬x|h). Then consider the u'-maximiser:
- (4) Σv Σh u'(v,h,x)P(v,h,x|e,a)+u'(v,h¬x)P(v,h,¬x|e,a)
Now P(v,h,x|e,a)=P(v|h,x,e,a)P(x|h,e,a)P(h|e,a). Because of the screening off assumptions, the middle term is the constant P(x|h). Multiplying this by u'(v,h,x)=u(v)/P(x|h) gives u(v)P(v|h,x,e,a)P(h|e,a). Similarly, the second term becomes λu(v)P(v|h,¬x,e,a)P(h|e,a). Thus a u'-maximiser, with the standard probability distribution, is the same as agent (3), thus proving the stability of that agent type.
Beyond the future: going crazy or staying sane
What happens after the event X has come to pass? In that case, agent (4), the u'-maximiser will continue as normal. Its behaviour will not be unusual as long as neither λ nor 1/λ is close to 0. The same goes for agent (2).
In contrast, agent (3) will no longer be stable after X, as H no longer screens off evidence after that point. And agent (1) was never stable in the first place, and now it denies all the evidence it sees to determine that impossible events actually happened. But what of those two agents, or the stable ones if λ or 1/λ were close to 0? In particular, what if λ falls below the probability that the agent is deluded in its observation of X?
In those cases, it's easy to argue that the agents would effectively go insane, believing wild and random things to justify their delusions.
But maybe not, in the end. Suppose that you, as a human, believe an untrue fact - maybe that Kennedy was killed on the 23rd of November rather than the 22nd. Maybe you construct elaborate conspiracy theories to account for the discrepancy. Maybe you posit an early mistake by some reporter that was then picked up and repeated. After a while, you discover that all the evidence you can find points to the 22nd. Thus, even though you believe with utter conviction that the assassination was on the 23rd, you learn to expect that the next piece of evidence will point to the 22nd. You look for the date-changing conspiracy, and never discover anything about it; and thus learn to expect they have covered their tracks so well they can't be detected.
In the end, the expectations of this "insane" agent could come to resemble those of normal agents, as long as there's some possibility of a general explanation of all the normal observations (eg a well-hidden conspiracy) given the incorrect assumption.
Of course, the safer option is just to corrige the agent to some sensible goal soon after X.
AI utility-based correlation
A putative new idea for AI control; index here.
This presents one way of implementing the indifference-based correlations of these posts.
Let u be a utility function, a map from worlds to real numbers. An expected utility maximiser considering whether to produce output Y, looks at the expected utility
Σw u(w)P(w|Y).
We now assume that there is another random variable X in the world, and we want the AI to be indifferent to worlds where Y≠X. We also want it to be indifferent to worlds where Z=0. Then it will assess the value of output Y as:
Σw u(w)P(w|X=Y,Z=1,Y).
Now, the idea of the setup was to ensure that Z=1 would erase the output Y so that it was never read. Hence P(w|Z=1,Y) Hence that equation simplifies to:
Σw u(w)P(w|X=Y).
Therefore the AI will chose the Y that maximises the (conditional) expected utility of u if X=Y. To get the full version of the initial post, you need to define some function f of Y and modify this to
Σw u(w)P(w|X=Y) + f(Y).
Against Expected Utility
Expected utility is optimal as the number of bets you take approaches infinity. You will lose bets on some days, and win bets on other days. But as you take more and more bets, the day to day randomness cancels out.
Say you want to save as many lives as possible. You can plug "number of lives saved" into an expected utility maximizer. And as the amount of bets it takes increases, it will start to save more lives than any other method.
But the real world obviously doesn't have an infinite number of bets. And following this algorithm in practice will get you worse results. It is not optimal.
In fact, as Pascal's Mugging shows, this could get arbitrarily terrible. An agent following expected utility would just continuously make bets with muggers and worship various religions, until it runs out of resources. Or worse, the expected utility calculations don't even converge, and the agent doesn't make any decisions.
So how do we fix it? Well we could just go back to the original line of reasoning that led us to expected utility, and fix it for finite cases. Instead of caring what method does the best on infinite bets, we might say we want the one that does the best the most on finite cases. That would get you median utility.
For most things, median utility will approximate expected utility. But for very very small risks, it will ignore them. It only cares that it does the best in most possible worlds. It won't ever trade away utility from the majority of your possible worlds to very very unlikely ones.
A naive implementation of median utility isn't actually viable, because at different points in time, the agent might make inconsistent decisions. To fix this, it needs to decide on policies instead of individual decisions. It will pick a decision policy which it believes will lead to the highest median outcome.
This does complicate making a real implementation of this procedure. But that's what you get when you generalize results, and try to make things work on the messy real world. Instead of idealized infinite worlds. The same issue occurs in the multi-armed bandit problem. Where the optimal infinite solution is simple, but finite solutions are incredibly complicated (or simple but require brute force.)
But if you do this, you don't need the independence axiom. You can be consistent and avoid money pumping without it. By not making decisions in isolation, but considering the entire probability space of decisions you will ever make. And choosing the best policies to navigate them.
It's interesting to note this actually solves some other problems. Such an agent would pick a policy that one-boxes on Newcomb's problems, simply because that is the optimal policy. Whereas a straightforward implementation of expected utility doesn't care.
But what if you really like the other mathematical properties of expected utility? What if we can just keep it and change something else? Like the probability function or the utility function?
Well the probability function is sacred IMO. Events should have the same probability of happening (given your prior knowledge), regardless what utility function you have, or what you are trying to optimize. And it's probably inconsistent too. An agent could exploit you. By giving you bets in the areas where your beliefs are forced to be different from reality.
The utility function is not necessarily sacred though. It is inherently subjective, with the goal of just producing the behavior we want. Maybe there is some modification to it that could fix these problems.
It seems really inelegant to do this. We had a nice beautiful system where you could just count the number of lives saved, and maximize that. But assume we give up on that. How can we change the utility function to make it work?
Well you could bound utility to get out of mugging situations. After a certain level, your utility function just stops. It can't get any higher.
But then you are stuck with a bound. If you ever reach it, then you suddenly stop caring about saving any more lives. Now it's possible that your true utility function really is bounded. But it's not a fully general solution for all utility functions. And I don't believe that human utility is actually bounded, but that will have to be a different post.
You could transform the utility function so it asymptotic. But this is just a continuous bound. It doesn't solve much. It still makes you care less and less about obtaining more utility, the closer you get to it.
Say you set your asymptote around 1,000. It can be much larger, but I need an example that is manageable. Now, what happens if you find yourself to exist in a world where all utilities are multiplied by a large number? Say 1,000. E.g. you save a 1,000 lives in situations where before, you would have saved only 1.

An example asymptoting function that is capped at 1,000. Notice how 2,000 is only slightly higher than 1,000, and everything after that is basically flat.
Now the utility of each additional life is diminishing very quickly. Saving 2,000 lives might have only 0.001% more utility than 1,000 lives.
This means that you would not take a 1% risk of losing 1,000 people, for a 99% chance at saving 2,000.
This is the exact opposite situation of Pascal's mugging! The probability of the reward is very high. Why are we refusing such an obviously good trade?
What we wanted to do was make it ignore really low probability bets. What we actually did was just make it stop caring about big rewards, regardless of the probability.
No modification to it can fix that. Because the utility function is totally indifferent to probability. That's what the decision procedure is for. That's where the real problem is.
In researching this topic I've seen all kinds of crazy resolutions to Pascal's Mugging. Some try to attack the exact thought experiment of an actual mugger. And miss the general problem of low probability events with large rewards. Others try to come up with clever arguments why you shouldn't pay the mugger. But not any general solution to the problem. And not one that works under the stated premises, where you care about saving human lives equally, and where you assign the mugger less than 1/3↑↑↑3 probability.
In fact Pascal's Mugger was originally written just to be a formalization of Pascal's original wager. Pascal's wager was dismissed for reasons like involving infinite utilities, and the possibility of an "anti-god" that exactly cancels the benefits out. Or that God wouldn't reward fake worshippers. People mostly missed the whole point about whether or not you should take low probability, high reward bets.
Pascal's Mugger showed that, no, it works fine in finite cases, and the probabilities do not have to exactly cancel each other out
Some people tried to fix the problem by adding hacks on top of the probability or utility functions. I argued against these solutions above. The problem is fundamentally with the decision procedure of expected utility.
I've spoken to someone who decided to just bite the bullet. He accepted that our intuition about big numbers is probably wrong, and we should just do what the math tells us.
But even that doesn't work. One of the points made in the original Pascal's Mugging post is that EU doesn't even converge. There is a hypothesis which has even less probability than the mugger, but promises 3↑↑↑↑3 utility. And a hypothesis even smaller than that which promises 3↑↑↑↑↑3 utility, and so on. Expected utility is utterly dominated by increasingly more improbable hypotheses. The expected utility of all actions approaches positive or negative infinity.
Expected utility is at the heart of the problem. We don't really want the average of our utility function over all possible worlds. No matter how big the numbers are or improbable they may be. We don't really want to trade away utility from the majority of our probability mass to infinitesimal slices of it.
The whole justification for EU being optimal in the infinite case, doesn't apply to the finite real world. The axioms that imply you need it to be consistent aren't true if you don't assume independence. So it's not sacred, and we can look at alternatives.
Median utility is just a first attempt at an alternative. We probably don't really want to maximize median utility either. Stuart Armstrong suggests using the mean of quantiles. There are probably better methods too. In fact there is an entire field of summary statistics and robust statistics, that I've barely looked at yet.
We can generalize and think of agents has having two utility functions. The regular utility function, which just gives a numerical value representing how preferable an outcome is. And a probability preference function, which gives a numerical value to each probability distribution of utilities.
Imagine we want to create an AI which acts the same as the agent would, given the same knowledge. Then we would need to know both of these functions. Not just the utility function. And they are both subjective, with no universally correct answer. Any function, so long as it converges (unlike expected utility), should produce perfectly consistent behavior.
Mean of quantiles
In a previous post, I looked at some of the properties of using the median rather than the mean.
Inspired by Househalter's comment, it seems we might be able to take a compromise between median and mean. It seems to me that simply taking the mean of the lower quartile, median, and upper quartile would also have the nice features I described, and would likely be closer to the mean.
Furthermore, there's no reason to stop there. We can take the mean of the n-1 n-quantiles.
Two questions:
- As n increases, does this quantity tend to the mean if it exists? (I suspect yes).
- For some distributions (eg Cauchy distribution) this quantity will tend to a limit as n increases, even if there is no mean. Is this an effective way of extending means to distributions that don't possess them?
Note the unlike the median approach, for large enough n, this maximiser will pay Pascal's mugger.
Median utility rather than mean?
tl;dr A median maximiser will expect to win. A mean maximiser will win in expectation. As we face repeated problems of similar magnitude, both types take on the advantage of the other. However, the median maximiser will turn down Pascal's muggings, and can say sensible things about distributions without means.
Prompted by some questions from Kaj Sotala, I've been thinking about whether we should use the median rather than the mean when comparing the utility of actions and policies. To justify this, see the next two sections: why the median is like the mean, and why the median is not like the mean.
Why the median is like the mean
The main theoretic justifications for the use of expected utility - hence of means - are the von Neumann Morgenstern axioms. Using the median obeys the completeness and transitivity axioms, but not the continuity and independence ones.
It does obey weaker forms of continuity; but in a sense, this doesn't matter. You can avoid all these issues by making a single 'ultra-choice'. Simply list all the possible policies you could follow, compute their median return, and choose the one with the best median return. Since you're making a single choice, independence doesn't apply.
So you've picked the policy πm with the highest median value - note that to do this, you need only know an ordinal ranking of worlds, not their cardinal values. In what way is this like maximising expected utility? Essentially, the more options and choices you have - or could hypothetically have - the closer this policy must be to expected utility maximalisation.
Assume u is a utility function compatible with your ordinal ranking of the worlds. Then πu = 'maximise the expectation of u' is also a policy choice. If we choose πm, we get a distribution dmu of possible values of u. Then E(u|πm) is within the absolute deviation (using dmu) of the median value of dmu. This absolute deviation always exists for any distribution with an expectation, and is itself bounded by the standard deviation, if it exists.
Thus maximising the median is like maximising the mean, with an error depending on the standard deviation. You can see it as a risk averse utility maximising policy (I know, I know - risk aversion is supposed to go in defining the utility, not in maximising it. Read on!). And as we face more and more choices, the standard deviation will tend to fall relative to the mean, and the median will cluster closer and closer to the mean.
For instance, suppose we consider the choice of whether to buckle our seatbelt or not. Assume we don't want to die in a car accident that a seatbelt could prevent; assume further that the cost of buckling a seatbelt is trivial but real. To simplify, suppose we have an independent 1/Ω chance of death every time we're in a car, and that a seatbelt could prevent this, for some large Ω. Furthermore, we will be in a car a total of ρΩ, for ρ < 0.5. Now, it seems, the median recommends a ridiculous policy: never wear seatbelts. Then you pay no cost ever, and your chance of dying is less than 50%, so this has the top median.
And that is indeed a ridiculous result. But it's only possible because we look at seatbelts in isolation. Every day, we face choices that have small chances of killing us. We could look when crossing the street; smoke or not smoke cigarettes; choose not to walk close to the edge of tall buildings; choose not to provoke co-workers to fights; not run around blindfolded. I'm deliberately including 'stupid things no-one sensible would ever do', because they are choices, even if they are obvious ones. Let's gratuitously assume that all these choices also have a 1/Ω chance of killing you. When you collect together all the possible choices (obvious or not) that you make in your life, this will be ρ'Ω choice, for ρ' likely quite a lot bigger than 1.
Assume that avoiding these choices has a trivial cost, incommensurable with dying (ie no matter how many times you have to buckle your seatbelt, it still better than a fatal accident). Now median-maximisation will recommend taking safety precautions for roughly (ρ'-0.5)Ω of these choices. This means that the decision of a median maximiser will be close to those of a utility maximiser - they take almost the same precautions - though the outcomes are still pretty far apart: the median maximiser accepts a 49.99999...% chance of death.
But now add serious injury to the mix (still assume the costs are incommensurable). This has a rather larger probability, and the median maximiser will now only accept a 49.99999...% chance of serious injury. Or add light injury - now they only accept a 49.99999...% chance of light injury. If light injuries are additive - two injuries are worse than one - then the median maximiser becomes even more reluctant to take risks. We can now relax the assumption of incommensurablility as well; the set of policies and assessments becomes even more complicated, and the median maximiser moves closer to the mean maximiser.
The same phenomena tends to happen when we add lotteries of decisions, chained decisions (decisions that depend on other decisions), and so on. Existential risks are interesting examples: from the selfish point of view, existential risks are just other things that can kills us - and not the most unlikely ones, either. So the median maximiser will be willing to pay a trivial cost to avoid an xrisk. Will a large group of median maximisers be willing to collectively pay a large cost to avoid an xrisk? That gets into superrationality, which I haven't considered yet in this context.
But let's turn back to the mystical utility function that we are trying to maximise. It's obvious that humans don't actually maximise a utility function; but according to the axioms, we should do so. Since we should, people on this list tend to often assume that we actually have one, skipping over the process of constructing it. But how would that process go? Let's assume we've managed to make our preferences transitive, already a major good achievement. How should we go about making them independent as well? We can do so as we go along. But if we do it ahead of time, chances are that we will be comparing hypothetical situations ("Do I like chocolate twice as much as sex? What would I think of a 50% chance of chocolate vs guaranteed sex? Well, it depends on the situation...") and thus construct a utility function. This is where we have to make decisions about very obscure and unintuitive hypothetical tradeoffs, and find a way to fold all our risk aversion/risk love into the utility.
When median maximising, we do exactly the same thing, except we constrain ourselves to choices that are actually likely to happen to us. We don't need a full ranking of all possible lotteries and choices; we just need enough to decide in the situations we are likely to face. You could consider this a form of moral learning (or preference learning). From our choices in different situations (real or possible), we decide what our preferences are in these situations, and this determines our preferences overall.
Why the median is not like the mean
Ok, so the previous paragraph argues that median maximising, if you have enough choices, functions like a clunky version of expected utility maximising. So what's the point?
The point is those situations that are not faced sufficiently often, or that have extreme characteristics. A median maximiser will reject Pascal's mugging, for instance, without any need for extra machinery (though they will accept Pascal's muggings if they face enough independent muggings, which is what we want - for stupidly large values of "enough"). They cope fine with distributions that have no means - such as the Cauchy distribution or a utility version of the St Petersburg paradox. They don't fall into paradox when facing choices with infinite (but ordered) rewards.
In a sense, median maximalisation is like expected utility maximalisation for common choices, but is different for exceptionally unlikely or high impact choices. Or, from the opposite perspective, expected utility maximising gives high probability of good outcomes for common choices, but not for exceptionally unlikely or high impact choices.
Another feature of the general idea (which might be seen as either a plus or a minus) is that it can get around some issues with total utilitarianism and similar ethical systems (such as the repugnant conclusion). What do I mean by this? Well, because the idea is that only choices that we actually expect to make matter, we can say, for instance, that we'd prefer a small ultra happy population to a huge barely-happy one. And if this is the only choice we make, we need not fear any paradoxes: we might get hypothetical paradoxes, just not actual ones. I won't put too much insistence on this point, I just thought it was an interesting observation.
For lack of a Cardinal...
Now, the main issue is that we might feel that there are certain rare choices that are just really bad or really good. And we might come to this conclusion by rational reasoning, rather than by experience, so this will not show up in the median. In these cases, it feels like we might want to force some kind of artificial cardinal order on the worlds, to make the median maximiser realise that certain rare events must be considered beyond their simple ordinal ranking.
In this case, maybe we could artificially add some hypothetical choices to our system, making us address these questions more than we actually would, and thus drawing them closer to the mean maximising situation. But there may be other, better ways of doing this.
Anyway, that's my first pass at constructing a median maximising system. Comments and critics welcome!
EDIT: We can use the absolute deviation (technically, the mean absolute deviation around the mean) to bound the distance between median and mean. This itself is bounded by the standard deviation, if it exists.
Rough utility estimates and clarifying questions
Related to: diminishing returns, utility.
I, for example, really don't care that much about trillions of dollars being won in a lottery or offered by an alien AI iff I make 'the right choice'. I mostly deal with things on pretty linear scales, barring sudden gifts from my relatives and Important Life Decisions. So the below was written with trivialities in mind. Why? Because I think we should train our utility-assigning skilz just like we train our prior-probability-estimating ones.
However, I am far from certain we should do it exactly this way. Maybe this would lead to a shiny new bias. At least I vaguely think I already have it, and formalizing it shouldn't make me worse off. I have tried to apply to myself the category of 'risk-averse', but in the end, it didn't change my prevailing heuristic: 'Everything's reasonable, if you have a sufficient reason.' Like, a pregnant woman should not run if she cares about carrying her child, but even then she should run if the house is on fire. Maybe my estimates of 'sufficient' are different than other people's, but they have served me so far; and setting the particular goal of ridding self of particular biases seems less instrumentally rational than just checking how accurate my individual predictions/impressions/any kind of actionable thoughts are.
So I drew up this list of utility components and will try it out at my leisure, tweaking it ad hoc and paying with time and money and health for my mistakes.
Utility of a given item/action for a given owner/actor = produced value + reduced cost + saved future opportunities + fun.
PV points: -2 if A/I 'takes from tomorrow'*, -1 if'harmful' only within the day, 0 if gives zero on net, 1 ifuseful within the day, 2 if 'gives to tomorrow'
*'tomorrow' is foreseeable future:)
RC points: -3 if takes from overall amount of money I have, less the *really* last-resort stash, -2 if takes from more than one-day-budget, -1 if takes from one-day-budget, 0 if zero on net, 1 if saves within a day (like 'saved on a ticket, might buy candy'), 2 saves for 'tomorrow' on net
SFO points: -2 if 'really sucks', -1 if no, 0 if dunno, 1 if yes
F points: -1 if no, 0 if okay, 1 if yes, 2 if hell yes.
U(bout of flue) =-2-3+0-1=-6. Even if I have flue, I might do research or call a friend or do something useful if it'snot very bad, then it will be only -5. On the other hand, I might get pneumonia, which really sucks, and then it willbe -7. Knowing this, I can, when I feel myself going under, 1) make sure I don't get pneumonia, and 2) go through low-effort stuff I keep labelling 'slow-day-stuff'.
U(room of a house) = use + status -maintenance = U(weighted activities of, well, life) + U(weighted signalling activities, like polishing family china) - U(weighted repair activities).
U(route) = f(weather, price, time, destination, health, 'carrying' potential, changeability on short notice, explainability to somebody else) = U(clothes) + U(activities during commute) + U(shopping/exchanging things/..) + U(emergencies)+ U(rescue missions).
What do you think?
Predicted corrigibility: pareto improvements
A putative new idea for AI control; index here.
Corrigibility allows an agent to transition smoothly from a perfect u-maximiser to a perfect v-maximiser, without seeking to resist or cause this transition.
And it's the very perfection of the transition that could cause problems; while u-maximising, the agent will not take the slightest action to increase v, even if such actions are readily available. Nor will it 'rush' to finish its u-maximising before transitioning. It seems that there's some possibility of improvements here.
I've already attempted one way of dealing with the issue (see the pre-corriged agent idea). This is another one.
Pareto improvements allowed
Suppose that an agent with corrigible algorithm A is following utility u currently, and estimates that there are probabilities pi that it will transition to utilities vi at midnight (note that these are utility function representatives, not affine classes of equivalent utility functions). At midnight, the usual corrigibility applies, making A indifferent to that transition, making use of such terms as E(u|u→u) (the expectation of u, given that the A's utility doesn't change) and E(vi|u→vi) (the expectation of vi, given that A's utility changes to vi).
But, in the meantime, there are expectations such as E({u,v1,v2,...}). These are A's best current estimates as to what the genuine expected utility of the various utilites are, given all it knows about the world and itself. It could be more explicitly written as E({u,v1,v2,...}| A), to emphasise that these expectations are dependent on the agent's own algorithm.
Then the idea is to modify the agent's algorithm so that Pareto improvements are possible. Call this modified algorithm B. B can select actions that A would not have chosen, conditional on:
- E(u|B) ≥ E(u|A) and E(Σpivi|B) ≥ E(Σpivi|A).
There are two obvious ways we could define B:
- B maximises u, subject to the constraints E(Σpivi|B) ≥ E(Σpivi|A).
- B maximises Σpivi, subject to the constraints E(u|B) ≥ E(u|A).
In the first case, the agent maximises its current utility, without sacrificing its future utility. This could apply, for example, to a ruby mining agent that rushes to gets its rubies to the bank before its utility changes. In the second case, the agent maximises it future expected utility, without sacrificing its current utility. This could apply to a ruby mining agent that's soon to become a sapphire mining agent: it then starts to look around and collect some early sapphires as well.
Now, it would seem that doing this must cause it to lose some ruby mining ability. However, it is being Pareto with E("rubies in bank"|A, expected future transition), not with E("rubies in bank"|A, "A remains a ruby mining agent forever"). The difference is that A will behave as if it was maximising the second term, and so might not go to the bank to deposit its gains, before getting hit by the transition. So B can collects some early sapphires, and also goes to the bank to deposit some rubies, and thus end up ahead for both u and Σpivi.
How do humans assign utilities to world states?
It seems like a good portion of the whole "maximizing utility" strategy which might be used by a sovereign relies on actually being able to consolidate human preferences into utilities. I think there are a few stages here, each of which may present obstacles. I'm not sure what the current state of the art is with regard to overcoming these, and am curious regarding such.
First, here are a few assumptions that I'm using just to make the problem a bit more navigable (dealing with one or two hard problems instead of a bunch at once) - will need to go back and do away with each of these (and each combination thereof) and see what additional problems result.
- The sovereign has infinite computing power (and to shorten the list of assumptions, can do 2-6 below)
- We're maximizing across the preferences of a single human (Alice for convenience). To the extent that Alice cares about others, we're accounting for their preferences, too. But we're not dealing with aggregating preferences across different sentient beings, yet. I think this is a separate hard problem.
- Alice has infinite computing power.
- We're assuming that Alice's preferences do not change and cannot change, ever, no matter what happens. So as Alice experiences different things in her life, she has the exact same preferences. No matter what she learns or concludes about the world, she has the exact same preferences. To be explicit, this includes preferences regarding the relative weightings of present and future worldstates. (And in CEV terms, no spread, no distance.)
- We're assuming that Alice (and the sovereign) can deductively conclude the future from the present, given a particular course of action by the sovereign. Picture a single history of the universe from the beginning of the universe to now, and a bunch of worldlines running into the future depending on what action the sovereign takes. To clarify, if you ask Alice about any single little detail across any of the future worldlines, she can tell you that detail.
- Alice can read minds and the preferences of other humans and sentient beings (implied by 5, but trying to be explicit.)
So Alice can conclude anything and everything, pretty much (and so can our sovereign.) The sovereign is faced with the problem of figuring out what action to take to maximize across Alice's preferences. However, Alice is basically a sack of meat that has certain emotions in response to certain experiences or certain conclusions about the world, and it doesn't seem obvious how to get the preference ordering of the different worldlines out of these emotions. Some difficulties:
- The sovereign notices that Alice experiences different feelings in response to different stimuli. How does the sovereign determine which types of feelings to maximize, and which to minimize? There are a bunch of ways to deal with this, but most of them seem to have a chance of error (and the conjunction of p(error) across all the times that the sovereign will need to do this approach 1). For example, could train off an existing data set, could have it simulate other humans with access to Alice's feelings and cognition and have a simulated committee discuss and reach a decision on each one, etc etc. But all of these bootstrap off of the assumed ability of humans to determine which feelings to maximize (just with amped up computing power) - this doesn't strike me as a satisfactory solution.
- Assume 1. is solved. The sovereign knows which feelings to maximize. However, it's ended up with a bunch of axes. How does it determine the appropriate trade-offs to make? (Or, to put it another way, how does it determine the relative value of different positions along each axis with different positions along different axes?)
So, to rehash my actual request: what's the state of the art with regards to these difficulties, and how confident are we that we've reached a satisfactory answer?
Utility vs Probability: idea synthesis
A putative new idea for AI control; index here.
This post is a synthesis of some of the ideas from utility indifference and false miracles, in an easier-to-follow format that illustrates better what's going on.
Utility scaling
Suppose you have an AI with a utility u and a probability estimate P. There is a certain event X which the AI cannot affect. You wish to change the AI's estimate of the probability of X, by, say, doubling the odds ratio P(X):P(¬X). However, since it is dangerous to give an AI false beliefs (they may not be stable, for one), you instead want to make the AI behave as if it were a u-maximiser with doubled odds ratio.
Assume that the AI is currently deciding between two actions, α and ω. The expected utility of action α decomposes as:
u(α) = P(X)u(α|X) + P(¬X)u(α|¬X).
The utility of action ω is defined similarly, and the expected gain (or loss) of utility by choosing α over ω is:
u(α)-u(ω) = P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).
If we were to double the odds ratio, the expected utility gain becomes:
u(α)-u(ω) = (2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)))/Ω, (1)
for some normalisation constant Ω = 2P(X)+P(¬X), independent of α and ω.
We can reproduce exactly the same effect by instead replacing u with u', such that
- u'( |X)=2u( |X)
- u'( |¬X)=u( |¬X)
Then:
u'(α)-u'(ω) = P(X)(u'(α|X)-u'(ω|X)) + P(¬X)(u'(α|¬X)-u'(ω|¬X)),
= 2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)). (2)
This, up to an unimportant constant, is the same equation as (1). Thus we can accomplish, via utility manipulation, exactly the same effect on the AI's behaviour as a by changing its probability estimates.
Notice that we could also have defined
- u'( |X)=u( |X)
- u'( |¬X)=(1/2)u( |¬X)
This is just the same u', scaled.
The utility indifference and false miracles approaches were just special cases of this, where the odds ratio was sent to infinity/zero by multiplying by zero. But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P'). Changes in probability can be replicated as changes in utility.
Utility translating
In the previous, we multiplied certain utilities by two. But by doing so, we implicitly used the zero point of u. But utility is invariant under translation, so this zero point is not actually anything significant.
It turns out that we don't need to care about this - any zero will do, what matters simply is that the spread between options is doubled in the X world but not in the ¬X one.
But that relies on the AI being unable to affect the probability of X and ¬X itself. If the AI has an action that will increase (or decrease) P(X), then it becomes very important where we set the zero before multiplying. Setting the zero in a different place is isomorphic with adding a constant to the X world and not the ¬X world (or vice versa). Obviously this will greatly affect the AI's preferences between X and ¬X.
One way of avoiding the AI affecting X is to set this constant so that u'(X)=u'(¬X), in expectation. Then the AI has no preferences between the two situations, and will not seek to boost one over the other. However, note that u(X) is an expected utility calculation. Therefore:
- Choosing the constant so that u'(X)=u'(¬X) requires accessing the AI's probability estimate P for various worlds; it cannot be done from outside, by multiplying the utility, as the previous approach could.
- Even if u'(X)=u'(¬X), this does not mean that u'(X|Y)=u'(¬X|Y) for every event Y that could happen before X does. Simple example: X is a coin flip, and Y is the bet of someone on that coin flip, someone the AI doesn't like.
This explains all the complexity of the utility indifference approach, which is essentially trying to decompose possible universes (and adding constants to particular subsets of universes) to ensure that u'(X|Y)=u'(¬X|Y) for any Y that could happen before X does.
Closest stable alternative preferences
A putative new idea for AI control; index here.
There's a result that's almost a theorem, which is that an agent that is an expected utility maximiser, is an agent that is stable under self-modification (or the creation of successor sub-agents).
Of course, this needs to be for "reasonable" utility, where no other agent cares about the internal structure of the agent (just its decisions), where the agent is not under any "social" pressure to make itself into something different, where the boundedness of the agent itself doesn't affect its motivations, and where issues of "self-trust" and acausal trade don't affect it in relevant ways, etc...
So quite a lot of caveats, but the result is somewhat stronger in the opposite direction: an agent that is not an expected utility maximiser is under pressure to self-modify itself into one that is. Or, more correctly, into an agent that is isomorphic with an expected utility maximiser (an important distinction).
What is this "pressure" agent are "under"? The known result is that if an agent obeys four simple axioms, then its behaviour must be isomorphic with an expected utility maximiser. If we assume the Completeness axiom (trivial) and Continuity (subtle), then violations of Transitivity or Independence correspond to situations where the agent has been money pumped - lost resources or power for no gain at all. The more likely the agent is to face these situations, the more pressure they're under to behave as an expected utility maximiser, or simply lose out.
Unbounded agents
I have two models for how idealised agents could deal with this sort of pressure. The first, post-hoc, is the unlosing agent I described here. The agent follows whatever preferences it had, but kept track of its past decisions, and whenever it was in a position to violate transitivity or independence in a way that it would suffer from, it makes another decision instead.
Another, pre-hoc, way of dealing with this is to make an "ultra choice" and choose between not decisions, but all possible input output maps (equivalently, between all possible decision algorithms), looking to the expected consequences of each one. This reduces the choices to a single choice, where issues of transitivity or independence need not necessarily apply.
Bounded agents
Actual agents will be bounded, unlikely to be able to store and consult their entire history when making every single decision, and unable to look at the whole future of their interactions to make a good ultra choice. So how would they behave?
This is not determined directly by their preferences, but by some sort of meta-preferences. Would they make an approximate ultra-choice? Or maybe build up a history of decisions, and then simplify it (when it gets to large to easily consult) into a compatible utility function? This is also determined by their interactions, as well - an agent that makes a single decision has no pressure to be an expected utility maximiser, one that makes trillions of related decisions has a lot of pressure.
It's also notable that different types of boundedness (storage space, computing power, time horizons, etc...) have different consequences for unstable agents, and would converge to different stable preference systems.
Investigation needed
So what is the point of this post? It isn't presenting new results; it's more an attempt to launch a new sub-field of investigation. We know that many preferences are unstable, and that the agent is likely to make them stable over time, either through self-modification, subagents, or some other method. There are also suggestions for preferences that are known to be unstable, but have advantages (such as resistance to Pascal Muggings) that standard maximalisation does not.
Therefore, instead of saying "that agent design can never be stable", we should be saying "what kind of stable design would that agent converge to?", "does that convergent stable design still have the desirable properties we want?" and "could we get that stable design directly?".
The first two things I found in this area were that traditional satisficers could converge to vastly different types of behaviour in an essentially unconstrained way, and that a quasi-expected utility maximiser of utility u might converge to an expected utility maximiser, but it might not be u that it maximises.
In fact, we need not look only at violations of the axioms of expected utility; they are but one possible reason for decision behaviour instability. Here are some that spring to mind:
- Non-independence and non-transitivity (as above).
- Boundedness of abilities.
- Adversaries and social pressure.
- Evolution (survival cost to following “odd” utilities (eg time-dependent preference)).
- Unstable decision theories (such as CDT).
Now, some categories (such as "Adversaries and social pressure") may not possess a tidy stable solution, but it is still worth asking what setups are more stable than others, and what the convergence rules are expected to be.
Anti-Pascaline agent
A putative new idea for AI control; index here.
Pascal's wager-like situations come up occasionally with expected utility, making some decisions very tricky. It means that events of the tiniest of probability could dominate the whole decision - intuitively unobvious, and a big negative for a bounded agent - and that expected utility calculations may fail to converge.
There are various principled approaches to resolving the problem, but how about an unprincipled approach? We could try and bound utility functions, but the heart of the problem is not high utility, but hight utility combined with low probability. Moreover, this has to behave sensibly with respect to updating.
The agent design
Consider a UDT-ish agent A looking at input-output maps {M} (ie algorithms that could determine every single possible decision of the agent in the future). We allow probabilistic/mixed output maps as well (hence A has access to a source of randomness). Let u be a utility function, and set 0 < ε << 1 to be the precision. Roughly, we'll be discarding the highest (and lowest) utilities that are below probability ε. There is no fundamental reason that the same ε should be used for highest and lowest utilities, but we'll keep it that way for the moment.
The agent is going to make an "ultra-choice" among the various maps M (ie fixing its future decision policy), using u and ε to do so. For any M, designate by A(M) the decision of the agent to use M for its decisions.
Then, for any map M, set max(M) to be the lowest number s.t P(u ≥ max(M)|A(M)) ≤ ε. In other words, if the agent decides to use M as its decision policy, this is the maximum utility that can be achieved if we ignore the highest valued ε of the probability distribution. Similarly, set min(M) to be the highest number s.t. P(u ≤ min(M)|A(M)) ≤ ε.
Then define the utility function uMε, which is simply u, bounded between max(M) and min(M). Now calculate the expected value of uMε given A(M), call this Eε(u|A(M)).
The agent then chooses the M that maximises Eε(u|A(M)). Call this the ε-precision u-maximising algorithm.
Stability of the design
The above decision process is stable, in that there is a single ultra-choice to be made, and clear criteria for making that ultra-choice. Realistic and bounded agents, however, cannot calculate all the M in sufficient detail to get a reasonable outcome. So we can ask whether the design is stable for a bounded agent.
Note that this question is underdefined, as there are many ways of being bounded, and many ways of cashing out ε-precision u-maximising into bounded form. Most likely, this will not be a direct expected utility maximalisation, so the algorithm will be unstable (prone to change under self-modification). But how exactly it's unstable is an interesting question.
I'll look at one particular situation: one where A was tasked with creating subagents that would go out and interact with the world. These agents are short-sighted: they apply ε-precision u-maximising not to the ultra-choice, but to each individual expected utility calculation (we'll assume the utility gains and losses for each decision is independent).
A has a single choice: what to set ε to for the subagents. Intuitively, it would seem that A would set ε lower than its own value; this could correspond roughly to an agent self-modifying to remove the ε-precision restriction from itself, converging on becoming a u-maximiser. However:
- Theorem: There are (stochastic) worlds in which A will set the subagent precision to be higher, lower or equal to its own precision ε.
The proof will be by way of illustration of the interesting things that can happen in this setup. Let B be the subagent whose precision A sets.
Let C(p) be a coupon that pays out 1 with probability p. xC(p) simply means the coupon pays out x instead of 1. Each coupon costs ε2 utility. This is negligible, and only serves to break ties. Then consider the following worlds:
- In W1, B will be offered the possibility of buying C(0.75ε).
- In W2, B will be offered the possibility of buying C(1.5ε).
- In W3, B will be offered the possibility of buying C(0.75ε), and the offer will be made twice.
- In W4, B will be offered, with 50% probability, the possibility of buying C(1.5ε).
- In W5, B will be offered, with 50% probability, the possibility of buying C(1.5ε), and otherwise the possibility buying 2C(1.5ε).
- In W6, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.5ε).
- In W7, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.05ε).
From A’s perspective, the best input-output maps are: in W1, don’t buy, in W2, buy, in W3, buy both, in W4, don’t buy (because the probability of getting above 0 utility by buying, is, from A's initial perspective, 1.5ε/2 = 0.75ε).
W5 is more subtle, and interesting – essentially A will treat 2C(1.5ε) as if it were C(1.5ε) (since the probability of getting above 1 utility by buying is 1.5ε/2 = 0.75ε, while the probability of getting above zero by buying is (1.5ε+1.5ε)/2=1.5ε). Thus A would buy everything offered.
Similarly, in W6, the agent would buy everything, and in W7, the agent would buy nothing (since the probability of getting above zero by buying is now (1.05ε + 0.75ε)/2 = 0.9ε).
So in W1 and W2, the agent can leave the sub-agent precision at ε. In W2, it needs to lower it below 0.75ε. In W4, it needs to raise it above 1.5ε. In W5 it can leave it alone, while in W6 it must lower it below 0.75ε, and in W7 it must raise it above 1.05ε.
Irrelevant information
One nice feature about this approach is that it ignores irrelevant information. Specifically:
- Theorem: Assume X is a random variable that is irrelevant to the utility function u. If A (before knowing X) has to design successor agents that will exist after X is revealed, then (modulo a few usual assumptions about only decisions mattering, not internal thought processes) it will make these successor agents isomorphic to copies of itself, i.e. ε-precision u-maximising algorithms (potentially with a different way of breaking ties).
These successor agents are not the short-sighted agents of the previous model, but full ultra-choice agents. Their ultra-choice is over all decisions to come, while A's ultra-choice (which is simply a choice) is over all agent designs.
For the proof, I'll assume X is boolean valued (the general proof is similar). Let M be the input-output map A would choose for itself, if it were to make all the decisions itself rather than just designing a subagent. Now, it's possible that M(X) will be different from M(¬X) (here M(X) and M(¬X) are contractions of the input-output map by adding in one of the inputs).
Define the new input-ouput map M' by defining a new internal variable Y in A (recall that A has access to a source of randomness). Since this variable is new, M is independent of the value of Y. Then M' is defined as M with X and Y permuted. Since both Y and X are equally irrelevant to u, Eε(u|A(M))=Eε(u|A(M')), so M' is an input output map that fulfils the ε-precision u-maximising. And M'(X)=M'(¬X), so M' is independent of X.
Now consider the subagent that runs the same algorithm as A, and has seen X. Because of the irrelevance of X, M'(X) will still fulfil ε-precision u-maximising (we can express any fact relevant to u in the form of Zs, with P(Z)=P(Z|X), and then the algorithm is the same).
Similarly, a subagent that has seen ¬X will run M'(¬X). Putting these together, the subagent will expect to run M'(X) with probability P(X) and M'(¬X) with probability P(¬X)=1-P(X).
Since M'(X)=M'(¬X), this whole thing is just M'. So if A creates a copy of itself (possibly tweaking the tie-breaking so that M' is selected), then it will achieve its maximum according to ε-precision u-maximising.
Creating a satisficer
A putative new idea for AI control; index here.
This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:
- Would not effectively aid M(u), a u-maximiser.
- Would not effectively resist M(-u), a u-minimizer.
So satisficer make poor allies and weak enemies.
The design, the designer and the verifier
Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).
However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.
So it seems that S(u) must;
- Set u close to its maximal value (as this is "easy to approach").
- Not increase v to any significant extent (or else M(u-v) would not design it).
- Not decrease v to any significant extent (or else M(εu+v) would veto it).
- Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
- Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).
For the moment, this does seems like it would produce a successful satisficer...
Resource gathering and pre-corriged agents
A putative new idea for AI control; index here.
Resource-gathering agent
It will often be useful to have a model of a “pure” resource gathering agent – one motivated only to gather resources, accumulated power, spread efficiently, and so on. This model could be used as behaviour not to emulate, or as a comparison yardstick for the accumulation behaviour of other agents.
The simplest design for a resource gathering agent would be to take a utility function u – one linear in paperclips, say – and give the agent the utility function X(u) + ¬X(-u), where X is some future observation that has 50% chance of occurring, and that the AI cannot affect. Some cosmological fact coming from a distant galaxy (at some point in the future) could do the trick.
This agent would behave roughly as a resource gathering agent, accumulating power in preparation for the day it would know what to do with it: it would want resources (as these could be used to create or destroy paperclips) but would be indifferent to creating or destroying paperclips currently, as the expected gain from u is exactly compensated by the expected loss from -u (and vice versa).
However, its behaviour is not independent of u: if for instance there were a Grand President of the Committee to Establish the Proper Number of Paperclips in the World (GPotCtEtPNoPitW), then the AI would desperately try to secure that position, but would not care overmuch about being the GPotCtEtPNoSitW, who deals with staples.
So a better model of a resource gathering agent is one that has a distribution P over all sorts of different utility functions, with the proviso that for all such utilities u, P(u)=P(-u). Note here that we’re talking about actual utility functions (which can be compared and summed directly), not functions-up-to-affine-transformations. This distribution P will be updated at some future date according to some phenomena outside of the agent’s control.
Then this agent, which currently has exactly zero motivations, will nonetheless accumulate resources in preparation for the day it will know what to do.
There are some distributions P which are better suited to getting a “purer” resource gathering agent (a bad P would be, eg, having a lots of utilities which are tiny variations on u, which is essentially the same as having just u – but “tiny variations” is not a stable concept under affine transformations). A simplicity prior seems a natural choice here. If u is linear in paperclips and v in staples, then the complexity penalty for w=u+v doesn’t matter so much, as the agent will already want to preserve power over paperclips and staples, because of the (simpler) u, -u, v and -v.
Pre-corriged agents
One of the problems with corrigible agents is that they are, in a sense, too good at what they do. An agent that is currently a u maximiser and will transition tomorrow to being a v maximiser (and everyone knows this) will accept the deal “give me £1,000,000, and I’ll return it tripled tomorrow if you’re still a u-maximiser” (link to corrigibility paper). Why would it accept this deal? Because a real u-maximiser would, and it behaves (almost) exactly as a real u-maximiser.
We might be able to solve that specific problem with methods that identify agents or subagents (see subsequent posts). But there are still issues with, for instance, people who want to trade their own u-valuable and v-useless resources for the agent’s u-useless and v-valuable ones – and then propose the opposite trade tomorrow, with an extra premium.
We can use the idea of a resource gathering agent to prevent such loss of utility. Assume the agent has current utility u, and will transition to some v at specific point in the future. It has a probability distribution P over what this v will be.
Then instead of having current utility u, have it instead as:
u + C Σv Q(v),
where C is some constant and Q(v)=(P(v)+P(-v))/2. Note that Q(v)=Q(-v), so this agent is currently a combination between a u-maximiser and a resource gathering agent – moreover, a resource gathering agent that cares about preserving flexibility in the (likely) correct areas for its future values. The importance of either factor (u-maximising or resource gathering) can be tuned by changing C.
What if the agent expects that their utility will get changed more than once in the future? This can be built up inductively: if there are two utility changes to come, for instance, then after the first transition (but before the second) the agent will have a composite utility, as above, of the form “u + Σv Q(v)”. Then the agent can have a P over all such composite utilities, and use that to define its current composite-composite utility (the one it has before the first change). A composite-composite utility is really just a composite utility, so the process can then be repeated.
Corrigibility will be applied to this setup in two types of circumstances: when people physically change the utility u, as before, and when the agent updates P (and hence Q) in a way that modifies the composite utility.
Note that this setup is less exploitable, but still suffers from the weakness that Q and P are not equal (in the worst case, you could have P(v)=0 while Q(v)=0.5). However, if Q were not symmetric, then the agent wouldn’t currently be a u-maximiser, so this non-equality is essential to preserving the idea of it being a (somewhat) u-maximising agent.
This may not matter too much in practice, however. The agent is like an investor on the stock market who wants to purchase a lot of the long-term stock options, but has no current interest in any stocks. However, given that other people are interested in stocks, it would be stupid to buy and sell them at prices too divergent from the majority opinion, even if the agent doesn’t itself value them. General measures against blackmail or exploitation might also help here.
Does the Utility Function Halt?
Suppose, for a moment, that somebody has written the Utility Function. It takes, as its input, some Universe State, runs it through a Morality Modeling Language, and outputs a number indicating the desirability of that state relative to some baseline, and more importantly, other Universe States which we might care to compare it to.
Can I feed the Utility Function the state of my computer right now, as it is executing a program I have written? And is a universe in which my program halts superior to one in which my program wastes energy executing an endless loop?
If you're inclined to argue that's not what the Utility Function is supposed to be evaluating, I have to ask what, exactly, it -is- supposed to be evaluating? We can reframe the question in terms of the series of keys I press as I write the program, if that is an easier problem to solve than what my computer is going to do.
Less exploitable value-updating agent
My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out.
For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.
One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.
As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.
We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).
The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).
So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).
This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.
Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.
"incomparable" outcomes--multiple utility functions?
I know that this idea might sound a little weird at first, so just hear me out please?
A couple weeks ago I was pondering decision problems where a human decision maker has to choose between two acts that lead to two "incomparable" outcomes. I thought, if outcome A is not more preferred than outcome B, and outcome B is not more preferred than outcome A, then of course the decision maker is indifferent between both outcomes, right? But if that's the case, the decision maker should be able to just flip a coin to decide. Not only that, but adding even a tiny amount of extra value to one of the outcomes should always make that outcome be preferred. So why can't a human decision maker just make up their mind about their preferences between "incomparable" outcomes until they're forced to choose between them? Also, if a human decision maker is really indifferent between both outcomes, then they should be able to know that ahead of time and have a plan for deciding, such as flipping a coin. And, if they're really indifferent between both outcomes, then they should not be regretting and/or doubting their decision before an outcome even occurs regardless of which act they choose. Right?
I thought of the idea that maybe the human decision maker has multiple utility functions that when you try to combine them into one function some parts of the original functions don't necessarily translate well. Like some sort of discontinuity that corresponds to "incomparable" outcomes, or something. Granted, it's been a while since I've taken Calculus, so I'm not really sure how that would look on a graph.
I had read Yudkowsky's "Thou Art Godshatter" a couple months ago, and there was a point where it said "one pure utility function splintered into a thousand shards of desire". That sounds like the "shards of desire" are actually a bunch of different utility functions.
I'd like to know what others think of this idea. Strengths? Weaknesses? Implications?
Potential vs already existent people and aggregation
EDIT: the purpose of this post is simply to show that there is a difference between certain reasoning for already existing and potential people. I don't argue that aggregation is the only difference, nor (in this post) that total utilitarianism for potential people is wrong. Simply that the case for existing people is stronger than for potential people.
Consider the following choices:
- You must choose between torturing someone for 50 years, or torturing 3^^^3 people for a millisecond each (yes, it's a more symmetric variant on the dust-specks vs torture problem).
- You must choose between creating someone who will be tortured for 50 years, or creating 3^^^3 people who will each get tortured for a millisecond each.
Some people might feel that these two choices are the same. There are some key differences between them, however - and not only because the second choice seems more underspecified than the first. The difference is the effect of aggregation - of facing the same choice again and again and again. And again...
There are roughly 1.6 billion seconds in 50 years (hence 1.6 trillion milliseconds in 50 years). Assume a fixed population of 3^^^3 people, and assume that you were going to face the first choice 1.6 trillion times (in each case, the person to be tortured is assigned randomly and independently). Then choosing "50 years" each time results in 1.6 trillion people getting tortured for 50 years (the chance of the same person being chosen to be tortured twice is of the order of 50/3^^^3 - closer to zero than most people can imagine). Choosing "a millisecond" each time results in 3^^^3 people, each getting tortured for (slightly more than) 50 years.
The choice there is clear: pick "50 years". Now, you could argue that your decision should change based on how often you (or people like you) expects to face the same choice, and assumes a fixed population of size 3^^^3, but there is a strong intuitive case to be made that the 50 years of torture is the way to go.
Compare with the second choice now. Choosing "50 years" 1.6 trillion times results in the creation of 1.6 trillion people who get tortured for 50 years. The "a millisecond" choice results in 1.6 trillion times 3^^^3 people being created, each tortured for a millisecond. Conditional on what the rest of the life of these people is like, many people (including me) would feel the "a millisecond" option is much better.
As far as I can tell (please do post suggestions), there is no way of aggregating impacts on potential people you are creating, in the same way that you can aggregate impacts on existing people (of course, you can first create potential people, then add impacts to them - or add impacts that will affect them when they get created - but this isn't the same thing). Thus the two situations seem justifiably different, and there is no strong reason to assign the intuitions of the first case to the second.
Population ethics and utility indifference
It occurs to me that the various utility indifference approaches might be usable in population ethics.
One challenge for non-total utilitarians is how to deal with new beings. Some theories - average utilitarianism, for instance, or some other systems that use overall population utility - have no problem dealing with this. But many non-total utilitarians would like to see creating new beings as a strictly neutral act.
One way you could do this is by starting with a total utilitarian framework, but subtracting a certain amount of utility every time a new being B is brought into the world. In the spirit of utility indifference, we could subtract exactly the expected utility that we expect B to enjoy during their life.
This means that we should be indifferent as to whether B is brought into the world or not, but, once B is there, we should aim to increase B's utility. There are two problems with this. The first is that, strictly interpreted, we would also be indifferent to creating people with negative utility. This can be addressed by only doing the "utility correction" if B's expected utility is positive, thus preventing us from creating beings only to have them suffer.
The second problem is more serious. What about all the actions that we could do, ahead of time, in order to harm or benefit the new being? For instance, it would seem perverse to argue that buying a rattle for a child after they are born (or conceived) is an act of positive utility, whereas buying it before they were born (or conceived) would be a neutral act, since the increase in expected utility for the child is cancel out by the above process. Not only is it perverse, but it isn't timeless, and isn't stable under self modification.
Omission vs commission and conservation of expected moral evidence
Consequentialism traditionally doesn't distinguish between acts of commission or acts of omission. Not flipping the lever to the left is equivalent with flipping it to the right.
But there seems one clear case where the distinction is important. Consider a moral learning agent. It must act in accordance with human morality and desires, which it is currently unclear about.
For example, it may consider whether to forcibly wirehead everyone. If it does so, they everyone will agree, for the rest of their existence, that the wireheading was the right thing to do. Therefore across the whole future span of human preferences, humans agree that wireheading was correct, apart from a very brief period of objection in the immediate future. Given that human preferences are known to be inconsistent, this seems to imply that forcible wireheading is the right thing to do (if you happen to personally approve of forcible wireheading, replace that example with some other forcible rewriting of human preferences).
What went wrong there? Well, this doesn't respect "conversation of moral evidence": the AI got the moral values it wanted, but only though the actions it took. This is very close to the omission/commission distinction. We'd want the AI to not take actions (commission) that determines the (expectation of the) moral evidence it gets. Instead, we'd want the moral evidence to accrue "naturally", without interference and manipulation from the AI (omission).
Truth vs Utility
According to Eliezer, there are two types of rationality. There is epistemic rationality, the process of updating your beliefs based on evidence to correspond to the truth (or reality) as closely as possible. And there is instrumental rationality, the process of making choices in order to maximize your future utility yield. These two slightly conflicting definitions work together most of the time as obtaining the truth is the rationalists' ultimate goal and thus yields the maximum utility. Are there ever times when the truth is not in a rationalist's best interest? Are there scenarios in which a rationalist should actively try to avoid the truth to maximize their possible utility? I have been mentally struggling with these questions for a while. Let me propose a scenario to illustrate the conundrum.
Suppose Omega, a supercomputer, comes down to Earth to offer you a choice. Option 1 is to live in a stimulated world where you have infinite utility (on this world there is no, pain, suffering, death, its basically a perfect world) and you are unaware you are living in a stimulation. Option 2 is Omega will answer one question on absolutely any subject truthfully pertaining to our universe with no strings attached. You can ask about the laws governing the universe, the meaning of life, the origin of time and space, whatever and Omega will give you a absolutely truthful, knowledgeable answer. Now, assuming all of these hypotheticals are true, which option would you pick? Which option should a perfect rationalist pick? Does the potential of asking a question whose answer could greatly improve humanity's knowledge of our universe outweigh the benefits of living in a perfect simulated world with unlimited utility? There is probably a lot of people who would object outright to living in a simulation because it's not reality or the truth. Well lets consider the simulation in my hypothetical conundrum for a second. It's a perfect reality and has unlimited utility potential, and you are completely unaware you are in a simulation on this world. Aside from the unlimited utility part, that sounds a lot like our reality. There are no signs of our reality of being a simulation and all (most) of humanity is convinced that our reality is not a simulation. There for, the only difference that really matters between the simulation in Option 1 and our reality is the unlimited utility potential that Option 1 offers. If there is no evidence that a simulation is not reality then the simulation is reality for the people inside the simulation. That is what I believe and that is why I would choose Option 1. The infinite utility of living in a perfect reality outweighs almost any utility amount increase I could contribute to humanity.
I am very interested in which option the less wrong community would choose (I know Option 2 is kind of arbitrary I just needed an option for people who wouldn't want to live in a simulation). As this is my first post, any feedback or criticism is appreciated. Also many more information on the topic of truth vs utility would be very helpful. Feel free to down vote me to oblivion if this post was stupid, didn't make sense, etc. It was simply an idea that I found interesting that I wanted to put into writing. Thank you for reading.
Model of unlosing agents
Some have expressed skepticism that "unlosing agents" can actually exist. So to provide an existence proof, here is a model of an unlosing agent. It's not a model you'd want to use constructively to build one, but it's sufficient for the existence result.
Let D be the set of all decisions the agent has made in the past, let U be the set of all utility functions that are compatible with those decisions, and let P be a "better than" relationship on the set of outcomes (possibly intransitive, dependent, incomplete, etc...).
By "utility functions that are compatible those decisions" I mean that an expected utility maximising agent with any u in U would reach the same decisions D as the agent actually did. Notice that U starts off infinitely large when D is empty; when the agent faces a new decision d, here is a decision criteria that leaves U non-empty:
- Restrict to the set of possible decision choices that would leave U non-empty. This is always possible, as any u in U would advocate for a particular decision choices du at d, and therefore choosing du would leave u in the updated U. Call this set compatible.
- Among those compatible choices, choose one that is the least incompatible with P, using some criteria (such as needing to do the least work to remove intransitivenesses and dependences and so on).
- Make that choice, and update P as in step 3, and update D and U (leaving U non-empty, as seen in step 1).
- Proceed.
That's the theory. In practice, we would want to restrict the utilities initially allowed into U to avoid really stupid utilities ("I like losing money to people called Rob at 15:46.34 every alternate Wednesday if the stock market is up; otherwise I don't.") When constructing the initial P and U, it could be a good start to be just looking at categories that humans natuarally express preferences between. But those are implementation details. And again, using this kind of explicit design violates the spirit of unlosing agents (unless the set U is defined in ways that are different from simply listing all u in U).
The proof that this agent is unlosing is that a) U will never be empty, and b) for any u in U, the agent will have behaved indistinguishably from a u-maximiser.
Continuity axiom of vNM
In a previous post, I left a somewhat cryptic comment on the continuity/Archimedean axiom of vNM expected utility.
- (Continuity/Achimedean) This axiom (and acceptable weaker versions of it) is much more subtle that it seems; "No choice is infinity important" is what it seems to say, but " 'I could have been a contender' isn't good enough" is closer to what it does. Anyway, that's a discussion for another time.
Here I'll explain briefly what I mean by it. Let's drop that axiom, and see what could happen. First of all, we could have a utility function with non-standard real value. This allows some things to be infinitely more important than others. A simple illustration is lexicographical ordering; eg my utility function consists of the amount of euros I end up owning, with the amount of sex I get serving as a tie-breaker.
There is nothing wrong with such a function! First, because in practice it functions as a standard utility function (I'm unlikely to be able to indulge in sex in a way that has absolutely no costs or opportunity costs, so the amount of euros will always predominate). Secondly because, even if it does make a difference... it's still expected utility maximisation, just a non-standard version.
But worse things can happen if you drop the axiom. Consider this decision criteria: I will act so that, at some point, there will have been a chance of me becoming heavy-weight champion of the world. This is compatible with all the other vNM axioms, but is obviously not what we want as a decision criteria. In the real world, such decision criteria is vacuous (there is a non-zero chance of me becoming heavyweight champion of the world right now), but it certainly could apply in many toy models.
That's why I said that the continuity axiom is protecting us from "I could have been a contender (and that's all that matters)" type reasoning, not so much from "some things are infinitely important (compared to others)".
Also notice that the quantum many-worlds version of the above decision criteria - "I will act so that the measure of type X universe is non-zero" - does not sound quite as stupid, especially if you bring in anthropics.
Expected utility, unlosing agents, and Pascal's mugging
Still very much a work in progress
EDIT: model/existence proof of unlosing agents can be found here.
Why do we bother about utility functions on Less Wrong? Well, because of results of the New man and the Morning Star, which showed that, essentially, if you make decisions, you better use something equivalent to expected utility maximisation. If you don't, you lose. Lose what? It doesn't matter, money, resources, whatever: the point is that any other system can be exploited by other agents or the universe itself to force you into a pointless loss. A pointless loss being a lose that give you no benefit or possibility of benefit - it's really bad.
The justifications for the axioms of expected utility are, roughly:
- (Completeness) "If you don't decide, you'll probably lose pointlessly."
- (Transitivity) "If your choices form loops, people can make you lose pointlessly."
- (Continuity/Achimedean) This axiom (and acceptable weaker versions of it) is much more subtle that it seems; "No choice is infinity important" is what it seems to say, but " 'I could have been a contender' isn't good enough" is closer to what it does. Anyway, that's a discussion for another time.
- (Independence) "If your choice aren't independent, people can expect to make you lose pointlessly."
Equivalency is not identity
A lot of people believe a subtlety different version of the result:
- If you don't have a utility function, you'll lose pointlessly.
This is wrong. The correct result is:
- If you don't lose pointlessly, then your decisions are equivalent with having a utility function.
Artificial Utility Monsters as Effective Altruism
Dear effective altruist,
have you considered artificial utility monsters as a high-leverage form of altruism?
In the traditional sense, a utility monster is a hypothetical being which gains so much subjective wellbeing (SWB) from marginal input of resources that any other form of resource allocation is inferior on a utilitarian calculus. (as illustrated on SMBC)
This has been used to show that utilitarianism is not as egalitarian as it intuitively may appear, since it prioritizes some beings over others rather strictly - including humans.
The traditional utility monster is implausible even in principle - it is hard to imagine a mind that is constructed such that it will not succumb to diminishing marginal utility from additional resource allocation. There is probably some natural limit on how much SWB a mind can implement, or at least how much this can be improved by spending more on the mind. This would probably even be true for an algorithmic mind that can be sped up with faster computers, and there are probably limits to how much a digital mind can benefit in subjective speed from the parallelization of its internal subcomputations.
However, we may broaden the traditional definition somewhat and call any technology utility-monstrous if it implements high SWB with exceptionally good cost-effectiveness and in a scalable form - even if this scalability stems form a larger set of minds running in parallel, rather than one mind feeling much better or living much longer per additional joule/dollar.
Under this definition, it may be very possible to create and sustain many artificial minds reliably and cheaply, while they all have a very high SWB level at or near subsistence. An important point here is that possible peak intensities of artificially implemented pleasures could be far higher than those commonly found in evolved minds: Our worst pains seem more intense than our best pleasures for evolutionary reasons - but the same does not have to be true for artifial sentience, whose best pleasures could be even more intense than our worst agony, without any need for suffering anywhere near this strong.
If such technologies can be invented - which seems highly plausible in principle, if not yet in practice - then the original conclusion for the utilitarian calculus is retained: It would be highly desirable for utilitarians to facilitate the invention and implementation of such utility-monstrous systems and allocate marginal resources to subsidize their existence. This makes it a potential high-value target for effective altruism.
Many tastes, many utility monsters
Human motivation is barely stimulated by abstract intellectual concepts, and "utilitronium" sounds more like "aluminium" than something to desire or empathize with. Consequently, the idea is as sexy as a brick. "Wireheading" evokes associations of having a piece of metal rammed into one's head, which is understandably unattractive to any evolved primate (unless it's attached to an iPod, which apparently makes it okay).
Technically, "utility monsters" suffer from a similar association problem, which is that the idea is dangerous or ethically monstrous. But since the term is so specific and established in ethical philosophy, and since "monster" can at least be given an emotive and amicable - almost endearing - tone, it seems realistic to use it positively. (Suggestions for a better name are welcome, of course.)
So a central issue for the actual implementation and funding is human attraction. It is more important to motivate humans to embrace the existence of utility monsters than it is for them to be optimally resource-efficient - after all, a technology that is never implemented or funded properly gains next to nothing from being efficient.
A compromise between raw efficiency of SWB per joule/dollar and better forms to attract humans might be best. There is probably a sweet spot - perhaps various different ones for different target groups - between resource-efficiency and attractiveness. Only die-hard utilitarians will actually want to fund something like hedonium, but the rest of the world may still respond to "The Sims - now with real pleasures!", likeable VR characters, or a new generation of reward-based Tamagotchis.
Once we step away somewhat from maximum efficiency, the possibilities expand drastically. Implementation forms may be:
- decorative like gimmicks or screensavers,
- fashionable like sentient wearables,
- sophisticated and localized like works of art,
- cute like pets or children,
- personalized like computer game avatars retiring into paradise,
- erotic like virtual lovers who continue to have sex without the user,
- nostalgic like digital spirits of dead loved ones in artificial serenity,
- crazy like hyperorgasmic flowers,
- semi-functional like joyful household robots and software assistants,
- and of course generally a wide range of human-like and non-human-like simulated characters embedded in all kinds of virtual narratives.
Possible risks and mitigation strategies
Open-souce utility monsters could be made public as templates to add additional control that the implementation of sentience is correct and positive, and to make better variations easy to explore. However, this would come with the downside of malicious abuse and reckless harm potential. Risks of suffering could come from artificial unhappiness desired by users, e.g. for narratives that contain sadism, dramatic violence or punishment of evil characters for quasi-moral gratification. Another such risk could come simply from bad local modifications that implement suffering by accident.
Despite these risks, one may hope that most humans who care enough to run artificial sentience are more benevolent and careful than malevolent and careless in a way that causes more positive SWB than suffering. After all, most people love their pets and do not torture them, and other people look down on those who do (compare this discussion of Norn abuse, which resulted in extremely hostile responses). And there may be laws against causing artificial suffering. Still, this is an important point of concern.
Closed-source utility monsters may further mitigate some of this risk by not making the sentient phenotypes directly available to the public, but encapsulating their internal implementation within a well-defined interface - like a physical toy or closed-source software that can be used and run by private users, but not internally manipulated beyond a well-tested state-space without hacking.
An extremely cautionary approach would be to run the utility monsters by externally controlled dedicated institutions and only give the public - such as voters or donors - some limited control over them through communication with the institution. For instance, dedicated charities could offer "virtual paradises" to donors so they can "adopt" utility monsters living there in certain ways without allowing those donors to actually lay hands on their implementation. On the other hand, this would require a high level of trustworthiness of the institutions or charities and their controllers.
Not for the sake of utility monsters alone
Human values are complex, and it has been argued on LessWrong that the resource allocation of any good future should not be spent for the sake of pleasure or happiness alone. As evolved primates, we all have more than one intuitive value we hold dear, even among self-identified intellectual utilitarians, who compose only a tiny fraction of the population.
However, some discussions in the rationalist community touching related technologies like pleasure wireheading, utilitronium, and so on, have suffered from implausible or orthogonal assumptions and associations. Since the utilitarian calculus favors SWB maximization above all else, it has been feared, we run the risk of losing a more complex future because
a) utilitarianism knows no compromise and
b) the future will be decided by one winning singleton who takes it all and
c) we have only one world with only one future to get it right
In addition, low status has been ascribed to wireheads, with the association of fake utility or cheating life as a form of low-status behavior. People have been competing for status by associating themselves with the miserable Socrates instead of the happy pig, without actually giving up real option value in their own lives.
On Scott Alexander's blog, there's a good example of a mostly pessimistic view both in the OP and in the comments. And in this comment on an effective altruism critique, Carl Shulman names hedonistic utilitarianism turning into a bad political ideology similar to communist states as a plausible failure mode of effective altruism.
So, will we all be killed by a singleton who turns us into utilitronium?
Be not afraid! These fears are plausibly unwarranted because:
a) Utilitarianism is consequentialism, and consequentialists are opportunistic compromisers - even within the conflicting impulses of their own evolved minds. The number of utilitarians who would accept existential risk for the sake of pleasure maximization is small, and practically all of them ascribe to the philosophy of cooperative compromise with orthogonal, non-exclusive values in the political marketplace. Those who don't are incompetent almost by definition and will never gain much political traction.
b) The future may very well not be decided by one singleton but by a marketplace of competing agency. Building a singleton is hard and requires the strict subduction or absorption of all competition. Even if it were to succeed, the singleton will probably not implement only one human value, since it will be created by many humans with complex values, or at least it will have to make credible concessions to a critical mass of humans with diverse values who can stop it before it reaches singleton status. And if these mitigating assumptions are all false and a fooming singleton is possible and easy, then too much pleasure should be the least of humanity's worries - after all, in this case the Taliban, the Chinese government, the US military or some modern King Joffrey are just as likely to get the singleton as the utilitarians.
c) There are plausibly many Everett branches and many hubble volumes like ours, implementing more than one future-earth outcome, as summed up by Max Tegmark here. Even if infinitarian multiverse theories should all end up false against current odds, a very large finite universe would still be far more realistic than a small one, given our physical observations. This makes a pre-existing value diversity highly probable if not inevitable. For instance, if you value pristine nature in addition to SWB, you should accept the high probability of many parallel earth-like planets with pristine nature irregardless of what you do, and consider that we may be in an exceptional minority position to improve the measure of other values that do not naturally evolve easily, such as a very high positive-SWB-over-suffering surplus.
From the present, into the future
If we accept the conclusion that utility-monstrous technology is a high-value vector for effective altruism (among others), then what could current EAs do as we transition into the future? To my best knowledge, we don't have the capacity yet to create artificial utility monsters.
However, foundational research in neuroscience and artificial intelligence/sentience theory is already ongoing today and certainly a necessity if we ever want to implement utility-monstrous systems. In addition, outreach and public discussion of the fundamental concepts is also possible and plausibly high-value (hence this post). Generally, the following steps seem all useful and could use the attention of EAs, as we progress into the future:
- spread the idea, refine the concepts, apply constructive criticism to all its weak spots until it becomes either solid or revealed as irredeemably undesirable
- identify possible misunderstandings, fears, biases etc. that may reduce human acceptance and find compromises and attraction factors to mitigate them
- fund and do the scientific research that, if successful, could lead to utility-monstrous technologies
- fund the implementation of the first actual utility monsters and test them thoroughly, then improve on the design, then test again, etc.
- either make the templates public (open-source approach) or make them available for specialized altruistic institutions, such as private charities
- perform outreach and fundraising to give existence donations to as many utility monsters as possible
All of this can be done without much self-sacrifice on the part of any individual. And all of this can be done within existing political systems, existing markets, and without violating anyone's rights.
Conservation of expected moral evidence, clarified
You know that when you title a post with "clarified", that you're just asking for the gods to smite you down, but let's try...
There has been some confusion about the concept of "conservation of expected moral evidence" that I touched upon in my posts here and here. The fault for the confusion is mine, so this is a brief note to try and explain it better.
The canonical example is that of a child who wants to steal a cookie. That child gets its morality mainly from its parents. The child strongly suspects that if it asks, all parents will indeed confirm that stealing cookies is wrong. So it decides not to ask, and happily steals the cookie.
I argued that this behvaiour showed a lack of "conservation of expected moral evidence": if the child knows what the answer would be, then that should be equivalent with actually asking. Some people got this immediately, and some people were confused that the agents I defined seemed Bayesian, and so should have conservation of expected evidence already, so how can they violate that principle?
The answer is... both groups are right. The child can be modelled as a Bayesian agent reaching sensible conclusions. If it values "I don't steal the cookie" at 0, "I steal the cookie without being told not to" at 1, and "I steal the cookie after being told not to" at -1, then its behaviour is rational - and those values are acceptable utility values over possible universes. So the child (and many value loading agents) are Bayesian agents with the usual properties.
But we are adding extra structure to the universe. Based on our understanding of what value loading should be, we are decreeing that the child's behaviour is incorrect. Though it doesn't violate expected utility, it violates any sensible meaning of value loading. Our idea of value loading is that, in a sense, values should be independent of many contingent things. There is nothing intrinsically wrong with "stealing cookies is wrong iff the Milky Way contains an even number of pulsars", but it violates what values should be. Similarly for "stealing cookies is wrong iff I ask about it".
But lets dig a bit deeper... Classical conservation of expected evidence fails in many cases. For instance, I can certainly influence the variable X="what Stuart will do in the next ten seconds" (or at least, my decision theory is constructed on assumptions that I can influence that). My decisions change X's expected value quite dramatically. What I can't influence is facts that are not contingent on my actions. For instance, I can't change my expected estimation of the number of pulsars in the galaxy last year. Were I super-powerful, I could change my expected estimation of the number of pulsars in the galaxy next year - by building or destroying pulsars, for instance.
So conservation of expected evidence only applies to things that are independent of the agent's decisions. When I say we need to have "conservation of expected moral evidence" I'm saying that the agent should treat their (expected) morality as independent of their decisions. The kid failed to do this in the example above, and that's the problem.
So conservation of expected moral evidence is something that would be automatically true if morality were something real and objective, and is also a desiderata when constructing general moral systems in practice.
Against utility functions
I think we should stop talking about utility functions.
In the context of ethics for humans, anyway. In practice I find utility functions to be, at best, an occasionally useful metaphor for discussions about ethics but, at worst, an idea that some people start taking too seriously and which actively makes them worse at reasoning about ethics. To the extent that we care about causing people to become better at reasoning about ethics, it seems like we ought to be able to do better than this.
The funny part is that the failure mode I worry the most about is already an entrenched part of the Sequences: it's fake utility functions. The soft failure is people who think they know what their utility function is and say bizarre things about what this implies that they, or perhaps all people, ought to do. The hard failure is people who think they know what their utility function is and then do bizarre things. I hope the hard failure is not very common.
It seems worth reflecting on the fact that the point of the foundational LW material discussing utility functions was to make people better at reasoning about AI behavior and not about human behavior.
An extended class of utility functions
This is a technical result that I wanted to check before writing up a major piece on value loading.
The purpose of a utility function is to give an agent criteria with which to make a decision. If two utility functions always give the same decisions, they're generally considered the same utility function. So, for instance, the utility function u always gives the same decisions as u+C for some constant C, or Du for some positive constant D. Thus we can say that utility functions are equivalent if they are related by a positive affine transformation.
For specific utility functions, and specific agents, the class of functions that give the same decisions is quite a bit larger. For instance, imagine that v is a utility function with the property v("any universe which contains humans")=constant. Then any human who attempts to follow u, could equivalently follow u+v (neglecting acausal trade) - it makes no difference. In general, if no action the agent could ever take would change the value of v, then u and u+v give the same decisions.
More subtly, if the agent can change v but cannot change the expectation of v, then u and u+v still give the same decisions. This is because for any actions a and b the agent could take:
E(u+v | a) = E(u | a) + E(v | a) = E(u | a) + E(v | b).
Hence E(u+v | a) > E(u+v | b) if and only if E(u | a) > E(u | b), and so the decision hasn't changed.
Note that E(v | a) need not be constant for all actions: simply that for every actions and b that an agent could take at a particular decision point, E(v | a) = E(v | b). It's perfectly possible for the expectation of v to be different at different moments, or conditional on different decisions made at different times.
Finally, as long as v obeys the above properties, there is no reason for it to be a utility function in the classical sense - it could be constructed any way we want.
An example: suffer not from probability, nor benefit from it
The preceding seems rather abstract, but here is the motivating example. It's a correction term T that adds or subtracts utility, as external evidence comes in (it's important that the evidence is external - the agent gets no correction from knowing what its own actions are/were). If the AI knows evidence e, and new (external) evidence f comes in, then its utility gets adjusted by T(e,f) which is defined as
T(e,f) = E(u | e) - E(u | e, f)
In other words, the agents utility gets adjusted by the difference between the new expected utility and the old - and hence the agent's expected utility is unchanged by new external evidence.
Consider for instance an agent with a utility u linear in money. It much choose between a bet that goes 50-50 on $0 (heads) or $100 (tails), versus a sure $49. It correctly choose the bet, having an expected utility of u=$50 - in other words, E(u, bet)=$50. But now imagine that the coin comes out heads. The utility u plunges to $0 (in other words E(u | bet, heads)=0). But the correction term cancels that out:
u(bet, heads) + T(bet, heads) = $0 + E(u | bet) - E(u |bet, heads) = $0 + $50 -$0 = $50.
A similar effect leaves utility unchanging if the coin is tails, cancelling the increase. In other words, adding the T correction term removes the impact of stochastic effects on utility.
But the agent will still make the same decisions. This is because before seeing evidence f, it cannot predict its impact on EU(u). In other words, summing over all possible evidences f:
E(u | e) = Σ p(f)E(u | e, f),
which is another way of phrasing "conservation of expected evidence". This implies that
E(T(e,-)) = Σ p(f)T(e,f)
= Σ p(f)((E(u | e) - E(u | e, f))
= E(u | e) - Σ p(f)E(u | e, f)
= 0,
and hence that adding the T term does not change the agent's decisions. All the various corrections add on to the utility as the agent continues making decisions, but none of them make the agent change what it does.
The relevance of this will be explained in a subsequent post (unless someone finds an error here).
Total Utility is Illusionary
(Abstract: We have the notion that people can have a "total utility" value, defined perhaps as the sum of all their changes in utility over time. This is usually not a useful concept, because utility functions can change. In many cases the less-confusing approach is to look only at the utility from each individual decision, and not attempt to consider the total over time. This leads to insights about utilitarianism.)
Let's consider the utility of a fellow named Bob. Bob likes to track his total utility; he writes it down in a logbook every night.
Bob is a stamp collector; he gets +1 utilon every time he adds a stamp to his collection, and he gets -1 utilon every time he removes a stamp from his collection. Bob's utility was zero when his collection was empty, so we can say that Bob's total utility is the number of stamps in his collection.
One day a movie theater opens, and Bob learns that he likes going to movies. Bob counts +10 utilons every time he sees a movie. Now we can say that Bob's total utility is the number of stamps in his collection, plus ten times the number of movies he has seen.
(A note on terminology: I'm saying that Bob's utility function is the thing that emits +1 or -1 or +10, and his total utility is the sum of all those emits over time. I'm not sure if this is standard terminology.)
This should strike us as a little bit strange: Bob now has a term in his total utility which is mostly based on history, and mostly independent of the present state of the world. Technically, we might handwave and say that Bob places value on his memories of watching those movies. But Bob knows that's not actually true: it's the act of watching the movies that he enjoys, and he rarely thinks about them once they're over.
If a hypnotist convinced Bob that he had watched ten billion movies, Bob would write down in his logbook that he had a hundred billion utilons. (Plus the number of stamps in his stamp collection.)
Let's talk some more about that stamp collection. Bob wakes up on June 14 and decides that he doesn't like stamps any more. Now, Bob gets -1 utilon every time he adds a stamp to his collection, and +1 utilon every time he removes one. What can we say about his total utility? We might say that Bob's total utility is the number of stamps in his collection at the start of June 14, plus ten times the number of movies he's watched, plus the number of stamps he removed from his collection after June 14. Or we might say that all Bob's utility from his stamp collection prior to June 14 was false utility, and we should strike it from the record books. Which answer is better?
...Really, neither answer is better, because the "total utility" number we're discussing just isn't very useful. Bob has a very clear utility function which emits numbers like +1 and +10 and -1; he doesn't gain anything by keeping track of the total separately. His total utility doesn't seem to track how happy he actually feels, either. It's not clear what Bob gains from thinking about this total utility number.
I think some of the confusion might be coming from Less Wrong's focus on AI design.
When you're writing a utility function for an AI, one thing you might try is to specify your utility function by specifying the total utility first: you might say "your total utility is the number of balls you have placed in this bucket" and then let the AI work out the implementation details of how happy each individual action makes it.
However, if you're looking at utility functions for actual people, you might encounter something weird like "I get +10 utility every time I watch a movie", or "I woke up today and my utility function changed", and then if you try to compute the total utility for that person, you can get confused.
Let's now talk about utilitarianism. For simplicity, let's assume we're talking about a utilitarian government which is making decisions on behalf of its constituency. (In other words, we're not talking about utilitarianism as a moral theory.)
We have the notion of total utilitarianism, in which the government tries to maximize the sum of the utility values of each of its constituents. This leads to "repugnant conclusion" issues in which the government generates new constituents at a high rate until all of them are miserable.
We also have the notion of average utilitarianism, in which the government tries to maximize the average of the utility values of each of its constituents. This leads to issues -- I'm not sure if there's a snappy name -- where the government tries to kill off the least happy constituents so as to bring the average up.
The problem with both of these notions is that they're taking the notion of "total utility of all constituents" as an input, and then they're changing the number of constituents, which changes the underlying utility function.
I think the right way to do utilitarianism is to ignore the "total utility" thing; that's not a real number anyway. Instead, every time you arrive at a decision point, evaluate what action to take by checking the utility of your constituents from each action. I propose that we call this "delta utilitarianism", because it isn't looking at the total or the average, just at the delta in utility from each action.
This solves the "repugnant conclusion" issue because, at the time when you're considering adding more people, it's more clear that you're considering the utility of your constituents at that time, which does not include the potential new people.
Skirting the mere addition paradox
Consider the following facts:
- For any population of people of happiness h, you can add more people of happiness less than h, and still improve things.
- For any population of people, you can spread people's happiness in a more egalitarian way, while keeping the same average happiness, and this makes things no worse.
This sounds a lot like the mere addition paradox, illustrated by the following diagram:

This is seems to lead directly to the repugnant conclusion - that there is a huge population of people who's lives are barely worth living, but that this outcome is better because of the large number of them (in practice this conclusion may have a little less bite than feared, at least for non-total utilitarians).
But that conclusion doesn't follow at all! Consider the following aggregation formula, where au is the average utility of the population and n is the total number of people in the population:
au(1-(1/2)n)
This obeys the two properties above, and yet does not lead to a repugnant conclusion. How so? Well, property 2 is immediate - since only the average utility appears, the reallocating utility in a more egalitarian way does not decrease the aggregation. For property 1, define f(n)=1-(1/2)n. This function f is strictly increasing, so if we add more members of the population, the product goes up - this allows us to diminish the average utility slightly (by decreasing the utility of the people we've added, say), and still end up with a higher aggregation.
How do we know that there is no repugnant conclusion? Well, f(n) is bounded above by 1. So let au and n be the average utility and size of a given population, and au' and n' those of a population better than this one. Hence au(f(n)) < au'(f(n')) < au'. So the average utility can never sink below au(f(n)): the average utility is bounded.
So some weaker versions of the mere addition argument do not imply the repugnant conclusion.
Another problem with quantum measure
Let's play around with the quantum measure some more. Specifically, let's posit a theory T that claims that the quantum measure of our universe is increasing - say by 50% each day. Why could this be happening? Well, here's a quasi-justification for it: imagine there are lots and lots of of universes, most of them in chaotic random states, jumping around to other chaotic random states, in accordance with the usual laws of quantum mechanics. Occasionally, one of them will partially tunnel, by chance, into the same state our universe is in - and then will evolve forwards in time exactly as our universe is. Over time, we'll accumulate an ever-growing measure.
That theory sounds pretty unlikely, no matter what feeble attempts are made to justify it. But T is observationally indistinguishable from our own universe, and has a non-zero probability of being true. It's the reverse of the (more likely) theory presented here, in which the quantum measure was being constantly diminished. And it's very bad news for theories that treat the quantum measure (squared) as akin to a probability, without ever renormalising. It implies that one must continually sacrifice for the long-term: any pleasure today is wasted, as that pleasure will be weighted so much more tomorrow, next week, next year, next century... A slight fleeting smile on the face of the last human is worth more than all the ecstasy of the previous trillions.
One solution to the "quantum measure is continually diminishing" problem was to note that as the measure of the universe diminished, it would eventually get so low that that any alternative, non-measure diminishing theory, not matter how initially unlikely, would predominate. But that solution is not available here - indeed, that argument runs in reverse, and makes the situation worse. No matter how initially unlikely the "quantum measure is continually increasing" theory is, eventually, the measure will become so high that it completely dominates all other theories.
Mutual Worth without default point (but with potential threats)
Though I planned to avoid posting anything more until well after baby, I found this refinement to MWBS yesterday, so I'm posting it while Miriam sleeps during a pause in contractions.
The mutual worth bargaining solution was built from the idea that the true value of a trade is having your utility function access the decision points of the other player. This gave the idea of utopia points: what happens when you are granted complete control over the other person's decisions. This gave a natural 1 to normalise your utility function. But the 0 point is chosen according to a default point. This is arbitrary, and breaks the symmetry between the top and bottom point of the normalisation.
We'd also want normalisations that function well when players have no idea what their opponents will be. This includes not knowing what their utility functions will be. Can we model what a 'generic' opposing utility function would be?
It's tricky, in general, to know what 'value' to put on an opponent's utility function. It's unclear what kind of utilities would you like to see them have? That's because game theory comes into play, with Nash equilibriums, multiple solution concepts, bargaining and threats: there is no universal default to the result of a game between two agents. There are two situations, however, that are respectively better and worse than all others: the situation where your opponent shares your exact utility function, and the situations where they have the negative of that (they're essentially your 'anti-agent').
If your opponent shares your utility function, then there is a clear ideal outcome: act as if you and the opponent were the same person, acting to maximise your joint utility. This is the utopia point for MWBS, which can be standardised to take value 1.
If your opponent has the negative of your utility, then the game is zero-sum: any gain to you is a loss to your opponent, and there is no possibility for mutually pleasing compromise. But zero-sum games also have a single canonical outcome! For zero-sum games, the concepts of Nash equilibrium, minimax, and maximin are all equivalent (and are generally mixed outcomes). The game has a single defined value: each player can guarantee they get as much utility as that value, and the other player can guarantee that they get no more.
It seems natural to normalise that point to -1 (0 would be equivalent, but -1 feels more appropriate). Given this normalisation for each utility, the two utilities can then be summed and joint maximised in the usual way.
This bargaining solution has a lot of attractive features - it's symmetric in minimal and maximal utilities, does not require a default point, reflects the relative power, and captures the spread of opponents utilities that could be encountered without needing to go into game theory. It is vulnerable to (implicit) threats, however! If I can (potentially) cause a lot of damage to you and your cause, then when you normalise your utility, you get penalised because of what your anti-agent could do if they controlled my decision nodes. So just by having the power do do bad stuff to you, I come out better than I would otherwise (and vice-versa, of course).
I feel it's worth exploring further (especially what happens with multiple agents) - but for me, after the baby.
240 questions for your utility function
A game comparing intrinsic values can approximate a person's utility function.
Upgrading moral theories to include complex values
Like many members of this community, reading the sequences has opened my eyes to a heavily neglected aspect of morality. Before reading the sequences I focused mostly on how to best improve people's wellbeing in the present and the future. However, after reading the sequences, I realized that I had neglected a very important question: In the future we will be able to create creatures with virtually any utility function imaginable. What sort of values should we give the creatures of the future? What sort of desires should they have, from what should they gain wellbeing?
Anyone familiar with the sequences should be familiar with the answer. We should create creatures with the complex values that human beings possess (call them "humane values"). We should avoid creating creatures with simple values that only desire to maximize one thing, like paperclips or pleasure.
It is important that future theories of ethics formalize this insight. I think we all know what would happen if we programmed an AI with conventional utilitarianism: It would exterminate the human race and replace them with creatures whose preferences are easier to satisfy (if you program it with preference utilitarianism) or creatures whom it is easier to make happy (if you program it with hedonic utilitarianism). It is important to develop a theory of ethics that avoids this.
Lately I have been trying to develop a modified utilitarian theory that formalizes this insight. My focus has been on population ethics. I am essentially arguing that population ethics should not just focus on maximizing welfare, it should also focus on what sort of creatures it is best to create. According to this theory of ethics, it is possible for a population with a lower total level of welfare to be better than a population with a higher total level of welfare, if the lower population consists of creatures that have complex humane values, while the higher welfare population consists of paperclip or pleasure maximizers. (I wrote a previous post on this, but it was long and rambling, I am trying to make this one more accessible).
One of the key aspects of this theory is that it does not necessarily rate the welfare of creatures with simple values as unimportant. On the contrary, it considers it good for their welfare to be increased and bad for their welfare to be decreased. Because of this, it implies that we ought to avoid creating such creatures in the first place, so it is not necessary to divert resources from creatures with humane values in order to increase their welfare.
My theory does allow the creation of simple-value creatures for two reasons. One is if the benefits they generate for creatures with humane values outweigh the harms generated when humane-value creatures must divert resources to improving their welfare (companion animals are an obvious example of this). The second is if creatures with humane values are about to go extinct, and the only choices are replacing them with simple value creatures, or replacing them with nothing.
So far I am satisfied with the development of this theory. However, I have hit one major snag, and would love it if someone else could help me with it. The snag is formulated like this:
1. It is better to create a small population of creatures with complex humane values (that has positive welfare) than a large population of animals that can only experience pleasure or pain, even if the large population of animals has a greater total amount of positive welfare. For instance, it is better to create a population of humans with 50 total welfare than a population of animals with 100 total welfare.
2. It is bad to create a small population of creatures with humane values (that has positive welfare) and a large population of animals that are in pain. For instance, it is bad to create a population of animals with -75 total welfare, even if doing so allows you to create a population of humans with 50 total welfare.
3. However, it seems like, if creating human beings wasn't an option, that it might be okay to create a very large population of animals, the majority of which have positive welfare, but the some of which are in pain. For instance, it seems like it would be good to create a population of animals where one section of the population has 100 total welfare, and another section has -75, since the total welfare is 25.
The problem is that this leads to what seems like a circular preference. If the population of animals with 100 welfare existed by itself it would be okay to not create it in order to create a population of humans with 50 welfare instead. But if the population we are talking about is the one in (3) then doing that would result in the population discussed in (2), which is bad.
My current solution to this dilemma is to include a stipulation that a population with negative utility can never be better than one with positive utility. This prevents me from having circular preferences about these scenarios. But it might create some weird problems. If population (2) is created anyway, and the humans in it are unable to help the suffering animals in any way, does that mean they have a duty to create lots of happy animals to get their population's utility up to a positive level? That seems strange, especially since creating the new happy animals won't help the suffering ones in any way. On the other hand, if the humans are able to help the suffering animals, and they do so by means of some sort of utility transfer, then it would be in the best interests to create lots of happy animals, to reduce the amount of utility each person has to transfer.
So far some of the solutions I am considering include:
1. Instead of focusing on population ethics, just consider complex humane values to have greater weight in utility calculations than pleasure or paperclips. I find this idea distasteful because it implies it would be acceptable to inflict large harms on animals for relatively small gains for humans. In addition, if the weight is not sufficiently great it could still lead to an AI exterminating the human race and replacing them with happy animals, since animals are easier to take care of and make happy than humans.
2. It is bad to create the human population in (2) if the only way to do so is to create a huge amount of suffering animals. But once both populations have been created, if the human population is unable to help the animal population, they have no duty to create as many happy animals as they can. This is because the two populations are not causally connected, and that is somehow morally significant. This makes some sense to me, as I don't think the existence of causally disconnected populations in the vast universe should bear any significance on my decision-making.
3. There is some sort of overriding consideration besides utility that makes (3) seem desirable. For instance, it might be bad for creatures with any sort of values to go extinct, so it is good to create a population to prevent this, as long as its utility is positive on the net. However, this would change in a situation where utility is negative, such as in (2).
4. Reasons to create a creature have some kind complex rock-paper-scissors-type "trumping" hierarchy. In other words, the fact that the humans have humane values can override the reasons to create a happy animals, but they cannot override the reason to not create suffering animals. The reasons to create happy animals, however, can override the reasons to not create suffering animals. I think that this argument might lead to inconsistent preferences again, but I'm not sure.
I find none of these solutions that satisfying. I would really appreciate it if someone could help me with solving this dilemma. I'm very hopeful about this ethical theory, and would like to see it improved.
*Update. After considering the issue some more, I realized that my dissatisfaction came from equivocating two different scenarios. I was considering the scenario, "Animals with 100 utility and animals with -75 utility are created, no humans are created at all" to be the same as the scenario "Humans with 50 utility and animals with -75 utility are created, then the humans (before the get to experience their 50 utility) are killed/harmed in order to create more animals without helping the suffering animals in any way" to be the same scenario. They are clearly not.
To make the analogy more obvious, imagine I was given a choice between creating a person who would experience 95 utility over the course of their life, or a person who would experience 100 utility over the course of their life. I would choose the person with 100 utility. But if the person destined to experience 95 utility already existed, but had not experienced the majority of that utility yet, I would oppose killing them and replacing them with the 100 utility person.
Or to put it more succinctly, I am willing to not create some happy humans to prevent some suffering animals from being created. And if the suffering animals and happy humans already exist I am willing to harm the happy humans to help the suffering animals. But if the suffering animals and happy humans already exist I am not willing to harm the happy humans to create some extra happy animals that will not help the existing suffering animals in any way.
Population Ethics Shouldn't Be About Maximizing Utility
let me suggest a moral axiom with apparently very strong intuitive support, no matter what your concept of morality: morality should exist. That is, there should exist creatures who know what is moral, and who act on that. So if your moral theory implies that in ordinary circumstances moral creatures should exterminate themselves, leaving only immoral creatures, or no creatures at all, well that seems a sufficient reductio to solidly reject your moral theory.
I agree strongly with the above quote, and I think most other readers will as well. It is good for moral beings to exist and a world with beings who value morality is almost always better than one where they do not. I would like to restate this more precisely as the following axiom: A population in which moral beings exist and have net positive utility, and in which all other creatures in existence also have net positive utility, is always better than a population where moral beings do not exist.
While the axiom that morality should exist is extremely obvious to most people, there is one strangely popular ethical system that rejects it: total utilitarianism. In this essay I will argue that Total Utilitarianism leads to what I will call the Genocidal Conclusion, which is that there are many situations in which it would be fantastically good for moral creatures to either exterminate themselves, or greatly limit their utility and reproduction in favor of the utility and reproduction of immoral creatures. I will argue that the main reason consequentialist theories of population ethics produce such obviously absurd conclusions is that they continue to focus on maximizing utility1 in situations where it is possible to create new creatures. I will argue that pure utility maximization is only a valid ethical theory for "special case" scenarios where the population is static. I will propose an alternative theory for population ethics I call "ideal consequentialism" or "ideal utilitarianism" which avoids the Genocidal Conclusion and may also avoid the more famous Repugnant Conclusion.
I will begin my argument by pointing to a common problem in population ethics known as the Mere Addition Paradox (MAP) and the Repugnant Conclusion. Most Less Wrong readers will already be familiar with this problem, so I do not think I need to elaborate on it. You may also be familiar with a even stronger variation called the Benign Addition Paradox (BAP). This is essentially the same as the MAP, except that each time one adds more people one also gives a small amount of additional utility to the people who already existed. One then proceeds to redistribute utility between people as normal, eventually arriving at the huge population where everyone's lives are "barely worth living." The point of this is to argue that the Repugnant Conclusion can be arrived at from "mere addition" of new people that not only doesn't harm the preexisting-people, but also one that benefits them.
The next step of my argument involves three slightly tweaked versions of the Benign Addition Paradox. I have not changed the basic logic of the problem, I have just added one small clarifying detail. In the original MAP and BAP it was not specified what sort of values the added individuals in population A+ held. Presumably one was meant to assume that they were ordinary human beings. In the versions of the BAP I am about to present, however, I will specify that the extra individuals added in A+ are not moral creatures, that if they have values at all they are values indifferent to, or opposed to, morality and the other values that the human race holds dear.
1. The Benign Addition Paradox with Paperclip Maximizers.
Let us imagine, as usual, a population, A, which has a large group of human beings living lives of very high utility. Let us then add a new population consisting of paperclip maximizers, each of whom is living a life barely worth living. Presumably, for a paperclip maximizer, this would be a life where the paperclip maximizer's existence results in at least one more paperclip in the world than there would have been otherwise.
Now, one might object that if one creates a paperclip maximizer, and then allows it to create one paperclip, the utility of the other paperclip maximizers will increase above the "barely worth living" level, which would obviously make this thought experiment nonalagous with the original MAP and BAP. To prevent this we will assume that each paperclip maximizer that is created has a slightly different values on what the ideal size, color, and composition of the paperclip they are trying to produce is. So the Purple 2 centimeter Plastic Paperclip Maximizer gains no addition utility from when the Silver Iron 1 centimeter Paperclip Maximizer makes a paperclip.
So again, let us add these paperclip maximizers to population A, and in the process give one extra utilon of utility to each preexisting person in A. This is a good thing, right? After all, everyone in A benefited, and the paperclippers get to exist and make paperclips. So clearly A+, the new population, is better than A.
Now let's take the next step, the transition from population A+ to population B. Take some of the utility from the human beings and convert it into paperclips. This is a good thing, right?
So let us repeat these steps adding paperclip maximizers and utility, and then redistributing utility. Eventually we reach population Z, where there is a vast amount of paperclip maximizers, a vast amount of many different kinds of paperclips, and a small amount of human beings living lives barely worth living.
Obviously Z is better than A, right? We should not fear the creation of a paperclip maximizing AI, but welcome it! Forget about things like high challenge, love, interpersonal entanglement, complex fun, and so on! Those things just don't produce the kind of utility that paperclip maximization has the potential to do!
Or maybe there is something seriously wrong with the moral assumptions behind the Mere Addition and Benign Addition Paradoxes.
But you might argue that I am using an unrealistic example. Creatures like Paperclip Maximizers may be so far removed from normal human experience that we have trouble thinking about them properly. So let's replay the Benign Addition Paradox again, but with creatures we might actually expect to meet in real life, and we know we actually value.
2. The Benign Addition Paradox with Non-Sapient Animals
You know the drill by now. Take population A, add a new population to it, while very slightly increasing the utility of the original population. This time let's have it be some kind animal that is capable of feeling pleasure and pain, but is not capable of modeling possible alternative futures and choosing between them (in other words, it is not capable of having "values" or being "moral"). A lizard or a mouse, for example. Each one feels slightly more pleasure than pain in its lifetime, so it can be said to have a life barely worth living. Convert A+ to B. Take the utilons that the human beings are using to experience things like curiosity, beatitude, wisdom, beauty, harmony, morality, and so on, and convert it into pleasure for the animals.
We end up with population Z, with a vast amount of mice or lizards with lives just barely worth living, and a small amount of human beings with lives barely worth living. Terrific! Why do we bother creating humans at all! Let's just create tons of mice and inject them full of heroin! It's a much more efficient way to generate utility!
3. The Benign Addition Paradox with Sociopaths
What new population will we add to A this time? How about some other human beings, who all have anti-social personality disorder? True, they lack the key, crucial value of sympathy that defines so much of human behavior. But they don't seem to miss it. And their lives are barely worth living, so obviously A+ has greater utility than A. If given a chance the sociopaths will reduce the utility of other people to negative levels, but let's assume that that is somehow prevented in this case.
Eventually we get to Z, with a vast population of sociopaths and a small population of normal human beings, all living lives just barely worth living. That has more utility, right? True, the sociopaths place no value on things like friendship, love, compassion, empathy, and so on. And true, the sociopaths are immoral beings who do not care in the slightest about right and wrong. But what does that matter? Utility is being maximized, and surely that is what population ethics is all about!
Asteroid!
Let's suppose an asteroid is approaching each of the four population Zs discussed before. It can only be deflected by so much. Your choice is, save the original population of humans from A, or save the vast new population. The choice is obvious. In 1, 2, and 3, each individual has the same level utility, so obviously we should choose which option saves a greater number of individuals.
Bam! The asteroid strikes. The end result in all four scenarios is a world in which all the moral creatures are destroyed. It is a world without the many complex values that human beings possess. Each world, for the most part, lack things like complex challenge, imagination, friendship, empathy, love, and the other complex values that human beings prize. But so what? The purpose of population ethics is to maximize utility, not silly, frivolous things like morality, or the other complex values of the human race. That means that any form of utility that is easier to produce than those values is obviously superior. It's easier to make pleasure and paperclips than it is to make eudaemonia, so that's the form of utility that ought to be maximized, right? And as for making sure moral beings exist, well that's just ridiculous. The valuable processing power they're using to care about morality could be being used to make more paperclips or more mice injected with heroin! Obviously it would be better if they died off, right?
I'm going to go out on a limb and say "Wrong."
Is this realistic?
Now, to fair, in the Overcoming Bias page I quoted, Robin Hanson also says:
I’m not saying I can’t imagine any possible circumstances where moral creatures shouldn’t die off, but I am saying that those are not ordinary circumstances.
Maybe the scenarios I am proposing are just too extraordinary. But I don't think this is the case. I imagine that the circumstances Robin had in mind were probably something like "either all moral creatures die off, or all moral creatures are tortured 24/7 for all eternity."
Any purely utility-maximizing theory of population ethics that counts both the complex values of human beings, and the pleasure of animals, as "utility" should inevitably draw the conclusion that human beings ought to limit their reproduction to the bare minimum necessary to maintain the infrastructure to sustain a vastly huge population of non-human animals (preferably animals dosed with some sort of pleasure-causing drug). And if some way is found to maintain that infrastructure automatically, without the need for human beings, then the logical conclusion is that human beings are a waste of resources (as are chimps, gorillas, dolphins, and any other animal that is even remotely capable of having values or morality). Furthermore, even if the human race cannot practically be replaced with automated infrastructure, this should be an end result that the adherents of this theory should be yearning for.2 There should be much wailing and gnashing of teeth among moral philosophers that exterminating the human race is impractical, and much hope that someday in the future it will not be.
I call this the "Genocidal Conclusion" or "GC." On the macro level the GC manifests as the idea that the human race ought to be exterminated and replaced with creatures whose preferences are easier to satisfy. On the micro level it manifests as the idea that it is perfectly acceptable to kill someone who is destined to live a perfectly good and worthwhile life and replace them with another person who would have a slightly higher level of utility.
Population Ethics isn't About Maximizing Utility
I am going to make a rather radical proposal. I am going to argue that the consequentialist's favorite maxim, "maximize utility," only applies to scenarios where creating new people or creatures is off the table. I think we need an entirely different ethical framework to describe what ought to be done when it is possible to create new people. I am not by any means saying that "which option would result in more utility" is never a morally relevant consideration when deciding to create a new person, but I definitely think it is not the only one.3
So what do I propose as a replacement to utility maximization? I would argue in favor of a system that promotes a wide range of ideals. Doing some research, I discovered that G. E. Moore had in fact proposed a form of "ideal utilitarianism" in the early 20th century.4 However, I think that "ideal consequentialism" might be a better term for this system, since it isn't just about aggregating utility functions.
What are some of the ideals that an ideal consequentialist theory of population ethics might seek to promote? I've already hinted at what I think they are: Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom... mutual affection, love, friendship, cooperation; all those other important human universals, plus all the stuff in the Fun Theory Sequence. When considering what sort of creatures to create we ought to create creatures that value those things. Not necessarily, all of them, or in the same proportions, for diversity is an important ideal as well, but they should value a great many of those ideals.
Now, lest you worry that this theory has any totalitarian implications, let me make it clear that I am not saying we should force these values on creatures that do not share them. Forcing a paperclip maximizer to pretend to make friends and love people does not do anything to promote the ideals of Friendship and Love. Forcing a chimpanzee to listen while you read the Sequences to it does not promote the values of Truth and Knowledge. Those ideals require both a subjective and objective component. The only way to promote those ideals is to create a creature that includes them as part of its utility function and then help it maximize its utility.
I am also certainly not saying that there is never any value in creating a creature that does not possess these values. There are obviously many circumstances where it is good to create nonhuman animals. There may even be some circumstances where a paperclip maximizer could be of value. My argument is simply that it is most important to make sure that creatures who value these various ideals exist.
I am also not suggesting that it is morally acceptable to casually inflict horrible harms upon a creature with non-human values if we screw up and create one by accident. If promoting ideals and maximizing utility are separate values then it may be that once we have created such a creature we have a duty to make sure it lives a good life, even if it was a bad thing to create it in the first place. You can't unbirth a child.5
It also seems to me that in addition to having ideals about what sort of creatures should exist, we also have ideals about how utility ought to be concentrated. If this is the case then ideal consequentialism may be able to block some forms of the Repugnant Conclusion, even if situations where the only creatures whose creation is being considered are human beings. If it is acceptable to create humans instead of paperclippers, even if the paperclippers would have higher utility, it may also be acceptable to create ten humans with a utility of ten each instead of a hundred humans with a utility of 1.01 each.
Why Did We Become Convinced that Maximizing Utility was the Sole Good?
Population ethics was, until comparatively recently, a fallow field in ethics. And in situations where there is no option to increase the population, maximizing utility is the only consideration that's really relevant. If you've created creatures that value the right ideals, then all that is left to be done is to maximize their utility. If you've created creatures that do not value the right ideals, there is no value to be had in attempting to force them to embrace those ideals. As I've said before, you will not promote the values of Love and Friendship by creating a paperclip maximizer and forcing it to pretend to love people and make friends.
So in situations where the population is constant, "maximize utility" is a decent approximation of the meaning of right. It's only when the population can be added to that morality becomes much more complicated.
Another thing to blame is human-centric reasoning. When people defend the Repugnant Conclusion they tend to point out that a life barely worth living is not as bad as it would seem at first glance. They emphasize that it need not be a boring life, it may be a life full of ups and downs where the ups just barely outweigh the downs. A life worth living, they say, is a life one would choose to live. Derek Parfit developed this idea to some extent by arguing that there are certain values that are "discontinuous" and that one needs to experience many of them in order to truly have a life worth living.
The Orthogonality Thesis throws all these arguments out the window. It is possible to create an intelligence to execute any utility function, no matter what it is. If human beings have all sorts of complex needs that must be fulfilled in order to for them lead worthwhile lives, then you could create more worthwhile lives by killing the human race and replacing them with something less finicky. Maybe happy cows. Maybe paperclip maximizers. Or how about some creature whose only desire is to live for one second and then die. If we created such a creature and then killed it we would reap huge amounts of utility, for we would have created a creature that got everything it wanted out of life!
How Intuitive is the Mere Addition Principle, Really?
I think most people would agree that morality should exist, and that therefore any system of population ethics should not lead to the Genocidal Conclusion. But which step in the Benign Addition Paradox should we reject? We could reject the step where utility is redistributed. But that seems wrong, most people seem to consider it bad for animals and sociopaths to suffer, and that it is acceptable to inflict at least some amount of disutilities on human beings to prevent such suffering.
It seems more logical to reject the Mere Addition Principle. In other words, maybe we ought to reject the idea that the mere addition of more lives-worth-living cannot make the world worse. And in turn, we should probably also reject the Benign Addition Principle. Adding more lives-worth-living may be capable of making the world worse, even if doing so also slightly benefits existing people. Fortunately this isn't a very hard principle to reject. While many moral philosophers treat it as obviously correct, nearly everyone else rejects this principle in day-to-day life.
Now, I'm obviously not saying that people's behavior in their day-to-day lives is always good, it may be that they are morally mistaken. But I think the fact that so many people seem to implicitly reject it provides some sort of evidence against it.
Take people's decision to have children. Many people choose to have fewer children than they otherwise would because they do not believe they will be able to adequately care for them, at least not without inflicting large disutilities on themselves. If most people accepted the Mere Addition Principle there would be a simple solution for this: have more children and then neglect them! True, the children's lives would be terrible while they were growing up, but once they've grown up and are on their own there's a good chance they may be able to lead worthwhile lives. Not only that, it may be possible to trick the welfare system into giving you money for the children you neglect, which would satisfy the Benign Addition Principle.
Yet most people choose not to have children and neglect them. And furthermore they seem to think that they have a moral duty not to do so, that a world where they choose to not have neglected children is better than one that they don't. What is wrong with them?
Another example is a common political view many people have. Many people believe that impoverished people should have fewer children because of the burden doing so would place on the welfare system. They also believe that it would be bad to get rid of the welfare system altogether. If the Benign Addition Principle were as obvious as it seems, they would instead advocate for the abolition of the welfare system, and encourage impoverished people to have more children. Assuming most impoverished people live lives worth living, this is exactly analogous to the BAP, it would create more people, while benefiting existing ones (the people who pay less taxes because of the abolition of the welfare system).
Yet again, most people choose to reject this line of reasoning. The BAP does not seem to be an obvious and intuitive principle at all.
The Genocidal Conclusion is Really Repugnant
There is nearly nothing repugnant than the Genocidal Conclusion. Pretty much the only way a line of moral reasoning could go more wrong would be concluding that we have a moral duty to cause suffering, as an end in itself. This means that it's fairly easy to counter any argument in favor of total utilitarianism that argues the alternative I am promoting has odd conclusions that do not fit some of our moral intuitions, while total utilitarianism does not. Is that conclusion more insane than the Genocidal Conclusion? If it isn't, total utilitarianism should still be rejected.
Ideal Consequentialism Needs a Lot of Work
I do think that Ideal Consequentialism needs some serious ironing out. I haven't really developed it into a logical and rigorous system, at this point it's barely even a rough framework. There are many questions that stump me. In particular I am not quite sure what population principle I should develop. It's hard to develop one that rejects the MAP without leading to weird conclusions, like that it's bad to create someone of high utility if a population of even higher utility existed long ago. It's a difficult problem to work on, and it would be interesting to see if anyone else had any ideas.
But just because I don't have an alternative fully worked out doesn't mean I can't reject Total Utilitarianism. It leads to the conclusion that a world with no love, curiosity, complex challenge, friendship, morality, or any other value the human race holds dear is an ideal, desirable world, if there is a sufficient amount of some other creature with a simpler utility function. Morality should exist, and because of that, total utilitarianism must be rejected as a moral system.
1I have been asked to note that when I use the phrase "utility" I am usually referring to a concept that is called "E-utility," rather than the Von Neumann-Morgenstern utility that is sometimes discussed in decision theory. The difference is that in VNM one's moral views are included in one's utility function, whereas in E-utility they are not. So if one chooses to harm oneself to help others because one believes that is morally right, one has higher VNM utility, but lower E-utility.
2There is a certain argument against the Repugnant Conclusion that goes that, as the steps of the Mere Addition Paradox are followed the world will lose its last symphony, its last great book, and so on. I have always considered this to be an invalid argument because the world of the RC doesn't necessarily have to be one where these things don't exist, it could be one where they exist, but are enjoyed very rarely. The Genocidal Conclusion brings this argument back in force. Creating creatures that can appreciate symphonies and great books is very inefficient compared to creating bunny rabbits pumped full of heroin.
3Total Utilitarianism was originally introduced to population ethics as a possible solution to the Non-Identity Problem. I certainly agree that such a problem needs a solution, even if Total Utilitarianism doesn't work out as that solution.
4I haven't read a lot of Moore, most of my ideas were extrapolated from other things I read on Less Wrong. I just mentioned him because in my research I noticed his concept of "ideal utilitarianism" resembled my ideas. While I do think he was on the right track he does commit the Mind Projection Fallacy a lot. For instance, he seems to think that one could promote beauty by creating beautiful objects, even if there were no creatures with standards of beauty around to appreciate them. This is why I am careful to emphasize that to promote ideals like love and beauty one must create creatures capable of feeling love and experiencing beauty.
5My tentative answer to the question Eliezer poses in "You Can't Unbirth a Child" is that human beings may have a duty to allow the cheesecake maximizers to build some amount of giant cheesecakes, but they would also have a moral duty to limit such creatures' reproduction in order to spare resources to create more creatures with humane values.
EDITED: To make a point about ideal consequentialism clearer, based on AlexMennen's criticisms.
Desires You're Not Thinking About at the Moment
While doing some reading on philosophy I came across some interesting questions about the nature of having desires and preferences. One, do you still have preferences and desires when you are unconscious? Two, if you don't does this call into question the many moral theories that hold that having preferences and desires is what makes one morally significant, since mistreating temporarily unconscious people seems obviously immoral?
Philosophers usually discuss this question when debating the morality of abortion, but to avoid doing any mindkilling I won't mention that topic, except to say in this sentence that I won't mention it.
In more detail the issue is: A common, intuitive, and logical-seeming explanation for why it is immoral to destroy a typical human being, but not to destroy a rock, is that a typical human being has certain desires (or preferences or values, whatever you wish to call them, I'm using the terms interchangably) that they wish to fulfill, and destroying them would hinder the fulfillment of these desires. A rock, by contrast does not have any such desires so it is not harmed by being destroyed. The problem with this is that it also seems immoral to harm a human being who is asleep, or is in a temporary coma. And, on the face of it, it seems plausible to say that an unconscious person does not have any desires. (And of course it gets even weirder when considering far-out concepts like a brain emulator that is saved to a hard drive, but isn't being run at the moment)
After thinking about this it occurred to me that this line of reasoning could be taken further. If I am not thinking about my car at the moment, can I still be said to desire that it is not stolen? Do I stop having desires about things the instant my attention shifts away from them?
I have compiled a list of possible solutions to this problem, ranked in order from least plausible to most plausible.
1. One possibility would be to consider it immoral to harm a sleeping person because if they will have desires in the future, even if they don't now. I find this argument extremely implausible because it has some extremely bizarre implications, some of which may lead to insoluble moral contradictions. For instance, this argument could be used to argue that it is immoral to destroy skin cells because it is possible to use them to clone a new person, who will eventually grow up to have desires.
Furthermore, when human beings eventually gain the ability to build AIs that possess desires, this solution interacts with the orthogonality thesis in a catastrophic fashion. If it is possible to build an AI with any utility function, then for every potential AI one can construct, there is another potential AI that desires the exact opposite of that AI. That leads to total paralysis, since for every set potential set of desires we are capable of satisfying there is another potential set that would be horribly thwarted.
Lastly, this argument implies that you can, (and may be obligated to) help someone who doesn't exist, and never has existed, by satisfying their non-personal preferences, without ever having to bother with actually creating them. This seem strange, I can maybe see an argument for respecting the once-existant preferences of those who are dead, but respecting the hypothetical preferences of the never-existed seems absurd. It also has the same problems with the orthogonality thesis that I mentioned earlier.
2. Make the same argument as solution 1, but somehow define the categories more narrowly so that an unconscious person's ability to have desires in the future differs from that of an uncloned skin cell or an unbuilt AI. Michael Tooley has tried to do this by discerning between things that have the "possibility" of becoming a person with desires (i.e skin cells) and those that have the "capacity" to have desires. This approach has been criticized, and I find myself pessimistic about it because categories have a tendency to be "fuzzy" in real life and not have sharp borders.
3. Another solution may be that desires that one has had in the past continue to count, even when one is unconscious or not thinking about them. So it's immoral to harm unconscious people because before they were unconscious they had a desire not to be harmed, and it's immoral to steal my car because I desired that it not be stolen earlier when I was thinking about it.
I find this solution fairly convincing. The only major quibble I have with it is that it gives what some might consider a counter-intuitive result on a variation of the sleeping person question. Imagine a nano-factory manufacturers a sleeping person. This person is a new and distinct individual, and when they wake up they will proceed to behave as a typical human. This solution may suggest that it is okay to kill them before they wake up, since they haven't had any desires yet, which does seem odd.
4. Reject the claim that one doesn't have desires when one is unconscious, or when one is not thinking about a topic. The more I think about this solution, the more obvious it seems. Generally when I am rationally deliberating about whether or not I desire something I consider how many of my values and ideaks it fulfills. It seems like my list of values and ideals remains fairly constant, and that even if I am focusing my attention on one value at a time it makes sense to say that I still "have" the other values I am not focusing on at the moment.
Obviously I don't think that there's some portion of my brain where my "values" are stored in a neat little Excel spreadsheet. But they do seem to be a persistent part of its structure in some fashion. And it makes sense that they'd still be part of its structure when I'm unconscious. If they weren't, wouldn't my preferences change radically every time I woke up?
In other words, it's bad to harm an unconscious person because they have desires, preferences, values, whatever you wish to call them, that harming them would violate. And those values are a part of the structure of their mind that doesn't go away when they sleep. Skin cells and unbuilt AIs, by contrast, have no such values.
Now, while I think that explanation 4 resolves the issue of desires and unconsciousness best, I do think solution 3 has a great deal of truth to it as well (For instance, I tend to respect the final wishes of a dead person because they had desires in the past, even if they don't now). The solutions 3 and 4 are not incompatible at all, so one can believe in both of them.
I'm curious as to what people think of my possible solutions. Am I right about people still having something like desires in their brain when they are unconscious?
Higher than the most high
In an earlier post, I talked about how we could deal with variants of the Heaven and Hell problem - situations where you have an infinite number of options, and none of them is a maximum. The solution for a (deterministic) agent was to try and implement the strategy that would reach the highest possible number, without risking falling into an infinite loop.
Wei Dai pointed out that in the cases where the options are unbounded in utility (ie you can get arbitrarily high utility), then there are probabilistic strategies that give you infinite expected utility. I suggested you could still do better than this. This started a conversation about choosing between strategies with infinite expectation (would you prefer a strategy with infinite expectation, or the same plus an extra dollar?), which went off into some interesting directions as to what needed to be done when the strategies can't sensibly be compared with each other...
Interesting though that may be, it's also helpful to have simple cases where you don't need all these subtleties. So here is one:
Omega approaches you and Mrs X, asking you each to name an integer to him, privately. The person who names the highest integer gets 1 utility; the other gets nothing. In practical terms, Omega will reimburse you all utility lost during the decision process (so you can take as long as you want to decide). The first person to name a number gets 1 utility immediately; they may then lose that 1 depending on the eventual response of the other. Hence if one person responds and the other doesn't, they get the 1 utility and keep it. What should you do?
In this case, a strategy that gives you a number with infinite expectation isn't enough - you have to beat Mrs X, but you also have to eventually say something. Hence there is a duel of (likely probabilistic) strategies, implemented by bounded agents, with no maximum strategy, and each agent trying to compute the maximal strategy they can construct without falling into a loop.
View more: Next
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)