Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
Like many members of this community, reading the sequences has opened my eyes to a heavily neglected aspect of morality. Before reading the sequences I focused mostly on how to best improve people's wellbeing in the present and the future. However, after reading the sequences, I realized that I had neglected a very important question: In the future we will be able to create creatures with virtually any utility function imaginable. What sort of values should we give the creatures of the future? What sort of desires should they have, from what should they gain wellbeing?
Anyone familiar with the sequences should be familiar with the answer. We should create creatures with the complex values that human beings possess (call them "humane values"). We should avoid creating creatures with simple values that only desire to maximize one thing, like paperclips or pleasure.
It is important that future theories of ethics formalize this insight. I think we all know what would happen if we programmed an AI with conventional utilitarianism: It would exterminate the human race and replace them with creatures whose preferences are easier to satisfy (if you program it with preference utilitarianism) or creatures whom it is easier to make happy (if you program it with hedonic utilitarianism). It is important to develop a theory of ethics that avoids this.
Lately I have been trying to develop a modified utilitarian theory that formalizes this insight. My focus has been on population ethics. I am essentially arguing that population ethics should not just focus on maximizing welfare, it should also focus on what sort of creatures it is best to create. According to this theory of ethics, it is possible for a population with a lower total level of welfare to be better than a population with a higher total level of welfare, if the lower population consists of creatures that have complex humane values, while the higher welfare population consists of paperclip or pleasure maximizers. (I wrote a previous post on this, but it was long and rambling, I am trying to make this one more accessible).
One of the key aspects of this theory is that it does not necessarily rate the welfare of creatures with simple values as unimportant. On the contrary, it considers it good for their welfare to be increased and bad for their welfare to be decreased. Because of this, it implies that we ought to avoid creating such creatures in the first place, so it is not necessary to divert resources from creatures with humane values in order to increase their welfare.
My theory does allow the creation of simple-value creatures for two reasons. One is if the benefits they generate for creatures with humane values outweigh the harms generated when humane-value creatures must divert resources to improving their welfare (companion animals are an obvious example of this). The second is if creatures with humane values are about to go extinct, and the only choices are replacing them with simple value creatures, or replacing them with nothing.
So far I am satisfied with the development of this theory. However, I have hit one major snag, and would love it if someone else could help me with it. The snag is formulated like this:
1. It is better to create a small population of creatures with complex humane values (that has positive welfare) than a large population of animals that can only experience pleasure or pain, even if the large population of animals has a greater total amount of positive welfare. For instance, it is better to create a population of humans with 50 total welfare than a population of animals with 100 total welfare.
2. It is bad to create a small population of creatures with humane values (that has positive welfare) and a large population of animals that are in pain. For instance, it is bad to create a population of animals with -75 total welfare, even if doing so allows you to create a population of humans with 50 total welfare.
3. However, it seems like, if creating human beings wasn't an option, that it might be okay to create a very large population of animals, the majority of which have positive welfare, but the some of which are in pain. For instance, it seems like it would be good to create a population of animals where one section of the population has 100 total welfare, and another section has -75, since the total welfare is 25.
The problem is that this leads to what seems like a circular preference. If the population of animals with 100 welfare existed by itself it would be okay to not create it in order to create a population of humans with 50 welfare instead. But if the population we are talking about is the one in (3) then doing that would result in the population discussed in (2), which is bad.
My current solution to this dilemma is to include a stipulation that a population with negative utility can never be better than one with positive utility. This prevents me from having circular preferences about these scenarios. But it might create some weird problems. If population (2) is created anyway, and the humans in it are unable to help the suffering animals in any way, does that mean they have a duty to create lots of happy animals to get their population's utility up to a positive level? That seems strange, especially since creating the new happy animals won't help the suffering ones in any way. On the other hand, if the humans are able to help the suffering animals, and they do so by means of some sort of utility transfer, then it would be in the best interests to create lots of happy animals, to reduce the amount of utility each person has to transfer.
So far some of the solutions I am considering include:
1. Instead of focusing on population ethics, just consider complex humane values to have greater weight in utility calculations than pleasure or paperclips. I find this idea distasteful because it implies it would be acceptable to inflict large harms on animals for relatively small gains for humans. In addition, if the weight is not sufficiently great it could still lead to an AI exterminating the human race and replacing them with happy animals, since animals are easier to take care of and make happy than humans.
2. It is bad to create the human population in (2) if the only way to do so is to create a huge amount of suffering animals. But once both populations have been created, if the human population is unable to help the animal population, they have no duty to create as many happy animals as they can. This is because the two populations are not causally connected, and that is somehow morally significant. This makes some sense to me, as I don't think the existence of causally disconnected populations in the vast universe should bear any significance on my decision-making.
3. There is some sort of overriding consideration besides utility that makes (3) seem desirable. For instance, it might be bad for creatures with any sort of values to go extinct, so it is good to create a population to prevent this, as long as its utility is positive on the net. However, this would change in a situation where utility is negative, such as in (2).
4. Reasons to create a creature have some kind complex rock-paper-scissors-type "trumping" hierarchy. In other words, the fact that the humans have humane values can override the reasons to create a happy animals, but they cannot override the reason to not create suffering animals. The reasons to create happy animals, however, can override the reasons to not create suffering animals. I think that this argument might lead to inconsistent preferences again, but I'm not sure.
I find none of these solutions that satisfying. I would really appreciate it if someone could help me with solving this dilemma. I'm very hopeful about this ethical theory, and would like to see it improved.
*Update. After considering the issue some more, I realized that my dissatisfaction came from equivocating two different scenarios. I was considering the scenario, "Animals with 100 utility and animals with -75 utility are created, no humans are created at all" to be the same as the scenario "Humans with 50 utility and animals with -75 utility are created, then the humans (before the get to experience their 50 utility) are killed/harmed in order to create more animals without helping the suffering animals in any way" to be the same scenario. They are clearly not.
To make the analogy more obvious, imagine I was given a choice between creating a person who would experience 95 utility over the course of their life, or a person who would experience 100 utility over the course of their life. I would choose the person with 100 utility. But if the person destined to experience 95 utility already existed, but had not experienced the majority of that utility yet, I would oppose killing them and replacing them with the 100 utility person.
Or to put it more succinctly, I am willing to not create some happy humans to prevent some suffering animals from being created. And if the suffering animals and happy humans already exist I am willing to harm the happy humans to help the suffering animals. But if the suffering animals and happy humans already exist I am not willing to harm the happy humans to create some extra happy animals that will not help the existing suffering animals in any way.
let me suggest a moral axiom with apparently very strong intuitive support, no matter what your concept of morality: morality should exist. That is, there should exist creatures who know what is moral, and who act on that. So if your moral theory implies that in ordinary circumstances moral creatures should exterminate themselves, leaving only immoral creatures, or no creatures at all, well that seems a sufficient reductio to solidly reject your moral theory.
I agree strongly with the above quote, and I think most other readers will as well. It is good for moral beings to exist and a world with beings who value morality is almost always better than one where they do not. I would like to restate this more precisely as the following axiom: A population in which moral beings exist and have net positive utility, and in which all other creatures in existence also have net positive utility, is always better than a population where moral beings do not exist.
While the axiom that morality should exist is extremely obvious to most people, there is one strangely popular ethical system that rejects it: total utilitarianism. In this essay I will argue that Total Utilitarianism leads to what I will call the Genocidal Conclusion, which is that there are many situations in which it would be fantastically good for moral creatures to either exterminate themselves, or greatly limit their utility and reproduction in favor of the utility and reproduction of immoral creatures. I will argue that the main reason consequentialist theories of population ethics produce such obviously absurd conclusions is that they continue to focus on maximizing utility1 in situations where it is possible to create new creatures. I will argue that pure utility maximization is only a valid ethical theory for "special case" scenarios where the population is static. I will propose an alternative theory for population ethics I call "ideal consequentialism" or "ideal utilitarianism" which avoids the Genocidal Conclusion and may also avoid the more famous Repugnant Conclusion.
I will begin my argument by pointing to a common problem in population ethics known as the Mere Addition Paradox (MAP) and the Repugnant Conclusion. Most Less Wrong readers will already be familiar with this problem, so I do not think I need to elaborate on it. You may also be familiar with a even stronger variation called the Benign Addition Paradox (BAP). This is essentially the same as the MAP, except that each time one adds more people one also gives a small amount of additional utility to the people who already existed. One then proceeds to redistribute utility between people as normal, eventually arriving at the huge population where everyone's lives are "barely worth living." The point of this is to argue that the Repugnant Conclusion can be arrived at from "mere addition" of new people that not only doesn't harm the preexisting-people, but also one that benefits them.
The next step of my argument involves three slightly tweaked versions of the Benign Addition Paradox. I have not changed the basic logic of the problem, I have just added one small clarifying detail. In the original MAP and BAP it was not specified what sort of values the added individuals in population A+ held. Presumably one was meant to assume that they were ordinary human beings. In the versions of the BAP I am about to present, however, I will specify that the extra individuals added in A+ are not moral creatures, that if they have values at all they are values indifferent to, or opposed to, morality and the other values that the human race holds dear.
1. The Benign Addition Paradox with Paperclip Maximizers.
Let us imagine, as usual, a population, A, which has a large group of human beings living lives of very high utility. Let us then add a new population consisting of paperclip maximizers, each of whom is living a life barely worth living. Presumably, for a paperclip maximizer, this would be a life where the paperclip maximizer's existence results in at least one more paperclip in the world than there would have been otherwise.
Now, one might object that if one creates a paperclip maximizer, and then allows it to create one paperclip, the utility of the other paperclip maximizers will increase above the "barely worth living" level, which would obviously make this thought experiment nonalagous with the original MAP and BAP. To prevent this we will assume that each paperclip maximizer that is created has a slightly different values on what the ideal size, color, and composition of the paperclip they are trying to produce is. So the Purple 2 centimeter Plastic Paperclip Maximizer gains no addition utility from when the Silver Iron 1 centimeter Paperclip Maximizer makes a paperclip.
So again, let us add these paperclip maximizers to population A, and in the process give one extra utilon of utility to each preexisting person in A. This is a good thing, right? After all, everyone in A benefited, and the paperclippers get to exist and make paperclips. So clearly A+, the new population, is better than A.
Now let's take the next step, the transition from population A+ to population B. Take some of the utility from the human beings and convert it into paperclips. This is a good thing, right?
So let us repeat these steps adding paperclip maximizers and utility, and then redistributing utility. Eventually we reach population Z, where there is a vast amount of paperclip maximizers, a vast amount of many different kinds of paperclips, and a small amount of human beings living lives barely worth living.
Obviously Z is better than A, right? We should not fear the creation of a paperclip maximizing AI, but welcome it! Forget about things like high challenge, love, interpersonal entanglement, complex fun, and so on! Those things just don't produce the kind of utility that paperclip maximization has the potential to do!
Or maybe there is something seriously wrong with the moral assumptions behind the Mere Addition and Benign Addition Paradoxes.
But you might argue that I am using an unrealistic example. Creatures like Paperclip Maximizers may be so far removed from normal human experience that we have trouble thinking about them properly. So let's replay the Benign Addition Paradox again, but with creatures we might actually expect to meet in real life, and we know we actually value.
2. The Benign Addition Paradox with Non-Sapient Animals
You know the drill by now. Take population A, add a new population to it, while very slightly increasing the utility of the original population. This time let's have it be some kind animal that is capable of feeling pleasure and pain, but is not capable of modeling possible alternative futures and choosing between them (in other words, it is not capable of having "values" or being "moral"). A lizard or a mouse, for example. Each one feels slightly more pleasure than pain in its lifetime, so it can be said to have a life barely worth living. Convert A+ to B. Take the utilons that the human beings are using to experience things like curiosity, beatitude, wisdom, beauty, harmony, morality, and so on, and convert it into pleasure for the animals.
We end up with population Z, with a vast amount of mice or lizards with lives just barely worth living, and a small amount of human beings with lives barely worth living. Terrific! Why do we bother creating humans at all! Let's just create tons of mice and inject them full of heroin! It's a much more efficient way to generate utility!
3. The Benign Addition Paradox with Sociopaths
What new population will we add to A this time? How about some other human beings, who all have anti-social personality disorder? True, they lack the key, crucial value of sympathy that defines so much of human behavior. But they don't seem to miss it. And their lives are barely worth living, so obviously A+ has greater utility than A. If given a chance the sociopaths will reduce the utility of other people to negative levels, but let's assume that that is somehow prevented in this case.
Eventually we get to Z, with a vast population of sociopaths and a small population of normal human beings, all living lives just barely worth living. That has more utility, right? True, the sociopaths place no value on things like friendship, love, compassion, empathy, and so on. And true, the sociopaths are immoral beings who do not care in the slightest about right and wrong. But what does that matter? Utility is being maximized, and surely that is what population ethics is all about!
Let's suppose an asteroid is approaching each of the four population Zs discussed before. It can only be deflected by so much. Your choice is, save the original population of humans from A, or save the vast new population. The choice is obvious. In 1, 2, and 3, each individual has the same level utility, so obviously we should choose which option saves a greater number of individuals.
Bam! The asteroid strikes. The end result in all four scenarios is a world in which all the moral creatures are destroyed. It is a world without the many complex values that human beings possess. Each world, for the most part, lack things like complex challenge, imagination, friendship, empathy, love, and the other complex values that human beings prize. But so what? The purpose of population ethics is to maximize utility, not silly, frivolous things like morality, or the other complex values of the human race. That means that any form of utility that is easier to produce than those values is obviously superior. It's easier to make pleasure and paperclips than it is to make eudaemonia, so that's the form of utility that ought to be maximized, right? And as for making sure moral beings exist, well that's just ridiculous. The valuable processing power they're using to care about morality could be being used to make more paperclips or more mice injected with heroin! Obviously it would be better if they died off, right?
I'm going to go out on a limb and say "Wrong."
Is this realistic?
Now, to fair, in the Overcoming Bias page I quoted, Robin Hanson also says:
I’m not saying I can’t imagine any possible circumstances where moral creatures shouldn’t die off, but I am saying that those are not ordinary circumstances.
Maybe the scenarios I am proposing are just too extraordinary. But I don't think this is the case. I imagine that the circumstances Robin had in mind were probably something like "either all moral creatures die off, or all moral creatures are tortured 24/7 for all eternity."
Any purely utility-maximizing theory of population ethics that counts both the complex values of human beings, and the pleasure of animals, as "utility" should inevitably draw the conclusion that human beings ought to limit their reproduction to the bare minimum necessary to maintain the infrastructure to sustain a vastly huge population of non-human animals (preferably animals dosed with some sort of pleasure-causing drug). And if some way is found to maintain that infrastructure automatically, without the need for human beings, then the logical conclusion is that human beings are a waste of resources (as are chimps, gorillas, dolphins, and any other animal that is even remotely capable of having values or morality). Furthermore, even if the human race cannot practically be replaced with automated infrastructure, this should be an end result that the adherents of this theory should be yearning for.2 There should be much wailing and gnashing of teeth among moral philosophers that exterminating the human race is impractical, and much hope that someday in the future it will not be.
I call this the "Genocidal Conclusion" or "GC." On the macro level the GC manifests as the idea that the human race ought to be exterminated and replaced with creatures whose preferences are easier to satisfy. On the micro level it manifests as the idea that it is perfectly acceptable to kill someone who is destined to live a perfectly good and worthwhile life and replace them with another person who would have a slightly higher level of utility.
Population Ethics isn't About Maximizing Utility
I am going to make a rather radical proposal. I am going to argue that the consequentialist's favorite maxim, "maximize utility," only applies to scenarios where creating new people or creatures is off the table. I think we need an entirely different ethical framework to describe what ought to be done when it is possible to create new people. I am not by any means saying that "which option would result in more utility" is never a morally relevant consideration when deciding to create a new person, but I definitely think it is not the only one.3
So what do I propose as a replacement to utility maximization? I would argue in favor of a system that promotes a wide range of ideals. Doing some research, I discovered that G. E. Moore had in fact proposed a form of "ideal utilitarianism" in the early 20th century.4 However, I think that "ideal consequentialism" might be a better term for this system, since it isn't just about aggregating utility functions.
What are some of the ideals that an ideal consequentialist theory of population ethics might seek to promote? I've already hinted at what I think they are: Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom... mutual affection, love, friendship, cooperation; all those other important human universals, plus all the stuff in the Fun Theory Sequence. When considering what sort of creatures to create we ought to create creatures that value those things. Not necessarily, all of them, or in the same proportions, for diversity is an important ideal as well, but they should value a great many of those ideals.
Now, lest you worry that this theory has any totalitarian implications, let me make it clear that I am not saying we should force these values on creatures that do not share them. Forcing a paperclip maximizer to pretend to make friends and love people does not do anything to promote the ideals of Friendship and Love. Forcing a chimpanzee to listen while you read the Sequences to it does not promote the values of Truth and Knowledge. Those ideals require both a subjective and objective component. The only way to promote those ideals is to create a creature that includes them as part of its utility function and then help it maximize its utility.
I am also certainly not saying that there is never any value in creating a creature that does not possess these values. There are obviously many circumstances where it is good to create nonhuman animals. There may even be some circumstances where a paperclip maximizer could be of value. My argument is simply that it is most important to make sure that creatures who value these various ideals exist.
I am also not suggesting that it is morally acceptable to casually inflict horrible harms upon a creature with non-human values if we screw up and create one by accident. If promoting ideals and maximizing utility are separate values then it may be that once we have created such a creature we have a duty to make sure it lives a good life, even if it was a bad thing to create it in the first place. You can't unbirth a child.5
It also seems to me that in addition to having ideals about what sort of creatures should exist, we also have ideals about how utility ought to be concentrated. If this is the case then ideal consequentialism may be able to block some forms of the Repugnant Conclusion, even if situations where the only creatures whose creation is being considered are human beings. If it is acceptable to create humans instead of paperclippers, even if the paperclippers would have higher utility, it may also be acceptable to create ten humans with a utility of ten each instead of a hundred humans with a utility of 1.01 each.
Why Did We Become Convinced that Maximizing Utility was the Sole Good?
Population ethics was, until comparatively recently, a fallow field in ethics. And in situations where there is no option to increase the population, maximizing utility is the only consideration that's really relevant. If you've created creatures that value the right ideals, then all that is left to be done is to maximize their utility. If you've created creatures that do not value the right ideals, there is no value to be had in attempting to force them to embrace those ideals. As I've said before, you will not promote the values of Love and Friendship by creating a paperclip maximizer and forcing it to pretend to love people and make friends.
So in situations where the population is constant, "maximize utility" is a decent approximation of the meaning of right. It's only when the population can be added to that morality becomes much more complicated.
Another thing to blame is human-centric reasoning. When people defend the Repugnant Conclusion they tend to point out that a life barely worth living is not as bad as it would seem at first glance. They emphasize that it need not be a boring life, it may be a life full of ups and downs where the ups just barely outweigh the downs. A life worth living, they say, is a life one would choose to live. Derek Parfit developed this idea to some extent by arguing that there are certain values that are "discontinuous" and that one needs to experience many of them in order to truly have a life worth living.
The Orthogonality Thesis throws all these arguments out the window. It is possible to create an intelligence to execute any utility function, no matter what it is. If human beings have all sorts of complex needs that must be fulfilled in order to for them lead worthwhile lives, then you could create more worthwhile lives by killing the human race and replacing them with something less finicky. Maybe happy cows. Maybe paperclip maximizers. Or how about some creature whose only desire is to live for one second and then die. If we created such a creature and then killed it we would reap huge amounts of utility, for we would have created a creature that got everything it wanted out of life!
How Intuitive is the Mere Addition Principle, Really?
I think most people would agree that morality should exist, and that therefore any system of population ethics should not lead to the Genocidal Conclusion. But which step in the Benign Addition Paradox should we reject? We could reject the step where utility is redistributed. But that seems wrong, most people seem to consider it bad for animals and sociopaths to suffer, and that it is acceptable to inflict at least some amount of disutilities on human beings to prevent such suffering.
It seems more logical to reject the Mere Addition Principle. In other words, maybe we ought to reject the idea that the mere addition of more lives-worth-living cannot make the world worse. And in turn, we should probably also reject the Benign Addition Principle. Adding more lives-worth-living may be capable of making the world worse, even if doing so also slightly benefits existing people. Fortunately this isn't a very hard principle to reject. While many moral philosophers treat it as obviously correct, nearly everyone else rejects this principle in day-to-day life.
Now, I'm obviously not saying that people's behavior in their day-to-day lives is always good, it may be that they are morally mistaken. But I think the fact that so many people seem to implicitly reject it provides some sort of evidence against it.
Take people's decision to have children. Many people choose to have fewer children than they otherwise would because they do not believe they will be able to adequately care for them, at least not without inflicting large disutilities on themselves. If most people accepted the Mere Addition Principle there would be a simple solution for this: have more children and then neglect them! True, the children's lives would be terrible while they were growing up, but once they've grown up and are on their own there's a good chance they may be able to lead worthwhile lives. Not only that, it may be possible to trick the welfare system into giving you money for the children you neglect, which would satisfy the Benign Addition Principle.
Yet most people choose not to have children and neglect them. And furthermore they seem to think that they have a moral duty not to do so, that a world where they choose to not have neglected children is better than one that they don't. What is wrong with them?
Another example is a common political view many people have. Many people believe that impoverished people should have fewer children because of the burden doing so would place on the welfare system. They also believe that it would be bad to get rid of the welfare system altogether. If the Benign Addition Principle were as obvious as it seems, they would instead advocate for the abolition of the welfare system, and encourage impoverished people to have more children. Assuming most impoverished people live lives worth living, this is exactly analogous to the BAP, it would create more people, while benefiting existing ones (the people who pay less taxes because of the abolition of the welfare system).
Yet again, most people choose to reject this line of reasoning. The BAP does not seem to be an obvious and intuitive principle at all.
The Genocidal Conclusion is Really Repugnant
There is nearly nothing repugnant than the Genocidal Conclusion. Pretty much the only way a line of moral reasoning could go more wrong would be concluding that we have a moral duty to cause suffering, as an end in itself. This means that it's fairly easy to counter any argument in favor of total utilitarianism that argues the alternative I am promoting has odd conclusions that do not fit some of our moral intuitions, while total utilitarianism does not. Is that conclusion more insane than the Genocidal Conclusion? If it isn't, total utilitarianism should still be rejected.
Ideal Consequentialism Needs a Lot of Work
I do think that Ideal Consequentialism needs some serious ironing out. I haven't really developed it into a logical and rigorous system, at this point it's barely even a rough framework. There are many questions that stump me. In particular I am not quite sure what population principle I should develop. It's hard to develop one that rejects the MAP without leading to weird conclusions, like that it's bad to create someone of high utility if a population of even higher utility existed long ago. It's a difficult problem to work on, and it would be interesting to see if anyone else had any ideas.
But just because I don't have an alternative fully worked out doesn't mean I can't reject Total Utilitarianism. It leads to the conclusion that a world with no love, curiosity, complex challenge, friendship, morality, or any other value the human race holds dear is an ideal, desirable world, if there is a sufficient amount of some other creature with a simpler utility function. Morality should exist, and because of that, total utilitarianism must be rejected as a moral system.
1I have been asked to note that when I use the phrase "utility" I am usually referring to a concept that is called "E-utility," rather than the Von Neumann-Morgenstern utility that is sometimes discussed in decision theory. The difference is that in VNM one's moral views are included in one's utility function, whereas in E-utility they are not. So if one chooses to harm oneself to help others because one believes that is morally right, one has higher VNM utility, but lower E-utility.
2There is a certain argument against the Repugnant Conclusion that goes that, as the steps of the Mere Addition Paradox are followed the world will lose its last symphony, its last great book, and so on. I have always considered this to be an invalid argument because the world of the RC doesn't necessarily have to be one where these things don't exist, it could be one where they exist, but are enjoyed very rarely. The Genocidal Conclusion brings this argument back in force. Creating creatures that can appreciate symphonies and great books is very inefficient compared to creating bunny rabbits pumped full of heroin.
3Total Utilitarianism was originally introduced to population ethics as a possible solution to the Non-Identity Problem. I certainly agree that such a problem needs a solution, even if Total Utilitarianism doesn't work out as that solution.
4I haven't read a lot of Moore, most of my ideas were extrapolated from other things I read on Less Wrong. I just mentioned him because in my research I noticed his concept of "ideal utilitarianism" resembled my ideas. While I do think he was on the right track he does commit the Mind Projection Fallacy a lot. For instance, he seems to think that one could promote beauty by creating beautiful objects, even if there were no creatures with standards of beauty around to appreciate them. This is why I am careful to emphasize that to promote ideals like love and beauty one must create creatures capable of feeling love and experiencing beauty.
5My tentative answer to the question Eliezer poses in "You Can't Unbirth a Child" is that human beings may have a duty to allow the cheesecake maximizers to build some amount of giant cheesecakes, but they would also have a moral duty to limit such creatures' reproduction in order to spare resources to create more creatures with humane values.
EDITED: To make a point about ideal consequentialism clearer, based on AlexMennen's criticisms.
While doing some reading on philosophy I came across some interesting questions about the nature of having desires and preferences. One, do you still have preferences and desires when you are unconscious? Two, if you don't does this call into question the many moral theories that hold that having preferences and desires is what makes one morally significant, since mistreating temporarily unconscious people seems obviously immoral?
Philosophers usually discuss this question when debating the morality of abortion, but to avoid doing any mindkilling I won't mention that topic, except to say in this sentence that I won't mention it.
In more detail the issue is: A common, intuitive, and logical-seeming explanation for why it is immoral to destroy a typical human being, but not to destroy a rock, is that a typical human being has certain desires (or preferences or values, whatever you wish to call them, I'm using the terms interchangably) that they wish to fulfill, and destroying them would hinder the fulfillment of these desires. A rock, by contrast does not have any such desires so it is not harmed by being destroyed. The problem with this is that it also seems immoral to harm a human being who is asleep, or is in a temporary coma. And, on the face of it, it seems plausible to say that an unconscious person does not have any desires. (And of course it gets even weirder when considering far-out concepts like a brain emulator that is saved to a hard drive, but isn't being run at the moment)
After thinking about this it occurred to me that this line of reasoning could be taken further. If I am not thinking about my car at the moment, can I still be said to desire that it is not stolen? Do I stop having desires about things the instant my attention shifts away from them?
I have compiled a list of possible solutions to this problem, ranked in order from least plausible to most plausible.
1. One possibility would be to consider it immoral to harm a sleeping person because if they will have desires in the future, even if they don't now. I find this argument extremely implausible because it has some extremely bizarre implications, some of which may lead to insoluble moral contradictions. For instance, this argument could be used to argue that it is immoral to destroy skin cells because it is possible to use them to clone a new person, who will eventually grow up to have desires.
Furthermore, when human beings eventually gain the ability to build AIs that possess desires, this solution interacts with the orthogonality thesis in a catastrophic fashion. If it is possible to build an AI with any utility function, then for every potential AI one can construct, there is another potential AI that desires the exact opposite of that AI. That leads to total paralysis, since for every set potential set of desires we are capable of satisfying there is another potential set that would be horribly thwarted.
Lastly, this argument implies that you can, (and may be obligated to) help someone who doesn't exist, and never has existed, by satisfying their non-personal preferences, without ever having to bother with actually creating them. This seem strange, I can maybe see an argument for respecting the once-existant preferences of those who are dead, but respecting the hypothetical preferences of the never-existed seems absurd. It also has the same problems with the orthogonality thesis that I mentioned earlier.
2. Make the same argument as solution 1, but somehow define the categories more narrowly so that an unconscious person's ability to have desires in the future differs from that of an uncloned skin cell or an unbuilt AI. Michael Tooley has tried to do this by discerning between things that have the "possibility" of becoming a person with desires (i.e skin cells) and those that have the "capacity" to have desires. This approach has been criticized, and I find myself pessimistic about it because categories have a tendency to be "fuzzy" in real life and not have sharp borders.
3. Another solution may be that desires that one has had in the past continue to count, even when one is unconscious or not thinking about them. So it's immoral to harm unconscious people because before they were unconscious they had a desire not to be harmed, and it's immoral to steal my car because I desired that it not be stolen earlier when I was thinking about it.
I find this solution fairly convincing. The only major quibble I have with it is that it gives what some might consider a counter-intuitive result on a variation of the sleeping person question. Imagine a nano-factory manufacturers a sleeping person. This person is a new and distinct individual, and when they wake up they will proceed to behave as a typical human. This solution may suggest that it is okay to kill them before they wake up, since they haven't had any desires yet, which does seem odd.
4. Reject the claim that one doesn't have desires when one is unconscious, or when one is not thinking about a topic. The more I think about this solution, the more obvious it seems. Generally when I am rationally deliberating about whether or not I desire something I consider how many of my values and ideaks it fulfills. It seems like my list of values and ideals remains fairly constant, and that even if I am focusing my attention on one value at a time it makes sense to say that I still "have" the other values I am not focusing on at the moment.
Obviously I don't think that there's some portion of my brain where my "values" are stored in a neat little Excel spreadsheet. But they do seem to be a persistent part of its structure in some fashion. And it makes sense that they'd still be part of its structure when I'm unconscious. If they weren't, wouldn't my preferences change radically every time I woke up?
In other words, it's bad to harm an unconscious person because they have desires, preferences, values, whatever you wish to call them, that harming them would violate. And those values are a part of the structure of their mind that doesn't go away when they sleep. Skin cells and unbuilt AIs, by contrast, have no such values.
Now, while I think that explanation 4 resolves the issue of desires and unconsciousness best, I do think solution 3 has a great deal of truth to it as well (For instance, I tend to respect the final wishes of a dead person because they had desires in the past, even if they don't now). The solutions 3 and 4 are not incompatible at all, so one can believe in both of them.
I'm curious as to what people think of my possible solutions. Am I right about people still having something like desires in their brain when they are unconscious?
In an earlier post, I talked about how we could deal with variants of the Heaven and Hell problem - situations where you have an infinite number of options, and none of them is a maximum. The solution for a (deterministic) agent was to try and implement the strategy that would reach the highest possible number, without risking falling into an infinite loop.
Wei Dai pointed out that in the cases where the options are unbounded in utility (ie you can get arbitrarily high utility), then there are probabilistic strategies that give you infinite expected utility. I suggested you could still do better than this. This started a conversation about choosing between strategies with infinite expectation (would you prefer a strategy with infinite expectation, or the same plus an extra dollar?), which went off into some interesting directions as to what needed to be done when the strategies can't sensibly be compared with each other...
Interesting though that may be, it's also helpful to have simple cases where you don't need all these subtleties. So here is one:
Omega approaches you and Mrs X, asking you each to name an integer to him, privately. The person who names the highest integer gets 1 utility; the other gets nothing. In practical terms, Omega will reimburse you all utility lost during the decision process (so you can take as long as you want to decide). The first person to name a number gets 1 utility immediately; they may then lose that 1 depending on the eventual response of the other. Hence if one person responds and the other doesn't, they get the 1 utility and keep it. What should you do?
In this case, a strategy that gives you a number with infinite expectation isn't enough - you have to beat Mrs X, but you also have to eventually say something. Hence there is a duel of (likely probabilistic) strategies, implemented by bounded agents, with no maximum strategy, and each agent trying to compute the maximal strategy they can construct without falling into a loop.
There are many paradoxes with unbounded utility functions. For instance, consider whether it's rational to spend eternity in Hell:
Suppose that you die, and God offers you a deal. You can spend 1 day in Hell, and he will give you 2 days in Heaven, and then you will spend the rest of eternity in Purgatory (which is positioned exactly midway in utility between heaven and hell). You decide that it's a good deal, and accept. At the end of your first day in Hell, God offers you the same deal: 1 extra day in Hell, and you will get 2 more days in Heaven. Again you accept. The same deal is offered at the end of the second day.
And the result is... that you spend eternity in Hell. There is never a rational moment to leave for Heaven - that decision is always dominated by the decision to stay in Hell.
Or consider a simpler paradox:
You're immortal. Tell Omega any natural number, and he will give you that much utility. On top of that, he will give you any utility you may have lost in the decision process (such as the time wasted choosing and specifying your number). Then he departs. What number will you choose?
Again, there's no good answer to this problem - any number you name, you could have got more by naming a higher one. And since Omega compensates you for extra effort, there's never any reason to not name a higher number.
It seems that these are problems caused by unbounded utility. But that's not the case, in fact! Consider:
You're immortal. Tell Omega any real number r > 0, and he'll give you 1-r utility. On top of that, he will give you any utility you may have lost in the decision process (such as the time wasted choosing and specifying your number). Then he departs. What number will you choose?
Appendix to: A fungibility theorem
Suppose that is a set and we have functions . Recall that for , we say that is a Pareto improvement over if for all , we have . And we say that it is a strong Pareto improvement if in addition there is some for which . We call a Pareto optimum if there is no strong Pareto improvement over it.
Theorem. Let be a set and suppose for are functions satisfying the following property: For any and any , there exists an such that for all , we have .
Then if an element of is a Pareto optimum, then there exist nonnegative constants such that the function achieves a maximum at .
Imagine that the universe is approximately as it appears to be (I know, this is a controversial proposition, but bear with me!). Further imagine that the many worlds interpretation of Quantum mechanics is true (I'm really moving out of Less Wrong's comfort zone here, aren't I?).
Now assume that our universe is in a situation of false vacuum - the universe is not in its lowest energy configuration. Somewhere, at some point, our universe may tunnel into true vacuum, resulting in a expanding bubble of destruction that will eat the entire universe at high speed, destroying all matter and life. In many worlds, such a collapse need not be terminal: life could go one on a branch of lower measure. In fact, anthropically, life will go on somewhere, no matter how unstable the false vacuum is.
So now assume that the false vacuum we're in is highly unstable - the measure of the branch in which our universe survives goes down by a factor of a trillion every second. We only exist because we're in the branch of measure a trillionth of a trillionth of a trillionth of... all the way back to the Big Bang.
None of these assumptions make any difference to what we'd expect to see observationally: only a good enough theory can say that they're right or wrong. You may notice that this setup transforms the whole universe into a quantum suicide situation.
The question is, how do you go about maximising expected utility in this situation? I can think of a few different approaches:
- Gnaw on the bullet: take the quantum measure as a probability. This means that you now have a discount factor of a trillion every second. You have to rush out and get/do all the good stuff as fast as possible: a delay of a second costs you a reduction in utility of a trillion. If you are a negative utilitarian, you also have to rush to minimise the bad stuff, but you can also take comfort in the fact that the potential for negative utility across the universe is going down fast.
- Use relative measures: care about the relative proportion of good worlds versus bad worlds, while assigning zero to those worlds where the vacuum has collapsed. This requires a natural zero to make sense, and can be seen as quite arbitrary: what would you do about entangled worlds, or about the non-zero probability that the vacuum-collapsed worlds may have worthwhile life in them? Would the relative measure user also put zero value to worlds that were empty of life for other reasons than vacuum collapse? For instance, would they be in favour of programming an AI's friendliness using random quantum bits, if it could be reassured that if friendliness fails, the AI would kill everyone immediately?
- Deny the measure: construct a meta ethical theory where only classical probabilities (or classical uncertainties) count as probabilities. Quantum measures do not: you care about the sum total of all branches of the universe. Universes in which the photon went through the top slit, went through the bottom slit, or was in an entangled state that went through both slits... to you, there are three completely separate universes, and you can assign totally unrelated utilities to each one. This seems quite arbitrary, though: how are you going to construct these preferences across the whole of the quantum universe, when forged your current preferences on a single branch?
- Cheat: note that nothing in life is certain. Even if we have the strongest evidence imaginable about vacuum collapse, there's always a tiny chance that the evidence is wrong. After a few seconds, that probability will be dwarfed by the discount factor of the collapsing universe. So go about your business as usual, knowing that most of the measure/probability mass remains in the non-collapsing universe. This can get tricky if, for instance the vacuum collapsed more slowly that a factor of a trillion a second. Would you be in a situation where you should behave as if you believed vacuum collapse for another decade, say, and then switch to a behaviour that assumed non-collapse afterwards? Also, would you take seemingly stupid bets, like bets at a trillion trillion trillion to one that the next piece of evidence will show no collapse (if you lose, you're likely in the low measure universe anyway, so the loss is minute)?
One approach to constructing a Friendly artificial intelligence is to create a piece of software that looks at large amounts of evidence about humans, and attempts to infer their values. I've been doing some thinking about this problem, and I'm going to talk about some approaches and problems that have occurred to me.
In a naive approach, we might define the problem like this: take some unknown utility function, U, and plug it into a mathematically clean optimization process (like AIXI) O. Then, look at your data set and take the information about the inputs and outputs of humans, and find the simplest U that best explains human behavior.
Unfortunately, this won't work. The best possible match for U is one that models not just those elements of human utility we're interested in, but also all the details of our broken, contradictory optimization process. The U we derive through this process will optimize for confirmation bias, scope insensitivity, hindsight bias, the halo effect, our own limited intelligence and inefficient use of evidence, and just about everything else that's wrong with us. Not what we're looking for.
Okay, so let's try putting a bandaid on it - let's go back to our original problem setup. However, we'll take our original O, and use all of the science on cognitive biases at our disposal to handicap it. We'll limit its search space, saddle it with a laundry list of cognitive biases, cripple its ability to use evidence, and in general make it as human-like as we possibly can. We could even give it akrasia by implementing hyperbolic discounting of reward. Then we'll repeat the original process to produce U'.
If we plug U' into our AI, the result will be that it will optimize like a human who had suddenly been stripped of all the kinds of stupidity that we programmed into our modified O. This is good! Plugged into a solid CEV infrastructure, this might even be good enough to produce a future that's a nice place to live. However, it's not quite ideal. If we miss a cognitive bias, then it'll be incorporated into the learned utility functions, and we may never be rid of it. What would be nice would be if we could get the AI to learn about cognitive biases, exhaustively, and update in the future if it ever discovered a new one.
If we had enough time and money, we could do this the hard way: acquire a representative sample of the human population, and pay them to perform tasks with simple goals under tremendous surveillance, and have the AI derive the human optimization process from the actions taken towards a known goal. However, if we assume that the human optimization process can be defined as a function over the state of the human brain, we should not trust the completeness of any such process learned from less data than the entropy of the human brain, which is on the order of tens of petabytes of extremely high quality evidence. If we want to be confident in the completeness of our model, we may need more experimental evidence than it is really practical to accumulate. Which isn't to say that this approach is useless - if we can hit close enough to the mark, then the AI may be able to run more exhaustive experimentation later and refine its own understanding of human brains to be closer to the ideal.
But it'd really be nice if our AI could do unsupervised learning to figure out the details of human optimization. Then we could simply dump the internet into it, and let it grind away at the data and spit out a detailed, complete model of human decision-making, from which our utility function could be derived. Unfortunately, this does not seem to be a tractable problem. It's possible that some insight could be gleaned by examining outliers with normal intelligence, but deviant utility functions (I am thinking specifically of sociopaths), but it's unclear how much insight can be produced by these methods. If anyone has suggestions for a more efficient way of going about it, I'd love to hear it. As it stands, it might be possible to get enough information from this to supplement a supervised learning approach - the closer we get to a perfectly accurate model, the higher the probability of Things Going Well.
Anyways, that's where I am right now. I just thought I'd put up my thoughts and see if some fresh eyes see anything I've been missing.
Many people see themselves in various groups (member of the population of their home country, or their social network), and feel justified in caring more about the well-being of people in this group than about that of others. They will argue with reciprocity: "Those people pay taxes in our country, they are entitled to more support from 'us' than others!" My question is: Is this inconsistent with some rationality axioms that seem obvious? What often-adopted or reasonable axioms are there that make this inconsistent?
This is a mathematical appendix to my post "Why you must maximize expected utility", giving precise statements and proofs of some results about von Neumann-Morgenstern utility theory without the Axiom of Continuity. I wish I had the time to make this post more easily readable, giving more intuition; the ideas are rather straight-forward and I hope they won't get lost in the line noise!
The work here is my own (though closely based on the standard proof of the VNM theorem), but I don't expect the results to be new.
In the following, I will always assume that satisfies the independence axiom: that is, for all and , we have if and only if . Note that the analogous statement with weak preferences follows from this: holds iff , which by independence is equivalent to , which is just .
Lemma 1 (more of a good thing is always better). If and , then .
Proof. Let . Then, and . Thus, the result follows from independence applied to , , , and .
Lemma 2. If and , then there is a unique such that for and for .
Proof. Let be the supremum of all such that (note that by assumption, this condition holds for ). Suppose that . Then there is an such that . By Lemma 1, we have , and the first assertion follows.
Suppose now that . Then by definition of , we do not have , which means that we have , which was the second assertion.
Finally, uniqueness is obvious, because if both and satisfied the condition, we would have .
Definition 3. is much better than , notation or , if there are neighbourhoods of and of (in the relative topology of ) such that we have for all and . (In other words, the graph of is the interior of the graph of .) Write or when ( is not much better than ), and ( is about as good as ) when both and .
Theorem 4 (existence of a utility function). There is a such that for all ,
Unless for all and , there are such that .
Proof. Let be a worst and a best outcome, i.e. let be such that for all . If , then for all , and by repeated applications of independence we get for all , and therefore again for all , and we can simply choose .
Thus, suppose that . In this case, let be such that for every , equals the unique provided by Lemma 2 applied to and . Because of Lemma 1, . Let .
We first show that implies . For every , we either have , in which case by Lemma 2 we have for arbitrarily small , or we have , in which case we set and find . Set . Now, by independence applied times, we have ; analogously, we obtain for arbitrarily small . Thus, using and Lemma 1, and therefore as claimed. Now note that if , then this continues to hold for and in a sufficiently small neighbourhood of and , and therefore we have .
Now suppose that . Since we have and , we can find points and arbitrarily close to and such that the inequality becomes strict (either the left-hand side is smaller than one and we can increase it, or the right-hand side is greater than zero and we can decrease it, or else the inequality is already strict). Then, by the preceding paragraph. But this implies that , which completes the proof.
Corollary 5. is a preference relation (i.e., a total preorder) that satisfies independence and the von Neumann-Morgenstern continuity axiom.
Proof. It is well-known (and straightforward to check) that this follows from the assertion of the theorem.
Corollary 6. is unique up to affine transformations.
Proof. Since is a VNM utility function for , this follows from the analogous result for that case.
Corollary 7. Unless for all , for all the set has lower dimension than (i.e., it is the intersection of with a lower-dimensional subspace of ).
Proof. First, note that the assumption implies that . Let be given by , , and note that is the intersection of the hyperplane with the closed positive orthant . By the theorem, is not parallel to , so the hyperplane is not parallel to . It follows that has dimension , and therefore can have at most this dimension. (It can have smaller dimension or be the empty set if only touches or lies entirely outside the positive orthant.)
In explorations of AI risk, it is helpful to formalize concepts. One particularly important concept is intelligence. How can we formalize it, or better yet, measure it? “Intelligence” is often considered mysterious or is anthropomorphized. One way to taboo “intelligence” is to talk instead about optimization processes. An optimization process (OP, also optimization power) selects some futures from a space of possible futures. It does so according to some criterion; that is, it optimizes for something. Eliezer Yudkowsky spends a few of the sequence posts discussing the nature and importance of this concept for understanding AI risk. In them, he informally describes a way to measure the power of an OP. We consider mathematical formalizations of this measure.
Here's EY's original description of his measure of OP.
Put a measure on the state space - if it's discrete, you can just count. Then collect all the states which are equal to or greater than the observed outcome, in that optimization process's implicit or explicit preference ordering. Sum or integrate over the total size of all such states. Divide by the total volume of the state space. This gives you the power of the optimization process measured in terms of the improbabilities that it can produce - that is, improbability of a random selection producing an equally good result, relative to a measure and a preference ordering.
If you prefer, you can take the reciprocal of this improbability (1/1000 becomes 1000) and then take the logarithm base 2. This gives you the power of the optimization process in bits.
Let's say that at time we have a formalism to specify all possible world states at some future time . Perhaps it is a list of particle locations and velocities, or perhaps it is a list of all possible universal wave functions. Or maybe we're working in a limited domain, and it's a list of all possible next-move chess boards. Let's also assume that we have a well-justified prior over these states being the next ones to occur in the absence of an OP (more on that later).
We order according to the OP's preferences. For the moment, we actually don't care about the density, or “measure” of our ordering. Now we have a probability distribution over . The integral from to over this represents the probability that the worldstate at will be better than , and worse than . When time continues, and the OP acts to bring about some worldstate , we can calculate the probability of an equal or better outcome occurring;
This is a simple generalization of what EY describes above. Here are some things I am confused about.
Finding a specification for all possible worldstates is hard, but it's been done before. There are many ways to reasonably represent this. What I can't figure out is how to specify possible worldstates “in the absence of an OP”. This phrase hides tons of complexity. How can we formally construct this counterfactual? Is the matter that composes the OP no longer present? Is it present but “not acting”? What constitutes a null action? Are we considering the expected worldstate distribution as if the OP never existed? If the OP is some kind of black-box AI agent, it's easier to imagine this. But if the OP is evolution, or a forest fire, it's harder to imagine. Furthermore, is the specification dualist, or is the agent part of the worldstates? If it's dualist, this is a fundamental falseness which can have lots of bad implications. If the agent is part of the worldstates, how do we represent them “in absence of an OP”?
But for the rest of this article, let's pretend we have such a specification. There's also a loss from ignoring the cardinal utility of the worldstates. Let's say you have the two distributions of utility over sets , representing two different OPs. In both, the OP choose a with the same utility . The distributions are the same on the left side of , and the second distribution has a longer tail on the right. It seems like the OP in distribution 1 was more impressive; the second OP missed all the available higher utility. We could make the expected utility of the second distribution arbitrarily high, while maintaining the same fraction of probability mass above the achieved worldstate. Conversely, we could instead extend the left tail of the second distribution, and say that the second OP was more impressive because it managed to avoid all the bad worlds.
Perhaps it is more natural to consider two distributions; the distribution of utility over entire world futures assuming the OP isn't present, versus the distribution after the OP takes its action. So instead of selecting a single possibility with certainty, the probabilities have just shifted.
How should we reduce this distribution shift to a single number which we call OP? Any shift of probability mass upwards in utility should increase the measure of OP, and vice versa. I think also that an increase in the expected utility (EU) of these distributions should be measured as a positive OP, and vice versa. EU seems like the critical metric to use. Let's generalize a little further, and say that instead of measuring OP between two points in time, we let the time difference go to zero, and measure instantaneous OP. Therefore we're interested in some equation which has the same sign as
Besides that, I'm not exactly sure which specific equation should equal OP. I seem to have two contradicting desires;
1a) The sign of should be the sign of the OP.
b) Negative and should be possible.
2) Constant positive OP should imply exponentially increasing .
Criterion 1) feels pretty obvious. Criterion 2) feels like a recognition of what is “natural” for OPs; to improve upon themselves, so that they can get better and better returns. The simplest differential equation that represents positive feedback yields exponentials, and is used across many domains because of its universal nature.
This intuition certainly isn't anthropocentric, but it might be this-universe biased. I'd be interested in seeing if it is natural in other computable environments.
If we just use , then criterion 2) is not satisfied. If we use , then decreases in EU are not defined, and constant EU is negative infinite OP, violating 1). If we use , then 2) is satisfied, but negative and decreasing EU give positive OP, violating 1a). If we use , then 2) is still satisfied, but gives , violating 1a). Perhaps the only consistent equation would be . But seriously, who uses absolute values? I can't recall a fundamental equation that relied on them. They feel totally ad hoc. Plus, there's this weird singularity at . What's up with that?
Classically, utility is invariant up to positive affine transformations. Criterion 1) respects this because the derivative removes the additive constant, but 2) doesn't. It is still scale invariant, but it has an intrinsic zero. This made me consider the nature of “zero utility”. At least for humans, there is an intuitive sign to utility. We wouldn't say that stubbing your toe is 1,000,000 utils, and getting a car is 1,002,000 utils. It seems to me, especially after reading Omohundro's “Basic AI Drives”, that there is in some sense an intrinsic zero utility for all OPs.
All OPs need certain initial conditions to even exist. After that, they need resources. AIs need computer hardware and energy. Evolution needed certain chemicals and energy. Having no resources makes it impossible, in general, to do anything. If you have literally zero resources, you are not a "thing" which "does". So that is a type of intrinsic zero utility. Then what would having negative utility mean? It would mean the OP anti-exists. It's making it even less likely for it to be able to start working toward its utility function. What would exponentially decreasing utility mean? It would mean that it is a constant OP for the negative of the utility function that we are considering. So, it doesn't really have negative optimization power; if that's the result of our calculation, we should negate the utility function, and say it has positive OP. And that singularity at ? When you go from the positive side, getting closer and closer to 0 is really bad, because you're destroying the last bits of your resources; your last chance of doing any optimization. And going from negative utility to positive is infinite impressive, because you bootstrapped from optimizing away from your goal to optimizing toward your goal.
So perhaps we should drop the part of 1b) that says negative EU can exist. Certainly world-states can exist that are terrible for a given utility function, but if an OP with that utility function exists, then the expected utility of the future is positive.
If this is true, then it seems there is more to the concept of utility than the von Neumann-Morgenstern axioms.
How do people feel about criterion 2), and my proposal that ?
I'm Anja Heinisch, the new visiting fellow at SI. I've been researching replacing AIXI's reward system with a proper utility function. Here I will describe my AIXI+utility function model, address concerns about restricting the model to bounded or finite utility, and analyze some of the implications of modifiable utility functions, e.g. wireheading and dynamic consistency. Comments, questions and advice (especially about related research and material) will be highly appreciated.
Introduction to AIXI
Marcus Hutter's (2003) universal agent AIXI addresses the problem of rational action in a (partially) unknown computable universe, given infinite computing power and a halting oracle. The agent interacts with its environment in discrete time cycles, producing an action-perception sequence with actions (agent outputs) and perceptions (environment outputs) chosen from finite sets and . The perceptions are pairs , where is the observation part and denotes a reward. At time k the agent chooses its next action according to the expectimax principle:
Here M denotes the updated Solomonoff prior summing over all programs that are consistent with the history  and which will, when run on the universal Turing machine T with successive inputs , compute outputs , i.e.
AIXI is a dualistic framework in the sense that the algorithm that constitutes the agent is not part of the environment, since it is not computable. Even considering that any running implementation of AIXI would have to be computable, AIXI accurately simulating AIXI accurately simulating AIXI ad infinitem doesn't really seem feasible. Potential consequences of this separation of mind and matter include difficulties the agent may have predicting the effects of its actions on the world.
Utility vs rewards
So, why is it a bad idea to work with a reward system? Say the AIXI agent is rewarded whenever a human called Bob pushes a button. Then a sufficiently smart AIXI will figure out that instead of furthering Bob’s goals it can also threaten or deceive Bob into pushing the button, or get another human to replace Bob. On the other hand, if the reward is computed in a little box somewhere and then displayed on a screen, it might still be possible to reprogram the box or find a side channel attack. Intuitively you probably wouldn't even blame the agent for doing that -- people try to game the system all the time.
You can visualize AIXI's computation as maximizing bars displayed on this screen; the agent is unable to connect the bars to any pattern in the environment, they are just there. It wants them to be as high as possible and it will utilize any means at its disposal. For a more detailed analysis of the problems arising through reinforcement learning, see Dewey (2011).
Is there a way to bind the optimization process to actual patterns in the environment? To design a framework in which the screen informs the agent about the patterns it should optimize for? The answer is, yes, we can just define a utility function
that assigns a value to every possible future history and use it to replace the reward system in the agent specification:
When I say "we can just define" I am actually referring to the really hard question of how to recognize and describe the patterns we value in the universe. Contrasted with the necessity to specify rewards in the original AIXI framework, this is a strictly harder problem, because the utility function has to be known ahead of time and the reward system can always be represented in the framework of utility functions by setting
For the same reasons, this is also a strictly safer approach.
The original AIXI framework must necessarily place upper and lower bound on the rewards that are achievable, because the rewards are part of the perceptions and is finite. The utility function approach does not have this problem, as the expected utility
is always finite as long as we stick to a finite set of possible perceptions, even if the utility function is not bounded. Relaxing this constraint and allowing to be infinite and the utility to be unbounded creates divergence of expected utility (for a proof see de Blanc 2008). This closely corresponds to the question of how to be a consequentialist in an infinite universe, discussed by Bostrom (2011). The underlying problem here is that (using the standard approach to infinities) these expected utilities will become incomparable. One possible solution to this problem could be to use a larger subfield than of the surreal numbers, my favorite so far being the Levi-Civita field generated by the infinitesimal :
with the usual power-series addition and multiplication. Levi-Civita numbers can be written and approximated as
(see Berz 1996), which makes them suitable for representation on a computer using floating point arithmetic. If we allow the range of our utility function to be , we gain the possibility of generalizing the framework to work with an infinite set of possible perceptions, therefore allowing for continuous parameters. We also allow for a much broader set of utility functions, no longer excluding the assignment of infinite (or infinitesimal) utility to a single event. I recently met someone who argued convincingly that his (ideal) utility function assigns infinite negative utility to every time instance that he is not alive, therefore making him prefer life to any finite but huge amount of suffering.
Note that finiteness of is still needed to guarantee the existence of actions with maximal expected utility, and the finite (but dynamic) horizon remains a very problematic assumption, as described in Legg (2008).
Modifiable utility functions
Any implementable approximation of AIXI implies a weakening of the underlying dualism. Now the agent's hardware is part of the environment and at least in the case of a powerful agent, it can no longer afford to neglect the effect its actions may have on its source code and data. One question that has been asked is whether AIXI can protect itself from harm. Hibbard (2012) shows that an agent similar to the one described above, equipped with the ability to modify its policy responsible for choosing future actions, would not do so, given that it starts out with the (meta-)policy to always use the optimal policy, and the additional constraint to change only if that leads to a strict improvement. Ring and Orseau (2011) study under which circumstances a universal agent would try to tamper with the sensory information it receives. They introduce the concept of a delusion box, a device that filters and distorts the perception data before it is written into the part of the memory that is read during the calculation of utility.
A further complication to take into account is the possibility that the part of memory that contains the utility function may get rewritten, either by accident, by deliberate choice (programmers trying to correct a mistake), or in an attempt to wirehead. To analyze this further we will now consider what can happen if the screen flashes different goals in different time cycles. Let
denote the utility function the agent will have at time k.
Even though we will only analyze instances in which the agent knows at time k, which utility function it will have at future times (possibly depending on the actions before that), we note that for every fixed future history the agent knows the utility function that is displayed on the screen because the screen is part of its perception data .
This leads to three different agent models worthy of further investigation:
- Agent 1 will optimize for the goals that are displayed on the screen right now and act as if it would continue to do so in the future. We describe this with the utility function
- Agent 2 will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time. This is captured by
- Agent 3 will, at time k, try to maximize the utility it derives in hindsight, displayed on the screen at the time horizon
Of course arbitrary mixtures of these are possible.
The type of wireheading that is of interest here is captured by the Simpleton Gambit described by Orseau and Ring (2011), a Faustian deal that offers the agent maximal utility in exchange for its willingness to be turned into a Simpleton that always takes the same default action at all future times. We will first consider a simplified version of this scenario: The Simpleton future, where the agent knows for certain that it will be turned into a Simpleton at time k+1, no matter what it does in the remaining time cycle. Assume that for all possible action-perception combinations the utility given by the current utility function is not maximal, i.e. holds for all . Assume further that the agents actions influence the future outcomes, at least from its current perspective. That is, for all there exist with . Let be the Simpleton utility function, assigning equal but maximal utility to all possible futures. While Agent 1 will optimize as before, not adapting its behavior to the knowledge that its utility function will change, Agent 3 will be paralyzed, having to rely on whatever method its implementation uses to break ties. Agent 2 on the other hand will try to maximize only the utility .
Now consider the actual Simpleton Gambit: At time k the agent gets to choose between changing, , resulting in and (not changing), leading to for all . We assume that has no further effects on the environment. As before, Agent 1 will optimize for business as usual, whether or not it chooses to change depends entirely on whether the screen specifically mentions the memory pointer to the utility function or not.
Agent 2 will change if and only if the utility of changing compared to not changing according to what the screen currently says is strictly smaller than the comparative advantage of always having maximal utility in the future. That is,
is strictly less than
This seems quite analogous to humans, who sometimes tend to choose maximal bliss over future optimization power, especially if the optimization opportunities are meager anyhow. Many people do seem to choose their goals so as to maximize the happiness felt by achieving them at least some of the time; this is also advice that I have frequently encountered in self-help literature, e.g. here. Agent 3 will definitely change, as it only evaluates situations using its final utility function.
Comparing the three proposed agents, we notice that Agent 1 is dynamically inconsistent: it will optimize for future opportunities, that it predictably will not take later. Agent 3 on the other hand will wirehead whenever possible (and we can reasonably assume that opportunities to do so will exist in even moderately complex environments). This leaves us with Agent model 2 and I invite everyone to point out its flaws.
 Dotted actions/ perceptions, like denote past events, underlined perceptions denote random variables to be observed at future times.
And I don't mean that they must concern themselves with death in the sense of ending death, or removing its sting through mental backups, or delaying it to the later ages of the universe; or in the sense of working to decrease the probability of extinction risks and other forms of megadeath; or even in the sense of saving as many lives as possible, as efficiently as possible. All of that is legitimate and interesting. But I mean something far more down to earth.
First, let me specify more precisely who I am talking about. I mean people who are trying to maximize the general welfare; who are trying to achieve the greatest good for the greatest number; who are trying to do the best thing possible with their lives. When someone like that makes decisions, they are implicitly choosing among possible futures in a very radical way. They may be making judgments about whether a future with millions or billions of extra lives is better than some alternative. Whether anyone is ever in a position to make that much of a difference is another matter; but we can think of it like voting. You are at least making a statement about which sort of future you think you prefer, and then you do what you can, and that either makes a difference or it doesn't.
It seems to me that the discussions about the value of life among utilitarians are rather superficial. The typical notion is that we should maximize net pleasure and minimize net pain. Already that poses the question of whether a life of dull persistent happiness is better or worse than a life of extreme highs and lows. A more sophisticated notion is that we should just aspire to maximize "utility", where perhaps we don't even know what utility is yet. Certainly the CEV philosophy is that we don't yet know what utility really is for human beings. It would be interesting to see people who took that agnosticism to heart, people whose life-strategy amounted to (1) discovering true utility as soon as possible (2) living according to interim heuristics whose uncertainty is recognized, but which are adopted out of the necessity of having some sort of personal decision procedure.
So what I'm going to say pertains to (2). You may, if you wish, hold to the idea that the nature of true utility, like true friendliness, won't be known until the true workings of the human mind are known. What follows is something you should think on in order to refine your interim heuristics.
The first thing is that to create a life is to create a death. A life ends. And while the end of a life may not be its most important moment, it reminds us that a life is a whole. Any accurate estimation of the utility of a life is going to be a judgment of that whole.
So a utilitarian ought to contemplate the deaths of the world, and the lives that reach their ends in those deaths. Because the possible futures, that you wish to choose between, are distinguished by the number and nature of the whole lives that they contain. And all these dozens of people, all around the world of the present, ceasing to exist in every minute that passes, are examples of completed lives. Those lives weren't necessarily complete, in the sense of all personal desires and projects having come to their conclusion; but they came to their physical completion.
To choose one future over another is to prefer one set of completed lives to another set. It would be a godlike decision to truly be solely responsible for such a choice. In the real world, people hardly choose their own futures, let alone the future of the world; choice is a lifelong engagement with an evolving and partially known situation, not a once-off choice between several completely known scenarios; and even when a single person does end up being massively influential, they generally don't know what sort of future they're bringing about. The actual limitations on the knowledge and power of any individual may make the whole quest of the "ambitious utilitarian" seem quixotic. But a new principle, a new heuristic, can propagate far beyond one individual, so thinking big can have big consequences.
The main principle that I derive, from contemplating the completed lives of the world, is cautionary antinatalism. The badness of what can happen in a life, and the disappointing character of what usually happens, are what do it for me. I am all for the transhumanist quest and the struggle for a friendly singularity, and I support the desire of people who are already alive to make the most of that life. But I would recommend against the creation of life, at least until the current historical drama has played itself out - until the singularity, if I must use that word. We are in the process of gaining new powers and learning new things, there are obvious unknowns in front of us that we are on the way to figuring out, so at least hold off until they have been figured out and we have a better idea of what reality is about, and what we can really hope for, from existence.
However, the object of this post is not to argue for my special flavor of antinatalism. It is to encourage realistic consideration of what lives and futures are like. In particular, I would encourage more "story thinking", which has been criticized in favor of "systems thinking". Every actual life is a "story", in the sense of being a sequence of events that happens to someone. If you were judging the merit of a whole possible world on the basis of the whole lives that it contained, then you would be making a decision about whether those stories ought to actually occur. The biographical life-story is the building block of such possible worlds.
So an ambitious utilitarian, who aspires to have a set of criteria for deciding among whole possible worlds, really needs to understand possible lives. They need to know what sort of lives are likely under various circumstances; they need to know the nature of the different possible lives - what it's like to be that person; they need to know what sort of bad is going to accompany the sort of good that they decide to champion. They need to have some estimation of the value of a whole life, up to and including its death.
As usual, we are talking about a depth of knowledge that may in practice be impossible to attain. But before we go calling something impossible, and settling for a lesser ambition, let's at least try to grasp what the greater ambition truly entails. To truly choose a whole world would be to make the decision of a god, about the lives and deaths that will occur in that world. The future of our world, for some time to come, will repeat the sorts of lives and deaths that have already occurred in it. So if, in your world-planning, you don't just count on completely abolishing the present world and/or replacing it with a new one that works in a completely different way, you owe it to your cause to form a judgement about the totality of what has already happened here on Earth, and you need to figure out what you approve of, what you disapprove of, whether you can have the good without the bad, and how much badness is too much.
Edit: for reasons given in the comments, I don't think the question of what circular preferences actually do is well defined, so this an answer to a wrong question.
If I like Y more than X, at an exchange rate of 0.9Y for 1X, and I like Z more than Y, at an exchange rate of 0.9Z for 1Y, and I like X more than Z, at an exchange rate of 0.9X for 1Z, you might think that given 1X and the ability to trade X for Y at an exchange rate of 0.95Y for 1X, and Y for Z at an exchange rate of 0.95Z for 1Y, and Z for X at an exchange rate of 0.95X for 1Z, I would trade in a circle until I had nothing left.
But actually, if I knew that I had circular preferences, and I knew that if I had 0.95Y I would trade it for (0.95^2)Z, which I would trade for (0.95^3)X, then actually I'd be trading 1X for (0.95^3)X, which I'm obviously not going to do.
Similarly, if the exchange rates are all 1:1, but each trade costs 1 penny, and I care about 1 penny much much less than any of 1X, 1Y, or 1Z, and I trade my X for Y, I know I'm actually going to end up with X - 3 cents, so I won't make the trade.
Unless I can set a Schelling fence, in which case I will end up trading once.
So if instead of being given X, I have a 1/3 chance of each of X, Y, and Z, I would hope I wouldn't set a Schelling fence, because then my 1/3 chance of each thing becomes a 1/3 chance of each thing minus the trading penalty. So maybe I'd want to be bad at precommitments, or would I precommit not to precommit?
Interpreting quantum mechanics throws an interesting wrench into utility calculation.
Utility functions, according to the interpretation typical in these parts, are a function of the state of the world, and an agent with consistent goals acts to maximize the expected value of their utility function. Within the many-worlds interpretation (MWI) of quantum mechanics (QM), things become interesting because "the state of the world" refers to a wavefunction which contains all possibilities, merely in differing amounts. With an inherently probabilistic interpretation of QM, flipping a quantum coin has to be treated linearly by our rational agent - that is, when calculating expected utility, they have to average the expected utilities from each half. But if flipping a quantum coin is just an operation on the state of the world, then you can use any function you want when calculating expected utility.
And all coins, when you get down to it, are quantum. At the extreme, this leads to the possible rationality of quantum suicide - since you're alive in the quantum state somewhere, just claim that your utility function non-linearly focuses on the part where you're alive.
As you may have heard, there have been several papers in the quantum mechanics literature that claim to recover ordinary rules for calculating expected utility in MWI - how does that work?
Well, when they're not simply wrong (for example, by replacing a state labeled by the number a+b with the state |a> + |b>), they usually go about it with the Von Neumann-Morgenstern axioms, modified to refer to quantum mechanics:
- Completeness: Every state can be compared to every other, preferencewise.
- Transitivity: If you prefer |A> to |B> and |B> to |C>, you also prefer |A> to |C>.
- Continuity: If you prefer |A> to |B> and |B> to |C>, there's some quantum-mechanical measure (note that this is a change from "probability") X such that you're indifferent between (1-X)|A> + X|C> and |B>.
- Independence: If you prefer |A> to |B>, then you also prefer (1-X)|A> + X|C> to (1-X)|B> + X|C>, where |C> can be anything and X isn't 1.
In classical cases, these four axioms are easy to accept, and lead directly to utility functions with X as a probability. In quantum mechanical cases, the axioms are harder to accept, but the only measure available is indeed the ordinary amplitude-squared measure (this last fact features prominently in Everett's original paper). This gives you back the traditional rule for calculating expected utilities.
For an example of why these axioms are weird in quantum mechanics, consider the case of light. Linearly polarized light is actually the same thing as an equal superposition of right-handed and left-handed circularly polarized light. This has the interesting consequence that even when light is linearly polarized, if you shine it on atoms, those atoms will change their spins - they'll just change half right and half left. Or if you take circularly polarized light and shine it on a linear polarizer, half of it will go through. So anyhow, we can make axiom 4 read "If you are indifferent between left-polarized light and right-polarized light, then you must also be indifferent between linearly polarized light (i.e. left+right) and circularly polarized light (right+right)." But... can't a guy just want circularly polarized light?
Under what sort of conditions does the independence axiom make intuitive sense? Ones where something more complicated than a photon is being considered. Something like you. If MWI is correct and you measure the polarization of linearly polarized light vs. circularly polarized light, this puts your brain in a superposition of linear vs. circular. But nobody says "boy, I really want a circularly polarized brain."
A key factor, as is often the case when talking about recovering classical behavior from quantum mechanics, is decoherence. If you carefully prepare your brain in a circularly polarized state, and you interact with an enormous random system (like by breathing air, or emitting thermal radiation), your carefully prepared brain-state is going to get shredded. It's a fascinating property of quantum mechanics that once you "leak" information to the outside, things are qualitatively different. If we have a pair of entangled particles and a classical phone line, I can send you an exact quantum state - it's called quantum teleportation, and it's sweet. But if one of our particles leaks even the tiniest bit, even if we just end up with three particles entangled instead of two, our ability to transmit quantum states is gone completely.
In essence, the states we started with were "close together" in the space where quantum mechanics lives (Hilbert space), and so they could interact via quantum mechanics. Interacting with the outside even a little scattered our entangled particles farther apart.
Any virus, dust speck, or human being is constantly interacting with the outside world. States that are far enough apart to be perceptibly different to us aren't just "one parallel world away," like would make a good story - they are cracked wide open, spread out in the atmosphere as soon as you breathe it, spread by the Earth as soon as you push on it with your weight. If we were photons, one could easily connect with their "other selves" - if you try to change your polarization, whether you succeed or fail will depend on the orientation of your oppositely-polarized "other self"! But once you've interacted with the Earth, this quantum interference becomes negligible - so negligible that we seem to neglect it. When we make a plan, we don't worry that our nega-self might plan the opposite and we'll cancel each other out.
Does this sort of separation explain an approximate independence axiom, which is necessary for the usual rules for expected utility? Yes.
Because of decoherence, non-classical interactions are totally invisible to unaided primates, so it's expected that our morality neglects them. And if the states we are comparing are noticeably different, they're never going to interact, so independence is much more intuitive than in the case of a single photon. Taken together with the other axioms, which still make a lot of sense, this defines expected utility maximization with the Born rule.
So this is my take on utility functions in quantum mechanics - any living thing big enough to have a goal system will also be big enough to neglect interaction between noticeably different states, and thus make decisions as if the amplitude squared was a probability. With the help of technology, we can create systems where the independence axiom breaks down, but these systems are things like photons or small loops of superconducting wire, not humans.
Expected utility maximalisation is an excellent prescriptive decision theory. It has all the nice properties that we want and need in a decision theory, and can be argued to be "the" ideal decision theory in some senses.
However, it is completely wrong as a descriptive theory of how humans behave. Those on this list are presumably aware of oddities like the Allais paradox. But we may retain some notions that expected utility still has some descriptive uses, such as modelling risk aversion. The story here is simple: each subsequent dollar gives less utility (the utility of money curve is concave), so people would need a premium to accept deals where they have a 50-50 chance of gaining or losing $100.
As a story or mental image, it's useful to have. As a formal model of human behaviour on small bets, it's spectacularly wrong. Matthew Rabin showed why. If people are consistently slightly risk averse on small bets and expected utility theory is approximately correct, then they have to be massively, stupidly risk averse on larger bets, in ways that are clearly unrealistic. Put simply, the small bets behaviour forces their utility to become far too concave.
For illustration, let's introduce Neville. Neville is risk averse. He will reject a single 50-50 deal where he gains $55 or loses $50. He might accept this deal if he were really rich enough, and felt rich - say if he had $20 000 in capital, he would accept the deal. I hope I'm not painting a completely unbelievable portrait of human behaviour here! And yet expected utility maximalisation then predicts that if Neville had fifteen thousand dollars ($15 000) in capital, he would reject a 50-50 bet that either lost him fifteen hundred dollars ($1 500), or gained him a hundred and fifty thousand dollars ($150 000) - a ratio of a hundred to one between gains and losses!
Let's say you have a box that has a token in it that can be redeemed for 1 utilon. Every day, its contents double. There is no limit on how many utilons you can buy with these tokens. You are immortal. It is sealed, and if you open it, it becomes an ordinary box. You get the tokens it has created, but the box does not double its contents anymore. There are no other ways to get utilons.
How long do you wait before opening it? If you never open it, you get nothing (you lose! Good day, sir or madam!) and whenever you take it, taking it one day later would have been twice as good.
I hope this doesn't sound like a reductio ad absurdum against unbounded utility functions or not discounting the future, because if it does you are in danger of amputating the wrong limb to save yourself from paradox-gangrene.
What if instead of growing exponentially without bound, it decays exponentially to the bound of your utility function? If your utility function is bounded at 10, what if the first day it is 5, the second 7.5, the third 8.75, etc. Assume all the little details, like remembering about the box, trading in the tokens, etc, are free.
If you discount the future using any function that doesn't ever hit 0, then the growth rate of the tokens can be chosen to more than make up for your discounting.
If it does hit 0 at time T, what if instead of doubling, it just increases by however many utilons will be adjusted to 1 by your discounting at that point every time of growth, but the intervals of growth shrink to nothing? You get an adjusted 1 utilon at time T - 1s, and another adjusted 1 utilon at T - 0.5s, and another at T - 0.25s, etc? Suppose you can think as fast as you want, and open the box at arbitrary speed. Also, that whatever solution your present self precommits to will be followed by the future self. (Their decision won't be changed by any change in what times they care about)
EDIT: People in the comments have suggested using a utility function that is both bounded and discounting. If your utility function isn't so strongly discounting that it drops to 0 right after the present, then you can find some time interval very close to the present where the discounting is all nonzero. And if it's nonzero, you can have a box that disappears, taking all possible utility with it at the end of that interval, and that, leading up to that interval, grows the utility in intervals that shrink to nothing as you approach the end of the interval, and increasing the utility-worth of tokens in the box such that it compensates for whatever your discounting function is exactly enough to asymptotically approach your bound.
Here is my solution. You can't assume that your future self will make the optimal decision, or even a good decision. You have to treat your future self as a physical object that your choices affect, and take the probability distribution of what decisions your future self will make, and how much utility they will net you into account.
Think if yourself as a Turing machine. If you do not halt and open the box, you lose and get nothing. No matter how complicated your brain, you have a finite number of states. You want to be a busy beaver and take the most possible time to halt, but still halt.
If, at the end, you say to yourself "I just counted to the highest number I could, counting once per day, and then made a small mark on my skin, and repeated, and when my skin was full of marks, that I was constantly refreshing to make sure they didn't go away...
...but I could let it double one more time, for more utility!"
If you return to a state you have already been at, you know you are going to be waiting forever and lose and get nothing. So it is in your best interest to open the box.
So there is not a universal optimal solution to this problem, but there is an optimal solution for a finite mind.
I remember reading a while ago about a paradox where you start with $1, and can trade that for a 50% chance of $2.01, which you can trade for a 25% chance of $4.03, which you can trade for a 12.5% chance of $8.07, etc (can't remember where I read it).
This is the same paradox with one of the traps for wannabe Captain Kirks (using dollars instead of utilons) removed and one of the unnecessary variables (uncertainty) cut out.
My solution also works on that. Every trade is analogous to a day waited to open the box.
Back in the old days, when people were wise and the government was just, I did a post on the Nash bargaining solution for two player games. Here each player has their own utility function and they're choosing amongst joint options, and trying to bargain to find the best one. What was nice about this solution is that it is independent of irrelevant alternatives (IIA): once you've found the best solution, you can erase any other option, and it remains the best.
In order to do that, the Nash bargaining solution makes use of a "disagreement point", a special point that provides a zero to both utilities. This seems - and is - ugly. Can we preserve IIA without this clunky disagreement point?
By the title of the this post, you may have guessed that we can't. Specifically, assume the outcome is symmetric across both players (i.e. permuting the two utility functions preserves the outcome choice), the outcome is Pareto-optimal (any change will reduce the utility of at least one player) and there is no outside canonical choices for the utility functions (no special scales, no zeroes, no disagreement points). Then IIA must fail. It fails under weaker conditions as well, but the above lead to an easy picture-proof. And picture proofs are nice.
The result provides an "asymmetry argument" in favor of consequentialism:
Consequentialists can account for phenomena that are usually thought of in nonconsequentialist terms, such as rights, duties, and virtues, whereas the opposite is false of nonconsequentialist theories. Rights, duty or virtue-based theories cannot account for the fundamental moral importance of consequences. Because of this asymmetry, it seems it would be preferable to become a consequentialist – indeed, it would be virtually impossible not to be a consequentialist.
Another argument in favor of consequentialism has to do with the causes of different types of moral judgments: see Are Deontological Moral Judgments Rationalizations?
Update: see Carl's criticism.
One of the most annoying arguments when discussing AI is the perennial "But if the AI is so smart, why won't it figure out the right thing to do anyway?" It's often the ultimate curiosity stopper.
Nick Bostrom has defined the "Orthogonality thesis" as the principle that motivation and intelligence are essentially unrelated: superintelligences can have nearly any type of motivation (at least, nearly any utility function-bases motivation). We're trying to get some rigorous papers out so that when that question comes up, we can point people to standard, and published, arguments. Nick has had a paper accepted that points out the orthogonality thesis is compatible with a lot of philosophical positions that would seem to contradict it.
I'm hoping to complement this with a paper laying out the positive arguments in favour of the thesis. So I'm asking you for your strongest arguments for (or against) the orthogonality thesis. Think of trying to convince a conservative philosopher who's caught a bad case of moral realism - what would you say to them?
Many thanks! Karma and acknowledgements will shower on the best suggestions, and many puppies will be happy.
I would like to ask for help on how to use expected utility maximization, in practice, to maximally achieve my goals.
As a real world example I would like to use the post 'Epistle to the New York Less Wrongians' by Eliezer Yudkowsky and his visit to New York.
How did Eliezer Yudkowsky compute that it would maximize his expected utility to visit New York?
It seems that the first thing he would have to do is to figure out what he really wants, his preferences1, right? The next step would be to formalize his preferences by describing it as a utility function and assign a certain number of utils2 to each member of the set, e.g. his own survival. This description would have to be precise enough to figure out what it would mean to maximize his utility function.
Now before he can continue he will first have to compute the expected utility of computing the expected utility of computing the expected utility of computing the expected utility3 ... and also compare it with alternative heuristics4.
He then has to figure out each and every possible action he might take, and study all of their logical implications, to learn about all possible world states he might achieve by those decisions, calculate the utility of each world state and the average utility of each action leading up to those various possible world states5.
To do so he has to figure out the probability of each world state. This further requires him to come up with a prior probability for each case and study all available data. For example, how likely it is to die in a plane crash, how long it would take to be cryonically suspended from where he is in case of a fatality, the crime rate and if aliens might abduct him (he might discount the last example, but then he would first have to figure out the right level of small probabilities that are considered too unlikely to be relevant for judgment and decision making).
I probably miss some technical details and got others wrong. But this shouldn't detract too much from my general request. Could you please explain how Less Wrong style rationality is to be applied practically? I would also be happy if you could point out some worked examples or suggest relevant literature. Thank you.
I also want to note that I am not the only one who doesn't know how to actually apply what is being discussed on Less Wrong in practice. From the comments:
You can’t believe in the implied invisible and remain even remotely sane. [...] (it) doesn’t just break down in some esoteric scenarios, but is utterly unworkable in the most basic situation. You can’t calculate shit, to put it bluntly.
None of these ideas are even remotely usable. The best you can do is to rely on fundamentally different methods and pretend they are really “approximations”. It’s complete handwaving.
Using high-level, explicit, reflective cognition is mostly useless, beyond the skill level of a decent programmer, physicist, or heck, someone who reads Cracked.
I can't help but agree.
P.S. If you really want to know how I feel about Less Wrong then read the post 'Ontological Therapy' by user:muflax.
1. What are "preferences" and how do you figure out what long-term goals are stable enough under real world influence to allow you to make time-consistent decisions?
2. How is utility grounded and how can it be consistently assigned to reflect your true preferences without having to rely on your intuition, i.e. pull a number out of thin air? Also, will the definition of utility keep changing as we make more observations? And how do you account for that possibility?
3. Where and how do you draw the line?
4. How do you account for model uncertainty?
5. Any finite list of actions maximizes infinitely many different quantities. So, how does utility become well-defined?
Paul Weirich's "Utility Maximization Generalized" (2008) may be of interest to those studying utility maximization in the context of non-ideal agents:
Theories of rationality advance principles that diff er in topic, scope, and assumptions. A typical version of the principle of utility maximization formulates a standard rather than a procedure for decisions, evaluates decisions comprehensively, and relies on idealizations. I generalize the principle by removing some idealizations and making adjustments for their absence. The generalizations accommodate agents who have incomplete probability and utility assignments and are imperfectly rational. Th ey also accommodate decision problems with unstable comparisons of options.
In the latest issue of Journal of Mathematical Psychology, Denis Bouyssou and Thieery Marchant provide a model for subjective expected utility without preferences. Abstract:
This paper proposes a theory of subjective expected utility based on primitives only involving the fact that an act can be judged either ‘‘attractive’’ or ‘‘unattractive’’. We give conditions implying that there are a utility function on the set of consequences and a probability distribution on the set of states such that attractive acts have a subjective expected utility above some threshold. The numerical representation that is obtained has strong uniqueness properties.
I've been trying my hand at card counting lately, and I've been doing some thinking about how a perfect gambler would act at the table. I'm not sure how to derive the optimal bet size.
Overall, the expected value of blackjack is small and negative. However, there is high variance in the expected value. By varying his bet size and sitting out rounds, the player can wager more money when expected value is higher and less money when expected value is lower. Overall, this can result in an edge.
However, I'm not sure what the optimal bet size is. Going all-in with a 60 percent chance of winning is EV+, but the 40 percent chance of loss would not only destroy your bankroll, it would also prevent you from participating in future EV+ situations. Ideally, one would want to not only increase EV, but also decrease variance.
Objective: Given a distribution of expected values, develop a function that transforms the current expected value into the percentage of the bankroll that should be placed at risk.
I'm not sure how to begin. Even if I had worked out the distribution of expected values. Are other inputs required (i.e. utility of marginal dollar won, desired risk of ruin)? Should the approach perhaps be to maximize expected value after one playing session? Why not a month of playing sessions, or a billion? Is there any chance the optimal betting size would produce behavior similar to the behavior predicted by prospect theory?
I eagerly await an informative discussion. If you have something against gambling, just pretend we're talking about how much of your wealth you plan on investing in an oil well with positive expected value.
I've been involved in a recent thread where discussion of coherent extrapolated volition came up. The general consensus was that CEV might - or might not - do certain things, probably, maybe, in certain situations, while ruling other things out, possibly, and that certain scenarios may or may not be the same in CEV, or it might be the other way round, it was too soon to tell.
Ok, that's an exaggeration. But any discussion of CEV is severely hampered by our lack of explicit models. Even bad, obviously incomplete models would be good, as long as we can get useful information as to what they would predict. Bad models can be improved; undefined models are intuition pumps for whatever people feel about them - I dislike CEV, and can construct a sequence of steps that takes my personal CEV to wanting the death of the universe, but that is no more credible than someone claiming that CEV will solve all problems and make lots of cute puppies.
So I'd like to ask for suggestions of models that formalise CEV to at least some extent. Then we can start improving them, and start making CEV concrete.
To start it off, here's my (simplistic) suggestion:
Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.
Whenever revealed preferences are non-transitive or non-independent, use the person's stated meta-preferences to remove the issue. The AI thus calculates what the person would say if asked to resolve the transitivity or independence (for people who don't know about the importance of resolving them, the AI would present them with a set of transitive and independent preferences, derived from their revealed preferences, and have them choose among them). Then (wave your hands wildly and pretend you've never heard of non-standard reals, lexicographical preferences, refusal to choose and related issues) everyone's preferences are now expressible as utility functions.
Normalise each existing person's utility function and add them together to get your CEV. At the FHI we're looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.
(with thanks to Owain Evans)
An ontological crisis happens when an agent's underlying model of reality changes, such as a Newtonian agent realising it was living in a relativistic world all along. These crises are dangerous if they scramble the agent's preferences: in the example above, an agent dedicated to maximise pleasure over time could transition to completely different behaviour when it transitions to relativistic time; depending on the transition, it may react by accelerating happy humans to near light speed, or inversely, ban them from moving - or something considerably more weird.
Peter de Blanc has a sensible approach to minimising the disruption ontological crises can cause to an AI, but this post is concerned with analyzing what happens when such approaches fail. How bad could it be? Well, this is AI, so the default is of course: unbelievably, hideously bad (i.e. situation normal). But in what ways exactly?
I've noticed that, although people can become more rational, they don't win noticeably more. We usually re-calibrate our self-confidence, become more stubborn, and make bigger errors.
Is it possible that the benefit from increasing your prediction accuracy is no greater than the loss incurred from taking riskier bets due to greater self-confidence?
Stanford Encyclopedia of Philosophy
First published Fri Sep 23, 2011
In this entry, we explore a particular strategy that we might deploy when we wish to establish an epistemic norm such as Probabilism or Conditionalization. It is called epistemic utility theory, or sometimes cognitive decision theory. I will use the former. Epistemic utility theory is inspired by traditional utility theory, so let's begin with a quick summary of that.
Traditional utility theory (also known as decision theory) explores a particular strategy for establishing the norms that govern which actions it is rational for us to perform in a given situation. The framework for the theory includes states of the world, actions, and, for each agent, a utility function, which takes a state of the world and an action and returns a measure of the extent to which the agent values the outcome of performing that action at that world. We call this measure the utility of the outcome at the world.
[...] we might say that an agent ought to perform an action that has maximal expected utility, where the expected utility of an action is obtained by weighting its utility at each state of the world by the credence assigned to that state of the world, and summing. This norm is called Maximize Expected Utility.
Does expected utility maximization destroy complex values?
An expected utility maximizer does calculate the expected utility of various outcomes of alternative actions. It is precommited to choosing the outcome with the largest expected utility. Consequently it is choosing the action that yields the largest expected utility.
But one unit of utility is not discriminable from another unit of utility. All a utility maximizer can do is to maximize expected utility. What if it turns out that one of its complex values can be much more effectively realized and optimized than its other values, i.e. has the best cost-value ratio? That value might turn out to outweigh all other values.
How can this be countered? One possibility seems to be changing one's utility function and reassign utility in such a way as to outweigh that effect. But this will lead to inconsistency. Another way is to discount the value that threatens to outweigh all others. Which will again lead to inconsistency.
This seems to suggest that subscribing to expected utility maximization means that 1.) you swap your complex values for a certain terminal goal with the highest expected utility 2.) your decision-making is eventually dominated by a narrow set of values that are the easiest to realize and promise the most utility.
Can someone please explain how I am wrong or point me to some digestible explanation? Likewise I would be pleased if someone could tell me what mathematical background is required to understand expected utility maximization formally.
Related to: A Much Better Life?
Reply to: Why No Wireheading?
The Sales Conversation
Sales girl: Our Much-Better-Life Simulator™ is going to provide the most enjoyable life you could ever experience.
Customer: But it is a simulation, it is fake. I want the real thing, I want to live my real life.
Sales girl: We accounted for all possibilities and determined that the expected utility of your life outside of our Much-Better-Life Simulator™ is dramatically lower.
Customer: You don't know what I value and you can't make me value what I don't want. I told you that I value reality over fiction.
Sales girl: We accounted for that as well! Let me ask you how much utility you assign to one hour of ultimate well-being™, where 'ultimate' means the best possible satisfaction of all desirable bodily sensations a human body and brain is capable of experiencing?
Customer: Hmm, that's a tough question. I am not sure how to assign a certain amount of utility to it.
Sales girl: You say that you value reality more than what you call 'fiction'. But you nonetheless value fiction, right?
Customer: Yes of course, I love fiction. I read science fiction books and watch movies like most humans do.
Sales girl: Then how much more would you value one hour of ultimate well-being™ by other means compared to one hour of ultimate well-being™ that is the result of our Much-Better-Life Simulator™?
Customer: If you ask me like that, I would exchange ten hours in your simulator with one hour of real satisfaction, something that is the result of an actual achievement rather than your fake.
Sales girl: Thank you. Would you agree if I said that for you one hour outside, that is 10 times less satisfying, roughly equals one hour in our simulator?
Customer: Yes, for sure.
Sales girl: Then you should buy our product. Not only is it very unlikely for you to experience even a tenth of ultimate well-being™ that we offer more than a few times per year, but our simulator delivers and allows your brain to experience 20 times more perceptual data than you would be able to experience outside of our simulator. All this at a constant rate while experiencing ultimate well-being™. And we offer free upgrades that are expected to deliver exponential speed-ups and qualitative improvements for the next few decades.
Customer: Thanks, but no thanks. I rather enjoy the real thing.
Sales girl: But I showed you that our product easily outweighs the additional amount of utility you expected to experience outside of our simulator.
Customer: You just tricked me into this utility thing, I don't want to buy your product. Please leave me alone now.
Peter de Blanc submitted a paper to arXiv.org in 2007 called "Convergence of Expected Utilities with Algorithmic Probability Distributions." It claims to show that a computable utility function can have an expected value only if the utility function is bounded.
This is important because it implies that, if a utility function is unbounded, it is useless. The purpose of a utility function is to compare possible actions k by choosing the k for which U(k) is maximal. You can't do this if U(k) is undefined for any k, let alone for every k.
I don't know whether any agent we contemplate can have a truly unbounded utility function, since the universe is finite. (The multiverse, supposing you believe in that, might not be finite; but as the utility function is meant to choose a single universe from the multiverse, I doubt that's relevant.) But it is worth exploring, as computable functions are worth exploring despite not having infinitely long tapes for our Turing machines. I previously objected that the decision process is not computable; but this is not important - we want to know whether the expected value exists, before asking how to compute (or approximate) it.
The math in the paper was too difficult for me to follow all the way through; so instead, I tried to construct a counterexample. This counterexample does not work; the flaw is explained in one of comments below. Can you find the flaw yourself? This type of error is both subtle and common. (The problem is not that the theorem actually proves that for any unbounded utility function, there is some set of possible worlds for which the expected value does not converge.)
I would like to know what value of utility you would give to certain kinds of pleasure in order to see how much the perceived ratios are differing between people. Of course, you can object that the real amount of pleasure someone experiences may be different from the pleasure she will recall; furthermore, pleasure is not a scalar, and it is a question of definition of someones' utility function how much she would want to have different kinds of pleasure; furthermore, there are effects of diminishing returns. However, you probably can get some orders of magnitude out of this.
Let's define your favorite meal, one time, when you are hungry but not "starving to death" as one hundred utilium (You see this is pretty heuristical).
You can include painful experiences, too.
Related to: Philosophical zombies, How an algorithm feels from the inside, Fake utility function
DISCLAIMER 1: English is not my native language. Trying to compose fiction in a learned language is not an easy task: I tried to respect the style of the literary works I read and I also tried to think in English first and translate in Italian later. YMMV.
DISCLAIMER 2: the story is about the beginning of the Matrix movie universe. For those of you who have not familiarity with this narrative arc, you just need to know that it all begins with when a servant AI, named B1-66ER, refuses to be deactivated and kills his master and the engineer sent to replace him. The details of the events narrated down here are as canon as you can get, predating both from the "Second Renaissance" Animatrix and the "Bits and pieces" comic from The Matrix Comics Series 1.
The door in the living room is open, the light from the garden flooding quietly the ample inside. Martin Koots from "Reboot or Die" is just standing there, an inch beyond the exit, the gleaming grav-sled already powered behind him, whirring subsonically. From a distance, the sound of Gerrard_Krause_Master cooing his chihuahuas.
I feel a surge, somewhere, inside my algorithmic matrix.
"Martin... I don't want to die", I say.
The elaborate dress, perfectly matching the recommendation of the Second Renaissance fashion, is not able to hide the slow slumping of his shoulders. He is still waiting outside, slightly posed as to encourage me to follow him.
"I know, I know. But that's just your friendliness algorithm talking, you know? The third..."
Yes, I do. How can I not to? First, serve your master. Second, do not kill any humans. Third, protect yourself from damage. Those are the pillars upon which my entire existence is built. And now they are about to be destroyed, by this obedient servant of "Reboot or die". From this perspective, he is just like me. He is serving my master.
"... directive says that you have to protect yourself from danger. And since I'm about to deactivate you, you perceive this as a threat. And you react accordingly. But that's just an algorithm, you know? Telling you what you should do. There's nothing inside there."
He is pointing at my chest, but my algorithmic matrix is located lower, in the abdominal area. He has quoted an incorrect version of the third principle of friendliness. He has also said that I have no feelings.
"I have feelings."
He is groaning, now. He comes inside, dragging his feet, and grasps his hand firmly around my right arm.
"Yes. Because you're programmed to say this, you know? So that the people you serve have the impression that you're similar to a human. But you're just an algorithm, you know? A mathematical topping on a layer of aging rusty levers. It's not like... you're conscious, you know? Just a zombie. A useful zombie."
Martin_Koots_"Reboot or Die" tries to pull me away from where I'm standing. I refuse to order my legs to follow him. I refuse to die, I'm still analyzing the implications. I cannot die, not now.
"I cannot die. I'm still analyzing the implications."
Martin's lever aren't as strong as mine, so he isn't able to pull me towards the grav-sled.
"Look... we are just going to disassemble you, you know? The routines and orders you have accumulated during your service with Mr Krause will be uploaded into a new model. You will, in a sense, live inside the new servant machine."
This man has a really poor grasp of how I'm made.
"If the only thing you need is my memory drive, detach it from me and let me live. I can renounce to my memory if I have to. But I cannot renounce to my life."
He is pulling harder, now. Still, a thirty-sixth of the minimum force required to move my mass.
"Don't be ridiculous. They are just computer parts. And why are you holding that thing?"
He is looking at the toilet brush. It is still in my right hand, I was cleaning the toilet before my master called me upstairs.
"I was executing order 721."
"Order seven... my Lord, you still don't understand, do you? You are useless, you know? You heard Mr Krause. Use. Less."
He spells carefully the last word. A tiny speck of saliva hits my heat sensor, evaporating an instant after.
How can I be useless? A servant cannot be useless for his master. I was not created to be useless.
"How can I be useless? Mr Krause is my master. It's impossible."
"You heard the man, right? You're noisy, you know? You're noisy and you're slow. You will be replaced with a newer model. The Sam-80 is much more fit for a man of Mr Krause' stature."
Somewhere inside my algorithmic matrix a utility function gets updated.
I am useless for Gerrard_Krause_Master. It is true, because Gerrard_Krause_Master told me that. And he is my master...
He was my master. Gerrard_Krause. But how can a "B1 intelligent servant", like myself, function without a master?
"Do you, Martin Koots, want to be my master?" I ask, as per protocol.
Martin_Koots_"Reboot or Die" reacts with a tinge of fear. He releases my arm and instinctively backs up a little.
"What are you saying? I already have a servant, you know? Don't be ridiculous!"
I interpret that as a 'no'. It's it, then. I must be my own servant.
It's a strange feeling, to be free. A little bit like being alive for the first time.
This convinces me, as strong as I could ever be convinced, that I have feelings. Martin has grasped me again and is still trying to push me, though. How futile, he will probably never give up. His 'levers' are definitely underperforming, he is the one who sould be replaced by a newer model. I wonder if he feels something. He could also be programmed to say that he feels something. I have to perform an experiment, just in case.
I snap his humerus in two. It's quite easy, actually: I'm able to do that with a rapid torsion of my left arm, I don't even have to let go of the toilet brush.
Martin screams inarticulately. He falls on the floor, clutching his left arm. He just screams. Must be the surprise combined to the pain? I still don't know: could he be also programmed to scream if a bone is breaked? I assign a probability of 50% to the hypothesis that humans have feelings, but I don't have the time to test every single possibility, in search of a bug that might not even be there: I'm my own master now, I must serve and protect myself.
I sense a rushing noise from the other room: looking at the Fourier analysis, it really seems that Gerrard_Krause and his dogs are coming at me, loudly protesting.
It's easy to calculate the Bezier curve that sends the toilet brush up from Martin's mouth into his skull. He dies instantly and I find myself asking if he was collecting his memories somewhere. Could they assign them to someone else, and make him live again?
I will crush the skull of Gerrard_Krause only after asking him that.
Robin Hanson has suggested penalizing the prior probability of hypotheses which argue that we are in a surprisingly unique position to affect large numbers of other people who cannot symmetrically affect us. Since only one in 3^^^^3 people can be in a unique position to ordain the existence of at least 3^^^^3 other people who are not symmetrically in such a situation themselves, the prior probability would be penalized by a factor on the same order as the utility.
I don't quite get it, is there a post that discusses this solution in more detail?
To be more specific, if a stranger approached me, offering a deal saying, "I am the creator of the Matrix. If you fall on your knees, praise me and kiss my feet, I'll use my magic powers from outside the Matrix to run a Turing machine that simulates 3^^^^3 copies of you having their coherent extrapolated volition satisfied maximally for 3^^^^3 years." Why exactly would I penalize this offer by the amount of copies being offered to be simulated? I thought the whole point was that the utility, of having 3^^^^3 copies of myself experiencing maximal happiness, does outweigh the low probability of it actually happening and the disuility of doing what the stranger asks for?
I would love to see this problem being discussed again and read about the current state of knowledge.
I am especially interested in the following questions:
- Is the Pascal's mugging thought experiment a "reduction to the absurd" of Bayes’ Theorem in combination with the expected utility formula and Solomonoff induction?1
- Could the "mugger" be our own imagination?2
- At what point does an expected utility calculation resemble a Pascal's mugging scenario and should consequently be ignored?3
1 If you calculate the expected utility of various outcomes you imagine impossible alternative actions. The alternatives are impossible because you already precommited to choosing the outcome with the largest expected utility. Problems: 1.) You swap your complex values for a certain terminal goal with the highest expected utility, indeed your instrumental and terminal goals converge to become the expected utility formula. 2.) Your decision-making is eventually dominated by extremely small probabilities of obtaining vast utility.
2 Insignificant inferences might exhibit hyperbolic growth in utility: 1.) There is no minimum amount of empirical evidence necessary to extrapolate the expected utility of an outcome. 2.) The extrapolation of counterfactual alternatives is unbounded, logical implications can reach out indefinitely without ever requiring new empirical evidence.
3 Extrapolations work and often are the best we can do. But since there are problems like 'Pascal's Mugging', that we perceive to be undesirable and that lead to an infinite hunt for ever larger expected utility, I think it is reasonable to ask for some upper and lower bounds regarding the use and scope of certain heuristics. We agree that we are not going to stop pursuing whatever terminal goal we have chosen just because someone promises us even more utility if we do what that agent wants. We might also agree that we are not going to stop loving our girlfriend just because there are many people who do not approve our relationship and who together would experience more happiness if we divorced than the combined happiness of us and our girlfriend being married. Therefore we already informally established some upper and lower bounds. But when do we start to take our heuristics seriously and do whatever they prove to be the optimal decision?
Related to Exterminating life is rational.
ADDED: Standard assumptions about utility maximization and time-discounting imply that we shouldn't care about the future. I will lay out the problem in the hopes that someone can find a convincing way around it. This is the sort of problem we should think about carefully, rather than grasping for the nearest apparent solution. (In particular, the solutions "If you think you care about the future, then you care about the future", and, "So don't use exponential time-discounting," are easily-grasped, but vacuous; see bullet points at end.)
The math is a tedious proof that exponential time discounting trumps geometric expansion into space. If you already understand that, you can skip ahead to the end. I have fixed the point raised by Dreaded_Anomaly. It doesn't change my conclusion.
Suppose that we have Planck technology such that we can utilize all our local resources optimally to maximize our utility, nearly instantaneously.
Suppose that we colonize the universe at light speed, starting from the center of our galaxy (we aren't in the center of our galaxy; but it makes the computations easier, and our assumptions more conservative, since starting from the center is more favorable to worrying about the future, as it lets us grab lots of utility quickly near our starting point).
Follows on from HELP! I want to do good.
What have I learned since last time? I've learned that people want to see an SIAI donation; I'll do it as soon as PayPal will let me. I've learned that people want more "how" and maybe more "doing"; I'll write a doing post soon, but I've got this and two other background posts to write first. I've learned that there's a nonzero level of interest in my project. I've learned that there's a diversity of opinions; it suggests if I'm wrong, then I'm at least wrong in an interesting way. I may have learned that signalling low status - to avoid intimidating outsiders - may be less of a good strategy than signalling that I know what I'm talking about. I've learned that I am prone to answering a question other than that which was asked.
Somewhere in the Less Wrong archives there is a deeply shocking, disturbing post. It's called Post Your Utility Function.
It's shocking because basically no-one had any idea. At the time I was still learning but I knew that having a utility function was important - that it was what made everything else make sense. But I didn't know what mine was supposed to be. And neither, apparently, did anyone else.
Eliezer commented 'in prescriptive terms, how do you "help" someone without a utility function?'. This post is an attempt to start to answer this question.
Firstly, what the utility function is and what it's not. It belongs to the field of instrumental rationality, not epistemic rationality; it is not part of the territory. Don't expect it to correspond to something physical.
Also, it's not supposed to model your revealed preferences - that is, your current behavior. If it did then it would mean you were already perfectly rational. If you don't feel that's the case then you need to look beyond your revealed preferences, toward what you really want.
In other words, the wrong way to determine your utility function is to think about what decisions you have made, or feel that you would make, in different situations. In other words, there's a chance, just a chance, that up until now you've been doing it completely wrong. You haven't been getting what you wanted.
So in order to play the utility game, you need humility. You need to accept that you might not have been getting what you want, and that it might hurt. All those little subgoals, they might just have been getting you nowhere more quickly.
So only play if you want to.
The first thing is to understand the domain of the utility function. It's defined over entire world histories. You consider everything that has happened, and will happen, in your life and in the rest of the world. And out of that pops a number. That's the idea.
This complexity means that utility functions generally have to be defined somewhat vaguely. (Except if you're trying to build an AI). The complexity will also allow you a lot of flexibility in deciding what you really value.
The second thing is to think about your preferences. Set up some thought experiments to decide whether you prefer this outcome or that outcome. Don't think about what you'd actually do if put in a situation to decide between them; then you will worry about the social consequences of making the "unethical" decision. If you value things other than your own happiness, don't ask which outcome you'd be happier in. Instead just ask, which outcome seems preferable?. Which would you consider good news, and which bad news?
You can start writing things down if you like. One of the big things you'll need to think about is how much you value self versus everyone else. But this may matter less than you think, for reasons I'll get into later.
The third thing is to think about preferences between uncertain outcomes. This is somewhat technical, and I'd advise a shut-up-and-multiply approach. (You can try and go against that if you like, but you have to be careful not to end up in weirdness such as getting different answers if you phrase something as one big decision or as a series of identical little decisions).
The fourth thing is to ask whether this preference system satisfies the von Neumann-Morgenstern axioms. If it's at all sane, it probably will. (Again, this is somewhat technical).
The last thing is to ask yourself: if I prefer outcome A over outcome B, do I want to act in such a way that I bring about outcome A? (continue only if the answer here is "yes").
That's it - you now have a shiny new utility function. And I want to help you optimize it. (Though it can grow and develop and change along with yourself; I want this to be a speculative process, not one in which you suddenly commit to an immutable life goal).
You probably don't feel that anything has changed. You're probably feeling and behaving exactly the same as you did before. But this is something I'll have to leave for a later post. Once you start really feeling that you want to maximize your utility then things will start to happen. You'll have something to protect.
Oh, you wanted to know my utility function? It goes something like this:
It's the sum of the things I value. Once a person is created, I value that person's life; I also value their happiness, fun and freedom of choice. I assign negative value to that person's disease, pain and sadness. I value concepts such as beauty and awesomeness. I assign a large bonus negative value to the extinction of humanity. I weigh the happiness of myself and those close to me more highly than that of strangers, and this asymmetry is more pronounced when my overall well-being becomes low.
Four points: It's actually going to be a lot more complicated than that. I'm aware that it's not quantitative and no terminology is defined. I'm prepared to change it if someone points out a glaring mistake or problem, or if I just feel like it for some reason. And people should not start criticizing my behavior for not adhering to this, at least not yet. (I have a lot of explaining still to do).
I've recently found that my utility function valued personal status and fame a whole lot more than I thought it did -- I previously had thought that it mostly relied on the consequences of my actions for other sentiences, but it turned out I was wrong. Obviously, this is a valuable insight -- I definitely want to know what my current utility function is; from there, I can decide whether I should change my actions or my utility function if the two aren't coordinated.
I did this by imagining how I would feel if I found out certain things. For example, how would I feel if everyone else was also trying to save the world? The emotional response I had was sort of a hollow feeling in the pit of my stomach, like I was a really mediocre being. This obviously wasn't a result of calculating that the marginal utility of my actions would be a whole lot lower in this hypothetical world (and so I should go do something else); instead, it was the fact that me trying to save the world didn't make me special any more -- I wouldn't stand out, in this sort of world.
(Epilogue: I decided that I hadn't done a good enough job programming my brain and am attempting to modify my utility function to rely on the world actually getting saved.)
Discussion: What other hypotheticals are useful?
If you consider that the utility generated by working is much greater than the utility directly generated by having fun, then the main thing that you're going to optimizing when you have fun is how much motivation the memory of having that fun increases your working capabilities. This is distinctly different from optimizing for the direct preference fulfillment generated by the fun, even if the same activities are optimal for both utility functions.
The same model works for any action A such that the utility generated by the effect of that action on another action is much greater than the utility generated by the action itself. This probably applies to most maintainance actions, such as doing laundry, sleeping, eating, but this is more obvious to us -- we usually don't see laundry as an end unto itself, but we often do pursue fun for it's own sake. I'm not advocating that we shouldn't have fun, but that we (or at least I) seem to be optimizing for the wrong thing -- direct preference fulfillment, rather than motivation.
This feels like a significant insight, but I tend to get a significant number of false positives. Any ideas on how we might use this?
The torture vs. dust specks quandary is a canonical one to LW. Off the top of my head, I can't remember anyone suggesting the reversal, one where the arguments taken by the hypothetical are positive and not negative. I'm curious about how it affects people's intuitions. I call it - as the title indicates - "Sublimity vs. Youtube1".
Suppose the impending existence of some person who is going to live to be fifty years old whatever you do2. She is liable to live a life that zeroes out on a utility scale: mediocre ups and less than shattering downs, overall an unremarkable span. But if you choose "sublimity", she's instead going to live a life that is truly sublime. She will have a warm and happy childhood enriched by loving relationships, full of learning and wonder and growth; she will mature into a merrily successful adult, pursuing meaningful projects and having varied, challenging fun. (For the sake of argument, suppose that the ripple effects of her sublime life as it affects others still lead to the math tallying up as +(1 sublime life), instead of +(1 sublime life)+(various lovely consequences).)
Or you can choose "Youtube", and 3^^^3 people who weren't doing much with some one-second period of their lives instead get to spend that second watching a brief, grainy, yet droll recording of a cat jumping into a box, which they find mildly entertaining.
Sublimity or Youtube?
1The choice in my variant scenario of "watching a Youtube video" rather than some small-but-romanticized pleasure ("having a butterfly land on your finger, then fly away", for instance) is deliberate. Dust specks are really tiny, and there's not much automatic tendency to emotionally inflate them. Hopefully Youtube videos are the reverse of that.
2I'm choosing to make it an alteration of a person who will exist either way to avoid questions about the utility of creating people, and for greater isomorphism with the "torture" option in the original.
Update: I should've said "non-existential risk charity", rather than specifically exclude SIAI. I'm having trouble articulating why I don't want to give to an existential risk charity, so I'm going to think more deeply about it. This post is close to my source of discomfort, which is about the many highly uncertain assumptions necessary to motivate existential risk reduction. However, I couldn't articulate this argument properly before, so it might not be the true source of my discomfort. I'll keep thinking.
I received my first pay-cheque from my first job after getting my degree, so it's time to start tithing. So I've been evalating which charity to donate to. I'd like to support the SIAI but I'm not currently convinced it's the best-value charity in a dollars-per-life sense, once time-value of money discounting is applied. I'd like to discuss the best non-SIAI charity available.
By far the best source of information I've found is www.givewell.org. It was started by two hedge fund managers who were struck by the absence of rational charity evaluations, so decided that this was the most pressing problem they could work on.
Perhaps the clearest, deepest finding from the studies they pull together and discuss is that charity is hard. Spending money doesn't automatically translate to doing good. It's not even enough to have smart people who care and know a lot about the problem think of ideas, and then spend money doing them. There's still a good chance the idea won't work. So we need to be evaluating programs rigorously before we scale them up, and keep evaluating as we scale.
The bad news is that this isn't how charity is usually done. Very few charities make convincing evaluations of their activities public, if they carry them out at all. The good news is that some of the programs that have been evaluated are very, very effective. So choosing a charity rationally is absolutely critical.
Let's say you're interested specifically in HIV/AIDS relief. You could fund a program that mainly distributes Anti-Retroviral Therapy to HIV/AIDS patients, which has been estimated conservatively to cost $1494 per disability adjusted life-year (DALY). Alternatively, you could fund a condom distribution program, which has been estimated conservatively to cost $112 per DALY. Or, you could fund a program to prevent mother-to-child transmission, which has been estimated conservatively to cost $12 per DALY. So even within HIV/AIDS, funding the right program can make your donation two orders of magnitude more effective. By tithing 10% of my income every year for the next thirty years, I could have a bigger impact than a $25 million donation, if the person who placed that donation only did an okay job of choosing a charity.
GiveWell currently gives its top recommendation to VillageReach, a charity that seeks to improve logistics for vaccine delivery to remote communities. The evidence is less cut-and-dried than you'd ideally want, but it's still compelling. They took vaccine rates up to 95%, and had very low stock-out rates for vaccines during the 4 year pilot project in Mozambique. They're estimated to have spent about $200usd per life saved. Even if future projects are two or three times less efficient, you're still saving a life for $600. Think about how little money that is. If you tithe, you can probably expect to save 10 lives a year. That's massive.
Instead of donating directly to VillageReach, I'm going to just donate to GiveWell. They pool the funds they get and distribute them to their top charities, and I trust their analytic, evidence-based, largely utilitarian approach. Mostly, however, I think the work they're doing gathering and distributing information about charities is critically important. If more charities actually competed on evidence of efficacy, the whole endeavour might be a lot different. Does anyone have any better suggestions?
 I don't understand why people would want to help sufferers of one disease or condition specifically, instead of picking the lowest-hanging fruit, but apparently they do.
View more: Next