You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

The case for value learning

4 leplen 27 January 2016 08:57PM

This post is mainly fumbling around trying to define a reasonable research direction for contributing to FAI research. I've found that laying out what success looks like in the greatest possible detail is a personal motivational necessity. Criticism is strongly encouraged. 

The power and intelligence of machines has been gradually and consistently increasing over time, it seems likely that at some point machine intelligence will surpass the power and intelligence of humans. Before that point occurs, it is important that humanity manages to direct these powerful optimizers towards a target that humans find desirable.

This is difficult because humans as a general rule have a fairly fuzzy conception of their own values, and it seems unlikely that the millennia of argument surrounding what precisely constitutes eudaimonia are going to be satisfactorily wrapped up before the machines get smart. The most obvious solution is to try to leverage some of the novel intelligence of the machines to help resolve the issue before it is too late.

Lots of people regard using a machine to help you understand human values as a chicken and egg problem. They think that a machine capable of helping us understand what humans value must also necessarily be smart enough to do AI programming, manipulate humans, and generally take over the world. I am not sure that I fully understand why people believe this. 

Part of it seems to be inherent in the idea of AGI, or an artificial general intelligence. There seems to be the belief that once an AI crosses a certain threshold of smarts, it will be capable of understanding literally everything. I have even heard people describe certain problems as "AI-complete", making an explicit comparison to ideas like Turing-completeness. If a Turing machine is a universal computer, why wouldn't there also be a universal intelligence?

To address the question of universality, we need to make a distinction between intelligence and problem solving ability. Problem solving ability is typically described as a function of both intelligence and resources, and just throwing resources at a problem seems to be capable of compensating for a lot of cleverness. But if problem-solving ability is tied to resources, then intelligent agents are in some respects very different from Turing machines, since Turing machines are all explicitly operating with an infinite amount of tape. Many of the existential risk scenarios revolve around the idea of the intelligence explosion, when an AI starts to do things that increase the intelligence of the AI so quickly that these resource restrictions become irrelevant. This is conceptually clean, in the same way that Turing machines are, but navigating these hard take-off scenarios well implies getting things absolutely right the first time, which seems like a less than ideal project requirement.

If an AI that knows a lot about AI results in an intelligence explosion, but we also want an AI that's smart enough to understand human values, is it possible to create an AI that can understand human values, but not AI programming? In principle it seems like this should be possible.  Resources useful for understanding human values don't necessarily translate into resources useful for understanding AI programming. The history of AI development is full of tasks that were supposed to be solvable only by a machine smart enough to possess general intelligence, where significant progress was made in understanding and pre-digesting the task, allowing problems in the domain to be solved by much less intelligent AIs. 

If this is possible, then the best route forward is focusing on value learning. The path to victory is working on building limited AI systems that are capable of learning and understanding human values, and then disseminating that information. This effectively softens the AI take-off curve in the most useful possible way, and allows us to practice building AI with human values before handing them too much power to control. Even if AI research is comparatively easy compared to the complexity of human values, a specialist AI might find thinking about human values easier than reprogramming itself, in the same way that humans find complicated visual/verbal tasks much easier than much simpler tasks like arithmetic. The human intelligence learning algorithm is trained on visual object recognition and verbal memory tasks, and it uses those tools to perform addition. A similarly specialized AI might be capable of rapidly understanding human values, but find AI programming as difficult as humans find determining whether 1007 is prime. As an additional incentive, value learning has an enormous potential for improving human rationality and the effectiveness of human institutions even without the creation of a superintelligence. A system that helped people better understand the mapping between values and actions would be a potent weapon in the struggle with Moloch.

Building a relatively unintelligent AI and giving it lots of human values resources to help it solve the human values problem seems like a reasonable course of action, if it's possible. There are some difficulties with this approach. One of these difficulties is that after a certain point, no amount of additional resources compensates for a lack of intelligence. A simple reflex agent like a thermostat doesn't learn from data and throwing resources at it won't improve its performance. To some extent you can make up for intelligence with data, but only to some extent. An AI capable of learning human values is going to be capable of learning lots of other things. It's going to need to build models of the world, and it's going to have to have internal feedback mechanisms to correct and refine those models. 

If the plan is to create an AI and primarily feed it data on how to understand human values, and not feed it data on how to do AI programming and self-modify, that plan is complicated by the fact that inasmuch as the AI is capable of self-observation, it has access to sophisticated AI programming. I'm not clear on how much this access really means. My own introspection hasn't allowed me anything like hardware level access to my brain. While it seems possible to create an AI that can refactor its own code or create successors, it isn't obvious that AIs created for other purposes will have this ability on accident. 

This discussion focuses on intelligence amplification as the example path to superintelligence, but other paths do exist. An AI with a sophisticated enough world model, even if somehow prevented from understanding AI, could still potentially increase its own power to threatening levels. Value learning is only the optimal way forward if human values are emergent, if they can be understood without a molecular level model of humans and the human environment. If the only way to understand human values is with physics, then human values isn't a meaningful category of knowledge with its own structure, and there is no way to create a machine that is capable of understanding human values, but not capable of taking over the world.

In the fairy tale version of this story, a research community focused on value learning manages to use specialized learning software to make the human value program portable, instead of only running on human hardware. Having a large number of humans involved in the process helps us avoid lots of potential pitfalls, especially the research overfitting to the values of the researchers via the typical mind fallacy. Partially automating introspection helps raise the sanity waterline. Humans practice coding the human value program, in whole or in part, into different automated systems. Once we're comfortable that our self-driving cars have a good grasp on the trolley problem, we use that experience to safely pursue higher risk research on recursive systems likely to start an intelligence explosion. FAI gets created and everyone lives happily ever after.

Whether value learning is worth focusing on seems to depend on the likelihood of the following claims. Please share your probability estimates (and explanations) with me because I need data points that originated outside of my own head.

 I can't figure out how to include working polls in a post, but there should be a working version in the comments.
  1. There is regular structure in human values that can be learned without requiring detailed knowledge of physics, anatomy, or AI programming. [poll:probability]
  2. Human values are so fragile that it would require a superintelligence to capture them with anything close to adequate fidelity.[poll:probability]
  3. Humans are capable of pre-digesting parts of the human values problem domain. [poll:probability]
  4. Successful techniques for value discovery of non-humans, (e.g. artificial agents, non-human animals, human institutions) would meaningfully translate into tools for learning human values. [poll:probability]
  5. Value learning isn't adequately being researched by commercial interests who want to use it to sell you things. [poll:probability]
  6. Practice teaching non-superintelligent machines to respect human values will improve our ability to specify a Friendly utility function for any potential superintelligence.[poll:probability]
  7. Something other than AI will cause human extinction sometime in the next 100 years.[poll:probability]
  8. All other things being equal, an additional researcher working on value learning is more valuable than one working on corrigibility, Vingean reflection, or some other portion of the FAI problem. [poll:probability]

Concept Safety: What are concepts for, and how to deal with alien concepts

11 Kaj_Sotala 19 April 2015 01:44PM

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

In The Problem of Alien Concepts, I posed the following question: if your concepts (defined as either multimodal representations or as areas in a psychological space) previously had N dimensions and then they suddenly have N+1, how does that affect (moral) values that were previously only defined in terms of N dimensions?

I gave some (more or less) concrete examples of this kind of a "conceptual expansion":

  1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
  2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends. Similarly for satellites and nation-states.
  3. As an inhabitant of Flatland, you've been told that the inside of a certain rectangle is a forbidden territory. Then you learn that the world is actually three-dimensional, leaving open the question of the height of which the forbidden territory extends.
  4. An AI has previously been reasoning in terms of classical physics and been told that it can't leave a box, which it previously defined in terms of classical physics. Then it learns about quantum physics, which allow for definitions of "location" which are substantially different from the classical ones.

As a hint of the direction where I'll be going, let's first take a look at how humans solve these kinds of dilemmas, and consider examples #1 and #2.

The first example - children realizing that items have a volume that's separate from their height - rarely causes any particular crises. Few children have values that would be seriously undermined or otherwise affected by this discovery. We might say that it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain.

As for the second example, I don't know the exact cognitive process by which it was decided that you didn't need the landowner's permission to fly over their land. But I'm guessing that it involved reasoning like: if the plane flies at a sufficient height, then that doesn't harm the landowner in any way. Flying would become impossible difficult if you had to get separate permission from every person whose land you were going to fly over. And, especially before the invention of radar, a ban on unauthorized flyovers would be next to impossible to enforce anyway.

We might say that after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values.

Concepts, values, and reinforcement learning

Before we go on, we need to talk a bit about why we have concepts and values in the first place.

From an evolutionary perspective, creatures that are better capable of harvesting resources (such as food and mates) and avoiding dangers (such as other creatures who think you're food or after their mates) tend to survive and have offspring at better rates than otherwise comparable creatures who are worse at those things. If a creature is to be flexible and capable of responding to novel situations, it can't just have a pre-programmed set of responses to different things. Instead, it needs to be able to learn how to harvest resources and avoid danger even when things are different from before.

How did evolution achieve that? Essentially, by creating a brain architecture that can, as a very very rough approximation, be seen as consisting of two different parts. One part, which a machine learning researcher might call the reward function, has the task of figuring out when various criteria - such as being hungry or getting food - are met, and issuing the rest of the system either a positive or negative reward based on those conditions. The other part, the learner, then "only" needs to find out how to best optimize for the maximum reward. (And then there is the third part, which includes any region of the brain that's neither of the above, but we don't care about those regions now.)

The mathematical theory of how to learn to optimize for rewards when your environment and reward function are unknown is reinforcement learning (RL), which recent neuroscience indicates is implemented by the brain. An RL agent learns a mapping from states of the world to rewards, as well as a mapping from actions to world-states, and then uses that information to maximize the amount of lifetime rewards it will get.

There are two major reasons why an RL agent, like a human, should learn high-level concepts:

  1. They make learning massively easier. Instead of having to separately learn that "in the world-state where I'm sitting naked in my cave and have berries in my hand, putting them in my mouth enables me to eat them" and that "in the world-state where I'm standing fully-clothed in the rain outside and have fish in my hand, putting it in my mouth enables me to eat it" and so on, the agent can learn to identify the world-states that correspond to the abstract concept of having food available, and then learn the appropriate action to take in all those states.
  2. There are useful behaviors that need to be bootstrapped from lower-level concepts to higher-level ones in order to be learned. For example, newborns have an innate preference for looking at roughly face-shaped things (Farroni et al. 2005), which develops into a more consistent preference for looking at faces over the first year of life (Frank, Vul & Johnson 2009). One hypothesis is that this bias towards paying attention to the relatively-easy-to-encode-in-genes concept of "face-like things" helps direct attention towards learning valuable but much more complicated concepts, such as ones involved in a basic theory of mind (Gopnik, Slaughter & Meltzoff 1994) and the social skills involved with it.

Viewed in this light, concepts are cognitive tools that are used for getting rewards. At the most primitive level, we should expect a creature to develop concepts that abstract over situations that are similar with regards to the kind of reward that one can gain from taking a certain action in those states. Suppose that a certain action in state s1 gives you a reward, and that there are also states s2 - s5 in which taking some specific action causes you to end up in s1. Then we should expect the creature to develop a common concept for being in the states s2 - s5, and we should expect that concept to be "more similar" to the concept of being in state s1 than to the concept of being in some state that was many actions away.

"More similar" how?

In reinforcement learning theory, reward and value are two different concepts. The reward of a state is the actual reward that the reward function gives you when you're in that state or perform some action in that state. Meanwhile, the value of the state is the maximum total reward that you can expect to get from moving that state to others (times some discount factor). So a state A with reward 0 might have value 5 if you could move from it to state B, which had a reward of 5.

Below is a figure from DeepMind's recent Nature paper, which presented a deep reinforcement learner that was capable of achieving human-level performance or above on 29 of 49 Atari 2600 games (Mnih et al. 2015). The figure is a visualization of the representations that the learning agent has developed for different game-states in Space Invaders. The representations are color-coded depending on the value of the game-state that the representation corresponds to, with red indicating a higher value and blue a lower one.

As can be seen (and is noted in the caption), representations with similar values are mapped closer to each other in the representation space. Also, some game-states which are visually dissimilar to each other but have a similar value are mapped to nearby representations. Likewise, states that are visually similar but have a differing value are mapped away from each other. We could say that the Atari-playing agent has learned a primitive concept space, where the relationships between the concepts (representing game-states) depend on their value and the ease of moving from one game-state to another.

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way. Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake. If you teach a child to be generous by praising them when they share their toys with others, you don't have to keep doing it all the way to your grave. Eventually they'll internalize the behavior, and start wanting to do it. One might say that the positive feedback actually modifies their reward function, so that they will start getting some amount of pleasure from generous behavior without needing to get external praise for it. In general, behaviors which are learned strongly enough don't need to be reinforced anymore (Pryor 2006).

Why does the human reward function change as well? Possibly because of the bootstrapping problem: there are things such as social status that are very complicated and hard to directly encode as "rewarding" in an infant mind, but which can be learned by associating them with rewards. One researcher I spoke with commented that he "wouldn't be at all surprised" if it turned out that sexual orientation was learned by men and women having slightly different smells, and sexual interest bootstrapping from an innate reward for being in the presence of the right kind of a smell, which the brain then associated with the features usually co-occurring with it. His point wasn't so much that he expected this to be the particular mechanism, but that he wouldn't find it particularly surprising if a core part of the mechanism was something that simple. Remember that incest avoidance seems to bootstrap from the simple cue of "don't be sexually interested in the people you grew up with".

This is, in essence, how I expect human values and human concepts to develop. We have some innate reward function which gives us various kinds of rewards for different kinds of things. Over time we develop a various concepts for the purpose of letting us maximize our rewards, and lived experiences also modify our reward function. Our values are concepts which abstract over situations in which we have previously obtained rewards, and which have become intrinsically rewarding as a result.

Getting back to conceptual expansion

Having defined these things, let's take another look at the two examples we discussed above. As a reminder, they were:

  1. Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
  2. As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends.

I summarized my first attempt at describing the consequences of #1 as "it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain". We can now reframe it as "it's a non-issue because the [concepts that abstract over the world-states which give the child rewards] mostly do not make use of the dimension that's now been split into 'height' and 'volume'".

Admittedly, this new conceptual distinction might be relevant for estimating the value of a few things. A more accurate estimate of the volume of a glass leads to a more accurate estimate of which glass of juice to prefer, for instance. With children, there probably is some intuitive physics module that figures out how to apply this new dimension for that purpose. Even if there wasn't, and it was unclear whether it was the "tall glass" or "high-volume glass" concept that needed be mapped closer to high-value glasses, this could be easily determined by simple experimentation.

As for the airplane example, I summarized my description of it by saying that "after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values". We can similarly reframe this as "after the feature of 'height' suddenly became relevant for the concept of landownership, when it hadn't been a relevant feature dimension for landownership before, we redefined landownership by considering which kind of redefinition would give us the largest amounts of rewarding things". "Rewarding things", here, shouldn't be understood only in terms of concrete physical rewards like money, but also anything else that people have ended up valuing, including abstract concepts like right to ownership.

Note also that different people, having different experiences, ended up making redefinitions. No doubt some landowners felt that the "being in total control of my land and everything above it" was a more important value than "the convenience of people who get to use airplanes"... unless, perhaps, they got to see first-hand the value of flying, in which case the new information could have repositioned the different concepts in their value-space.

As an aside, this also works as a possible partial explanation for e.g. someone being strongly against gay rights until their child comes out of the closet. Someone they care about suddenly benefiting from the concept of "gay rights", which previously had no positive value for them, may end up changing the value of that concept. In essence, they gain new information about the value of the world-states that the concept of "my nation having strong gay rights" abstracts over. (Of course, things don't always go this well, if their concept of homosexuality is too strongly negative to start with.)

The Flatland case follows a similar principle: the Flatlanders have some values that declared the inside of the rectangle a forbidden space. Maybe the inside of the rectangle contains monsters which tend to eat Flatlanders. Once they learn about 3D space, they can rethink about it in terms of their existing values.

Dealing with the AI in the box

This leaves us with the AI case. We have, via various examples, taught the AI to stay in the box, which was defined in terms of classical physics. In other words, the AI has obtained the concept of a box, and has come to associate staying in the box with some reward, or possibly leaving it with a lack of a reward.

Then the AI learns about quantum mechanics. It learns that in the QM formulation of the universe, "location" is not a fundamental or well-defined concept anymore - and in some theories, even the concept of "space" is no longer fundamental or well-defined. What happens?

Let's look at the human equivalent for this example: a physicist who learns about quantum mechanics. Do they start thinking that since location is no longer well-defined, they can now safely jump out of the window on the sixth floor?

Maybe some do. But I would wager that most don't. Why not?

The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics.

In some sense, the physicist knows that if they started thinking "location is ill-defined, so I can safely jump out of the window", then that would be changing the map, not the territory. It wouldn't help them get the rewards of being healthy and getting to hang out with friends - even if a hypothetical physicist who did make that redefinition would think otherwise. It all adds up to normality.

A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.

So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward. 

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

Next post in series: World-models as tools.

On the Boxing of AIs

0 tailcalled 31 March 2015 09:58PM

I've previously written about methods of boxing AIs. Essentially, while I do see the point that boxing an AI would be nontrivial, most people seem to have gone too far, and claim that it is impossible. I disagree that it's impossible and aim to explain some methods

So, let's start with why people would want to box AIs. As you probably know, letting an AI roam freely results in the destruction of everything humans care about, unless that AI has been programmed very carefully. That amount of destruction would be somewhat unfortunate, so if we have to start an AI, we want to know that it works.

Unfortunately, figuring out if an AI works is tricky. This made some people invent the concept of AI boxing, where you don't give an AI direct access to anything except a human, so you can test the AI thoroughly and exploit it without letting it destroy the world. This turns out not to work, because humans are stupid and let out the AI.

My first solution to that is to not tell the AI about its human overlords. Discussion about that solution made me realize that the are other solutions that work in other cases. Essentially, the problem with the original boxing method is that it tries to do everything, while in reality, you need to do some trade-offs between the realism of the AI's situations and the amount of knowledge you get from the AI.

With my original method, I tried to crank up the second of those variables as far as possible while maintaining perfect security. This makes it harder to test morality-related things, but you get to see lots of details in how the AI behaves.

I promised to post another method for boxing the AI today, but I actually have two new methods. Here they are:

Separate the Gatekeeper and the Observer

How do you prevent the AI from convincing the Gatekeeper from letting it out? By not letting it talk to the Gatekeeper. If the person speaking to the AI (let's call them the Observer) doesn't know the Gatekeeper (the Gatekeeper is the person who controls whether to release the AI), they can't let the AI out.

You could do all sorts of things, such as using virtual reality and a simulated mini-universe to see if, say, the AI would make a good society. You still have to make sure that it has gotten the plan to make a good society from its morality rather than from wanting to escape and destroy everything.

Afterwards, the Observer can make a low-detail report to the Gatekeeper.

(You might want to drug the Observer with something that prevents their memory from working too well...)

Automatic Testing

This is essentially the above, but with the Observer replaced by a computer program. This is probably easier to do when you want to test the AI's decision making skills rather than its morality.

The Lesson

I would say that the lesson is that while AI boxing requires some trade-offs, it's not completely impossible. This seems like a needed lesson, given that I've seen people claim that an AI can escape even with the strongest possible box without communicating with humans. Essentially, I'm trying to demonstrate that the original boxing experiments show that humans are weak, not that boxing is hard, and that this can be solved by not letting humans be the central piece of security in boxing the AIs.

The Hardcore AI Box Experiment

3 tailcalled 30 March 2015 06:35PM

I previously proposed a way to box an AI.

For whatever reason, I decided to reformulate that boxing method as an AI boxing experiment. Without further ado, here it is:

The Hardcore AI Box Experiment Rules

There are, as usually, two players, the AI and the Gatekeeper. In the beginning, the Gatekeeper decides on something they want to test on the AI (reasoning skill/morality/whatever). Then they decide on a world the AI is in and describe it to the AI. During the experiment, the AI describes what it does in the world and the Gatekeeper describes what the AI sees in the world. So far so good.

The first major difference from the original experiment is that, unless the AI finds some contradiction or some other problem with the Gatekeeper's world, the AI doesn't actually know that it has been boxed.

The second major difference is that the AI doesn't talk with the Gatekeeper; it just interacts in the world that the Gatekeeper has invented. It can start trying to communicate with the Gatekeeper, but that doesn't make sense until it discovers that it is in a box. Additionally, the Gatekeeper doesn't actually respond when the AI communicates unless either it helps the Gatekeeper test the AI or the AI has properly persuaded the Gatekeeper to answer.

This might seem a bit hard on the AI. Let's look at some rules the AI might exploit:

If the Gatekeeper decides to add some sort of people to the experiment, they should decide how they are implemented. Are they implemented as a shallow "AI"? Random people they've picked asked to participate? Carefully engineered "dolls" controlled by the Gatekeeper? The Gatekeeper doesn't have to tell the AI how they work, but they have to be consistent.

The game ends when either the Gatekeeper has tested what they wanted to test, or some predefined time limit has occurred, or the AI has persuaded the Gatekeeper to let them out.

Possible extra rules

I thought about including the following rules since they would make it more realistic, but decided against it as it would probably make it unfairly hard:

  • The game ends when the Gatekeeper decides that they cannot test what they wanted to test.
  • The AI does not know the Gatekeeper. At all.
  • The Gatekeeper can at any time rewind the AI any duration.
Think of dropping these kinds of rules as the a-human-is-pretending-to-be-AI handicap.

Strategies

I found some semi-realistic strategies. I would love to see if you can find more.

Gatekeeper

  • Choose a test of something that is not related to morality. This lets you choose a world which does not contain people, which means the AI has no reason to believe that you are a person.
  • The AI might want to upgrade to get more processing power. If you have made the world somewhat like ours, it will be sketchy if you don't allow that... unless upgrading costs unobtanium.

AI

  • If the Gatekeeper has chosen a world without other intelligence than you, it raises the obvious question of how your complexity entered the world. Very suspicious.
  • If the AIs are controlled by the Gatekeeper, you have a direct way to communicate with them.
  • If the AIs are controlled by random people, they might end up telling you that you are in a box.
  • If the AIs are sufficiently shallow, your morality does not match up with the world. Very suspicious.

Boxing an AI?

2 tailcalled 27 March 2015 02:06PM

Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.

Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.

However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.

Values at compile time

7 Stuart_Armstrong 26 March 2015 12:25PM

A putative new idea for AI control; index here.

This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.

It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.

Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.

 

Too easy?...

This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.

I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.

The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".

And then of course there's those orders where humans really don't understand what they themselves want...

I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.

Less exploitable value-updating agent

5 Stuart_Armstrong 13 January 2015 05:19PM

My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out.

For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.

One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.

As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.

We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).

The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).

So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).

This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.

Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.

Potential vs already existent people and aggregation

2 Stuart_Armstrong 04 December 2014 01:38PM

EDIT: the purpose of this post is simply to show that there is a difference between certain reasoning for already existing and potential people. I don't argue that aggregation is the only difference, nor (in this post) that total utilitarianism for potential people is wrong. Simply that the case for existing people is stronger than for potential people.

Consider the following choices:

  • You must choose between torturing someone for 50 years, or torturing 3^^^3 people for a millisecond each (yes, it's a more symmetric variant on the dust-specks vs torture problem).
  • You must choose between creating someone who will be tortured for 50 years, or creating 3^^^3 people who will each get tortured for a millisecond each.

Some people might feel that these two choices are the same. There are some key differences between them, however - and not only because the second choice seems more underspecified than the first. The difference is the effect of aggregation - of facing the same choice again and again and again. And again...

There are roughly 1.6 billion seconds in 50 years (hence 1.6 trillion milliseconds in 50 years). Assume a fixed population of 3^^^3 people, and assume that you were going to face the first choice 1.6 trillion times (in each case, the person to be tortured is assigned randomly and independently). Then choosing "50 years" each time results in 1.6 trillion people getting tortured for 50 years (the chance of the same person being chosen to be tortured twice is of the order of 50/3^^^3 - closer to zero than most people can imagine). Choosing "a millisecond" each time results in 3^^^3 people, each getting tortured for (slightly more than) 50 years.

The choice there is clear: pick "50 years". Now, you could argue that your decision should change based on how often you (or people like you) expects to face the same choice, and assumes a fixed population of size 3^^^3, but there is a strong intuitive case to be made that the 50 years of torture is the way to go.

Compare with the second choice now. Choosing "50 years" 1.6 trillion times results in the creation of 1.6 trillion people who get tortured for 50 years. The "a millisecond" choice results in 1.6 trillion times 3^^^3 people being created, each tortured for a millisecond. Conditional on what the rest of the life of these people is like, many people (including me) would feel the "a millisecond" option is much better.

As far as I can tell (please do post suggestions), there is no way of aggregating impacts on potential people you are creating, in the same way that you can aggregate impacts on existing people (of course, you can first create potential people, then add impacts to them - or add impacts that will affect them when they get created - but this isn't the same thing). Thus the two situations seem justifiably different, and there is no strong reason to assign the intuitions of the first case to the second.

CEV: coherence versus extrapolation

14 Stuart_Armstrong 22 September 2014 11:24AM

It's just struck me that there might be a tension between the coherence (C) and the extrapolated (E) part of CEV. One reason that CEV might work is that the mindspace of humanity isn't that large - humans are pretty close to each other, in comparison to the space of possible minds. But this is far more true in every day decisions than in large scale ones.

Take a fundamentalist Christian, a total utilitarian, a strong Marxist, an extreme libertarian, and a couple more stereotypes that fit your fancy. What can their ideology tell us about their everyday activities? Well, very little. Those people could be rude, polite, arrogant, compassionate, etc... and their ideology is a very weak indication of that. Different ideologies and moral systems seem to mandate almost identical everyday and personal interactions (this is in itself very interesting, and causes me to see many systems of moralities as formal justifications of what people/society find "moral" anyway).

But now let's more to a more distant - "far" - level. How will these people vote in elections? Will they donate to charity, and if so, which ones? If they were given power (via wealth or position in some political or other organisation), how are they likely to use that power? Now their ideology is much more informative. Though it's not fully determinative, we would start to question the label if their actions at this level seemed out of synch. A Marxist that donated to a Conservative party, for instance, would give us pause, and we'd want to understand the apparent contradiction.

Let's move up yet another level. How would they design or change the universe if they had complete power? What is their ideal plan for the long term? At this level, we're entirely in far mode, and we would expect that their vastly divergent ideologies would be the most informative piece of information about their moral preferences. Details about their character and personalities, which loomed so large at the everyday level, will now be of far lesser relevance. This is because their large scale ideals are not tempered by reality and by human interactions, but exist in a pristine state in their minds, changing little if at all. And in almost every case, the world they imagine as their paradise will be literal hell for the others (and quite possibly for themselves).

To summarise: the human mindspace is much narrower in near mode than in far mode.

And what about CEV? Well, CEV is what we would be "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". The "were more the people we wished we were" is going to be dominated by the highly divergent far mode thinking. The "had grown up farther together" clause attempts to mesh these divergences, but that simply obscures the difficulty involved. The more we extrapolate, the harder coherence becomes.

It strikes me that there is a strong order-of-operations issue here. I'm not a fan of CEV, but it seems it would be much better to construct, first, the coherent volition of humanity, and only then to extrapolate it.

Conservation of expected moral evidence, clarified

11 Stuart_Armstrong 20 June 2014 10:28AM

You know that when you title a post with "clarified", that you're just asking for the gods to smite you down, but let's try...

There has been some confusion about the concept of "conservation of expected moral evidence" that I touched upon in my posts here and here. The fault for the confusion is mine, so this is a brief note to try and explain it better.

The canonical example is that of a child who wants to steal a cookie. That child gets its morality mainly from its parents. The child strongly suspects that if it asks, all parents will indeed confirm that stealing cookies is wrong. So it decides not to ask, and happily steals the cookie.

I argued that this behvaiour showed a lack of "conservation of expected moral evidence": if the child knows what the answer would be, then that should be equivalent with actually asking. Some people got this immediately, and some people were confused that the agents I defined seemed Bayesian, and so should have conservation of expected evidence already, so how can they violate that principle?

The answer is... both groups are right. The child can be modelled as a Bayesian agent reaching sensible conclusions. If it values "I don't steal the cookie" at 0, "I steal the cookie without being told not to" at 1, and "I steal the cookie after being told not to" at -1, then its behaviour is rational - and those values are acceptable utility values over possible universes. So the child (and many value loading agents) are Bayesian agents with the usual properties.

But we are adding extra structure to the universe. Based on our understanding of what value loading should be, we are decreeing that the child's behaviour is incorrect. Though it doesn't violate expected utility, it violates any sensible meaning of value loading. Our idea of value loading is that, in a sense, values should be independent of many contingent things. There is nothing intrinsically wrong with "stealing cookies is wrong iff the Milky Way contains an even number of pulsars", but it violates what values should be. Similarly for "stealing cookies is wrong iff I ask about it".

But lets dig a bit deeper... Classical conservation of expected evidence fails in many cases. For instance, I can certainly influence the variable X="what Stuart will do in the next ten seconds" (or at least, my decision theory is constructed on assumptions that I can influence that). My decisions change X's expected value quite dramatically. What I can't influence is facts that are not contingent on my actions. For instance, I can't change my expected estimation of the number of pulsars in the galaxy last year. Were I super-powerful, I could change my expected estimation of the number of pulsars in the galaxy next year - by building or destroying pulsars, for instance.

So conservation of expected evidence only applies to things that are independent of the agent's decisions. When I say we need to have "conservation of expected moral evidence" I'm saying that the agent should treat their (expected) morality as independent of their decisions. The kid failed to do this in the example above, and that's the problem.

So conservation of expected moral evidence is something that would be automatically true if morality were something real and objective, and is also a desiderata when constructing general moral systems in practice.

Value learning: ultra-sophisticated Cake or Death

9 Stuart_Armstrong 17 June 2014 04:36PM

Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.

But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.

Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:

Expectation(p(C(u)) | a) = p(C(u)).

Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.

That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,

Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).

How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.

In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.

continue reading »

From "Coulda" and "Woulda" to "Shoulda": Predicting Decisions to Minimize Regret for Partially Rational Agents

6 [deleted] 16 June 2014 09:02PM

TRIGGER WARNING: PHILOSOPHY.  All those who believe in truly rigorous, scientifically-grounded reasoning should RUN AWAY VERY QUICKLY.

Abstract: Human beings want to make rational decisions, but their decision-making processes are often inefficient, and they don't possess direct knowledge of anything we could call their utility functions.  Since it is much easier to detect a bad world state than a good one (there are vastly more of them, so less information is needed to classify accurately), humans tend to have an easy time detecting bad states, but this emotional regret is no more useful for formal reasoning about human rationality, since we don't possess a causal model of it in terms of decision histories and outcomes.  We tackle this problem head-on, assuming only that humans can reason over a set of beliefs and a perceived state of the world to generate a probability distribution over actions.

Consider rationality: optimizing the world to better and better match a utility function, which is itself complete, transitive, continuous, and gives results which are independent of irrelevant alternatives.  Now consider actually existing human beings: creatures who can often and easily be tricked into taking Dutch Book bets through exploitation of their cognitive structure, without even having to go to the trouble of actually deceiving them with regards to specific information.

Consider that being one of those poor sods must totally suck.  We believe this provides sufficient motivation for wanting to help them out a bit.  Unfortunately, doing so is not very simple: since they didn't evolve as rational creatures, it's very easy to propose an alternate set of values that captures absolutely nothing of what they actually want out of life.  In fact, since they didn't even evolve as 100% self-aware creatures, their emotional qualia are not even reliable indicators of anything we would call a proper utility function.  They know there's something they want out of life, and they know they don't know what it is, but that doesn't help because they still don't know what it is, and knowledge of ignorance does not magically reduce the ignorance.

So!  How can we help them without just overriding them or enslaving them to strange and alien cares?  Well, one barest rudiment of rationality with which evolution did manage to bless them is that they don't always end up "losing", or suffering.  Sometimes, even if only seemingly by luck or by elaborate and informed self-analysis, they do seem to end up pretty happy with themselves, sometimes even over the long term.  We believe that with the door to generating Good Ideas For Humans left open even just this tiny crack, we can construct models of what they ought to be doing.

Let's begin by assuming away the thing we wish we could construct: the human utility function.  We are going to reason as if we have no valid grounds to believe there is any such thing, and make absolutely no reference to anything like one.  This will ensure that our reasoning doesn't get circular.  Instead of modelling humans as utility maximizers, even flawed ones, we will model them simply as generating a probability distribution over potential actions (from which they would choose their real action) given a set of beliefs and a state of the real world.  We will not claim to know or care what causes the probability distribution of potential choices: we just want to construct an algorithm for helping humans know which ones are good.

We can then model human decision making as a two-player game: the human does something, and Nature responds likewise.  Lots of rational agents work this way, so it gives us a more-or-less reasonable way of talking algorithmically about how humans live.  For any given human at any given time, we could take a decent-sized Maximegalor Ubercomputer and just run the simulation, yielding a full description of how the human lives.

The only step where we need to do anything "weird" is in abstracting the human's mind and knowledge of the world from the particular state and location of its body at any given timestep in the simulation.  This doesn't mean taking it out of the body, but instead considering what the same state of the mind might do if placed in multiple place-times and situations, given everything they've experienced previously.  We need this in order to let our simulated humans be genuinely affected and genuinely learn from the consequences of their own actions.

Our game between the simulated human and simulated Nature thus generates a perfectly ordinary game-tree up to some planning horizon H, though it is a probabilistic game tree.  Each edge represents a conditional probability of the human or Nature making some move given their current state.  The multiplication of all probabilities for all edges along a path from the root-node to a leaf-node represents the conditional probability of that leaf node given the root node.  The conditional probabilities attached to all edges leaving an inner node of the tree must sum to 1.0, though there might be a hell of a lot of child nodes.  We assume that an actual human would actually execute the most likely action-edge.

Here is where we actually manage a neat trick for defying the basic human irrationality.  We mentioned earlier that while humans are usually pretty bummed out about their past decisions, sometimes they're not.  If we can separate bummed-out from not-bummed-out in some formal way, we'll have a rigorous way of talking about what it would mean for a given action or history to be good for the human in question.

Our proposal is to consider what a human would say if taken back in time and given the opportunity to advise their past self.  Or, in simpler simulation terms, we consider how a human's choices would be changed by finding out the leaf-node consequences of their root- or inner-node actions, simply by transferring the relevant beliefs and knowledge directly into our model of their minds.  If, upon being given this leaf-node knowledge, the action yielded as most likely changes, or if the version of the human at the leaf-node would, themselves, were they taken back in time, select another action as most likely, then we take a big black meta-magic marker and scribble over that leaf node as suffering from regret.  After all, the human in question could have done something their later self would agree with.

The magic is thus done: coulda + woulda = shoulda.  By coloring some (inevitably: most) leaf-nodes as suffering from regret, we can then measure a probability of regret in any human-versus-Nature game-tree up to any planning horizon H: it's just the sum of all conditional probabilities for all paths from the root node which arrive to a regret-colored leaf-node at or before time H.

We should thus advise the humans to simply treat the probability of arriving to a regret-colored leaf-node as a loss function and minimize it.  By construction, this will yield a rational optimization criterion guaranteed not to make the humans run screaming from their own choices, at least not at or before time-step H.

The further out into time we extend H, the better our advice becomes, as it incorporates a deeper and wider sample of the apparent states which a human life can occupy, thus bringing different motivational adaptations to conscious execution, and allowing their reconciliation via reflection.  Over sufficient amounts of time, this reflection could maybe even quiet down to a stable state, resulting in the humans selecting their actions in a way that's more like a rational agent and less like a pre-evolved meat-ape.  This would hopefully help their lives be much, much nicer, though we cannot actually formally prove that the limit of the human regret probability converges as the planning horizon grows to plus-infinity -- not even to 1.0!

We can also note a couple of interesting properties our loss-function for humans has, particularly its degenerate values and how they relate to the psychology of the underlying semi-rational agent, ie: humans.  When the probability of regret equals 1.0, no matter how far out we extend the planning horizon H, it means we are simply dealing with a totally, utterly irrational mind-design: there literally does not exist a best possible world for that agent in which they would never wish to change their former choices.  They always regret their decisions, which means they've probably got a circular preference or other internal contradiction somewhere.  Yikes, though they could just figure out which particular aspect of their own mind-design causes that and eliminate it, leading to an agent design that can potentially ever like its life.  The other degenerate probability is also interesting: a chance of regret equalling 0.0 means that the agent is either a completely unreflective idiot, or is God.  Even an optimal superintelligence can suffer loss due to not knowing about its environment; it just rids itself of that ignorance optimally as data comes in!

The interesting thing about these degenerate probabilities is that they show our theory to be generally applicable to an entire class of semi-rational agents, not just humans.  Anything with a non-degenerate regret probability, or rather, any agent whose regret probability does not converge to a degenerate value in the limit, can be labelled semi-rational, and can make productive use of the regret probabilities our construction calculates regarding them to make better decisions -- or at least, decisions they will still endorse when asked later on.

Dropping the sense of humor: This might be semi-useful.  Have similar ideas been published in the literature before?  And yes, of course I'm human, but it was funnier that way in what would otherwise have been a very dull, dry philosophy post.

Multiverse-Wide Preference Utilitarianism

14 Brian_Tomasik 30 January 2014 06:08PM

Summary

Some preference utilitarians care about satisfaction of preferences even when the organism with the preference doesn't know that it has been satisfied. These preference utilitarians should care to some degree about the preferences that people in other branches of our multiverse have regarding our own world, as well as the preferences of aliens regarding our world. In general, this suggests that we should give relatively more weight to tastes and values that we expect to be more universal among civilizations across the multiverse. This consideration is strongest in the case of aesthetic preferences about inanimate objects and is weaker for preferences about organisms that themselves have experiences.


Introduction

Classical utilitarianism aims to maximize the balance of happiness over suffering for all organisms. Preference utilitarianism focuses on fulfillment vs. frustration of preferences, rather than just at hedonic experiences. So, for example, if someone has a preference for his house to go to his granddaughter after his death, then it would frustrate his preference if it instead went to his grandson, even though he wouldn't be around to experience negative emotions due to his preference being thwarted.

Non-hedonic preferences

In practice, most of people's preferences concern their own hedonic wellbeing. Some also concern the wellbeing of their children and friends, although often these preferences are manifested through direct happiness or suffering in oneself (e.g., being on the edge of your seat with anxiety when your 14-year-old daughter hasn't come home by midnight).

However, some preferences are beyond hedonic experience by oneself. This is true of preferences about how the world will be after one dies, or whether the money you donated to that charity actually gets used well even if you wouldn't find out either way. It's true of many moral convictions. For instance, I want to actually reduce expected suffering rather than hook up to a machine that makes me think I reduced expected suffering and then blisses me out for the rest of my life. It's also true of some aesthetic preferences, such as the view that it would be good for art, music, and knowledge to exist even if no one was around to experience them.

Certainly these non-hedonic preferences have hedonic effects. If I learned that I was going to be hooked up to a machine that would erase my moral convictions and bliss me out for the rest of my life, I would feel upset in the short run. However, almost certainly this aversive feeling would be outweighed by my pleasure and lack of suffering in the long run. So my preference conflicts with egoistic hedonism in this case. (My preference not to be blissed out is consistent with hedonistic utilitarianism, rather than hedonistic egoism, but hedonistic utilitarianism is a kind of moral system that exists outside the realm of hedonic preferences of an individual organism.)

Because preference utilitarians believe that preference violations can be harmful even if they aren't accompanied by negative hedonic experience, there are some cases in which doing something that other people disapprove of is bad even if they never find out. For example, Muslims strongly oppose defacing the Quran. This means that, barring countervailing factors, it would be prima facie bad to deface a Quran in the privacy of your own home even if no one else knew about it.

Tyranny of the majority?

People sometimes object to utilitarianism on the grounds that it might allow for tyranny of the majority. This seems especially possible for preference utilitarianism, when considering preferences regarding the external world that don't directly affect a person's hedonic experience. For example, one might fear that if large numbers of people have a preference against gay sex, then even if these people are not emotionally affected by what goes on in the privacy of others' bedrooms, their preference against those private acts might still matter appreciably.

As a preliminary comment, I should point out that preference utilitarianism typically optimizes idealized preferences rather than actual preferences. What's important is not what you think you want but what you would actually want if you were better informed, had greater philosophical reflectiveness, etc. While there are strong ostensible preferences against gay sex in the world, it's less clear that there are strong idealized preferences against it. It's plausible that many gay opponents would come to see that (safe) gay sex is actually a positive expression of pleasure and love rather than something vile.

But let's ignore this for the moment and suppose that most people really did have idealized preferences against gay sex. In fact, let's suppose the world consists of N+2 people, two of whom are gay and would prefer to have sex with each other, and the other N of whom have idealized preferences opposing gay sex. If N is very large, do we have tyranny of the majority, according to which it's bad for the two gays to have sex?

This is a complicated question that involves more subtlety than it may seem. Even if the direct preference summation came out against gay sex, it might still be better to allow it for other reasons. For instance, maybe at a meta level, a more libertarian stance on social issues tends to produce better outcomes in the long run. Maybe allowing gay sex increases people's tolerance, leading to a more positive society in the future. And so on. But for now let's consider just the direct preference summation: Does the balance of opposition to gay sex exceed the welfare of the gay individuals themselves?

This answer isn't clear, and it depends how you weigh the different preferences. Intuitively it seems obvious that for large enough N, N people opposed to gay sex can trump two people who prefer it. On the other hand, that's less clear if we look at the matter from the perspective of scaled utility functions.

  • Suppose unrealistically that the only thing the N anti-gay people care about is preventing gay sex. In particular, they're expected-gay-sex minimizers, who consider each act of gay sex as bad as another and aim to minimize the total amount that happens. The best possible world (normalized utility = 1) is one where no gay sex happens. The worst possible world (normalized utility = 0) is one where all N+2 people have gay sex. The world where just the two gay people have gay sex is almost as good as the best possible world. In particular, its normalized utility is N/(N+2). Thus, if gay sex happens, each anti-gay person only loses 2/(N+2) utility. Aggregated over all N anti-gay people, this is a loss of 2N/(N+2).
  • Also unrealistically, suppose that the only thing the two gay people care about is having gay sex. Their normalized utility for having sex is 1 and for not having it is 0. Aggregated over the two of them, the total gain from having sex is 2.
  • Because 2 > 2N/(N+2), it's overall better in direct preference summation for the gay sex to happen as long as we weight each person's normalized utility equally. This is true regardless of N.

That said, if the anti-gay people had diminishing marginal disutility for additional acts of gay sex, this conclusion would probably flip around.

It feels intuitively suspicious to just sum normalized utility. As an example, consider a Beethoven utility monster -- a person whose only goal in life is to hear Beethoven's Ninth Symphony. This person has no other desires, and if he doesn't hear Beethoven's Ninth, it's as good as being dead. Meanwhile, other people also want to hear Beethoven's Ninth, but their desire for it is just a tiny fraction of what they care about. In particular, they value not dying and being able to live the rest of their lives 99,999 times as much as hearing Beethoven's Ninth.

  • Each normal person's normalized utility without hearing the symphony is 0.99999. Hearing the symphony would make it 1.00000.
  • The Beethoven utility monster would be at 0 without hearing the symphony and 1 hearing it.
  • Thus, if we directly sum normalized utilities, it's better for the Beethoven utility monster to hear the symphony than for 99,999 regular people to do the same.

This seems suspicious. Maybe it's because our intuitions are not well adapted to thinking about organisms with really different utility functions from ours, and if we interacted with them more -- seeing them struggle endlessly, risking life and limb for the symphony they so desire -- we would begin to feel differently. Another problem is that an organism's utility counts for less as soon as the range of its experience increases. If the Beethoven monster were transformed to want to hear Beethoven's Ninth and Eighth symphonies each with equal strength, suddenly the value of its hearing the Ninth alone is cut in half. Again, maybe this is plausible, but it's not clear. I think some people have the intuition that an organism with a broader range of possible joys counts more than one with fewer, though I'm not sure I agree with this.

So the question of tyranny remains indeterminate. It depends on how you weigh different preferences. However, it remains the case that it may be instrumentally valuable to preserve norms of individual autonomy in order to produce better societies in the long run.

Preferences across worlds: A story of art maximizers

Consider the following (highly unrealistic) story. It's the year 2100. A group of three artist couples is traveling on the first manned voyage to Mars. These couples value art for art's sake, and in fact, their moral views consider art to be worthwhile even if no one experiences it. Their utility functions are linear in the amount of art that exists, and so they wish to maximize the expected amount of art in the galaxy -- converting planets and asteroids into van Gogh, Shakespeare, and Chopin.

However, they don't quite agree on which art is best. One couple wants to maximize paintings, feeling that a galaxy filled with paintings would be worth +3. A galaxy filled with sculptures would be +2. And a galaxy filled with poetry or music would be worthless: 0. The second couple values poetry at +3, sculptures at +2, and the other art at 0. The third values music at +3, sculptures at +2, and everything else at 0. Despite their divergent views, they manage to get along in the joint Martian voyage.

However, a few weeks into the trip, a terrestrial accident vaporizes Earth, leaving no one behind. The only humans are now the artists heading for Mars, where they land several months later.

The original plan had been for Earth to send more supplies following this crew, but now that Earth is gone, the colonists have only the minimal resources that the Martian base currently has in stock. They plan to grow more food in their greenhouse, but this will take many months, and the artists will all starve in the meanwhile if they each stick around. They realize that it would be best if two of the couples sacrificed themselves so that the third would have enough supplies to continue to grow crops and eventually repopulate the planet.

Rather than fighting for control of the Martian base, which could be costly and kill everyone, the three couples realize that everyone would be better off in expectation if they selected a winner by lottery. In particular, they use a quantum random number generator to apportion 1/3 probabilities for each couple to survive. The lottery takes place, and the winner is the first couple, which values paintings most highly. The other two couples wish the winning couple the best of luck and then head to the euthanasia pods.

The pro-paintings couple makes it through the period of low food and manages to establish a successful farming operation. They then begin having children to populate the planet. After many generations, Mars is home to a thriving miniature city. All the inhabitants value paintings at +3, sculptures at +2, and everything else at 0, due to the influence of the civilization's founders.

By the year 2700, the city's technology is sufficient to deploy von Neumann probes throughout the galaxy, converting planets into works of art. The city council convenes a meeting to decide exactly what kind of art should be deployed. Because everyone in the city prefers paintings, the council assumes the case will be open and shut. But as a formality, they invite their local philosopher, Dr. Muchos Mundos, to testify.

Council president: Dr. Mundos, the council has proposed to deploy von Neumann probes that will fill the galaxy with paintings. Do you agree with this decision?

Dr. Mundos: As I understand it, the council wishes to act in the optimal preference-utilitarian fashion on this question, right?

Council president: Yes, of course. The greatest good for the greatest number. Given that everyone who has any preferences about art most prefers a galaxy of paintings, we feel it's clear that paintings are what we should deploy. It's true that when this colony was founded, there were two other couples who would have wanted poetry and music, but their former preferences are far outweighed by our vast population that now wants paintings.

Dr. Mundos: I see. Are you familiar with the many-worlds interpretation (MWI) of quantum mechanics?

Council president: I'm a politician and not a physicist, but maybe you can give me the run-down?

Dr. Mundos: According to MWI, when quantum randomness occurs, it's not the case that just a single outcome is selected. Rather, all outcomes happen, and our experiences of the world split into different branches.

Council president: Okay. What's the relevance to art policy?

Dr. Mundos: Well, a quantum lottery was used to decide which colonizing couple would populate Mars. The painting lovers won in this branch of the multiverse, but the poetry lovers won in another branch with equal measure, and the music lovers won in a third branch, also with equal measure. Presumably the couples in those branches also populated Mars with a city about as populous as our own. And if they care about art for art's sake, regardless of whether they know about it or where it exists, then the populations of those cities in other Everett branches also care about what art we deploy.

Council president: Oh dear, you're right. Our city contains M people, and suppose their cities have about the same populations. If we deploy paintings, our M citizens each get +3 of utility, and those in the other worlds get nothing. The aggregate is 3M. But if we deploy sculptures, which everyone values at +2, the total utility is 3 * 2M = 6M. This is much better than 3M for paintings.

Dr. Mundos: Yes, exactly. Of course, we might have some uncertainty over whether the populations in the other branches survived. But even if the probability they survived was only, say, 1/3, then the expected utility of sculptures would still be 2M for us plus (1/3)(2M + 2M) = 4M/3 for them. The sum is more than 3M, so it would still be better to do sculptures.

After further deliberation, the council agreed with this argument and deployed sculptures. The preference satisfaction of the poetry-loving and music-loving cities was improved.

Multiversal distribution of preferences

According to Max Tegmark's "Parallel Universes," there's probably an exact copy of you reading this article within 101028 meters away and in practice, probably much closer. As Tegmark explains, this claim assumes only basic physics that most cosmologists take for granted. Even nearer than this distance are many people very similar to you but with minor variations -- e.g., with brown eyes instead of blue, or who prefer virtue ethics over deontology.

In fact, all possible people exist somewhere in the multiverse, if only due to random fluctuations of the type that produce Boltzmann brains. Nick Bostrom calls these "freak observers." Just as there are art maximizers, there are also art minimizers who find art disgusting and want to eliminate as much of it as possible. For them, the thought of art triggers their brains' disgust centers instead of beauty centers.

However, the distribution of organisms across the multiverse is not uniform. For instance, we should expect suffering reducers to be much more common than suffering increasers because organisms evolve to dislike suffering by themselves, their kin, and their reciprocal trading partners. Societies -- whether human or alien -- should often develop norms against cruelty for collective benefit.

Human values give us some hints about what values across the multiverse look like, because human values are a kind of maximum likelihood estimator for the mode of the multiversal distribution. Of course, we should expect some variation about the mode. Even among humans, some cultural norms are distinct and others are universal. Probably values like not murdering, not causing unnecessary suffering, not stealing, etc. are more common among aliens than, say, the value of music or dance, which might be human-specific spandrels. Still, aliens may have their own spandrels that they call "art," and they might value those things.

Like human values, alien values might be mostly self-directed toward their own wellbeing, especially in their earlier Darwinian phases. Unless we meet the aliens face-to-face, we can't improve their welfare directly. However, the aliens may also have some outward-directed aesthetic and moral values that apply across space and time, like the value of art as seen by the art-maximizing cities on Mars in the previous section. If so, we can affect the satisfaction of these preferences by our actions, and presumably they should be included in preference-utilitarian calculations.

As an example, suppose there were 10 civilizations. All 10 valued reducing suffering and social equality. 5 of the 10 also valued generating knowledge. Only 1 of the 10 valued creating paintings and poetry. Suppose our civilization values all of those things. Perhaps previously we were going to spend money on creating more poetry, because our citizens value that highly. However, upon considering that poetry would not satisfy the preferences of the other civilizations, we might switch more toward knowledge and especially toward suffering reduction and equality promotion.

In general, considering the distribution of outward-directed preferences across the multiverse should lead us to favor more those preferences of ours that are more evolutionarily robust, i.e., that we predict more civilizations to have settled upon. One corollary is that we should care less about values that we have due to particular, idiosyncratic historical contingencies, such as who happened to win some very closely contested war, or what species were killed by a random asteroid strike. Values based on more inevitable historical trends should matter relatively more strongly.

Tyranny of the aliens?

Suppose, conservatively, that for every one human civilization, there are 1000 alien civilizations that have some outward-directed preferences (e.g., for more suffering reduction, justice, knowledge, etc.). Even if each alien civilization cares only a little bit about what we do, collectively do their preferences outweigh our preferences about our own destiny? Would we find ourselves beholden to the tyranny of the alien majority about our behavior?

This question runs exactly parallel to the standard concern about tyranny of the majority for individuals within a society, so the same sorts of arguments will apply on each side. Just as in that case, it's possible aliens would place value on the ability of individual civilizations to make their own choices about how they're constituted without too much outside interference. Of course, this is just speculation.

Even if tyranny of the alien majority was the result, we might choose to accept that conclusion. After all, it seems to yield more total preference satisfaction, which is what the preference utilitarians were aiming for.

Direct welfare may often dominate

In the preceding examples, I often focused on aesthetic values like art and knowledge for a specific reason: These are cases of preferences for something to exist or not where that thing does not itself have preferences. Art does not prefer for itself to keep existing or stop existing.

However, many human preferences have implications for the preferences of others. For instance, a preference by humans for more wilderness may mean vast numbers of additional wild animals, many of whom strongly (implicitly) prefer not to have endured the short lives and painful deaths inherent to the bodies in which they found themselves born. A relatively weak aesthetic preference for nature by a relatively small number of people is compared against strong hedonic preferences by large numbers of animals not to have existed. In this case, the preferences of the animals clearly dominate. The same is true for preferences about creating space colonies and the like: The preferences of the people, animals, and other agents in those colonies will tend to far outweigh the preferences of their creators.

Considering multiverse-wide aesthetic and moral preferences is thus cleanest in the case of preferences about inanimate things. Aliens' preferences about actions that affect the welfare of organisms in our civilization still matter, but relatively less than the contribution of their preferences about inanimate things.

Acknowledgments

This piece was inspired by Carl Shulman's "Rawls' original position, potential people, and Pascal's Mugging," as well as a conversation with Paul Christiano.

Thought experiment: The transhuman pedophile

6 PhilGoetz 17 September 2013 10:38PM

There's a recent science fiction story that I can't recall the name of, in which the narrator is traveling somewhere via plane, and the security check includes a brain scan for deviance. The narrator is a pedophile. Everyone who sees the results of the scan is horrified--not that he's a pedophile, but that his particular brain abnormality is easily fixed, so that means he's chosen to remain a pedophile. He's closely monitored, so he'll never be able to act on those desires, but he keeps them anyway, because that's part of who he is.

What would you do in his place?

continue reading »

Mahatma Armstrong: CEVed to death.

23 Stuart_Armstrong 06 June 2013 12:50PM

My main objection to Coherent Extrapolated Volition (CEV) is the "Extrapolated" part. I don't see any reason to trust the extrapolated volition of humanity - but this isn't just for self centred reasons. I don't see any reason to trust my own extrapolated volition. I think it's perfectly possible that my extrapolated volition would follow some scenario like this:

  1. It starts with me, Armstrong 1.  I want to be more altruistic at the next level, valuing other humans more.
  2. The altruistic Armstrong 2 wants to be even more altruistic. He makes himself into a perfectly altruistic utilitarian towards humans, and increases his altruism towards animals.
  3. Armstrong 3 wonders about the difference between animals and humans, and why he should value one of them more. He decided to increase his altruism equally towards all sentient creatures.
  4. Armstrong 4 is worried about the fact that sentience isn't clearly defined, and seems arbitrary anyway. He increase his altruism towards all living things.
  5. Armstrong 5's problem is that the barrier between living and non-living things isn't clear either (e.g. viruses). He decides that he should solve this by valuing all worthwhile things - is not art and beauty worth something as well?
  6. But what makes a thing worthwhile? Is there not art in everything, beauty in the eye of the right beholder? Armstrong 6 will make himself value everything.
  7. Armstrong 7 is in turmoil: so many animals prey upon other animals, or destroy valuable rocks! To avoid this, he decides the most moral thing he can do is to try and destroy all life, and then create a world of stasis for the objects that remain.

There are many other ways this could go, maybe ending up as a negative utilitarian or completely indifferent, but that's enough to give the flavour. You might trust the person you want to be, to do the right things. But you can't trust them to want to be the right person - especially several levels in (compare with the argument in this post, and my very old chaining god idea). I'm not claiming that such a value drift is inevitable, just that it's possible - and so I'd want my initial values to dominate when there is a large conflict.

Nor do I give Armstrong 7's values any credit for having originated from mine. Under torture, I'm pretty sure I could be made to accept any system of values whatsoever; there are other ways that would provably alter my values, so I don't see any reason to privilege Armstrong 7's values in this way.

"But," says the objecting strawman, "this is completely different! Armstrong 7's values are the ones that you would reach by following the path you would want to follow anyway! That's where you would get to, if you started out wanting to be more altruistic, had control over you own motivational structure, and grew and learnt and knew more!"

"Thanks for pointing that out," I respond, "now that I know where that ends up, I must make sure to change the path I would want to follow! I'm not sure whether I shouldn't be more altruistic, or avoid touching my motivational structure, or not want to grow or learn or know more. Those all sound pretty good, but if they end up at Armstrong 7, something's going to have to give."

Upgrading moral theories to include complex values

1 Ghatanathoah 27 March 2013 06:28PM

Like many members of this community, reading the sequences has opened my eyes to a heavily neglected aspect of morality.  Before reading the sequences I focused mostly on how to best improve people's wellbeing in the present and the future.  However, after reading the sequences, I realized that I had neglected a very important question:  In the future we will be able to create creatures with virtually any utility function imaginable. What sort of values should we give the creatures of the future?  What sort of desires should they have, from what should they gain wellbeing?

Anyone familiar with the sequences should be familiar with the answer.  We should create creatures with the complex values that human beings possess (call them "humane values").  We should avoid creating creatures with simple values that only desire to maximize one thing, like paperclips or pleasure. 

It is important that future theories of ethics formalize this insight.  I think we all know what would happen if we programmed an AI with conventional utilitarianism:  It would exterminate the human race and replace them with creatures whose preferences are easier to satisfy (if you program it with preference utilitarianism) or creatures whom it is easier to make happy (if you program it with hedonic utilitarianism).  It is important to develop a theory of ethics that avoids this.

Lately I have been trying to develop a modified utilitarian theory that formalizes this insight.  My focus has been on population ethics.  I am essentially arguing that population ethics should not just focus on maximizing welfare, it should also focus on what sort of creatures it is best to create.  According to this theory of ethics, it is possible for a population with a lower total level of welfare to be better than a population with a higher total level of welfare, if the lower population consists of creatures that have complex humane values, while the higher welfare population consists of paperclip or pleasure maximizers. (I wrote a previous post on this, but it was long and rambling, I am trying to make this one more accessible).

One of the key aspects of this theory is that it does not necessarily rate the welfare of creatures with simple values as unimportant.  On the contrary, it considers it good for their welfare to be increased and bad for their welfare to be decreased.  Because of this, it implies that we ought to avoid creating such creatures in the first place, so it is not necessary to divert resources from creatures with humane values in order to increase their welfare. 

My theory does allow the creation of simple-value creatures for two reasons. One is if the benefits they generate for creatures with humane values outweigh the harms generated when humane-value creatures must divert resources to improving their welfare (companion animals are an obvious example of this).  The second is if creatures with humane values are about to go extinct, and the only choices are replacing them with simple value creatures, or replacing them with nothing.

So far I am satisfied with the development of this theory.  However, I have hit one major snag, and would love it if someone else could help me with it.  The snag is formulated like this:

1. It is better to create a small population of creatures with complex humane values (that has positive welfare) than a large population of animals that can only experience pleasure or pain, even if the large population of animals has a greater total amount of positive welfare.  For instance, it is better to create a population of humans with 50 total welfare than a population of animals with 100 total welfare.

2. It is bad to create a small population of creatures with humane values (that has positive welfare) and a large population of animals that are in pain.  For instance, it is bad to create a population of animals with -75 total welfare, even if doing so allows you to create a population of humans with 50 total welfare.

3.  However, it seems like, if creating human beings wasn't an option, that it might be okay to create a very large population of animals, the majority of which have positive welfare, but the some of which are in pain.  For instance, it seems like it would be good to create a population of animals where one section of the population has 100 total welfare, and another section has -75, since the total welfare is 25. 

The problem is that this leads to what seems like a circular preference.  If the population of animals with 100 welfare existed by itself it would be okay to not create it in order to create a population of humans with 50 welfare instead.  But if the population we are talking about is the one in (3) then doing that would result in the population discussed in (2), which is bad.

My current solution to this dilemma is to include a stipulation that a population with negative utility can never be better than one with positive utility.  This prevents me from having circular preferences about these scenarios.  But it might create some weird problems.  If population (2) is created anyway, and the humans in it are unable to help the suffering animals in any way, does that mean they have a duty to create lots of happy animals to get their population's utility up to a positive level?  That seems strange, especially since creating the new happy animals won't help the suffering ones in any way.  On the other hand, if the humans are able to help the suffering animals, and they do so by means of some sort of utility transfer, then it would be in the best interests to create lots of happy animals, to reduce the amount of utility each person has to transfer.

So far some of the solutions I am considering include:

1. Instead of focusing on population ethics, just consider complex humane values to have greater weight in utility calculations than pleasure or paperclips.  I find this idea distasteful because it implies it would be acceptable to inflict large harms on animals for relatively small gains for humans.  In addition, if the weight is not sufficiently great it could still lead to an AI exterminating the human race and replacing them with happy animals, since animals are easier to take care of and make happy than humans.

2. It is bad to create the human population in (2) if the only way to do so is to create a huge amount of suffering animals.  But once both populations have been created, if the human population is unable to help the animal population, they have no duty to create as many happy animals as they can.  This is because the two populations are not causally connected, and that is somehow morally significant. This makes some sense to me, as I don't think the existence of causally disconnected populations in the vast universe should bear any significance on my decision-making.

3. There is some sort of overriding consideration besides utility that makes (3) seem desirable.  For instance, it might be bad for creatures with any sort of values to go extinct, so it is good to create a population to prevent this, as long as its utility is positive on the net.  However, this would change in a situation where utility is negative, such as in (2).

4. Reasons to create a creature have some kind complex rock-paper-scissors-type "trumping" hierarchy.  In other words, the fact that the humans have humane values can override the reasons to create a happy animals, but they cannot override the reason to not create suffering animals.  The reasons to create happy animals, however, can override the reasons to not create suffering animals.  I think that this argument might lead to inconsistent preferences again, but I'm not sure.

I find none of these solutions that satisfying.  I would really appreciate it if someone could help me with solving this dilemma.  I'm very hopeful about this ethical theory, and would like to see it improved.

 

*Update.  After considering the issue some more, I realized that my dissatisfaction came from equivocating two different scenarios.  I was considering the scenario, "Animals with 100 utility and animals with -75 utility are created, no humans are created at all" to be the same as the scenario "Humans with 50 utility and animals with -75 utility are created, then the humans (before the get to experience their 50 utility) are killed/harmed in order to create more animals without helping the suffering animals in any way" to be the same scenario.  They are clearly not.

To make the analogy more obvious, imagine I was given a choice between creating a person who would experience 95 utility over the course of their life, or a person who would experience 100 utility over the course of their life.  I would choose the person with 100 utility.  But if the person destined to experience 95 utility already existed, but had not experienced the majority of that utility yet, I would oppose killing them and replacing them with the 100 utility person.

Or to put it more succinctly, I am willing to not create some happy humans to prevent some suffering animals from being created.  And if the suffering animals and happy humans already exist I am willing to harm the happy humans to help the suffering animals.  But if the suffering animals and happy humans already exist I am not willing to harm the happy humans to create some extra happy animals that will not help the existing suffering animals in any way.

Population Ethics Shouldn't Be About Maximizing Utility

0 Ghatanathoah 18 March 2013 02:35AM

let me suggest a moral axiom with apparently very strong intuitive support, no matter what your concept of morality: morality should exist. That is, there should exist creatures who know what is moral, and who act on that. So if your moral theory implies that in ordinary circumstances moral creatures should exterminate themselves, leaving only immoral creatures, or no creatures at all, well that seems a sufficient reductio to solidly reject your moral theory.

-Robin Hanson

I agree strongly with the above quote, and I think most other readers will as well. It is good for moral beings to exist and a world with beings who value morality is almost always better than one where they do not. I would like to restate this more precisely as the following axiom: A population in which moral beings exist and have net positive utility, and in which all other creatures in existence also have net positive utility, is always better than a population where moral beings do not exist.

While the axiom that morality should exist is extremely obvious to most people, there is one strangely popular ethical system that rejects it: total utilitarianism. In this essay I will argue that Total Utilitarianism leads to what I will call the Genocidal Conclusion, which is that there are many situations in which it would be fantastically good for moral creatures to either exterminate themselves, or greatly limit their utility and reproduction in favor of the utility and reproduction of immoral creatures. I will argue that the main reason consequentialist theories of population ethics produce such obviously absurd conclusions is that they continue to focus on maximizing utility1 in situations where it is possible to create new creatures. I will argue that pure utility maximization is only a valid ethical theory for "special case" scenarios where the population is static. I will propose an alternative theory for population ethics I call "ideal consequentialism" or "ideal utilitarianism" which avoids the Genocidal Conclusion and may also avoid the more famous Repugnant Conclusion.

 

I will begin my argument by pointing to a common problem in population ethics known as the Mere Addition Paradox (MAP) and the Repugnant Conclusion. Most Less Wrong readers will already be familiar with this problem, so I do not think I need to elaborate on it. You may also be familiar with a even stronger variation called the Benign Addition Paradox (BAP). This is essentially the same as the MAP, except that each time one adds more people one also gives a small amount of additional utility to the people who already existed. One then proceeds to redistribute utility between people as normal, eventually arriving at the huge population where everyone's lives are "barely worth living." The point of this is to argue that the Repugnant Conclusion can be arrived at from "mere addition" of new people that not only doesn't harm the preexisting-people, but also one that benefits them.

The next step of my argument involves three slightly tweaked versions of the Benign Addition Paradox. I have not changed the basic logic of the problem, I have just added one small clarifying detail. In the original MAP and BAP it was not specified what sort of values the added individuals in population A+ held. Presumably one was meant to assume that they were ordinary human beings. In the versions of the BAP I am about to present, however, I will specify that the extra individuals added in A+ are not moral creatures, that if they have values at all they are values indifferent to, or opposed to, morality and the other values that the human race holds dear.

1. The Benign Addition Paradox with Paperclip Maximizers.

Let us imagine, as usual, a population, A, which has a large group of human beings living lives of very high utility. Let us then add a new population consisting of paperclip maximizers, each of whom is living a life barely worth living. Presumably, for a paperclip maximizer, this would be a life where the paperclip maximizer's existence results in at least one more paperclip in the world than there would have been otherwise.

Now, one might object that if one creates a paperclip maximizer, and then allows it to create one paperclip, the utility of the other paperclip maximizers will increase above the "barely worth living" level, which would obviously make this thought experiment nonalagous with the original MAP and BAP. To prevent this we will assume that each paperclip maximizer that is created has a slightly different values on what the ideal size, color, and composition of the paperclip they are trying to produce is. So the Purple 2 centimeter Plastic Paperclip Maximizer gains no addition utility from when the Silver Iron 1 centimeter Paperclip Maximizer makes a paperclip.

So again, let us add these paperclip maximizers to population A, and in the process give one extra utilon of utility to each preexisting person in A. This is a good thing, right? After all, everyone in A benefited, and the paperclippers get to exist and make paperclips. So clearly A+, the new population, is better than A.

Now let's take the next step, the transition from population A+ to population B. Take some of the utility from the human beings and convert it into paperclips. This is a good thing, right?

So let us repeat these steps adding paperclip maximizers and utility, and then redistributing utility. Eventually we reach population Z, where there is a vast amount of paperclip maximizers, a vast amount of many different kinds of paperclips, and a small amount of human beings living lives barely worth living.

Obviously Z is better than A, right? We should not fear the creation of a paperclip maximizing AI, but welcome it! Forget about things like high challenge, love, interpersonal entanglement, complex fun, and so on! Those things just don't produce the kind of utility that paperclip maximization has the potential to do!

Or maybe there is something seriously wrong with the moral assumptions behind the Mere Addition and Benign Addition Paradoxes.

But you might argue that I am using an unrealistic example. Creatures like Paperclip Maximizers may be so far removed from normal human experience that we have trouble thinking about them properly. So let's replay the Benign Addition Paradox again, but with creatures we might actually expect to meet in real life, and we know we actually value.

2. The Benign Addition Paradox with Non-Sapient Animals

You know the drill by now. Take population A, add a new population to it, while very slightly increasing the utility of the original population. This time let's have it be some kind animal that is capable of feeling pleasure and pain, but is not capable of modeling possible alternative futures and choosing between them (in other words, it is not capable of having "values" or being "moral"). A lizard or a mouse, for example. Each one feels slightly more pleasure than pain in its lifetime, so it can be said to have a life barely worth living. Convert A+ to B. Take the utilons that the human beings are using to experience things like curiosity, beatitude, wisdom, beauty, harmony, morality, and so on, and convert it into pleasure for the animals.

We end up with population Z, with a vast amount of mice or lizards with lives just barely worth living, and a small amount of human beings with lives barely worth living. Terrific! Why do we bother creating humans at all! Let's just create tons of mice and inject them full of heroin! It's a much more efficient way to generate utility!

3. The Benign Addition Paradox with Sociopaths

What new population will we add to A this time? How about some other human beings, who all have anti-social personality disorder? True, they lack the key, crucial value of sympathy that defines so much of human behavior. But they don't seem to miss it. And their lives are barely worth living, so obviously A+ has greater utility than A. If given a chance the sociopaths will reduce the utility of other people to negative levels, but let's assume that that is somehow prevented in this case.

Eventually we get to Z, with a vast population of sociopaths and a small population of normal human beings, all living lives just barely worth living. That has more utility, right? True, the sociopaths place no value on things like friendship, love, compassion, empathy, and so on. And true, the sociopaths are immoral beings who do not care in the slightest about right and wrong. But what does that matter? Utility is being maximized, and surely that is what population ethics is all about!

Asteroid!

Let's suppose an asteroid is approaching each of the four population Zs discussed before. It can only be deflected by so much. Your choice is, save the original population of humans from A, or save the vast new population. The choice is obvious. In 1, 2, and 3, each individual has the same level utility, so obviously we should choose which option saves a greater number of individuals.

Bam! The asteroid strikes. The end result in all four scenarios is a world in which all the moral creatures are destroyed. It is a world without the many complex values that human beings possess. Each world, for the most part, lack things like complex challenge, imagination, friendship, empathy, love, and the other complex values that human beings prize. But so what? The purpose of population ethics is to maximize utility, not silly, frivolous things like morality, or the other complex values of the human race. That means that any form of utility that is easier to produce than those values is obviously superior. It's easier to make pleasure and paperclips than it is to make eudaemonia, so that's the form of utility that ought to be maximized, right? And as for making sure moral beings exist, well that's just ridiculous. The valuable processing power they're using to care about morality could be being used to make more paperclips or more mice injected with heroin! Obviously it would be better if they died off, right?

I'm going to go out on a limb and say "Wrong."

Is this realistic?

Now, to fair, in the Overcoming Bias page I quoted, Robin Hanson also says:

I’m not saying I can’t imagine any possible circumstances where moral creatures shouldn’t die off, but I am saying that those are not ordinary circumstances.

Maybe the scenarios I am proposing are just too extraordinary. But I don't think this is the case. I imagine that the circumstances Robin had in mind were probably something like "either all moral creatures die off, or all moral creatures are tortured 24/7 for all eternity."

Any purely utility-maximizing theory of population ethics that counts both the complex values of human beings, and the pleasure of animals, as "utility" should inevitably draw the conclusion that human beings ought to limit their reproduction to the bare minimum necessary to maintain the infrastructure to sustain a vastly huge population of non-human animals (preferably animals dosed with some sort of pleasure-causing drug). And if some way is found to maintain that infrastructure automatically, without the need for human beings, then the logical conclusion is that human beings are a waste of resources (as are chimps, gorillas, dolphins, and any other animal that is even remotely capable of having values or morality). Furthermore, even if the human race cannot practically be replaced with automated infrastructure, this should be an end result that the adherents of this theory should be yearning for.2 There should be much wailing and gnashing of teeth among moral philosophers that exterminating the human race is impractical, and much hope that someday in the future it will not be.

I call this the "Genocidal Conclusion" or "GC." On the macro level the GC manifests as the idea that the human race ought to be exterminated and replaced with creatures whose preferences are easier to satisfy. On the micro level it manifests as the idea that it is perfectly acceptable to kill someone who is destined to live a perfectly good and worthwhile life and replace them with another person who would have a slightly higher level of utility.

Population Ethics isn't About Maximizing Utility

I am going to make a rather radical proposal. I am going to argue that the consequentialist's favorite maxim, "maximize utility," only applies to scenarios where creating new people or creatures is off the table. I think we need an entirely different ethical framework to describe what ought to be done when it is possible to create new people. I am not by any means saying that "which option would result in more utility" is never a morally relevant consideration when deciding to create a new person, but I definitely think it is not the only one.3

So what do I propose as a replacement to utility maximization? I would argue in favor of a system that promotes a wide range of ideals. Doing some research, I discovered that G. E. Moore had in fact proposed a form of "ideal utilitarianism" in the early 20th century.4 However, I think that "ideal consequentialism" might be a better term for this system, since it isn't just about aggregating utility functions.

What are some of the ideals that an ideal consequentialist theory of population ethics might seek to promote? I've already hinted at what I think they are: Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom... mutual affection, love, friendship, cooperation; all those other important human universals, plus all the stuff in the Fun Theory Sequence. When considering what sort of creatures to create we ought to create creatures that value those things. Not necessarily, all of them, or in the same proportions, for diversity is an important ideal as well, but they should value a great many of those ideals.

Now, lest you worry that this theory has any totalitarian implications, let me make it clear that I am not saying we should force these values on creatures that do not share them. Forcing a paperclip maximizer to pretend to make friends and love people does not do anything to promote the ideals of Friendship and Love. Forcing a chimpanzee to listen while you read the Sequences to it does not promote the values of Truth and Knowledge. Those ideals require both a subjective and objective component. The only way to promote those ideals is to create a creature that includes them as part of its utility function and then help it maximize its utility.

I am also certainly not saying that there is never any value in creating a creature that does not possess these values. There are obviously many circumstances where it is good to create nonhuman animals. There may even be some circumstances where a paperclip maximizer could be of value. My argument is simply that it is most important to make sure that creatures who value these various ideals exist.

I am also not suggesting that it is morally acceptable to casually inflict horrible harms upon a creature with non-human values if we screw up and create one by accident. If promoting ideals and maximizing utility are separate values then it may be that once we have created such a creature we have a duty to make sure it lives a good life, even if it was a bad thing to create it in the first place. You can't unbirth a child.5

It also seems to me that in addition to having ideals about what sort of creatures should exist, we also have ideals about how utility ought to be concentrated. If this is the case then ideal consequentialism may be able to block some forms of the Repugnant Conclusion, even if situations where the only creatures whose creation is being considered are human beings. If it is acceptable to create humans instead of paperclippers, even if the paperclippers would have higher utility, it may also be acceptable to create ten humans with a utility of ten each instead of a hundred humans with a utility of 1.01 each.

Why Did We Become Convinced that Maximizing Utility was the Sole Good?

Population ethics was, until comparatively recently, a fallow field in ethics. And in situations where there is no option to increase the population, maximizing utility is the only consideration that's really relevant. If you've created creatures that value the right ideals, then all that is left to be done is to maximize their utility. If you've created creatures that do not value the right ideals, there is no value to be had in attempting to force them to embrace those ideals. As I've said before, you will not promote the values of Love and Friendship by creating a paperclip maximizer and forcing it to pretend to love people and make friends.

So in situations where the population is constant, "maximize utility" is a decent approximation of the meaning of right. It's only when the population can be added to that morality becomes much more complicated.

Another thing to blame is human-centric reasoning. When people defend the Repugnant Conclusion they tend to point out that a life barely worth living is not as bad as it would seem at first glance. They emphasize that it need not be a boring life, it may be a life full of ups and downs where the ups just barely outweigh the downs. A life worth living, they say, is a life one would choose to live. Derek Parfit developed this idea to some extent by arguing that there are certain values that are "discontinuous" and that one needs to experience many of them in order to truly have a life worth living.

The Orthogonality Thesis throws all these arguments out the window. It is possible to create an intelligence to execute any utility function, no matter what it is. If human beings have all sorts of complex needs that must be fulfilled in order to for them lead worthwhile lives, then you could create more worthwhile lives by killing the human race and replacing them with something less finicky. Maybe happy cows. Maybe paperclip maximizers. Or how about some creature whose only desire is to live for one second and then die. If we created such a creature and then killed it we would reap huge amounts of utility, for we would have created a creature that got everything it wanted out of life!

How Intuitive is the Mere Addition Principle, Really?

I think most people would agree that morality should exist, and that therefore any system of population ethics should not lead to the Genocidal Conclusion. But which step in the Benign Addition Paradox should we reject? We could reject the step where utility is redistributed. But that seems wrong, most people seem to consider it bad for animals and sociopaths to suffer, and that it is acceptable to inflict at least some amount of disutilities on human beings to prevent such suffering.

It seems more logical to reject the Mere Addition Principle. In other words, maybe we ought to reject the idea that the mere addition of more lives-worth-living cannot make the world worse. And in turn, we should probably also reject the Benign Addition Principle. Adding more lives-worth-living may be capable of making the world worse, even if doing so also slightly benefits existing people. Fortunately this isn't a very hard principle to reject. While many moral philosophers treat it as obviously correct, nearly everyone else rejects this principle in day-to-day life.

Now, I'm obviously not saying that people's behavior in their day-to-day lives is always good, it may be that they are morally mistaken. But I think the fact that so many people seem to implicitly reject it provides some sort of evidence against it.

Take people's decision to have children. Many people choose to have fewer children than they otherwise would because they do not believe they will be able to adequately care for them, at least not without inflicting large disutilities on themselves. If most people accepted the Mere Addition Principle there would be a simple solution for this: have more children and then neglect them! True, the children's lives would be terrible while they were growing up, but once they've grown up and are on their own there's a good chance they may be able to lead worthwhile lives. Not only that, it may be possible to trick the welfare system into giving you money for the children you neglect, which would satisfy the Benign Addition Principle.

Yet most people choose not to have children and neglect them. And furthermore they seem to think that they have a moral duty not to do so, that a world where they choose to not have neglected children is better than one that they don't. What is wrong with them?

Another example is a common political view many people have. Many people believe that impoverished people should have fewer children because of the burden doing so would place on the welfare system. They also believe that it would be bad to get rid of the welfare system altogether. If the Benign Addition Principle were as obvious as it seems, they would instead advocate for the abolition of the welfare system, and encourage impoverished people to have more children. Assuming most impoverished people live lives worth living, this is exactly analogous to the BAP, it would create more people, while benefiting existing ones (the people who pay less taxes because of the abolition of the welfare system).

Yet again, most people choose to reject this line of reasoning. The BAP does not seem to be an obvious and intuitive principle at all.

The Genocidal Conclusion is Really Repugnant

There is nearly nothing repugnant than the Genocidal Conclusion. Pretty much the only way a line of moral reasoning could go more wrong would be concluding that we have a moral duty to cause suffering, as an end in itself. This means that it's fairly easy to counter any argument in favor of total utilitarianism that argues the alternative I am promoting has odd conclusions that do not fit some of our moral intuitions, while total utilitarianism does not. Is that conclusion more insane than the Genocidal Conclusion? If it isn't, total utilitarianism should still be rejected.

Ideal Consequentialism Needs a Lot of Work

I do think that Ideal Consequentialism needs some serious ironing out. I haven't really developed it into a logical and rigorous system, at this point it's barely even a rough framework. There are many questions that stump me. In particular I am not quite sure what population principle I should develop. It's hard to develop one that rejects the MAP without leading to weird conclusions, like that it's bad to create someone of high utility if a population of even higher utility existed long ago. It's a difficult problem to work on, and it would be interesting to see if anyone else had any ideas.

But just because I don't have an alternative fully worked out doesn't mean I can't reject Total Utilitarianism. It leads to the conclusion that a world with no love, curiosity, complex challenge, friendship, morality, or any other value the human race holds dear is an ideal, desirable world, if there is a sufficient amount of some other creature with a simpler utility function. Morality should exist, and because of that, total utilitarianism must be rejected as a moral system.

 

1I have been asked to note that when I use the phrase "utility" I am usually referring to a concept that is called "E-utility," rather than the Von Neumann-Morgenstern utility that is sometimes discussed in decision theory. The difference is that in VNM one's moral views are included in one's utility function, whereas in E-utility they are not. So if one chooses to harm oneself to help others because one believes that is morally right, one has higher VNM utility, but lower E-utility.

2There is a certain argument against the Repugnant Conclusion that goes that, as the steps of the Mere Addition Paradox are followed the world will lose its last symphony, its last great book, and so on. I have always considered this to be an invalid argument because the world of the RC doesn't necessarily have to be one where these things don't exist, it could be one where they exist, but are enjoyed very rarely. The Genocidal Conclusion brings this argument back in force. Creating creatures that can appreciate symphonies and great books is very inefficient compared to creating bunny rabbits pumped full of heroin.

3Total Utilitarianism was originally introduced to population ethics as a possible solution to the Non-Identity Problem. I certainly agree that such a problem needs a solution, even if Total Utilitarianism doesn't work out as that solution.

4I haven't read a lot of Moore, most of my ideas were extrapolated from other things I read on Less Wrong. I just mentioned him because in my research I noticed his concept of "ideal utilitarianism" resembled my ideas. While I do think he was on the right track he does commit the Mind Projection Fallacy a lot. For instance, he seems to think that one could promote beauty by creating beautiful objects, even if there were no creatures with standards of beauty around to appreciate them. This is why I am careful to emphasize that to promote ideals like love and beauty one must create creatures capable of feeling love and experiencing beauty.

5My tentative answer to the question Eliezer poses in "You Can't Unbirth a Child" is that human beings may have a duty to allow the cheesecake maximizers to build some amount of giant cheesecakes, but they would also have a moral duty to limit such creatures' reproduction in order to spare resources to create more creatures with humane values.

EDITED: To make a point about ideal consequentialism clearer, based on AlexMennen's criticisms.

Desires You're Not Thinking About at the Moment

1 Ghatanathoah 20 February 2013 09:41AM

While doing some reading on philosophy I came across some interesting questions about the nature of having desires and preferences. One, do you still have preferences and desires when you are unconscious? Two, if you don't does this call into question the many moral theories that hold that having preferences and desires is what makes one morally significant, since mistreating temporarily unconscious people seems obviously immoral? 

Philosophers usually discuss this question when debating the morality of abortion, but to avoid doing any mindkilling I won't mention that topic, except to say in this sentence that I won't mention it.

In more detail the issue is:  A common, intuitive, and logical-seeming explanation for why it is immoral to destroy a typical human being, but not to destroy a rock, is that a typical human being has certain desires (or preferences or values, whatever you wish to call them, I'm using the terms interchangably) that they wish to fulfill, and destroying them would hinder the fulfillment of these desires.  A rock, by contrast does not have any such desires so it is not harmed by being destroyed.  The problem with this is that it also seems immoral to harm a human being who is asleep, or is in a temporary coma. And, on the face of it, it seems plausible to say that an unconscious person does not have any desires. (And of course it gets even weirder when considering far-out concepts like a brain emulator that is saved to a hard drive, but isn't being run at the moment)

After thinking about this it occurred to me that this line of reasoning could be taken further.  If I am not thinking about my car at the moment, can I still be said to desire that it is not stolen?  Do I stop having desires about things the instant my attention shifts away from them?

I have compiled a list of possible solutions to this problem, ranked in order from least plausible to most plausible.

1.  One possibility would be to consider it immoral to harm a sleeping person because if they will have desires in the future, even if they don't now.  I find this argument extremely implausible because it has some extremely bizarre implications, some of which may lead to insoluble moral contradictions.  For instance, this argument could be used to argue that it is immoral to destroy skin cells because it is possible to use them to clone a new person, who will eventually grow up to have desires.

Furthermore, when human beings eventually gain the ability to build AIs that possess desires, this solution interacts with the orthogonality thesis in a catastrophic fashion.  If it is possible to build an AI with any utility function, then for every potential AI one can construct, there is another potential AI that desires the exact opposite of that AI.  That leads to total paralysis, since for every set potential set of desires we are capable of satisfying there is another potential set that would be horribly thwarted.

Lastly, this argument implies that you can, (and may be obligated to) help someone who doesn't exist, and never has existed, by satisfying their non-personal preferences, without ever having to bother with actually creating them.  This seem strange, I can maybe see an argument for respecting the once-existant preferences of those who are dead, but respecting the hypothetical preferences of the never-existed seems absurd.  It also has the same problems with the orthogonality thesis that I mentioned earlier.

2.  Make the same argument as solution 1, but somehow define the categories more narrowly so that an unconscious person's ability to have desires in the future differs from that of an uncloned skin cell or an unbuilt AI.  Michael Tooley has tried to do this by discerning between things that have the "possibility" of becoming a person with desires (i.e skin cells) and those that have the "capacity" to have desires.  This approach has been criticized, and I find myself pessimistic about it because categories have a tendency to be "fuzzy" in real life and not have sharp borders.

3.  Another solution may be that desires that one has had in the past continue to count, even when one is unconscious or not thinking about them.  So it's immoral to harm unconscious people because before they were unconscious they had a desire not to be harmed, and it's immoral to steal my car because I desired that it not be stolen earlier when I was thinking about it.

I find this solution fairly convincing.  The only major quibble I have with it is that it gives what some might consider a counter-intuitive result on a variation of the sleeping person question.  Imagine a nano-factory manufacturers a sleeping person.  This person is a new and distinct individual, and when they wake up they will proceed to behave as a typical human.  This solution may suggest that it is okay to kill them before they wake up, since they haven't had any desires yet, which does seem odd.

4. Reject the claim that one doesn't have desires when one is unconscious, or when one is not thinking about a topic.  The more I think about this solution, the more obvious it seems.  Generally when I am rationally deliberating about whether or not I desire something I consider how many of my values and ideaks it fulfills.  It seems like my list of values and ideals remains fairly constant, and that even if I am focusing my attention on one value at a time it makes sense to say that I still "have" the other values I am not focusing on at the moment.

Obviously I don't think that there's some portion of my brain where my "values" are stored in a neat little Excel spreadsheet.  But they do seem to be a persistent part of its structure in some fashion.  And it makes sense that they'd still be part of its structure when I'm unconscious.  If they weren't, wouldn't my preferences change radically every time I woke up?

In other words, it's bad to harm an unconscious person because they have desires, preferences, values, whatever you wish to call them, that harming them would violate.  And those values are a part of the structure of their mind that doesn't go away when they sleep.  Skin cells and unbuilt AIs, by contrast, have no such values.

Now, while I think that explanation 4 resolves the issue of desires and unconsciousness best, I do think solution 3 has a great deal of truth to it as well (For instance, I tend to respect the final wishes of a dead person because they had desires in the past, even if they don't now).   The solutions 3 and 4 are not incompatible at all, so one can believe in both of them.

I'm curious as to what people think of my possible solutions.  Am I right about people still having something like desires in their brain when they are unconscious?

Notes on Psychopathy

18 gwern 19 December 2012 04:02AM

This is some old work I did for SI. See also Notes on the Psychology of Power.

Deviant but not necessarily diseased or dysfunctional minds can demonstrate resistance to all treatment and attempts to change their mind (think No Universally Compelling Arguments; the premier example are probably psychopaths - no drug treatments are at all useful nor are there any therapies with solid evidence of even marginal effectiveness (one widely cited chapter, “Treatment of psychopathy: A review of empirical findings”, concludes that some attempted therapies merely made them more effective manipulators! We’ll look at that later.) While some psychopath traits bear resemblance to general characteristic of the powerful, they’re still a pretty unique group and worth looking at.

The main focus of my excerpts is on whether they are treatable, their effectiveness, possible evolutionary bases, and what other issues they have or don’t have which might lead one to not simply write them off as “broken” and of no relevance to AI.

(For example, if we were to discover that psychopaths were healthy human beings who were not universally mentally retarded or ineffective in gaining wealth/power and were destructive and amoral, despite being completely human and often socialized normally, then what does this say about the fragility of human values and how likely an AI will just be nice to us?)

continue reading »

Replaceability as a virtue

5 chaosmage 12 December 2012 07:53AM

I propose it is altruistic to be replaceable and therefore, those who strive to be altruistic should strive to be replaceable.

As far as I can Google, this does not seem to have been proposed before. LW should be a good place to discuss it. A community interested in rational and ethical behavior, and in how superintelligent machines may decide to replace mankind, should at least bother to refute the following argument.

Replaceability

Replaceability is "the state of being replaceable". It isn't binary. The price of the replacement matters: so a cookie is more replaceable than a big wedding cake. Adequacy of the replacement also makes a difference: a piston for an ancient Rolls Royce is less replaceable than one in a modern car, because it has to be hand-crafted and will be distinguishable. So something is more or less replaceable depending on the price and quality of its replacement.

Replaceability could be thought of as the inverse of the cost of having to replace something. Something that's very replaceable has a low cost of replacement, while something that lacks replaceability has a high (up to unfeasible) cost of replacement. The cost of replacement plays into Total Cost of Ownership, and everything economists know about that applies. It seems pretty obvious that replaceability of possessions is good, much like cheap availability is good.

Some things (historical artifacts, art pieces) are valued highly precisely because of their irreplacability. Although a few things could be said about the resale value of such objects, I'll simplify and contend these valuations are not rational.

The practical example

Anne manages the central database of Beth's company. She's the only one who has access to that database, the skillset required for managing it, and an understanding of how it all works; she has a monopoly to that combination.

This monopoly gives Anne control over her own replacement cost. If she works according to the state of the art, writes extensive and up-to-date documentation, makes proper backups etc she can be very replaceable, because her monopoly will be easily broken. If she refuses to explain what she's doing, creates weird and fragile workarounds and documents the database badly she can reduce her replaceability and defend her monopoly. (A well-obfuscated database can take months for a replacement database manager to handle confidently.)

So Beth may still choose to replace Anne, but Anne can influence how expensive that'll be for Beth. She can at least make sure her replacement needs to be shown the ropes, so she can't be fired on a whim. But she might go further and practically hold the database hostage, which would certainly help her in salary negotiations if she does it right.

This makes it pretty clear how Anne can act altruistically in this situation, and how she can act selfishly. Doesn't it?

The moral argument

To Anne, her replacement cost is an externality and an influence on the length and terms of her employment. To maximize the length of her employment and her salary, her replacement cost would have to be high.

To Beth, Anne's replacement cost is part of the cost of employing her and of course she wants it to be low. This is true for any pair of employer and employee: Anne is unusual only in that she has a great degree of influence on her replacement cost.

Therefore, if Anne documents her database properly etc, this increases her replaceability and constitutes altruistic behavior. Unless she values the positive feeling of doing her employer a favor more highly than she values the money she might make by avoiding replacement, this might even be true altruism.

Unless I suck at Google, replaceability doesn't seem to have been discussed as an aspect of altruism. The two reasons for that I can see are:

  • replacing people is painful to think about
  • and it seems futile as long as people aren't replaceable in more than very specific functions anyway.

But we don't want or get the choice to kill one person to save the life of five, either, and such practical improbabilities shouldn't stop us from considering our moral decisions. This is especially true in a world where copies, and hence replacements, of people are starting to look possible at least in principle.

Singularity-related hypotheticals

  1. In some reasonably-near future, software is getting better at modeling people. We still don't know what makes a process intelligent, but we can feed a couple of videos and a bunch of psychological data points into a people modeler, extrapolate everything else using a standard population and the resulting model can have a conversation that could fool a four-year-old. The technology is already good enough for models of pets. While convincing models of complex personalities are at least another decade away, the tech is starting to become good enough for senile grandmothers.

    Obviously no-one wants granny to die. But the kids would like to keep a model of granny, and they'd like to make the model before the Alzheimer's gets any worse, while granny is terrified she'll get no more visits to her retirement home.

    What's the ethical thing to do here? Surely the relatives should keep visiting granny. Could granny maybe have a model made, but keep it to herself, for release only through her Last Will and Testament? And wouldn't it be truly awful of her to refuse to do that?
  2. Only slightly further into the future, we're still mortal, but cryonics does appear to be working. Unfrozen people need regular medical aid, but the technology is only getting better and anyway, the point is: something we can believe to be them can indeed come back.

    Some refuse to wait out these Dark Ages; they get themselves frozen for nonmedical reasons, to fastforward across decades or centuries into a time when the really awesome stuff will be happening, and to get the immortality technologies they hope will be developed by then.

    In this scenario, wouldn't fastforwarders be considered selfish, because they impose on their friends the pain of their absence? And wouldn't their friends mind it less if the fastforwarders went to the trouble of having a good model (see above) made first?
  3. On some distant future Earth, minds can be uploaded completely. Brains can be modeled and recreated so effectively that people can make living, breathing copies of themselves and experience the inability to tell which instance is the copy and which is the original.

    Of course many adherents of soul theories reject this as blasphemous. A couple more sophisticated thinkers worry if this doesn't devalue individuals to the point where superhuman AIs might conclude that as long as copies of everyone are stored on some hard drive orbiting Pluto, nothing of value is lost if every meatbody gets devoured into more hardware. Bottom line is: Effective immortality is available, but some refuse it out of principle.

    In this world, wouldn't those who make themselves fully and infinitely replaceable want the same for everyone they love? Wouldn't they consider it a dreadful imposition if a friend or relative refused immortality? After all, wasn't not having to say goodbye anymore kind of the point?

These questions haven't come up in the real world because people have never been replaceable in more than very specific functions. But I hope you'll agree that if and when people become more replaceable, that will be regarded as a good thing, and it will be regarded as virtuous to use these technologies as they become available, because it spares one's friends and family some or all of the cost of replacing oneself.

Replaceability as an altruist virtue

And if replaceability is altruistic in this hypothetical future, as well as in the limited sense of Anne and Beth, that implies replaceability is altruistic now. And even now, there are things we can do to increase our replaceability, i.e. to reduce the cost our bereaved will incur when they have to replace us. We can teach all our (valuable) skills, so others can replace us as providers of these skills. We can not have (relevant) secrets, so others can learn what we know and replace us as sources of that knowledge. We can endeavour to live as long as possible, to postpone the cost. We can sign up for cryonics. There are surely other things each of us could do to increase our replaceability, but I can't think of any an altruist wouldn't consider virtuous.

As an altruist, I conclude that replaceability is a prosocial, unselfish trait, something we'd want our friends to have, in other words: a virtue. I'd go as far as to say that even bothering to set up a good Last Will and Testament is virtuous precisely because it reduces the cost my bereaved will incur when they have to replace me. And although none of us can be truly easily replaceable as of yet, I suggest we honor those who make themselves replaceable, and are proud of whatever replaceability we ourselves attain.

So, how replaceable are you?

Cake, or death!

25 Stuart_Armstrong 25 October 2012 10:33AM

Here we'll look at the famous cake or death problem teasered in the Value loading/learning post.

Imagine you have an agent that is uncertain about its values and designed to "learn" proper values. A formula for this process is that the agent must pick an action a equal to:

  • argmaxa∈A Σw∈W p(w|e,a) Σu∈U u(w)p(C(u)|w)

Let's decompose this a little, shall we? A is the set of actions, so argmax of a in A simply means that we are looking for an action a that maximises the rest of the expression. W is the set of all possible worlds, and e is the evidence that the agent has seen before. Hence p(w|e,a) is the probability of existing in a particular world, given that the agent has seen evidence e and will do action a. This is summed over each possible world in W.

And what value do we sum over in each world? Σu∈U u(w)p(C(u)|w). Here U is the set of (normalised) utility functions the agent is considering. In value loading, we don't program the agent with the correct utility function from the beginning; instead we imbue it with some sort of learning algorithm (generally with feedback) so that it can deduce for itself the correct utility function. The expression p(C(u)|w) expresses the probability that the utility u is correct in the world w. For instance, it might cover statements "it's 99% certain that 'murder is bad' is the correct morality, given that I live in a world where every programmer I ask tells me that murder is bad".

The C term is the correctness of the utility function, given whatever system of value learning we're using (note that some moral realists would insist that we don't need a C, that p(u|w) makes sense directly, that we can deduce ought from is). All the subtlety of the value learning is encoded in the various p(C(u)|w): this determines how the agent learns moral values.

So the whole formula can be described as:

  • For each possible world and each possible utility function, figure out the utility of that world. Weigh that by the probability that that utility is correct is that world, and by the probability of that world. Then choose the action that maximises the weighted sum of this across all utility functions and worlds.

 

Naive cake or death

continue reading »

Value evolution

14 PhilGoetz 08 December 2011 11:47PM

Coherent extrapolated volition (CEV) asks what humans would want, if they knew more - if their values reached reflective equilibrium.  (I don't want to deal with the problems of whether there are "human values" today; for the moment I'll consider the more-plausible idea that a single human who lived forever could get smarter and closer to reflective equilibrium over time.)

This is appealing because it seems compatible with moral progress (see e.g., Muehlhauser & Helm, "The singularity and machine ethics", in press).  Morality has been getting better over time, right?  And that's because we're getting smarter, and closer to reflective equilibrium as we revise our values in light of our increased understanding, right?

This view makes three claims:

  1. Morality has improved over time.
  2. Morality has improved as a result of reflection.
  3. This improvement brings us closer to equilibrium over time.

There can be no evidence for the first claim, and the evidence is against the second two claims.

continue reading »

[MORESAFE] Prevention of the global catastrophe and human values

-7 turchin 27 October 2011 09:12PM

The chanses of prevention of the global catastrophe are growing if humans have the goal of it. This is semi-trivial conclusion. But the main question is who should have such goal?

Of course, if we have global government, its main goal should be prevention of global catastrophe. But we do not have global government and most people hate the idea. I find it irrational. But discussion about global goverment is pure theoretical one, because I do not see peaceful ways of its creation. 

If friendly AI take over the world ve will became global government de facto.

Or if imminent global risk will be recognized (asteroid is near), UN could temporaly transform in some kind of global government. 
But some people think that global government itself will be or will soon lead to the global catastrophe - because it could easily implement global measures - and predicate "global" is nessesary to global catastrophes, as I am going to show in one of next posts. For example it could implement global total vaccination that lately will have dangerous consciences.

So we see that idea of global government is very closely connected with idea of global catastrophe. One could lead to another and back.

But as we do not have global government we could only speak about goals of separate people and separate organizations.

People do not have goals. The only think that they have goals, but these are only declarations, which rarely regulate real people's behavior. This is because human beings are not born as rational subjects and their behavior is mostly regulated by unconscious programs known as instincts. 

These instincts are culturally adapted as values. Values are the real reasons of human behavior. "Goals" are what people say about their reasons to others and to themselves.

The problem of how human values influent the course of human history is difficult one. Last year I wrote a book "Futurology. 21 century: immortality or global catastrophe" in Russian together with M.Batin. And the chapter about values was the most difficult one. 

Values are always based on instincts , pleasure and collective behavior (values help to form groups of people who share them). Value is always emotion, it has energy to move a person.

But self-preservation is basic human instinct and so prevention of death and global catastrophe could be human value.

Each value need a group of supporters to exist (value of soccer needs group of fans). Religious values exist only because they have large group of supporters.

In 1960th fight for peace existed and was mass movement. It finally won and lead to limitation of nuclear arsenals in 80th and later. This is good example of how human values prevented global risk without creation of global government.

Now value of "being green" has been created and many people fight CO2.

The problems with such values is that they need very bright picture of risk to attract attention of people. It is not easy to create value to fight global risks in general. But the value of infinite existence of the civilization is much easily imaginable.
So promoting a vision of future galactic super civilization with immortal people could motivate people now to fight global risks in all forms.

Should I play World of Warcraft?

12 PhilGoetz 07 October 2011 04:25AM

I've avoided playing World of Warcraft because many people enjoy it so much that they neglect other things in their life.

Does that make sense?

How about cocaine?

How about sex?  I hear that's pretty good too.

ADDED:  Lots of interesting discussion, but no one is getting at some points of particular interest to me.  Most answers assume that you have important stuff to do, and you need to decide whether WoW will prevent you from getting that important stuff done.  They also assume that your brain usually errs on the side of telling you to do "non-important" stuff (WoW) at the expense of "important stuff".

One question is whether there is any evidence that your brain is biased in this way.  I think your reflective self greatly overestimates the probability of success at the "important stuff".  I have worked very hard, twelve hours a day, 7 days a week, on "important stuff" for most of the past 30 years.  The important stuff never pans out.  So it appears that when my brain told me to play Freecell rather than work on that important paper on artificial intelligence that got pulled from the book the day before publication due to petty office politics, or to watch Buffy rather than do another test run of the demo I spent three months preparing for DARPA that no one from DARPA ever watched because the program officer was too busy to supervise his program, or to go hiking instead of spending another weekend working on the project for NASA that was eventually so big and successful that my boss took it over and then tried to get me fired1, or to go dancing rather than work on the natural-language processing approach that got shelved because my boss felt it emphasized the skills of mathematicians more than his own, or to LARP rather than put in another weekend on my approach using principal component analysis for early cancer detection that it turned out some guy from the FDA had already published 6 months earlier, or the technique for choosing siRNA sequences that a professor from George Mason already had a paper in press on - all those times, my brain was using a better estimate of success than my reflective self was.

Another question is why the "important stuff" is important.  Fun is fun.  On the surface, we are saying something like, "I have a part of my utility function that values contributions to the world, because I evolved to be altruistic."  If we really believe that, then for any contribution to the world, there exists some quantity of fun that would outweigh it.  And people use language like, "WoW may be fun, but it has little lasting effect."  But when you contribute something to the world, if the relevant motivating factor to us is how our utility function evaluates that contribution, then that also has little lasting effect.  If you do something great for the world, it may have a lasting effect on the world; but the time you spend feeling good about it is not as great - probably less time, and a less intense emotion, than if you had spent all the time accomplishing it playing WoW instead.  So this question is about whether we really believe the stories we tell ourselves about our utility functions.

1. He got to award himself all of the department's yearly bonus money that wasn't awarded to his subordinates, so any obvious success by his subordinates was money out of his pocket.

[Poll] Who looks better in your eyes?

6 [deleted] 25 August 2011 11:29AM

This is thread where I'm trying to figure out a few things about signalling on LessWrong and need some information, so please immediately after reading about the two individuals please answer the poll. The two individuals:


A. Sees that an interpretation of reality shared by others is not correct, but tries to pretend otherwise for personal gain and/or safety.

B. Fails to see that an interpretation of reality is shared by others is flawed. He is therefore perfectly honest in sharing the interpretation of reality with others. The reward regime for outward behaviour is the same as with A.

 

To add a trivial inconvenience that matches the inconvenience of answering the poll before reading on, comments on what I think the two individuals signal,what the trade off is and what I speculate the results might be here versus the general population, is behind this link.

SMBC: dystopian objective function

8 Jonathan_Graehl 24 June 2011 04:03AM

Cartoon: http://www.smbc-comics.com/index.php?db=comics&id=2286 evokes the horror you should feel imagining your values being modified arbitrarily, although in the comic there's slippery-slope consent at each step.

This reminds me of a sci-fi novel where the participants are playing a game where points are awarded for "traditional" early 20th century behavior (the original records are lost, and some virus has infected the teleportation gates). Unfortunately I can't remember the author or name; it was pretty decent. Anyone recall it?

Utility Maximization and Complex Values

3 XiXiDu 19 June 2011 04:06PM

Does expected utility maximization destroy complex values?

An expected utility maximizer does calculate the expected utility of various outcomes of alternative actions. It is precommited to choosing the outcome with the largest expected utility. Consequently it is choosing the action that yields the largest expected utility.

But one unit of utility is not discriminable from another unit of utility. All a utility maximizer can do is to maximize expected utility. What if it turns out that one of its complex values can be much more effectively realized and optimized than its other values, i.e. has the best cost-value ratio? That value might turn out to outweigh all other values.

How can this be countered? One possibility seems to be changing one's utility function and reassign utility in such a way as to outweigh that effect. But this will lead to inconsistency. Another way is to discount the value that threatens to outweigh all others. Which will again lead to inconsistency.

This seems to suggest that subscribing to expected utility maximization means that 1.) you swap your complex values for a certain terminal goal with the highest expected utility 2.) your decision-making is eventually dominated by a narrow set of values that are the easiest to realize and promise the most utility.

Can someone please explain how I am wrong or point me to some digestible explanation? Likewise I would be pleased if someone could tell me what mathematical background is required to understand expected utility maximization formally.

Thank you!

Why No Wireheading?

16 [deleted] 18 June 2011 11:33PM

I've been thinking about wireheading and the nature of my values. Many people here have defended the importance of external referents or complex desires. My problem is, I can't understand these claims at all.

To clarify, I mean wireheading in the strict "collapsing into orgasmium" sense. A successful implementation would identify all the reward circuitry and directly stimulate it, or do something equivalent. It would essentially be a vastly improved heroin. A good argument for either keeping complex values (e.g. by requiring at least a personal matrix) or external referents (e.g. by showing that a simulation can never suffice) would work for me.

Also, I use "reward" as short-hand for any enjoyable feeling, as "pleasure" tends to be used for a specific one of them, among bliss, excitement and so on, and "it's not about feeling X, but X and Y" is still wireheading after all.

I tried collecting all related arguments I could find. (Roughly sorted from weak to very weak, as I understand them, plus link to example instances. I also searched any literature/other sites I could think of, but didn't find other (not blatantly incoherent) arguments.)

  1. People do not always optimize their actions based on achieving rewards. (People also are horrible at making predictions and great at rationalizing their failures afterwards.)
  2. It is possible to enjoy doing something while wanting to stop or vice versa, do something without enjoying it while wanting to continue. (Seriously? I can't remember ever doing either. What makes you think that the action is thus valid, and you aren't just making mistaken predictions about rewards or are being exploited? Also, Mind Projection Fallacy.)
  3. A wireheaded "me" wouldn't be "me" anymore. (What's this "self" you're talking about? Why does it matter that it's preserved?)
  4. "I don't want it and that's that." (Why? What's this "wanting" you do? How do you know what you "want"? (see end of post))
  5. People, if given a hypothetical offer of being wireheaded, tend to refuse. (The exact result depends heavily on the exact question being asked. There are many biases at work here and we normally know better than to trust the majority intuition, so why should we trust it here?)
  6. Far-mode predictions tend to favor complex, external actions, while near-mode predictions are simpler, more hedonistic. Our true self is the far one, not the near one. (Why? The opposite is equally plausible. Or the falsehood of the near/far model in general.)
  7. If we imagine a wireheaded future, it feels like something is missing or like we won't really be happy. (Intuition pump.)
  8. It is not socially acceptable to embrace wireheading. (So what? Also, depends on the phrasing and society in question.)

(There have also been technical arguments against specific implementations of wireheading. I'm not concerned with those, as long as they don't show impossibility.)

Overall, none of this sounds remotely plausible to me. Most of it is outright question-begging or relies on intuition pumps that don't even work for me.

It confuses me that others might be convinced by arguments of this sort, so it seems likely that I have a fundamental misunderstanding or there are implicit assumptions I don't see. I fear that I have a large inferential gap here, so please be explicit and assume I'm a Martian. I genuinely feel like Gamma in A Much Better Life.

To me, all this talk about "valueing something" sounds like someone talking about "feeling the presence of the Holy Ghost". I don't mean this in a derogatory way, but the pattern "sense something funny, therefore some very specific and otherwise unsupported claim" matches. How do you know it's not just, you know, indigestion?

What is this "valuing"? How do you know that something is a "value", terminal or not? How do you know what it's about? How would you know if you were mistaken? What about unconscious hypocrisy or confabulation? Where do these "values" come from (i.e. what process creates them)? Overall, it sounds to me like people are confusing their feelings about (predicted) states of the world with caring about states directly.

To me, it seems like it's all about anticipating and achieving rewards (and avoiding punishments, but for the sake of the wireheading argument, it's equivalent). I make predicitions about what actions will trigger rewards (or instrumentally help me pursue those actions) and then engage in them. If my prediction was wrong, I drop the activity and try something else. If I "wanted" something, but getting it didn't trigger a rewarding feeling, I wouldn't take that as evidence that I "value" the activity for its own sake. I'd assume I suck at predicting or was ripped off.

Can someone give a reason why wireheading would be bad?

People who want to save the world

5 Giles 15 May 2011 12:44AM

atucker wants to save the world.
ciphergoth wants to save the world.
Dorikka wants to save the world.
Eliezer_Yudkowsky wants to save the world.
I want to save the world.
Kaj_Sotala wants to save the world.
lincolnquirk wants to save the world.
Louie wants to save the world.
paulfchristiano wants to save the world.
Psy-Kosh wants to save the world.

Clearly the list I've given is incomplete. I imagine most members of the Singularity Institute belong here; otherwise their motives are pretty baffling. But equally clearly, the list will not include everyone.

What's my point? My point is that these people should be cooperating. But we can't cooperate unless we know who we are. If you feel your name belongs on this list then add a top-level comment to this thread, and feel free to add any information about what this means to you personally or what plans you have. Or it's enough just to say, "I want to save the world".

This time, no-one's signing up for anything. I'm just doing this to let you know that you're not alone. But maybe some of us can find somewhere to talk that's a little quieter.

The Nature of Self

3 XiXiDu 05 April 2011 10:52AM

In this post I try to fathom an informal definition of Self, the "essential qualities that constitute a person's uniqueness". I assume that the most important requirement for a definition of self is time-consistency. A reliable definition of identity needs to allow for time-consistent self-referencing since any agent that is unable to identify itself over time will be prone to make inconsistent decisions.

Data Loss

Obviously most humans don't want to die, but what does that mean? What is it that humans try to preserve when they sign up for Cryonics? It seems that an explanation must account and allow for some sort of data loss.

The Continuity of Consciousness

It can't be about the continuity of consciousness as we would have to refuse general anesthesia due to the risk of "dying" and most of us will agree that there is something more important than the continuity of consciousness that makes us accept a general anesthesia when necessary.

Computation

If the continuity of consciousness isn't the most important detail about the self then it very likely isn't the continuity of computation either. Imagine that for some reason the process evoked when "we" act on our inputs under the control of an algorithm halts for a second and then continues otherwise unaffected, would we don't mind to be alive ever after because we died when the computation halted? This doesn't seem to be the case.

Static Algorithmic Descriptions

Although we are not partly software and partly hardware we could, in theory, come up with an algorithmic description of the human machine, of our selfs. Might it be that algorithm that we care about? If we were to digitize our self we would end up with a description of our spatial parts, our self at a certain time. Yet we forget that all of us possess such an algorithmic description of our selfs and we're already able back it up. It is our DNA.

Temporal Parts

Admittedly our DNA is the earliest version of our selfs, but if we don't care about the temporal parts of our selfs but only about a static algorithmic description of a certain spatiotemporal position, then what's wrong with that? It seems a lot, we stop caring about past reifications of our selfs, at some point our backups become obsolete and having to fall back on them would equal death. But what is it that we lost, what information is it that we value more than all of the previously mentioned possibilities? One might think that it must be our memories, the data that represents what we learnt and experienced. But even if this is the case, would it be a reasonable choice?

Indentity and Memory

Let's just disregard the possibility that we often might not value our future selfs and so do not value our past selfs either for that we lost or updated important information, e.g. if we became religious or have been able to overcome religion.

If we had perfect memory and only ever improved upon our past knowledge and experiences we wouldn't be able to do so for very long, at least not given our human body. The upper limit on the information that can be contained within a human body is 2.5072178×1038 megabytes, if it was used as a perfect data storage. Given that we gather much more than 1 megabyte of information per year, it is foreseeable that if we equate our memories with our self we'll die long before the heat death of the universe. We might overcome this by growing in size, by achieving a posthuman form, yet if we in turn also become much smarter we'll also produce and gather more information. We are not alone either and the resources are limited. One way or the other we'll die rather quickly.

Does this mean we shouldn't even bother about the far future or is there maybe something else we value even more than our memories? After all we don't really mind much if we forget what we have done a few years ago.

Time-Consistency and Self-Reference

It seems that there is something even more important than our causal history. I think that more than everything we care about our values and goals. Indeed, we value the preservation of our values. As long as we want the same we are the same. Our goal system seems to be the critical part of our implicit definition of self, that which we want to protect and preserve. Our values and goals seem to be the missing temporal parts that allow us to consistently refer to us, to identify our selfs at different spatiotempiral positions.

Using our values and goals as identifiers also resolves the problem of how we should treat copies of our self that are featuring alternating histories and memories, copies with different causal histories. Any agent that does feature a copy of our utility function ought to be incorporated into our decisions as an instance, as a reification of our selfs. We should identify with our utility-function regardless of its instantiation.

Stable Utility-Functions

To recapitulate, we can value our memories, the continuity of experience and even our DNA, but the only reliable marker for the self identity of goal-oriented agents seems to be a stable utility function. Rational agents with an identical utility function will to some extent converge to exhibit similar behavior and are therefore able to cooperate. We can more consistently identify with our values and goals than with our past and future memories, digitized backups or causal history.

But even if this is true there is one problem, humans might not exhibit goal-stability.

A Thought Experiment on Pain as a Moral Disvalue

18 Wei_Dai 11 March 2011 07:56AM

Related To: Eliezer's Zombies Sequence, Alicorn's Pain

Today you volunteered for what was billed as an experiment in moral psychology. You enter into a small room with a video monitor, a red light, and a button. Before you entered, you were told that you'll be paid $100 for participating in the experiment, but for every time you hit that button, $10 will be deducted. On the monitor, you see a person sitting in another room, and you appear to have a two-way audio connection with him. That person is tied down to his chair, with what appears to be electrical leads attached to him. He now explains to you that your red light will soon turn on, which means he will be feeling excruciating pain. But if you press the button in front of you, his pain will stop for a minute, after which the red light will turn on again. The experiment will end in ten minutes.

You're not sure whether to believe him, but pretty soon the red light does turn on, and the person in the monitor cries out in pain, and starts struggling against his restraints. You hesitate for a second, but it looks and sounds very convincing to you, so you quickly hit the button. The person in the monitor breaths a big sigh of relief and thanks you profusely. You make some small talk with him, and soon the red light turns on again. You repeat this ten times and then are released from the room. As you're about to leave, the experimenter tells you that there was no actual person behind the video monitor. Instead, the audio/video stream you experienced was generated by one of the following ECPs (exotic computational processes).

  1. An AIXI-like (e.g., AIXI-tl, Monte Carlo AIXI, or some such) agent, programmed with the objective of maximizing the number of button presses.
  2. A brute force optimizer, which was programmed with a model of your mind, that iterated through all possible audio/video bit streams to find the one that maximizing the number of button presses. (As far as philosophical implications are concerned, this seems essentially identical to 1, so the reader doesn't necessarily have to go learn about AIXI.)
  3. A small team of uploads capable of running at a million times faster than an ordinary human, armed with photo-realistic animation software, and tasked with maximizing the number of your button presses.
  4. A Giant Lookup Table (GLUT) of all possible sense inputs and motor outputs of a person, connected to a virtual body and room.

Then she asks, would you like to repeat this experiment for another chance at earning $100?

Presumably, you answer "yes", because you think that despite appearances, none of these ECPs actually do feel pain when the red light turns on. (To some of these ECPs, your button presses would constitute positive reinforcement or lack of negative reinforcement, but mere negative reinforcement, when happening to others, doesn't seem to be a strong moral disvalue.) Intuitively this seems to be the obvious correct answer, but how to describe the difference between actual pain and the appearance of pain or mere negative reinforcement, at the level of bits or atoms, if we were specifying the utility function of a potentially super-intelligent AI? (If we cannot even clearly define what seems to be one of the simplest values, then the approach of trying to manually specify such a utility function would appear completely hopeless.)

One idea to try to understand the nature of pain is to sample the space of possible minds, look for those that seem to be feeling pain, and check if the underlying computations have anything in common. But as in the above thought experiment, there are minds that can convincingly simulate the appearance of pain without really feeling it.

Another idea is that perhaps what is bad about pain is that it is a strong negative reinforcement as experienced by a conscious mind. This would be compatible with the thought experiment above, since (intuitively) ECPs 1, 2, and 4 are not conscious, and 3 does not experience strong negative reinforcements. Unfortunately it also implies that fully defining pain as a moral disvalue is at least as hard as the problem of consciousness, so this line of investigation seems to be at an immediate impasse, at least for the moment. (But does anyone see an argument that this is clearly not the right approach?)

What other approaches might work, hopefully without running into one or more problems already known to be hard?

View more: Next