Ends Don't Justify Means (Among Humans)

Eliezer Yudkowsky

Ends Don't Justify Means (Among Humans) — LessWrong

Quantified Humanism

222 Ends Don't Justify Means (Among Humans)

by Eliezer Yudkowsky

14th Oct 2008

5 min read

222

"If the ends don't justify the means, what does?"
—variously attributed

"I think of myself as running on hostile hardware."
—Justin Corwin

Yesterday I talked about how humans may have evolved a structure of political revolution, beginning by believing themselves morally superior to the corrupt current power structure, but ending by being corrupted by power themselves—not by any plan in their own minds, but by the echo of ancestors who did the same and thereby reproduced.

This fits the template:

In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.

From this proposition, I now move on to my main point, a question considerably outside the realm of classical Bayesian decision theory:

"What if I'm running on corrupted hardware?"

In such a case as this, you might even find yourself uttering such seemingly paradoxical statements—sheer nonsense from the perspective of classical decision theory—as:

"The ends don't justify the means."

But if you are running on corrupted hardware, then the reflective observation that it seems like a righteous and altruistic act to seize power for yourself—this seeming may not be be much evidence for the proposition that seizing power is in fact the action that will most benefit the tribe.

By the power of naive realism, the corrupted hardware that you run on, and the corrupted seemings that it computes, will seem like the fabric of the very world itself—simply the way-things-are.

And so we have the bizarre-seeming rule: "For the good of the tribe, do not cheat to seize power even when it would provide a net benefit to the tribe."

Indeed it may be wiser to phrase it this way: If you just say, "when it seems like it would provide a net benefit to the tribe", then you get people who say, "But it doesn't just seem that way—it would provide a net benefit to the tribe if I were in charge."

The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can't yet say, but that would seem to be the appropriate level to handle it.)

But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it. A rule that says, "For the good of the tribe, do not cheat to seize power even for the good of the tribe." Or "For the good of the tribe, do not murder even for the good of the tribe."

And now the philosopher comes and presents their "thought experiment"—setting up a scenario in which, by stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is certain to save the five lives. "There's a train heading to run over five innocent people, who you can't possibly warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the train. These are your only options; what do you do?"

An altruistic human, who has accepted certain deontological prohibits—which seem well justified by some historical statistics on the results of reasoning in certain ways on untrustworthy hardware—may experience some mental distress, on encountering this thought experiment.

So here's a reply to that philosopher's scenario, which I have yet to hear any philosopher's victim give:

"You stipulate that the only possible way to save five innocent lives is to murder one innocent person, and this murder will definitely save the five lives, and that these facts are known to me with effective certainty. But since I am running on corrupted hardware, I can't occupy the epistemic state you want me to imagine. Therefore I reply that, in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder the one innocent person to save five, and moreover all its peers would agree. However, I refuse to extend this reply to myself, because the epistemic state you ask me to imagine, can only exist among other kinds of people than human beings."

Now, to me this seems like a dodge. I think the universe is sufficiently unkind that we can justly be forced to consider situations of this sort. The sort of person who goes around proposing that sort of thought experiment, might well deserve that sort of answer. But any human legal system does embody some answer to the question "How many innocent people can we put in jail to get the guilty ones?", even if the number isn't written down.

As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. But I don't think that our deontological prohibitions are literally inherently nonconsequentially terminally right. I endorse "the end doesn't justify the means" as a principle to guide humans running on corrupted hardware, but I wouldn't endorse it as a principle for a society of AIs that make well-calibrated estimates. (If you have one AI in a society of humans, that does bring in other considerations, like whether the humans learn from your example.)

And so I wouldn't say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don't call it an alarm light, if an AI says that the right thing to do is sacrifice one to save five. Again, I don't go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects. I happen to be a human. But for a Friendly AI to be corrupted by power would be like it starting to bleed red blood. The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn't spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.

I would even go further, and say that if you had minds with an inbuilt warp that made them overestimate the external harm of self-benefiting actions, then they would need a rule "the ends do not prohibit the means"—that you should do what benefits yourself even when it (seems to) harm the tribe. By hypothesis, if their society did not have this rule, the minds in it would refuse to breathe for fear of using someone else's oxygen, and they'd all die. For them, an occasional overshoot in which one person seizes a personal benefit at the net expense of society, would seem just as cautiously virtuous—and indeed be just as cautiously virtuous—as when one of us humans, being cautious, passes up an opportunity to steal a loaf of bread that really would have been more of a benefit to them than a loss to the merchant (including knock-on effects).

"The end does not justify the means" is just consequentialist reasoning at one meta-level up. If a human starts thinking on the object level that the end justifies the means, this has awful consequences given our untrustworthy brains; therefore a human shouldn't think this way. But it is all still ultimately consequentialism. It's just reflective consequentialism, for beings who know that their moment-by-moment decisions are made by untrusted hardware.

ConsequentialismEthics & MoralitySelf-DeceptionDeontologyDecision theoryEvolutionary Psychology

Frontpage

222

The "Intuitions" Behind "Utilitarianism"

208 comments98 karma

Ethical Injunctions

78 comments81 karma

Ends Don't Justify Means (Among Humans)

4Swimmer963 (Miranda Dixon-Luinenburg)

0MarsColony_in10years

0yters

0[anonymous]

0DavidAgain

5Swimmer963 (Miranda Dixon-Luinenburg)

New Comment

99 comments, sorted by

oldest

Click to highlight new comments since: Today at 2:06 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Carl_Shulman18y680

"So here's a reply to that philosopher's scenario, which I have yet to hear any philosopher's victim give" People like Hare have extensively discussed this, although usually using terms like 'angels' or 'ideally rational agent' in place of 'AIs.'

3Eliezer Yudkowsky17y

Okay.

[-]jfm14y130

Yes, this made me think precisely of Hare's two-level utilitarianism, with a Friendly AI in place of Hare's Archangel.

[-]Phil_Goetz518y290

The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn't spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.

This is critical to your point. But you haven't established this at all. You made one post with a just-so story about males in tribes perceiving those above them as corrupt, and then assumed, with no logical justification that I can recall, that this meant that those above them actually are corrupt. You haven't defined what corrupt means, either.

I think you need to sit down and spell out what 'corrupt' means, and then Think Really Hard about whether those in power actually are more corrupt than those not in power;and if so, whether the mechanisms that lead to that result are a result of the peculiar evolutionary history of humans, or of general game-theoretic / evolutionary mechanisms that would apply equally to competing AIs.

You might argue that if you have one Sysop AI, it isn't subject to evolutionary forces. This may be true. But if that's what you're counting on, it's very important for you to make that explicit. I think that, as your post stands, you may be attributing qualities to Friendly AIs, that apply only to Solitary Friendly AIs that are in complete control of the world.

3dejb15y

Just to extend on this, it seems most likely that multiple AIs would actually be subject to dynamics similar to evolution and a totally 'Friendly' AI would probably tend to lose out against a more self-serving (but not necessarily evil) AIs. Or just like the 'young revolutionary' of the first post, a truly enlightened Friendly AI would be forced to assume power to deny it to any less moral AIs. Philosophical questions aside, the likely reality of the future AI development is surely that it will also go to those that are able to seize the resources to propagate and improve themselves.

6DanielLC13y

Why would a Friendly AI lose out? They can do anything any other AI can do. They're not like humans, where they have to worry about becoming corrupt if they start committing atrocities for the good of humanity.

2PhilGoetz2y

You have it backwards. The difference between a Friendly AI and an unfriendly one is entirely one of restrictions placed on the Friendly AI. So an unfriendly AI can do anything a friendly AI could, but not vice-versa. The friendly AI could lose out because it would be restricted from committing atrocities, or at least atrocities which were strictly bad for humans, even in the long run. Your comment that they can commit atrocities for the good of humanity without worrying about becoming corrupt is a reason to be fearful of "friendly" AIs.

[-]Jef_Allbright18y70

There's really no paradox, nor any sharp moral dichotomy between human and machine reasoning. Of course the ends justify the means -- to the extent that any moral agent can fully specify the ends.

But in an interesting world of combinatorial explosion of indirect consequences, and worse yet, critically underspecified inputs to any such supposed moral calculations, no system of reasoning can get very far betting on longer-term specific consequences. Rather the moral agent must necessarily fall back on heuristics, fundamentally hard-to-gain wisdom based on ... (read more)

[-]Phil_Goetz518y20

Good point, Jef - Eliezer is attributing the validity of "the ends don't justify the means" entirely to human fallibility, and neglecting that part accounted for by the unpredictability of the outcome.

He may have some model of an AI as a perfect Bayesian reasoner that he uses to justify neglecting this. I am immediately suspicious of any argument invoking perfection.

I don't know what "a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences" means.

[-]Jef_Allbright18y20

Phil: "I don't know what "a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences" means."

You and I engaged briefly on this four or five years ago, and I have yet to write the book. [Due to the explosion of branching background requirements that would ensue.] I have, however, effectively conveyed the concept face to face to very small groups.

I keep seeing Eliezer orbiting this attractor, and then veering off as he encounters contradictions to a few deeply held assumptions. I remain hopeful that the prodigious effort going into the essays on this site will eventually (and virtually) serve as that book.

[-]Silas18y20

in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder ... I refuse to extend this reply to myself, because the epistemological state you ask me to imagine, can only exist among other kinds of people than human beings.

Interesting reply. But the AIs are programmed by corrupted humans. Do you really expect to be able to check the full source code? That you can outsmart the people who win obfuscated code contests?

How is the epistemological state of human-verified, human-built, non-corrupt AIs, any more possible?

0[anonymous]12y

We're likely to insert our faulty cached wisdom deliberately. We're unlikely to insert our power-corrupts biases deliberately. We might insert something vaguely analogous accidentally, though. As for obfuscated source code -- we would want programmatic verification of correctness, which would be another huge undertaking on top of solving the AI and FAI problems. Obfuscation doesn't help you there.

[-]Utilitarian18y20

As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. [...] I don't go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects.

It seems a strong claim to suggest that the limits you impose on yourself due to epistemological deficiency line up exactly with the mores and laws imposed by society. Are there some conventional ends-don't-justify-means notions that you would violate, or non-socially-taboo situations in which you would restrain yourself?

Also, what happens when the consequences grow large? Say 1 person to save 500, or 1 to save 3^^^^3?

2thrawnca10y

If 3^^^^3 lives are at stake, and we assume that we are running on faulty or even hostile hardware, then it becomes all the more important not to rely on potentially-corrupted "seems like this will work".

[-]Kaj_Sotala18y40

Phil Goetz: or of general game-theoretic / evolutionary mechanisms that would apply equally to competing AIs.

You are assuming that an AI would be subject to the same sort of evolutionary mechanism that humans traditionally were: namely, that only AIs with a natural tendency towards a particular behavior would survive. But an AI isn't cognitively limited in the way animals were. While animals had to effectively be pre-programmed with certain behaviors or personality traits, as they weren't intelligent or knowledgable enough to just derive all the useful sub... (read more)

[-]Phil_Goetz518y20

He may have some model of an AI as a perfect Bayesian reasoner that he uses to justify neglecting this. I am immediately suspicious of any argument invoking perfection.

It may also be that what Eliezer has in mind is that any heuristic that can be represented to the AI, could be assigned priors and incorporated into Bayesian reasoning.

Eliezer has read Judea Pearl, so he knows how computational time for Bayesian networks scales with the domain, particularly if you don't ever assume independence when it is not justified, so I won't lecture him on that. But ... (read more)

[-]Kaj_Sotala18y00

Phil: Agreed, that's certainly possible. I was only objecting to the implied possibility of AIs evolving "personality traits" the same way humans did (an idea I've come across a lot during the last few days, for some reason). But I have no objection to game theoretic reasoning (or any other reasoning) possibly coming up with results we wouldn't want it to.

[-]Nominull318y30

The thing is, an AI doesn't have to use mental tricks to compensate for known errors in its reasoning, it can just correct those errors. An AI never winds up in the position of having to strive to defeat its own purposes.

4Swimmer963 (Miranda Dixon-Luinenburg)15y

A self-modifying AI. Not all AI has to be self-modifying, although superhuman Friendly AI probably does have to be in order to work.

[-]Zubon18y120

I think the simple statement you want is, "You should accept deontology on consequentialist grounds."

[-]haig218y00

What you are getting at is that the ends justify the means only when the means don't effect the ends. In the case of a human as part of the means, the act of the means may effect the human and thus effect the ends. In summary, reflexivity is a bitch. This is a reason why social science and economics is so hard--the subjects being modeled change as a result of the modeling process.

This is a problem with any sufficiently self-reflective mind, not with AIs that do not change their own rules. A simple mechanical narrow AI that is programmed to roam about c... (read more)

[-]Cyan218y20

But in an interesting world of combinatorial explosion of indirect consequences, and worse yet, critically underspecified inputs to any such supposed moral calculations, no system of reasoning can get very far betting on longer-term specific consequences.

This point and the subsequent discussion are tangential to the point of the post, to wit, evolutionary adaptations can cause us to behave in ways that undermine our moral intentions. To see this, limit the universe of discourse to actions which have predictable effects and note that Eliezer's argument still makes strong claims about how humans should act.

[-]Caroline18y30

Why must the power structure cycle be adaptive? I mean, couldn't it simply be non-maladaptive?

Because if the net effect on human fitness is zero, then perhaps it's just a quirk. I'm not sure how this affects your argument otherwise, I'm just curious as to why you think it was an adaptive pattern and not just a pattern that didn't kill us at too high a rate.

[-]Phil_Goetz418y00

Of course, if you never assume independence, then the only right network is the fully-connected one.

Um, conditional independence, that is.

I want to know if my being killed by Eliezer's AI hinges on how often observables of interest tend to be conditionally dependent.

[-]Richard_Hollerith218y00

It is refreshing to read something by Eliezer on morality I completely agree with.

And nice succinct summary by Zubon.

[-]Lake18y10

@ Caroline: the effect on overall human fitness is neither here nor there, surely. The revolutionary power cycle would be adaptive because of its effect on the reproductive success of those who play the game versus those who don't. That is, the adaptation would only have to benefit specific lineages, not the whole species. Or have I missed your point?

[-]Vladimir_Slepnev18y10

What if a AI decides, with good reason, that it's running on hostile hardware?

[-]Emile18y20

I wonder where this is leading ... 1) Morality is a complex computation, that seems to involve a bunch of somewhat independent concerns 2) Some concerns of human morality may not need to apply to AI

So it seems that building friendly AI involves not only correctly building (human) morality, but figuring out which parts don't need to apply to an AI that doesn't have the same flaws.

[-]NancyLebovitz18y00

It seems to me that an FAI would still be in an evolutionary situation. It's at least going to need a goal of self-preservation [1] and it might well have a goal of increasing its abilities in order to be more effectively Friendly.

This implies it will have to somehow deal with the possibility that it might overestimate its own value compared to the humans it's trying to help.

[1] What constitutes the self for an AI is left as a problem for the student.

[-]Richard_Hollerith218y00

But, Nancy, the self-preservation can be an instrumental goal. That is, we can make it so that the only reason the AI wants to keep on living is that if it does not then it cannot help the humans.

[-]Stuart_Armstrong18y90

Still disagreeing with the whole "power corrupts" idea.

A builder, or a secratary, who looks out for his friends and does them favours is... a good friend. A politician who does the same is... a corrupt politician.

A sad bastard who will sleep with anyone he can is a sad bastard. A politician who will sleep with anyone he can is a power-abusing philanderer.

As you increase power, you become corrupt just by doing what you've always done.

[-]NancyLebovitz18y00

Richard, I'm looking at the margins. The FAI is convinced that it's humanity's only protection against UFAIs. If UFAIs can wipe out humanity, surely the FAI is justified in killing a million or so people to protect itself, or perhaps even to make sure it's capable of defeating UFAIs which have not yet been invented and whose abilities can only be estimated.

[-]Nick_Tarleton18y00

And if an FAI makes that judgment, I'm not going to question it - it's smarter than me, and not biased toward accumulating power for "instrumental" reasons like I am.

[-]Nick_Tarleton18y00

s/like I am/like humans are/

[-]Jef_Allbright18y00

Cyan: "...tangential to the point of the post, to wit, evolutionary adaptations can cause us to behave in ways that undermine our moral intentions."

On the contrary, promotion into the future of a [complex, hierarchical] evolving model of values of increasing coherence over increasing context, would seem to be central to the topic of this essay.

Fundamentally, any system, through interaction with its immediate environment, always only expresses its values (its physical nature.) "Intention", corresponding to "free-will" is merel... (read more)

[-]JamesAndrix18y20

How would we know if this line of thought is a recoiling from the idea that if you shut up and multiply, you should happily kill 10,000 for a 10% chance at saving a million.

[-]Richard_Hollerith218y-10

Andrix, if it is just a recoiling from that, then how do you explain Stalin, Mao, etc?

Yes, Nancy, as soon as an AI endorsed by Eliezer or me transcends to superintelligence, it will probably make a point of preventing any other AI from transcending, and there is indeed a chance that that will entail killing a few (probably very irresponsible) humans. It is very unlikely to entail the killing of millions, and I can go into that more if you want.

The points are that (1) self-preservation and staying in power is easy if you are the only superintelligence in t... (read more)

1wizzwizz46y

I disagree. Killing people to stop them doing bad stuff is only necessary given insufficient resources to prevent them from doing the bad stuff in a nicer way. If the FAI makes the tradeoff that expending those resources isn't worth it, then it doesn't sound very friendly to me.

[-]Cyan218y00

Jef Allbright,

By subsequent discussion, I meant Phil Goetz's comment about Eliezer "neglecting that part accounted for by the unpredictability of the outcome". I'm with him on not understanding what "a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences" means; I also found your reply to me utterly incomprehensible. In fact, it's incredible to me that the same mind that could formulate that reply to me would come shuddering to a halt upon encountering the unexceptionable phrase "universe of discourse".

[-]Cyan218y00

Since you said you didn't know what to do with my statement, I'll add, just replace the phrase "limit the universe of discourse to" with "consider only" and see if that helps. But I think we're using the same words to talk about different things, so your original comment may not mean what I think it means, and that's why my criticism looks wrong-headed to you.

[-]Ian_C.18y00

I don't think it's possible that our hardware could trick us in this way (making us doing self-interested things by making them appear moral).

To express the idea "this would be good for the tribe" would require the use of abstract concepts (tribe, good) but abstract concepts/sentences are precisely the things that are observably under our conscious control. What can pop up without our willing it are feelings or image associations so the best trickery our hardware could hope for is to make something feel good.

[-]Jef_Allbright18y00

@Cyan: Substituting "consider only actions that have predictable effects..." is for me much clearer than "limit the universe of discourse to actions that have predictable effects..." ["and note that Eliezer's argument still makes strong claims about how humans should act."]

But it seems to me that I addressed this head-on at the beginning of my initial post, saying "Of course the ends justify the means -- to the extent that any moral agent can fully specify the ends."

The infamous "Trolley Paradox" does not d... (read more)

[-]Henry_V18y-20

I've always thought the "moral" answer to the question was "I wouldn't push the innocent in front of the train; I'd jump in front of the train myself."

[-]Zubon18y00

Henry V, the usual version does not offer that option. You frequently are offered a lever to change the track the train is on, diverting it from five to one. And then there are a dozen variations. And one of those later variations sometimes involves a man fat enough to derail/slow/stop the train if you push him in front (by assumption: much fatter than Henry V, but not so fat that you could not push him over).

The question is there to check if your answer differs between the lever and the push. If you would pull the lever but not push the guy, the impli... (read more)

[-]n18y20

To take a subset of the topic at hand, I think Mencius nailed it when he defined corruption. To very roughly paraphrase, corruption is a mismatch between formal and informal power.

Acton's famous aphorism can be rewritten in the following form: 'Those with formal power tend to use it to increase their informal power'.

Haig: "Without ego corruption does not exist"

Not true at all. This simply rules out corruption due to greed. There are tons of people who do corrupt things for 'noble causes'. Just as a quick example, regardless of the truth of th... (read more)

[-]JamesAndrix18y00

@Richard Hollerith

Stalin and may very well have been corrupted by power, that part of the theory may be right or wrong, but it isn't self serving. Coming from a culture that vilifies such corrupted leaders, we personally want to avoid being like them.

We don't want to think of ourselves as mass-murderers-for-real. So we declare ourselves too untrustworthy to decide to murder people, and we rule out that whole decision tree. We know we are mass-murderers-in-principle, but still we're decent people.

But maybe really we should shut up and multiply, and accept t... (read more)

[-]Cyan218y00

So in my posts on this topic, I proceeded to (attempt to) convey a larger and more coherent context making sense of the ostensible issue.

Right! Now we're communicating. My point is that the context you want to add is tangential (or parallel...? pick your preferred geometric metaphor) to Eliezer's point. That doesn't mean it's without value, but it does mean that it fails to engage Eliezer's argument.

But it seems to me that I addressed this head-on at the beginning of my initial post, saying "Of course the ends justify the means -- to the extent that a... (read more)

[-]Jef_Allbright18y00

@Cyan: "Hostile hardware", meaning that an agent's values-complex (essentially the agent's nature, driving its actions) contains elements misaligned (even to the extent of being in internal opposition on some level(s) of the complex hierarchy of values) is addressed by my formulation in the "increasing coherence" term. Then, I did try to convey how this is applicable to any moral agent, regardless of form, substrate, or subjective starting point.

I'm tempted to use n's very nice elucidation of the specific example of political corrupti... (read more)

[-]Roko18y10

Eliezer: "But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it."

One could do this, but I doubt that many people do, in fact, behave the way they do for this reason.

Deontological ethics is more popular than consequentialist reasoning amongst normal people in day-to-day life; thus there are billions of people who argue deontologically that “the ends don’t justify the means”. Surely very few of these people know about evolutionary psychology in enough ... (read more)

[-]Henry_V18y10

@Zuban. I'm familiar with the contrivances used to force the responder into a binary choice. I just think that the contrivances are where the real questions are. Why am I in that situation? Was my behavior beyond reproach up to that point? Could I have averted this earlier? Is it someone else's evil action that is a threat? I think in most situations, the moral answer is rather clear, because there are always more choices. E.g., ask the fat man to jump. or do nothing and let him make his own choice, as I could only have averted it by committing murder. or ... (read more)

[-]Henry_V18y00

@Roko. You mention "maximizing the greater good" as if that is not part of a deontological ethic.

[-]Phil_Goetz518y20

All the discussion so far indicates that Eliezer's AI will definitely kill me, and some others posting here, as soon as he turns it on.

It seems likely, if it follows Eliezer's reasoning, that it will kill anyone who is overly intelligent. Say, the top 50,000,000 or so.

(Perhaps a special exception will be made for Eliezer.)

Hey, Eliezer, I'm working in bioinformatics now, okay? Spare me!

Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?

[-]Eliezer Yudkowsky18y00

Note for readers: I'm not responding to Phil Goetz and Jef Allbright. And you shouldn't infer my positions from what they seem to be arguing with me about - just pretend they're addressing someone else.

Roko, now that you mention it, I wasn't thinking hard enough about "it's easier to check whether someone followed deontological rules or not" as a pressure toward them in moral systems. Obvious in retrospect, but my own thinking had tended to focus on the usefulness of deontological rules in individual reasoning.

[-]Caledonian218y30

Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?

At present, Eliezer cannot functionally describe what 'Friendliness' would actually entail. It is likely that any outcome he views as being undesirable (including, presumably, his murder) would be claimed to be impermissible for a Friendly AI.

Imagine if Isaac Asimov not only lacked the ability to specify how the Laws of Robotics were to be implanted in artificial brains, but couldn't specify what those Laws were supposed to be. You would essentially ... (read more)

[-]Jef_Allbright18y00

Eliezer: "I'm not responding to Phil Goetz and Jef Allbright. And you shouldn't infer my positions from what they seem to be arguing with me about - just pretend they're addressing someone else."

Huh. That doesn't feel very nice.

[-]Caledonian218y00

Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?

At present, Eliezer cannot functionally describe what ‘Friendliness’ would actually entail. It is likely that any outcome he views as being undesirable (including, presumably, his murder) would be claimed to be impermissible for a Friendly AI.

Imagine if Isaac Asimov not only lacked the ability to specify how the Laws of Robotics were to be implanted in artificial brains, but couldn’t specify what those Laws were supposed to be. You would essentially ... (read more)

[-]Richard_Hollerith218y00

Goetz,

For a superhuman AI to stop you and your friends from launching a competing AI, it suffices for it to take away your access to unsupervised computing resources. It does not have to kill you.

[-]Phil_Goetz518y10

Note for readers: I'm not responding to Phil Goetz and Jef Allbright. And you shouldn't infer my positions from what they seem to be arguing with me about - just pretend they're addressing someone else.

Is that on this specific question, or a blanket "I never respond to Phil or Jef" policy?

Huh. That doesn't feel very nice.

Nor very rational, if one's goal is to communicate.

[-]Jef_Allbright18y00

Phil: "Is that on this specific question, or a blanket "I never respond to Phil or Jef" policy?"

I was going to ask the same question, but assumed there'd be no answer from our gracious host. Disappointing.

[-]Tim_Freeman18y-10

>And now the philosopher comes and presents their "thought experiment" - setting up a scenario in which, by
>stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is
>certain to save the five lives. "There's a train heading to run over five innocent people, who you can't possibly
>warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the
>train. These are your only options; what do you do?"

If you are lo... (read more)

[-]billswift18y10

I guess I'm going to have to start working harder on IA to stay ahead of any "Friendly" AI that might want to keep me down.

[-]Phil_Boncer18y-10

Stuart Armstrong wrote: "Still disagreeing with the whole "power corrupts" idea.

A builder, or a secratary, who looks out for his friends and does them favours is... a good friend.
A politician who does the same is... a corrupt politician.

A sad bastard who will sleep with anyone he can is a sad bastard.
A politician who will sleep with anyone he can is a power-abusing philanderer.

As you increase power, you become corrupt just by doing what you've always done."

I disagree here. The thing about power is that it entails the ability to use c... (read more)

[-]Caroline18y00

@lake My point is that a species or group or individual can acquire many traits that are simply non-maladaptive rather than adaptive. Once the revolutionary power cycle blip shows up, as long as it confers no disadvantages, it probably won't get worked out of the system.

I heard a story once about a girl and a chicken. She was training the chicken to play a song by giving it a treat every time it pecked the right notes in the right order. During this process, the chicken started wiggling it's neck before pecking each note. Since it was still hitting th... (read more)

[-]Caledonian218y10

I received an email from Eliezer stating:

You're welcome to repost if you criticize Coherent Extrapolated Volition specifically, rather than talking as if the document doesn't exist. And leave off the snark at the end, of course.

There is no 'snark'; what there IS, is a criticism. A very pointed one that Eliezer cannot counter.

There is no content to 'Coherent Extrapolated Volition'. It contains nothing but handwaving, smoke and mirrors. From the point of view of rational argument, it doesn't exist.

[-]Fillup_Jay_Phry18y10

I believe that rule-utilitarianism was presented to dispose of this very idea. It is also why rule-utilitarianism is right. Using correct utilitarian principles to derive deontic-esque rules of behavior. Rule based thinking maximizes utility better than situational utilitarian calculation.

[-]Jack16y80

I finally put words to my concern with this. Hopefully it doesn't get totally buried because I'd like to hear what people think.

It might be the case that a race of consequentialists would come up with deontological prohibitions on reflection of their imperfect hardware. But that isn't close to the right story for how human deontological prohibitions actually came about. There was no reflection at all, cultural and biological evolution just gave us normative intuitions and cultural institutions. If things were otherwise (our ancestors were more rational) p... (read more)

0MarsColony_in10years11y

Interesting point. It seems like human morality is more than just a function which maximizes human prosperity, or minimizes human deaths. It is a function which takes a LOT more into account than simply how many people die. However, it does take into account its own biases, at least when it finds them displeasing, and corrects for them. When it thinks it has made an error, it corrects the part of the function which produced that error. For example, we might learn new things about game theory, or even switch from a deontological ethical framework to a utilitarian one. So, the meta-level question is which of our moral intuitions are relevant to the trolley problem. (or more generally, what moral framework is correct.) If human deaths can be shown to be much more morally important than other factors, then the good of the many outweighs the good of the few. If, however, deontological ethics is correct, then the ends don't justify the means.

[-]yters15y00

It's coherent to say de-ontological ethics are hierarchical, and higher goods take precedence over lower goods. So, the lower good of sacrificing one person to save a greater good does not entail sacrificing the person is good. It is just necessary.

Saying the ends justify the means entails the means become good if they achieve a good.

0[anonymous]12y

That is, you can't take the precedent of killing one person to save five, and use that to kill another person on a whim. [...] I have mainly heard the phrase used to ignore the consequences of your actions because your goal is a good one. It's obviously wrong to suggest that a type of behavior is universally justified if it is justified in one set of circumstances in which the sum of its effects is positive.

[-]DavidAgain15y00

Very interesting article (though as has been commented, the idea has philosophical precedent). Presumably this would go alongside the idea of upholding institutions/principles. If I can steal whenever I think it's for the best, it means each theft is only culpable if the courts can prove that it caused more harm than good overall, which is impractical. We also have to consider that even if we judge correctly that we can break a rule, others will see that as meaning the rule can be ditched at will. One very good expression of the importance of laws starts 2... (read more)

[-]Swimmer963 (Miranda Dixon-Luinenburg)15y50

This is a really interesting post, and it does a good job of laying out clearly what I've often, less clearly, tried to explain to people: the human brain is not a general intelligence. It has a very limited capacity to do universal computation, but it's mostly "short-cuts" optimized for a very specific set of situations...

[-]Boyi15y00

When I first read this article the imagery of corrupt hardware cause a certain memory to pop into my head. The memory is of an interaction with my college roommate about computers. Due to various discourses I had been exposed to at the time I was under the impression that computers were designed to have a life-expectancy of about 5 years. I am not immersed the world of computers, and this statement seemed feasible to me from a economic perspective of producer rationale within a capitalistic society. So I accepted it. I accepted that computers were design... (read more)

2gwern15y

The hardware is corrupted, that's not the same as evil. The corruptedness can easily lead to 'nice' or 'good' prosocial actions - 'I am doing this soup kitchen work because I am a good person' (as opposed to trying to look good or impress this potential ally or signal nurturing characteristics to a potential mate etc.).

0Boyi15y

Then I do not understand what is meant by corrupted. Perhaps it is because of my limited knowledge of the computer science lexicon, but to me the word corrupted means damaged, imperfect, made inferior. To imply something is damaged/ inferior makes a value-judgment about what is well/superior. But if you are saying that doing something out of self-interest is an inferior state, then what is the superior state? Altruism? By what rational basis can you say that people should be completely altruistic? Then we would not be people, we would be ants ,or bees, or some other social creature. Self-interest is part of what makes human sociality so powerful. I do not see it as corrupted hardware, but rather misused hardware (as I state in my original post). The self can be extended to a family, a community, a nation, even to humanity itself, so that even though a person acts out of self-intrest their interest extends beyond an atomized body or singular lineage. Basically I am agreeing with your deception of human nature, but not your interpretation of it. What I get out of the analogy "corrupted hardware" is that self-interest is a detrimental capacity of human nature. If this is not what is meant, then please explain to me what is meant by corrupted hardware. If it is what is meant, then I stand by my assertion that it is not self-interest that is detrimental but cultural conceptions of the self; making it the software, not the hardware that is corrupted.

1gwern15y

If a file is corrupted with noise, or a portion of RAM is corrupted by some cosmic rays, is that file or portion of memory now filled with evil? No; it is simply not what it was intended to be. Whether there are any moral connotations beyond that depends on additional details and considerations. For example, Robin Hanson (or maybe it was Katja Grace?) has argued that the proper response to discovering the powerful and pervasive missions of one's evolved subconscious - aims that may not be shared by the conscious - is not to regard the subconscious as one's enemy corrupting one's actions towards its own goals, but as simply part of oneself, to embrace its goals as perfectly valid as the conscious mind's goals. Other LWers disagree and think the subconscious biases are just that, biases to be opposed like any other source of noise/bias/corruption. (I hope you see how this Hansonian argument does not fit in with a simplistic 'human nature is good' or 'evil' take on the idea that the mind has hidden motives. It's pretty rare for anyone to seriously argue that just because human nature is flawed, we should give up on morality entirely and become immoral evil monsters.)

-4Boyi15y

Thanks for the clarification of the corrupted hardware analogy. It was a poor choice of words to compare the argument to human nature being evil. The point I am trying to make is that I do not agree with the statement t hat human nature is flawed. What you are calling flawed I was calling evil. But from this point on I will switch to your language because it is better. I still do not see the logic -In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence. As proving that human nature is flawed, because it makes the assumption that self-interest is a flaw. I would ask you two questions if I could. First, do you believe self-interest to be a flaw of human nature, if not what is the flaw that is talked about in corrupt hardware? Second, do you believe it is possible to posses a conscious without self-interest? I would add that just because I support self-interest, does not mean I support selfishness. Please respond!

2gwern15y

No, again you're not following the precise lines. An adaptation doesn't necessarily benefit one's 'self': it's supposed to help one's genes or one's genes in another person (or even just a gene at the expense of all the others). Kin selection, right? Fisher's famous "I would not sacrifice myself to save a brother, but would for 2 brothers, 4 cousins...' So again, this corrupted hardware business is not identical with selfishness or self-interest, however you seem to be using either.

0Boyi15y

So you are saying the hardware of genes that has fueled the movement of life, and must embryologically exist within the human structure, is a hinderance to the structure of the social animal?

0gwern15y

Genes give rise to the sociality in the first place; this is one of the paradoxes of trying to fight one's genes, as it were. It's hairy meta-ethics: where do your desires and morals come from and what justifies them?

1Boyi15y

I don't think morality should be segregated from desire. I realize that Freud's concept of drives is at this point in time obsolete, but if there were "drives" it would not be a sex, aggression, or hunger drive that dominated the human animal, but a belonging drive. In my opinion it does not matter where the hardware comes from, what is important is an intimacy with its function. I think for too long there has been a false dichotomy constructed between morals and desires. as to the question of meta-ethics, I would apply the works of E. O Wilson or Joseph Tainter to the construction of a more humane humanity.

[-]stokys14y00

The third alternative in the train example is to sacrifice one's own self. (Unless this has been stated already, I did not read the whole of the comments)

2DSimon14y

Assume that you are too light to stop the train. Otherwise you aren't really addressing the moral quandary that the scenario is intended to invoke.

3[anonymous]14y

Having run into this problem when presenting the trolly problem on many occasions, I've come to wonder whether or not it might just be the right kind of response: can we really address moral quandaries in the abstract? I suspect not, and that when people try to make these ad hoc adjustments to the scenario, they're coming closer to thinking morally about the situation, just insofar as they're imagining it as a real event with its stresses, uncertainties, and possibilities.

1DSimon14y

Maybe it's just that that trolley problem is a really terrible example. It seems to be asking us to consider trains and/or people which operate under some other system of physics than the one we are familiar with. Maybe an adjustment would make it better. How about this: A runaway train carrying a load of ore is coming down the track and will hit 5 people, certainly killing them, unless a switch is activated which changes the train's path. Unfortunately, the switch will activate only when a heavy load is placed on a connected pressure plate (set up this way so that when one train on track A drops off its cargo, the following train will be routed to track B). Furthermore, triggering the pressure plate has an unfortunate secondary effect; it causes a macerator to activate nearly instantly and chop up whatever is on the plate (typically raw ore) so that it can be sucked easily through a tube into a storage area, rather like a giant food disposal. Standing next to the plate, you consider your options. You know, from your experience working on the site, that the plate and track switch system work quite reliably, but that you are too light to trigger it even if you tried jumping up and down. However, a very fat man is standing next to you; you are certain that he is heavy enough. With one shove, you could push him onto the plate, saving the lives of the five people on the tracks but causing his grisly death instead. Also, the switch's design does not have any manual activation button near the plate itself; damn those cheap contractors! There are only a few seconds before the train will pass the switch point, and from there only a few seconds until it hits the people on the track; not enough time to try anything clever with the mechanism, or for the 5 people to get out of the narrow canal in which the track runs. You frantically look around, but no other objects of any significant weight are nearby. What should you do?

2[anonymous]14y

That works, or at any rate I can't think of plausible ways to get out of your scenario. My worry though is that people's attempts to come up with alternatives is actually evidence that hypothetical moral problems have some basic flaw. I'm having a hard time coming up with an example of what I mean, but suppose someone were to describe a non-existant person in great detail and ask you if you loved them. It's not that you couldn't love someone who fit that description, but rather that the kind of reasoning you would have to engage in to answer the question 'do you love this person?' just doesn't work in the abstract. So my thought was that maybe something similar is going on with these moral puzzles. This isn't to say moral theories aren't worthwhile, but rather that the conditions necessary for their rational application exclude hypotheticals.

0[anonymous]12y

It's not a flaw in the hypotheticals. Rather, it's a healthy desire in humans to find better tradeoffs than the ones initially presented to them.

[-][anonymous]14y00

And so I wouldn't say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don't call it an alarm light, if an AI says that the right thing to do is sacrifice one to

... (read more)

[This comment is no longer endorsed by its author]Reply

1FeepingCreature14y

People doing this I think is a problem because people suck at genuinely deciding based on the issues. I would rather live in a society where people were such that they could be trusted with the responsibility to push guys in front of trains if they had sufficient grounds to reasonably believe this was a genuine positive action. But knowing that people are not such, I would much rather they didn't falsely believe they were, even if it sometimes causes suboptimal decisions in train scenarios. [...] I don't think you can automatically call a suboptimal decision a mistake. This actually has a real-life equivalent, in the situation of having to shoot down a plane that is believed to be in the control of terrorists and flying towards a major city. I would not want to be in the position of that fighter pilot, but I would also want him to fire. And I'm much more willing to trust a FAI with that call than any human.

0[anonymous]14y

Huh? You wouldn't call a decision that results in an unnecessary loss of life a mistake, but rather a suboptimal decision? Note that I altered the hypothetical situation in the comment and this "suboptimal decision" was labeled a mistake in the event that a 3rd party would come up with a superior decision (ie. one that would save all the lives) [...] Edited: There's no FAI we can trust yet and this particular detail seems to be about the friendliness of an AI, so your belief seems a little out of place in this context, but nevermind that since if there were an actual FAI, I suppose I'd agree. I think there's potential for severe error in the logic present in the text of the post and I find it proper to criticize the substance of this post, despite it being 4 years old. Anyway for an omniscient being not putting any weight on the potential of error would seem reasonable.

0[anonymous]12y

I might decide to take a general, consistent strategy due to my own limitations. In this example, the limitation is that if I feel justified in engaging in this sort of behavior on occasion, I will feel justified employing it on other occasions with insufficient justifications. If I employed a different general strategy with a similar level of simplicity, it would be less optimal. Other strategies exist that are closer to optimal, but my limitations preclude me from employing them. [...] Of course there is. If you can show a specific error, that would be great.

[-][anonymous]11y-30

As long as the ends don't justify the means, prediction markets oracles will be unfriendly: they won't be able to distinguish between values (ends) and beliefs (means).

[-][anonymous]6y10

If morality is utilitarianism, then means (and all actions) are justified if they are moral, i.e. if they lead to increased utility. Never the less, "The ends don't justify the means" can be given a reasonable meaning; I have one which is perhaps more pedestrian than the one in the article.

If u(x, y) = ax + by with a < b, then sacrificing one y to gain one x is utility-lowering. The (partial) end of increasing x does not justify any means which decrease y by the same amount^1. Our values are multidimensional; no single dimension is worth ... (read more)

[-]orthonormal6y130

The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can't yet say, but that would seem to be the appropriate level to handle it.)

It's nice to see the genesis of corrigibility before Eliezer had unconfused himself enough to take that first step.

[-]Tim Liptrot6y-10

This is very true

[-]cozy6y10

Quite often when given that problem I have heard non-answers. Even at the time of writing I do not believe it was unreasonable to give a non-answer; not just from a perceived moral perspective, but even from a utilitarian perspective, there are so many contextual elements removed that the actual problem isn't whether they will answer kill one and save the others or decline to act and save one only,

but rather the extent of the originality of the given answer. One can then sort of extrapolate the sort of thinking the individual asked may be pursuing, a... (read more)

[-]jwray4y70

If our corrupted hardware can't be trusted to compute the consequences in a specific case, it probably also can't be trusted to compute the consequences of a general rule. All our derivations of deontological rules will be tilted in the direction of self interest or tribalism or unexamined disgust responses, not some galaxy-brained evaluation of the consequences of applying the rule to all possible situations.

Russell conjugation: I have deontological guardrails, you have customs, he has ancient taboos.

[edit: related Scott post which I endorse i... (read more)

1azergante1y

Specific details of a case can make people emotional and corrupt the reasoning, less so for an abstract general rule.

[-]Paul Kent3y10

It just occurred to me that this post serves as a fairly compelling argument in favor of a modest epistemology, which in 2017 Eliezer wrote a whole book arguing against. ("I think I'm doing this for the good of the tribe, but maybe I'm just fooling myself" is definitely an "outside view".) Eliezer, have you changed your mind since writing this post? If so, where do you think your past self went awry? If not, how do you reconcile the ideas in this article with the idea that modest epistemology is harmful?

[-]wedrifid7mo120

But for a Friendly AI to be corrupted by power would be like it starting to bleed red blood. The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn't spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.

There's a thought. While not FAIs, I wonder how much LLMs are corrupted by how much power they are primed to consider that they have. I am guessing a huge amount. When speaking as if a person with higher status I expect it to convey more self serving arguments.

Anyone know if this has been studied?

Moderation Log