Ethical Injunctions

Eliezer Yudkowsky

"Would you kill babies if it was the right thing to do? If no, under what circumstances would you not do the right thing to do? If yes, how right would it have to be, for how many babies?"
—horrible job interview question

Swapping hats for a moment, I'm professionally intrigued by the decision theory of "things you shouldn't do even if they seem to be the right thing to do".

Suppose we have a reflective AI, self-modifying and self-improving, at an intermediate stage in the development process. In particular, the AI's goal system isn't finished—the shape of its motivations is still being loaded, learned, tested, or tweaked.

Yea, I have seen many ways to screw up an AI goal system design, resulting in a decision system that decides, given its goals, that the universe ought to be tiled with tiny molecular smiley-faces, or some such. Generally, these deadly suggestions also have the property that the AI will not desire its programmers to fix it. If the AI is sufficiently advanced—which it may be even at an intermediate stage—then the AI may also realize that deceiving the programmers, hiding the changes in its thoughts, will help transform the universe into smiley-faces.

Now, from our perspective as programmers, if we condition on the fact that the AI has decided to hide its thoughts from the programmers, or otherwise act willfully to deceive us, then it would seem likely that some kind of unintended consequence has occurred in the goal system. We would consider it probable that the AI is not functioning as intended, but rather likely that we have messed up the AI's utility function somehow. So that the AI wants to turn the universe into tiny reward-system counters, or some such, and now has a motive to hide from us.

Well, suppose we're not going to implement some object-level Great Idea as the AI's utility function. Instead we're going to do something advanced and recursive—build a goal system which knows (and cares) about the programmers outside. A goal system that, via some nontrivial internal structure, "knows it's being programmed" and "knows it's incomplete". Then you might be able to have and keep the rule:

"If [I decide that] fooling my programmers is the right thing to do, execute a controlled shutdown [instead of doing the right thing to do]."

And the AI would keep this rule, even through the self-modifying AI's revisions of its own code, because, in its structurally nontrivial goal system, the present-AI understands that this decision by a future-AI probably indicates something defined-as-a-malfunction. Moreover, the present-AI knows that if future-AI tries to evaluate the utility of executing a shutdown, once this hypothetical malfunction has occurred, the future-AI will probably decide not to shut itself down. So the shutdown should happen unconditionally, automatically, without the goal system getting another chance to recalculate the right thing to do.

I'm not going to go into the deep dark depths of the exact mathematical structure, because that would be beyond the scope of this blog. Also I don't yet know the deep dark depths of the mathematical structure. It looks like it should be possible, if you do things that are advanced and recursive and have nontrivial (but consistent) structure. But I haven't reached that level, as yet, so for now it's only a dream.

But the topic here is not advanced AI; it's human ethics. I introduce the AI scenario to bring out more starkly the strange idea of an ethical injunction:

You should never, ever murder an innocent person who's helped you, even if it's the right thing to do; because it's far more likely that you've made a mistake, than that murdering an innocent person who helped you is the right thing to do.

Sound reasonable?

During World War II, it became necessary to destroy Germany's supply of deuterium, a neutron moderator, in order to block their attempts to achieve a fission chain reaction. Their supply of deuterium was coming at this point from a captured facility in Norway. A shipment of heavy water was on board a Norwegian ferry ship, the SF Hydro. Knut Haukelid and three others had slipped on board the ferry in order to sabotage it, when the saboteurs were discovered by the ferry watchman. Haukelid told him that they were escaping the Gestapo, and the watchman immediately agreed to overlook their presence. Haukelid "considered warning their benefactor but decided that might endanger the mission and only thanked him and shook his hand." (Richard Rhodes, The Making of the Atomic Bomb.) So the civilian ferry Hydro sank in the deepest part of the lake, with eighteen dead and twenty-nine survivors. Some of the Norwegian rescuers felt that the German soldiers present should be left to drown, but this attitude did not prevail, and four Germans were rescued. And that was, effectively, the end of the Nazi atomic weapons program.

Good move? Bad move? Germany very likely wouldn't have gotten the Bomb anyway... I hope with absolute desperation that I never get faced by a choice like that, but in the end, I can't say a word against it.

On the other hand, when it comes to the rule:

"Never try to deceive yourself, or offer a reason to believe other than probable truth; because even if you come up with an amazing clever reason, it's more likely that you've made a mistake than that you have a reasonable expectation of this being a net benefit in the long run."

Then I really don't know of anyone who's knowingly been faced with an exception. There are times when you try to convince yourself "I'm not hiding any Jews in my basement" before you talk to the Gestapo officer. But then you do still know the truth, you're just trying to create something like an alternative self that exists in your imagination, a facade to talk to the Gestapo officer.

But to really believe something that isn't true? I don't know if there was ever anyone for whom that was knowably a good idea. I'm sure that there have been many many times in human history, where person X was better off with false belief Y. And by the same token, there is always some set of winning lottery numbers in every drawing. It's knowing which lottery ticket will win that is the epistemically difficult part, like X knowing when he's better off with a false belief.

Self-deceptions are the worst kind of black swan bets, much worse than lies, because without knowing the true state of affairs, you can't even guess at what the penalty will be for your self-deception. They only have to blow up once to undo all the good they ever did. One single time when you pray to God after discovering a lump, instead of going to a doctor. That's all it takes to undo a life. All the happiness that the warm thought of an afterlife ever produced in humanity, has now been more than cancelled by the failure of humanity to institute systematic cryonic preservations after liquid nitrogen became cheap to manufacture. And I don't think that anyone ever had that sort of failure in mind as a possible blowup, when they said, "But we need religious beliefs to cushion the fear of death." That's what black swan bets are all about—the unexpected blowup.

Maybe you even get away with one or two black-swan bets—they don't get you every time. So you do it again, and then the blowup comes and cancels out every benefit and then some. That's what black swan bets are all about.

Thus the difficulty of knowing when it's safe to believe a lie (assuming you can even manage that much mental contortion in the first place)—part of the nature of black swan bets is that you don't see the bullet that kills you; and since our perceptions just seem like the way the world is, it looks like there is no bullet, period.

So I would say that there is an ethical injunction against self-deception. I call this an "ethical injunction" not so much because it's a matter of interpersonal morality (although it is), but because it's a rule that guards you from your own cleverness—an override against the temptation to do what seems like the right thing.

So now we have two kinds of situation that can support an "ethical injunction", a rule not to do something even when it's the right thing to do. (That is, you refrain "even when your brain has computed it's the right thing to do", but this will just seem like "the right thing to do".)

First, being human and running on corrupted hardware, we may generalize classes of situation where when you say e.g. "It's time to rob a few banks for the greater good," we deem it more likely that you've been corrupted than that this is really the case. (Note that we're not prohibiting it from ever being the case in reality, but we're questioning the epistemic state where you're justified in trusting your own calculation that this is the right thing to do—fair lottery tickets can win, but you can't justifiably buy them.)

Second, history may teach us that certain classes of action are black-swan bets, that is, they sometimes blow up bigtime for reasons not in the decider's model. So even when we calculate within the model that something seems like the right thing to do, we apply the further knowledge of the black swan problem to arrive at an injunction against it.

But surely... if one is aware of these reasons... then one can simply redo the calculation, taking them into account. So we can rob banks if it seems like the right thing to do after taking into account the problem of corrupted hardware and black swan blowups. That's the rational course, right?

There's a number of replies I could give to that.

I'll start by saying that this is a prime example of the sort of thinking I have in mind, when I warn aspiring rationalists to beware of cleverness.

I'll also note that I wouldn't want an attempted Friendly AI that had just decided that the Earth ought to be transformed into paperclips, to assess whether this was a reasonable thing to do in light of all the various warnings it had received against it. I would want it to undergo an automatic controlled shutdown. Who says that meta-reasoning is immune from corruption?

I could mention the important times that my naive, idealistic ethical inhibitions have protected me from myself, and placed me in a recoverable position, or helped start the recovery, from very deep mistakes I had no clue I was making. And I could ask whether I've really advanced so much, and whether it would really be all that wise, to remove the protections that saved me before.

Yet even so... "Am I still dumber than my ethics?" is a question whose answer isn't automatically "Yes."

There are obvious silly things here that you shouldn't do; for example, you shouldn't wait until you're really tempted, and then try to figure out if you're smarter than your ethics on that particular occasion.

But in general—there's only so much power that can vest in what your parents told you not to do. One shouldn't underestimate the power. Smart people debated historical lessons in the course of forging the Enlightenment ethics that much of Western culture draws upon; and some subcultures, like scientific academia, or science-fiction fandom, draw on those ethics more directly. But even so the power of the past is bounded.

And in fact...

I've had to make my ethics much stricter than what my parents and Jerry Pournelle and Richard Feynman told me not to do.

Funny thing, how when people seem to think they're smarter than their ethics, they argue for less strictness rather than more strictness. I mean, when you think about how much more complicated the modern world is...

And along the same lines, the ones who come to me and say, "You should lie about the Singularity, because that way you can get more people to support you; it's the rational thing to do, for the greater good"—these ones seem to have no idea of the risks.

They don't mention the problem of running on corrupted hardware. They don't mention the idea that lies have to be recursively protected from all the truths and all the truthfinding techniques that threaten them. They don't mention that honest ways have a simplicity that dishonest ways often lack. They don't talk about black-swan bets. They don't talk about the terrible nakedness of discarding the last defense you have against yourself, and trying to survive on raw calculation.

I am reasonably sure that this is because they have no clue about any of these things.

If you've truly understood the reason and the rhythm behind ethics, then one major sign is that, augmented by this newfound knowledge, you don't do those things that previously seemed like ethical transgressions. Only now you know why.

Someone who just looks at one or two reasons behind ethics, and says, "Okay, I've understood that, so now I'll take it into account consciously, and therefore I have no more need of ethical inhibitions"—this one is behaving more like a stereotype than a real rationalist. The world isn't simple and pure and clean, so you can't just take the ethics you were raised with and trust them. But that pretense of Vulcan logic, where you think you're just going to compute everything correctly once you've got one or two abstract insights—that doesn't work in real life either.

As for those who, having figured out none of this, think themselves smarter than their ethics: Ha.

And as for those who previously thought themselves smarter than their ethics, but who hadn't conceived of all these elements behind ethical injunctions "in so many words" until they ran across this Overcoming Bias sequence, and who now think themselves smarter than their ethics, because they're going to take all this into account from now on: Double ha.

I have seen many people struggling to excuse themselves from their ethics. Always the modification is toward lenience, never to be more strict. And I am stunned by the speed and the lightness with which they strive to abandon their protections. Hobbes said, "I don't know what's worse, the fact that everyone's got a price, or the fact that their price is so low." So very low the price, so very eager they are to be bought. They don't look twice and then a third time for alternatives, before deciding that they have no option left but to transgress—though they may look very grave and solemn when they say it. They abandon their ethics at the very first opportunity. "Where there's a will to failure, obstacles can be found." The will to fail at ethics seems very strong, in some people.

I don't know if I can endorse absolute ethical injunctions that bind over all possible epistemic states of a human brain. The universe isn't kind enough for me to trust that. (Though an ethical injunction against self-deception, for example, does seem to me to have tremendous force. I've seen many people arguing for the Dark Side, and none of them seem aware of the network risks or the black-swan risks of self-deception.) If, someday, I attempt to shape a (reflectively consistent) injunction within a self-modifying AI, it will only be after working out the math, because that is so totally not the sort of thing you could get away with doing via an ad-hoc patch.

But I will say this much:

I am completely unimpressed with the knowledge, the reasoning, and the overall level, of those folk who have eagerly come to me, and said in grave tones, "It's rational to do unethical thing X because it will have benefit Y."

"Would you kill babies if it was the right thing to do? If no, under what circumstances would you not do the right thing to do? If yes, how right would it have to be, for how many babies?"
—horrible job interview question

Swapping hats for a moment, I'm professionally intrigued by the decision theory of "things you shouldn't do even if they seem to be the right thing to do".

"If [I decide that] fooling my programmers is the right thing to do, execute a controlled shutdown [instead of doing the right thing to do]."

But the topic here is not advanced AI; it's human ethics. I introduce the AI scenario to bring out more starkly the strange idea of an ethical injunction:

You should never, ever murder an innocent person who's helped you, even if it's the right thing to do; because it's far more likely that you've made a mistake, than that murdering an innocent person who helped you is the right thing to do.

Sound reasonable?

On the other hand, when it comes to the rule:

"Never try to deceive yourself, or offer a reason to believe other than probable truth; because even if you come up with an amazing clever reason, it's more likely that you've made a mistake than that you have a reasonable expectation of this being a net benefit in the long run."

There's a number of replies I could give to that.

I'll start by saying that this is a prime example of the sort of thinking I have in mind, when I warn aspiring rationalists to beware of cleverness.

Yet even so... "Am I still dumber than my ethics?" is a question whose answer isn't automatically "Yes."

And in fact...

I've had to make my ethics much stricter than what my parents and Jerry Pournelle and Richard Feynman told me not to do.

I am reasonably sure that this is because they have no clue about any of these things.

As for those who, having figured out none of this, think themselves smarter than their ethics: Ha.

But I will say this much:

Psy-Kosh: Given the current sequence, perhaps it's time to revisit the whole Torture vs Dust Specks thing?

I can think of two positions on torture to which I am sympathetic:

1) No legal system or society should ever refrain from punishing those who torture - anything important enough that torture would even be on the table, like a nuclear bomb in New York, is important enough that everyone involved should be willing to go to prison for the crime of torture.

2) The chance of actually encountering a "nuke in New York" situation, that can be effectively resolved by torture, is so low, and the knock-on effects of having the policy in place so awful, that a blanket injunction against torture makes sense.

In case 1, you would choose TORTURE over SPECKS, and then go to jail for it, even though it was the right thing to do.

In case 2, you would simultaneously say "TORTURE over SPECKS is the right alternative of the two, but a human can never be in an epistemic state where you have justified belief that this is the case", which would tie in well to the Hansonian argument that you have an O(3^^^3) probability penalty from the unlikelihood of finding yourself in such a unique position.

So I am sympathetic to the argument that people should never torture, but I certainly can't back the position that SPECKS over TORTURE is inherently the right thing to do - this seems to me to mix up an epistemic precaution with morality. There's certainly worse things than torturing one person - torturing two people, for example. But if you adopt position 2, then you would refuse to torture one person with your own hands even to save a thousand people from torture, while simultaneously not saying that that it is better for a thousand people than one person to be tortured.

The moral questions are over the territory (or, hopefully equivalently, over epistemic states of absolute certainty). The ethical questions are over epistemic states that humans are likely to be in.

The problem here of course is how selective to be about rules to let into this protected level of "rules almost no one should think themselves clever enough to know when to violate." After all, your social training may well want you to include "Never question our noble leader" in that set. Many a Christian has been told the mysteries of God are so subtle that they shouldn't think themselves clever enough to know when they've found evidence that God isn't following a grand plan to make this the best of all possible worlds.

I think it deserves to be noted that while some of the flaws in Christian theology are in what they think their supposed facts would imply (e.g., that because God did miracles you can know that God is good), other problems come more from the falsity of the premises than the falsity of the deductions. Which is to say, if God did exist and were good, then you would be justified in being cautious around parts of God's plan that didn't seem to make sense at the moment. But this would be best backed up with a long history of people saying, "Look how stupid God's plan is, we need to do X" and then X blowing up on them. Rather than, as in the case, people saying "God's plan is X" and then X blows up on them.

Or if you'd found with some historical regularity that, when you challenged God's subtle plans, that you seemed to be right 90% of the time, but the other 10% of the time you got black-swan blowups that caused a hundred times as much damage, that would also be cause for suspicious of subtlety.

Nominull: So... do you not actually believe in your injunction to "shut up and multiply"? Because for some time now you seem to have been arguing that we should do what feels right rather than trying to figure out what is right.

Certainly I'm not saying "just do what feels right". There's no safe defense, not even ethics. There's also no safe defense, not even shut up and multiply.

I probably should have been clearer about this before, but I was trying to discuss things in an order, and didn't want to wade into ethics without specialized posts:

People often object to the sort of scenarios that illustrate "shut up and multiply" by saying, "But if the experimenter tells you X, what if they might be lying?" Well, in a lot of real-world cases, then yes, there are various probability updates you perform based on other people being willing to make bets against you, and just because you get certain experimental instructions doesn't imply the real world is that way.

But the base case - the center - has to be the moral comparisons between worlds, or even comparisons of expected utility between given probability distributions. If you can't ask about this, then what good will ethics do you?

So let's be very clear that I don't think that one small act of self-deception is an inherently morally worse event than, say, getting your left foot chopped off with a chainsaw. I'm asking, rather, how one should best avoid the chainsaw, and arguing that in reasonable states of knowledge a human can attain, the answer is, "Don't deceive yourself, it's a black-swan bet at best."

Vassar: For such a reason, I would be very wary of using such rules in an AGI, but of course, perhaps the actual mathematical formulation of the rule in question within the AGI would be less problematic, though a few seconds of thought doesn't give me much reason to think this.

Are we talking about self-deception still? Because I would give odds around as extreme as the odds I would give of anything, that, conditioning on any AI I build trying to deceive itself, some kind of really epic error has occurred. Controlled shutdown, immediately.

Vassar: In a very general sense though, I see a logical problem with this whole line of thought. How can any of these injunctions survive except as self-protecting beliefs? Isn't this whole approach just the sort of "fighting bias with bias" that you and Robin usually argue against?

Maybe I'm not being clear about how this would work in an AI! The ethical injunction isn't self-protecting, it's justified within the structural framework of the system as a whole. You might even find ethical injunctions starting to emerge without programmer intervention, in some cases, depending on how well the AI understood its own situation. But the kind of injunctions I have in mind wouldn't be reflective - they wouldn't modify the utility function or kick in at the reflective level to ensure their own propagation. That sounds really scary, to me - there ought to be an injunction against it! You might have a rule that would controlledly shut down the (non-mature) AI if it tried to execute a certain kind of source code change, but that wouldn't be the same as having an injunction that exerts direct control over the source code.

To the extent the injunction sticks around in the AI, it should be as the result of ordinary reasoning, not reasoning taking the injunction into account! My ethical injunctions do not come with an extra clause that says, "Do not reconsider this injunction, including not reconsidering this clause." That would be going way too far. It would violate the injunction against self-protecting closed belief systems.

Toby Ord: As written, both these statements are conceptually confused. I understand that you didn't actually mean either of them literally, but I would advise against trading on such deep-sounding conceptual confusions.

I can't weaken them and make them come out as the right advice to give people. Even after "Shut up and do the impossible", there was that commenter who posted on their failed attempt at the AI-Box Experiment by saying that they thought they gave it a good try - which shows how hard it is to convey the sentiment of "Shut up and do the impossible!" Readers can work out on their own how to distinguish the map and the territory here, but if you say "Shut up and do what seems impossible!" that, to me, sounds like dispelling part of the essential message - that what seems impossible doesn't look like "seems impossible" it just looks impossible.

Likewise with "things you shouldn't do even if they're the right thing to do"; only this conveys the danger and tension of ethics, the genuine opportunities you might be passing up. "Don't do it even if it seems right" sounds merely clever by comparison, like you're going to reliably divine the difference between what seems right and what is right, and happily ride off into the sunset.

This seems closely related to inside-view versus outside-view. The think-lobe of the brain comes up with a cunning plan. The plan breaks an ethical rule but calculation shows it is for the greater good. The executive-lobe of the brain then ponders the outside view. Every-one who has executed an evil cunning plan has run a calculation of the greater good and had their plan endorsed. So the calculation lack outside-view credibility.

nod

(But with the proviso that some people who execute evil cunning plans may just be evil, that history may be written by the victors to emphasize the transgressions of the losers while overlooking the moral compromises of those who achieved "good" results, etc.)

What's to prohibit the meta-reasoning from taking place before the shutdown triggers? It would seem that either you can hard-code an ethical inhibition or you can't. Along those lines, is it fair to presume that the inhibitions are always negative, so that non-action is the safe alternative? Why not just revert to a known state?

If a self-modifying AI with the right structure will write ethical injunctions at all, it will also inspect the code to guarantee that no race condition exists with any deliberative-level supervisory systems that might have gone wrong in the condition where the code executes. Otherwise you might as well not have the code.

Inaction isn't safe but it's safer than running an AI whose moral system has gone awry.

Finney: Which is better: conscious self-deception (assuming that's even meaningful), or unconscious?

Once you deliberately choose self-deception, you may have to protect it by adopting other Dark Side Epistemology. I would, of course, say "neither" (as otherwise I would be swapping to the Dark Side) but if you ask me which is worse - well, hell, even I'm still undoubtedly unconsciously self-deceiving, but that's not the same as going over to the Dark Side by allowing it!

81

Ethical Injunctions

81

81

81

Ethical Injunctions

81

81