"So here's a reply to that philosopher's scenario, which I have yet to hear any philosopher's victim give" People like Hare have extensively discussed this, although usually using terms like 'angels' or 'ideally rational agent' in place of 'AIs.'
Yes, this made me think precisely of Hare's two-level utilitarianism, with a Friendly AI in place of Hare's Archangel.
The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn't spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.This is critical to your point. But you haven't established this at all. You made one post with a just-so story about males in tribes perceiving those above them as corrupt, and then assumed, with no logical justification that I can recall, that this meant that those above them actually are corrupt. You haven't defined what corrupt means, either.
I think you need to sit down and spell out what 'corrupt' means, and then Think Really Hard about whether those in power actually are more corrupt than those not in power;and if so, whether the mechanisms that lead to that result are a result of the peculiar evolutionary history of humans, or of general game-theoretic / evolutionary mechanisms that would apply equally to competing AIs.
You might argue that if you have one Sysop AI, it isn't subject to evolutionary forces. This may be true. But if that's what you're counting on, it's very important for you to make that explicit. I think that, as your post stands, you may be attributing qualities to Friendly AIs, that apply only to Solitary Friendly AIs that are in complete control of the world.
There's really no paradox, nor any sharp moral dichotomy between human and machine reasoning. Of course the ends justify the means -- to the extent that any moral agent can fully specify the ends.
But in an interesting world of combinatorial explosion of indirect consequences, and worse yet, critically underspecified inputs to any such supposed moral calculations, no system of reasoning can get very far betting on longer-term specific consequences. Rather the moral agent must necessarily fall back on heuristics, fundamentally hard-to-gain wisdom based on ...
Good point, Jef - Eliezer is attributing the validity of "the ends don't justify the means" entirely to human fallibility, and neglecting that part accounted for by the unpredictability of the outcome.
He may have some model of an AI as a perfect Bayesian reasoner that he uses to justify neglecting this. I am immediately suspicious of any argument invoking perfection.
I don't know what "a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences" means.
Phil: "I don't know what "a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences" means."
You and I engaged briefly on this four or five years ago, and I have yet to write the book. [Due to the explosion of branching background requirements that would ensue.] I have, however, effectively conveyed the concept face to face to very small groups.
I keep seeing Eliezer orbiting this attractor, and then veering off as he encounters contradictions to a few deeply held assumptions. I remain hopeful that the prodigious effort going into the essays on this site will eventually (and virtually) serve as that book.
in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder ... I refuse to extend this reply to myself, because the epistemological state you ask me to imagine, can only exist among other kinds of people than human beings.
Interesting reply. But the AIs are programmed by corrupted humans. Do you really expect to be able to check the full source code? That you can outsmart the people who win obfuscated code contests?
How is the epistemological state of human-verified, human-built, non-corrupt AIs, any more possible?
As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. [...] I don't go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects.
It seems a strong claim to suggest that the limits you impose on yourself due to epistemological deficiency line up exactly with the mores and laws imposed by society. Are there some conventional ends-don't-justify-means notions that you would violate, or non-socially-taboo situations in which you would restrain yourself?
Also, what happens when the consequences grow large? Say 1 person to save 500, or 1 to save 3^^^^3?
Phil Goetz: or of general game-theoretic / evolutionary mechanisms that would apply equally to competing AIs.
You are assuming that an AI would be subject to the same sort of evolutionary mechanism that humans traditionally were: namely, that only AIs with a natural tendency towards a particular behavior would survive. But an AI isn't cognitively limited in the way animals were. While animals had to effectively be pre-programmed with certain behaviors or personality traits, as they weren't intelligent or knowledgable enough to just derive all the useful sub...
He may have some model of an AI as a perfect Bayesian reasoner that he uses to justify neglecting this. I am immediately suspicious of any argument invoking perfection.It may also be that what Eliezer has in mind is that any heuristic that can be represented to the AI, could be assigned priors and incorporated into Bayesian reasoning.
Eliezer has read Judea Pearl, so he knows how computational time for Bayesian networks scales with the domain, particularly if you don't ever assume independence when it is not justified, so I won't lecture him on that. But ...
Phil: Agreed, that's certainly possible. I was only objecting to the implied possibility of AIs evolving "personality traits" the same way humans did (an idea I've come across a lot during the last few days, for some reason). But I have no objection to game theoretic reasoning (or any other reasoning) possibly coming up with results we wouldn't want it to.
The thing is, an AI doesn't have to use mental tricks to compensate for known errors in its reasoning, it can just correct those errors. An AI never winds up in the position of having to strive to defeat its own purposes.
I think the simple statement you want is, "You should accept deontology on consequentialist grounds."
What you are getting at is that the ends justify the means only when the means don't effect the ends. In the case of a human as part of the means, the act of the means may effect the human and thus effect the ends. In summary, reflexivity is a bitch. This is a reason why social science and economics is so hard--the subjects being modeled change as a result of the modeling process.
This is a problem with any sufficiently self-reflective mind, not with AIs that do not change their own rules. A simple mechanical narrow AI that is programmed to roam about c...
But in an interesting world of combinatorial explosion of indirect consequences, and worse yet, critically underspecified inputs to any such supposed moral calculations, no system of reasoning can get very far betting on longer-term specific consequences.
This point and the subsequent discussion are tangential to the point of the post, to wit, evolutionary adaptations can cause us to behave in ways that undermine our moral intentions. To see this, limit the universe of discourse to actions which have predictable effects and note that Eliezer's argument still makes strong claims about how humans should act.
Why must the power structure cycle be adaptive? I mean, couldn't it simply be non-maladaptive?
Because if the net effect on human fitness is zero, then perhaps it's just a quirk. I'm not sure how this affects your argument otherwise, I'm just curious as to why you think it was an adaptive pattern and not just a pattern that didn't kill us at too high a rate.
Of course, if you never assume independence, then the only right network is the fully-connected one.Um, conditional independence, that is.
I want to know if my being killed by Eliezer's AI hinges on how often observables of interest tend to be conditionally dependent.
It is refreshing to read something by Eliezer on morality I completely agree with.
And nice succinct summary by Zubon.
@ Caroline: the effect on overall human fitness is neither here nor there, surely. The revolutionary power cycle would be adaptive because of its effect on the reproductive success of those who play the game versus those who don't. That is, the adaptation would only have to benefit specific lineages, not the whole species. Or have I missed your point?
I wonder where this is leading ... 1) Morality is a complex computation, that seems to involve a bunch of somewhat independent concerns 2) Some concerns of human morality may not need to apply to AI
So it seems that building friendly AI involves not only correctly building (human) morality, but figuring out which parts don't need to apply to an AI that doesn't have the same flaws.
It seems to me that an FAI would still be in an evolutionary situation. It's at least going to need a goal of self-preservation [1] and it might well have a goal of increasing its abilities in order to be more effectively Friendly.
This implies it will have to somehow deal with the possibility that it might overestimate its own value compared to the humans it's trying to help.
[1] What constitutes the self for an AI is left as a problem for the student.
But, Nancy, the self-preservation can be an instrumental goal. That is, we can make it so that the only reason the AI wants to keep on living is that if it does not then it cannot help the humans.
Still disagreeing with the whole "power corrupts" idea.
A builder, or a secratary, who looks out for his friends and does them favours is... a good friend. A politician who does the same is... a corrupt politician.
A sad bastard who will sleep with anyone he can is a sad bastard. A politician who will sleep with anyone he can is a power-abusing philanderer.
As you increase power, you become corrupt just by doing what you've always done.
Richard, I'm looking at the margins. The FAI is convinced that it's humanity's only protection against UFAIs. If UFAIs can wipe out humanity, surely the FAI is justified in killing a million or so people to protect itself, or perhaps even to make sure it's capable of defeating UFAIs which have not yet been invented and whose abilities can only be estimated.
And if an FAI makes that judgment, I'm not going to question it - it's smarter than me, and not biased toward accumulating power for "instrumental" reasons like I am.
Cyan: "...tangential to the point of the post, to wit, evolutionary adaptations can cause us to behave in ways that undermine our moral intentions."
On the contrary, promotion into the future of a [complex, hierarchical] evolving model of values of increasing coherence over increasing context, would seem to be central to the topic of this essay.
Fundamentally, any system, through interaction with its immediate environment, always only expresses its values (its physical nature.) "Intention", corresponding to "free-will" is merel...
How would we know if this line of thought is a recoiling from the idea that if you shut up and multiply, you should happily kill 10,000 for a 10% chance at saving a million.
Andrix, if it is just a recoiling from that, then how do you explain Stalin, Mao, etc?
Yes, Nancy, as soon as an AI endorsed by Eliezer or me transcends to superintelligence, it will probably make a point of preventing any other AI from transcending, and there is indeed a chance that that will entail killing a few (probably very irresponsible) humans. It is very unlikely to entail the killing of millions, and I can go into that more if you want.
The points are that (1) self-preservation and staying in power is easy if you are the only superintelligence in t...
Jef Allbright,
By subsequent discussion, I meant Phil Goetz's comment about Eliezer "neglecting that part accounted for by the unpredictability of the outcome". I'm with him on not understanding what "a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences" means; I also found your reply to me utterly incomprehensible. In fact, it's incredible to me that the same mind that could formulate that reply to me would come shuddering to a halt upon encountering the unexceptionable phrase "universe of discourse".
Since you said you didn't know what to do with my statement, I'll add, just replace the phrase "limit the universe of discourse to" with "consider only" and see if that helps. But I think we're using the same words to talk about different things, so your original comment may not mean what I think it means, and that's why my criticism looks wrong-headed to you.
I don't think it's possible that our hardware could trick us in this way (making us doing self-interested things by making them appear moral).
To express the idea "this would be good for the tribe" would require the use of abstract concepts (tribe, good) but abstract concepts/sentences are precisely the things that are observably under our conscious control. What can pop up without our willing it are feelings or image associations so the best trickery our hardware could hope for is to make something feel good.
@Cyan: Substituting "consider only actions that have predictable effects..." is for me much clearer than "limit the universe of discourse to actions that have predictable effects..." ["and note that Eliezer's argument still makes strong claims about how humans should act."]
But it seems to me that I addressed this head-on at the beginning of my initial post, saying "Of course the ends justify the means -- to the extent that any moral agent can fully specify the ends."
The infamous "Trolley Paradox" does not d...
I've always thought the "moral" answer to the question was "I wouldn't push the innocent in front of the train; I'd jump in front of the train myself."
Henry V, the usual version does not offer that option. You frequently are offered a lever to change the track the train is on, diverting it from five to one. And then there are a dozen variations. And one of those later variations sometimes involves a man fat enough to derail/slow/stop the train if you push him in front (by assumption: much fatter than Henry V, but not so fat that you could not push him over).
The question is there to check if your answer differs between the lever and the push. If you would pull the lever but not push the guy, the impli...
To take a subset of the topic at hand, I think Mencius nailed it when he defined corruption. To very roughly paraphrase, corruption is a mismatch between formal and informal power.
Acton's famous aphorism can be rewritten in the following form: 'Those with formal power tend to use it to increase their informal power'.
Haig: "Without ego corruption does not exist"
Not true at all. This simply rules out corruption due to greed. There are tons of people who do corrupt things for 'noble causes'. Just as a quick example, regardless of the truth of th...
@Richard Hollerith
Stalin and may very well have been corrupted by power, that part of the theory may be right or wrong, but it isn't self serving. Coming from a culture that vilifies such corrupted leaders, we personally want to avoid being like them.
We don't want to think of ourselves as mass-murderers-for-real. So we declare ourselves too untrustworthy to decide to murder people, and we rule out that whole decision tree. We know we are mass-murderers-in-principle, but still we're decent people.
But maybe really we should shut up and multiply, and accept t...
So in my posts on this topic, I proceeded to (attempt to) convey a larger and more coherent context making sense of the ostensible issue.
Right! Now we're communicating. My point is that the context you want to add is tangential (or parallel...? pick your preferred geometric metaphor) to Eliezer's point. That doesn't mean it's without value, but it does mean that it fails to engage Eliezer's argument.
But it seems to me that I addressed this head-on at the beginning of my initial post, saying "Of course the ends justify the means -- to the extent that a...
@Cyan: "Hostile hardware", meaning that an agent's values-complex (essentially the agent's nature, driving its actions) contains elements misaligned (even to the extent of being in internal opposition on some level(s) of the complex hierarchy of values) is addressed by my formulation in the "increasing coherence" term. Then, I did try to convey how this is applicable to any moral agent, regardless of form, substrate, or subjective starting point.
I'm tempted to use n's very nice elucidation of the specific example of political corrupti...
Eliezer: "But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it."
One could do this, but I doubt that many people do, in fact, behave the way they do for this reason.
Deontological ethics is more popular than consequentialist reasoning amongst normal people in day-to-day life; thus there are billions of people who argue deontologically that “the ends don’t justify the means”. Surely very few of these people know about evolutionary psychology in enough ...
@Zuban. I'm familiar with the contrivances used to force the responder into a binary choice. I just think that the contrivances are where the real questions are. Why am I in that situation? Was my behavior beyond reproach up to that point? Could I have averted this earlier? Is it someone else's evil action that is a threat? I think in most situations, the moral answer is rather clear, because there are always more choices. E.g., ask the fat man to jump. or do nothing and let him make his own choice, as I could only have averted it by committing murder. or ...
@Roko. You mention "maximizing the greater good" as if that is not part of a deontological ethic.
All the discussion so far indicates that Eliezer's AI will definitely kill me, and some others posting here, as soon as he turns it on.
It seems likely, if it follows Eliezer's reasoning, that it will kill anyone who is overly intelligent. Say, the top 50,000,000 or so.
(Perhaps a special exception will be made for Eliezer.)
Hey, Eliezer, I'm working in bioinformatics now, okay? Spare me!
Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?
Note for readers: I'm not responding to Phil Goetz and Jef Allbright. And you shouldn't infer my positions from what they seem to be arguing with me about - just pretend they're addressing someone else.
Roko, now that you mention it, I wasn't thinking hard enough about "it's easier to check whether someone followed deontological rules or not" as a pressure toward them in moral systems. Obvious in retrospect, but my own thinking had tended to focus on the usefulness of deontological rules in individual reasoning.
Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?At present, Eliezer cannot functionally describe what 'Friendliness' would actually entail. It is likely that any outcome he views as being undesirable (including, presumably, his murder) would be claimed to be impermissible for a Friendly AI.
Imagine if Isaac Asimov not only lacked the ability to specify how the Laws of Robotics were to be implanted in artificial brains, but couldn't specify what those Laws were supposed to be. You would essentially ...
Eliezer: "I'm not responding to Phil Goetz and Jef Allbright. And you shouldn't infer my positions from what they seem to be arguing with me about - just pretend they're addressing someone else."
Huh. That doesn't feel very nice.
Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?
At present, Eliezer cannot functionally describe what ‘Friendliness’ would actually entail. It is likely that any outcome he views as being undesirable (including, presumably, his murder) would be claimed to be impermissible for a Friendly AI.
Imagine if Isaac Asimov not only lacked the ability to specify how the Laws of Robotics were to be implanted in artificial brains, but couldn’t specify what those Laws were supposed to be. You would essentially ...
Goetz,
For a superhuman AI to stop you and your friends from launching a competing AI, it suffices for it to take away your access to unsupervised computing resources. It does not have to kill you.
Note for readers: I'm not responding to Phil Goetz and Jef Allbright. And you shouldn't infer my positions from what they seem to be arguing with me about - just pretend they're addressing someone else.Is that on this specific question, or a blanket "I never respond to Phil or Jef" policy?
Huh. That doesn't feel very nice.Nor very rational, if one's goal is to communicate.
Phil: "Is that on this specific question, or a blanket "I never respond to Phil or Jef" policy?"
I was going to ask the same question, but assumed there'd be no answer from our gracious host. Disappointing.
>And now the philosopher comes and presents their "thought experiment" - setting up a scenario in which, by
>stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is
>certain to save the five lives. "There's a train heading to run over five innocent people, who you can't possibly
>warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the
>train. These are your only options; what do you do?"
If you are lo...
I guess I'm going to have to start working harder on IA to stay ahead of any "Friendly" AI that might want to keep me down.
Stuart Armstrong wrote: "Still disagreeing with the whole "power corrupts" idea.
A builder, or a secratary, who looks out for his friends and does them favours is... a good friend.
A politician who does the same is... a corrupt politician.
A sad bastard who will sleep with anyone he can is a sad bastard.
A politician who will sleep with anyone he can is a power-abusing philanderer.
As you increase power, you become corrupt just by doing what you've always done."
I disagree here. The thing about power is that it entails the ability to use c...
@lake My point is that a species or group or individual can acquire many traits that are simply non-maladaptive rather than adaptive. Once the revolutionary power cycle blip shows up, as long as it confers no disadvantages, it probably won't get worked out of the system.
I heard a story once about a girl and a chicken. She was training the chicken to play a song by giving it a treat every time it pecked the right notes in the right order. During this process, the chicken started wiggling it's neck before pecking each note. Since it was still hitting th...
I received an email from Eliezer stating:
You're welcome to repost if you criticize Coherent Extrapolated Volition specifically, rather than talking as if the document doesn't exist. And leave off the snark at the end, of course.
There is no 'snark'; what there IS, is a criticism. A very pointed one that Eliezer cannot counter.
There is no content to 'Coherent Extrapolated Volition'. It contains nothing but handwaving, smoke and mirrors. From the point of view of rational argument, it doesn't exist.
I believe that rule-utilitarianism was presented to dispose of this very idea. It is also why rule-utilitarianism is right. Using correct utilitarian principles to derive deontic-esque rules of behavior. Rule based thinking maximizes utility better than situational utilitarian calculation.
I finally put words to my concern with this. Hopefully it doesn't get totally buried because I'd like to hear what people think.
It might be the case that a race of consequentialists would come up with deontological prohibitions on reflection of their imperfect hardware. But that isn't close to the right story for how human deontological prohibitions actually came about. There was no reflection at all, cultural and biological evolution just gave us normative intuitions and cultural institutions. If things were otherwise (our ancestors were more rational) p...
It's coherent to say de-ontological ethics are hierarchical, and higher goods take precedence over lower goods. So, the lower good of sacrificing one person to save a greater good does not entail sacrificing the person is good. It is just necessary.
Saying the ends justify the means entails the means become good if they achieve a good.
Very interesting article (though as has been commented, the idea has philosophical precedent). Presumably this would go alongside the idea of upholding institutions/principles. If I can steal whenever I think it's for the best, it means each theft is only culpable if the courts can prove that it caused more harm than good overall, which is impractical. We also have to consider that even if we judge correctly that we can break a rule, others will see that as meaning the rule can be ditched at will. One very good expression of the importance of laws starts 2...
This is a really interesting post, and it does a good job of laying out clearly what I've often, less clearly, tried to explain to people: the human brain is not a general intelligence. It has a very limited capacity to do universal computation, but it's mostly "short-cuts" optimized for a very specific set of situations...
When I first read this article the imagery of corrupt hardware cause a certain memory to pop into my head. The memory is of an interaction with my college roommate about computers. Due to various discourses I had been exposed to at the time I was under the impression that computers were designed to have a life-expectancy of about 5 years. I am not immersed the world of computers, and this statement seemed feasible to me from a economic perspective of producer rationale within a capitalistic society. So I accepted it. I accepted that computers were design...
The third alternative in the train example is to sacrifice one's own self. (Unless this has been stated already, I did not read the whole of the comments)
...And so I wouldn't say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don't call it an alarm light, if an AI says that the right thing to do is sacrifice one to
As long as the ends don't justify the means, prediction markets oracles will be unfriendly: they won't be able to distinguish between values (ends) and beliefs (means).
If morality is utilitarianism, then means (and all actions) are justified if they are moral, i.e. if they lead to increased utility. Never the less, "The ends don't justify the means" can be given a reasonable meaning; I have one which is perhaps more pedestrian than the one in the article.
If u(x, y) = ax + by with a < b, then sacrificing one y to gain one x is utility-lowering. The (partial) end of increasing x does not justify any means which decrease y by the same amount^1. Our values are multidimensional; no single dimension is worth ...
The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can't yet say, but that would seem to be the appropriate level to handle it.)
It's nice to see the genesis of corrigibility before Eliezer had unconfused himself enough to take that first step.
Quite often when given that problem I have heard non-answers. Even at the time of writing I do not believe it was unreasonable to give a non-answer; not just from a perceived moral perspective, but even from a utilitarian perspective, there are so many contextual elements removed that the actual problem isn't whether they will answer kill one and save the others or decline to act and save one only,
but rather the extent of the originality of the given answer. One can then sort of extrapolate the sort of thinking the individual asked may be pursuing, a...
If our corrupted hardware can't be trusted to compute the consequences in a specific case, it probably also can't be trusted to compute the consequences of a general rule. All our derivations of deontological rules will be tilted in the direction of self interest or tribalism or unexamined disgust responses, not some galaxy-brained evaluation of the consequences of applying the rule to all possible situations.
Russell conjugation: I have deontological guardrails, you have customs, he has ancient taboos.
[edit: related Scott post which I endorse i...
It just occurred to me that this post serves as a fairly compelling argument in favor of a modest epistemology, which in 2017 Eliezer wrote a whole book arguing against. ("I think I'm doing this for the good of the tribe, but maybe I'm just fooling myself" is definitely an "outside view".) Eliezer, have you changed your mind since writing this post? If so, where do you think your past self went awry? If not, how do you reconcile the ideas in this article with the idea that modest epistemology is harmful?
Yesterday I talked about how humans may have evolved a structure of political revolution, beginning by believing themselves morally superior to the corrupt current power structure, but ending by being corrupted by power themselves—not by any plan in their own minds, but by the echo of ancestors who did the same and thereby reproduced.
This fits the template:
From this proposition, I now move on to my main point, a question considerably outside the realm of classical Bayesian decision theory:
In such a case as this, you might even find yourself uttering such seemingly paradoxical statements—sheer nonsense from the perspective of classical decision theory—as:
But if you are running on corrupted hardware, then the reflective observation that it seems like a righteous and altruistic act to seize power for yourself—this seeming may not be be much evidence for the proposition that seizing power is in fact the action that will most benefit the tribe.
By the power of naive realism, the corrupted hardware that you run on, and the corrupted seemings that it computes, will seem like the fabric of the very world itself—simply the way-things-are.
And so we have the bizarre-seeming rule: "For the good of the tribe, do not cheat to seize power even when it would provide a net benefit to the tribe."
Indeed it may be wiser to phrase it this way: If you just say, "when it seems like it would provide a net benefit to the tribe", then you get people who say, "But it doesn't just seem that way—it would provide a net benefit to the tribe if I were in charge."
The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can't yet say, but that would seem to be the appropriate level to handle it.)
But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it. A rule that says, "For the good of the tribe, do not cheat to seize power even for the good of the tribe." Or "For the good of the tribe, do not murder even for the good of the tribe."
And now the philosopher comes and presents their "thought experiment"—setting up a scenario in which, by stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is certain to save the five lives. "There's a train heading to run over five innocent people, who you can't possibly warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the train. These are your only options; what do you do?"
An altruistic human, who has accepted certain deontological prohibits—which seem well justified by some historical statistics on the results of reasoning in certain ways on untrustworthy hardware—may experience some mental distress, on encountering this thought experiment.
So here's a reply to that philosopher's scenario, which I have yet to hear any philosopher's victim give:
"You stipulate that the only possible way to save five innocent lives is to murder one innocent person, and this murder will definitely save the five lives, and that these facts are known to me with effective certainty. But since I am running on corrupted hardware, I can't occupy the epistemic state you want me to imagine. Therefore I reply that, in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder the one innocent person to save five, and moreover all its peers would agree. However, I refuse to extend this reply to myself, because the epistemic state you ask me to imagine, can only exist among other kinds of people than human beings."
Now, to me this seems like a dodge. I think the universe is sufficiently unkind that we can justly be forced to consider situations of this sort. The sort of person who goes around proposing that sort of thought experiment, might well deserve that sort of answer. But any human legal system does embody some answer to the question "How many innocent people can we put in jail to get the guilty ones?", even if the number isn't written down.
As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. But I don't think that our deontological prohibitions are literally inherently nonconsequentially terminally right. I endorse "the end doesn't justify the means" as a principle to guide humans running on corrupted hardware, but I wouldn't endorse it as a principle for a society of AIs that make well-calibrated estimates. (If you have one AI in a society of humans, that does bring in other considerations, like whether the humans learn from your example.)
And so I wouldn't say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don't call it an alarm light, if an AI says that the right thing to do is sacrifice one to save five. Again, I don't go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects. I happen to be a human. But for a Friendly AI to be corrupted by power would be like it starting to bleed red blood. The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn't spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.
I would even go further, and say that if you had minds with an inbuilt warp that made them overestimate the external harm of self-benefiting actions, then they would need a rule "the ends do not prohibit the means"—that you should do what benefits yourself even when it (seems to) harm the tribe. By hypothesis, if their society did not have this rule, the minds in it would refuse to breathe for fear of using someone else's oxygen, and they'd all die. For them, an occasional overshoot in which one person seizes a personal benefit at the net expense of society, would seem just as cautiously virtuous—and indeed be just as cautiously virtuous—as when one of us humans, being cautious, passes up an opportunity to steal a loaf of bread that really would have been more of a benefit to them than a loss to the merchant (including knock-on effects).
"The end does not justify the means" is just consequentialist reasoning at one meta-level up. If a human starts thinking on the object level that the end justifies the means, this has awful consequences given our untrustworthy brains; therefore a human shouldn't think this way. But it is all still ultimately consequentialism. It's just reflective consequentialism, for beings who know that their moment-by-moment decisions are made by untrusted hardware.