Why a Human (Or Group of Humans) Might Create UnFriendly AI Halfway On Purpose

ChrisHallquist

First, some quotes from Eliezer's contributions to the Global Catastrophic Risks anthology. First, from Cognitive biases potentially affecting judgement of global risks:

All else being equal, not many people would prefer to destroy the world. Even faceless corporations,meddling governments, reckless scientists, and other agents of doom, require a world in which to achieve their goals of profit, order, tenure, or other villainies. If our extinction proceeds slowly enough to allow a moment of horrified realization, the doers of the deed will likely be quite taken aback on realizing that they have actually destroyed the world. Therefore I suggest that if the Earth is destroyed, it will probably be by mistake.

And from Artificial Intelligence as a Positive and Negative Factor in Global Risk:

We can therefore visualize a possible first-mover effect in superintelligence. The first mover effect is when the outcome for Earth-originating intelligent life depends primarily on the makeup of whichever mind first achieves some key threshold of intelligence - such as criticality of self-improvement. The two necessary assumptions are these:

• The first AI to surpass some key threshold (e.g. criticality of self-improvement), if unFriendly, can wipe out the human species.
• The first AI to surpass the same threshold, if Friendly, can prevent a hostile AI from coming into existence or from harming the human species; or find some other creative way to ensure the survival and prosperity of Earth-originating intelligent
life.

More than one scenario qualifies as a first-mover effect. Each of these examples reflects a different key threshold:

• Post-criticality, self-improvement reaches superintelligence on a timescale of weeks or less. AI projects are sufficiently sparse that no other AI achieves criticality before the first mover is powerful enough to overcome all opposition. The key threshold is criticality of recursive self-improvement.
• AI-1 cracks protein folding three days before AI-2. AI-1 achieves nanotechnology six hours before AI-2. With rapid manipulators, AI-1 can (potentially) disable AI-2's R&D before fruition. The runners are close, but whoever crosses the finish line first, wins. The key threshold is rapid infrastructure.
• The first AI to absorb the Internet can (potentially) keep it out of the hands of other AIs. Afterward, by economic domination or covert action or blackmail or supreme ability at social manipulation, the first AI halts or slows other AI projects so that no other AI catches up. The key threshold is absorption of a unique resource.

I think the first quote is exactly right. But it leaves out something important. The effects of someone's actions do not need to destroy the world in order to be very, very, harmful. These definitions of Friendly and unFriendly AI are worth quoting (I don't know how consistently they're actually used by people associated with the SIAI, but they're useful for my purposes):

A "Friendly AI" is an AI that takes actions that are, on the whole, beneficial to humans and humanity; benevolent rather than malevolent; nice rather than hostile. The evil Hollywood AIs of The Matrix or Terminator are, correspondingly, "hostile" or "unFriendly".

Again, an action does not need to destroy the world to be, on the whole, harmful to humans and humanity; malevolent rather than benevolent. An assurance that a human or humans will not do the former is no assurance that they will not do the latter. So if there ends up being a strong first-mover effect in the development of AI, we have to worry about the possibility that whoever gets control of the AI will use it selfishly, at the expense of the rest of humanity.

The title of this post says "halfway on purpose" instead of "on purpose," because in human history even the villains tend to see themselves as heroes of their own story. I've previously written about how we deceive ourselves so as to better deceive others, and how I suspect this is the most harmful kind of human irrationality.

Too many people--at least, too many writers of the kind of fiction where the villain turns out to be an all-right guy in the end--seem to believe that if someone is the hero of their own story and genuinely believes they're doing the right thing, they can't really be evil. But you know who was the hero of his own story and genuinely believed he was doing the right thing? Hitler. He believed he was saving the world from the Jews and promoting the greatness of the German volk.

We have every reason to think that the psychological tendencies that created these hero-villains are nearly universal. Evolution has no way to give us nice impulses for the sake of having nice impulses. Theory predicts, and observation confirms, that we tend to care more about blood-relatives than mere allies and allies more than strangers. As Hume observed (remarkably, without any knowledge of Hammilton's rule) "A man naturally loves his children better than his nephews, his nephews better than his cousins, his cousins better than strangers, where every thing else is equal." And we care more about ourselves than any single other individual on the planet (even if we might sacrifice ourselves for two brothers or eight cousins.)

Most of us are not murderers, but then most of have never been in a situation where it would be in our interest to commit murder. The really disturbing thing is that there is much evidence that ordinary people can become monsters as soon as the situation changes. Science gives us the Stanford Prison Experiment and Milgram's experiment on obedience to authority, history gives us even more disturbing facts about how many soldiers commit atrocities in war time. Of the soldiers who came from societies where atrocities are frowned on, most of them must have seemed perfectly normal before they went off to war. Probably most of them, if they'd thought about it, would have sincerely believed they were incapable of doing such things.

This makes a frightening amount of evolutionary sense. There's reason for evolution to, as much as possible, give us conditional rules for behavior so we only do certain things when it's fitness increasing to do so. Normally, doing the kind of things done during the Rape of Nanking leads to swift punishments, but the circumstances when such things actually happen tend to be circumstances where punishment is much less likely, where the other guys are trying to kill you anyway and your superior officer is willing to at minimum look the other way. But if you're in a situation where doing such things is not in your interest, where's the evolutionary benefit of even being aware of what you're capable of?

Taking this all together, the risk is not that someone will deliberately use AI to harm humanity (do it on purpose). The risk is that they'll use AI to harm humanity for selfish reasons, while persuading themselves they're actually benefiting humanity (doing it halfway on purpose.) If whoever gets control of a first-mover scenario sincerely believed, prior to gaining unlimited power, that they really wanted to be really, really careful not to do that, that's no assurance of anything, because they'll have been thinking that before the situation changed and there was a chance for the conditional rule, "Screw over other people for personal gain if you're sure if getting away with it" triggered.

I don't want to find out what I'd do with unlimited power. Or rather, all else being equal I would like to find out, but I don't think putting myself in a position where I actually could find out would be worth the risk. This is in spite of the fact that the fact that I am even worrying about these things may be a sign that I'd be less of a risk than other people. That should give you an idea of how little I would trust other people with such power.

The fact that Eliezer has stated his intention to have the Singularity Institute create FOOM-capable AI doesn't worry me much, because I think the SIAI is highly unlikely to succeed at that. I think if we do end up in a first-mover scenario, it will probably be the result of some project backed by a rich organization like IBM, the United States Department of Defense, or Google.

Forgetting about that, though, this looks to me like an absolutely crazy strategy. Eliezer has said creating FAI will be a very meta operation, and I think I heard him once mention putting prospective FAI coders through a lot of rationality training before beginning the process, but I have no idea why he would think those are remotely sufficient safeguards for giving a group of humans unlimited power. Even if you believe there's a significant risk that creating FOOM-capable FAI could be necessary to human survival, shouldn't, in that case, there be a major effort to first answer the question, "Is there any possible way to give a group of humans unlimited power without it ending in disaster?"

More broadly, given even a small chance that the future of AI will end up in some first-mover scenario, it's worth asking, "what can we do to prevent some small group of humans (the SIAI, a secret conspiracy of billionaires, a secret conspiracy of Google employees, whoever) from steering a first-mover scenario in a direction that's beneficial to themselves and perhaps their blood relatives, but harmful to the rest of humanity?"

All of this being said, what do you think SIAI should do differently? FAI has to be built. It has to be built by a team of humans. SIAI will do its best to make sure those humans are aware of problems like this, and that's where I'd say it would be nice to know some concrete steps they could take to mitigate the risk. But aside form that, are there any other alternatives?

One important question is how long to delay attempting to build FAI, which requires balancing the risks that you'll screw it up against the risks that someone else will screw it up first. For people in the SIAI who seriously think they have a shot at making FOOM-capable AI first, the primary thing I'd ask them to do is pay more attention to the considerations above when making that calculation.

But given that I think it's extremely likely that some wealthier organization will end up steering any first-mover scenario (if one occurs, which it may not), I think it's also worth putting some work into figuring out what the SIAI could do to get those people (whoever they end up being) to behave altruistically in that scenario.

Too many people--at least, too many writers of the kind of fiction where the villain turns out to be an all-right guy in the end--seem to believe that if someone is the hero of their own story and genuinely believes they're doing the right thing, they can't really be evil. But you know who was the hero of his own story and genuinely believed he was doing the right thing? Hitler. He believed he was saving the world from the Jews and promoting the greatness of the German volk.

Was this bit absolutely necessary? It comes across as almost a cliched invocation of the Nazis. This would probably be better received if one had a different example or no example at all.

This is on the whole a minor criticism more about presentation than anything else. Your general points seems worth-while.

No it wasn't absolutely necessary. I was deliberately being a bit cliche. Normally I avoid cliches, but it felt right rhetorically here. I don't know if it really was.

Had I been extra-concerned about avoiding confusion, I would have also pointed out the difference between attempting to invoke the Nazis to prove a general rule, and invoking the Nazis as a counter-example, and that I was doing the latter here. I failed to do that. Oh well.

I think the disagreement may disappear if you taboo evil.

The fact that Eliezer has stated his intention to have the Singularity Institute create FOOM-capable AI doesn't worry me much, because I think the SIAI is highly unlikely to succeed at that.

That's largely my attitude. The Fellowship of the AI probably won't lead to a scenaro like Gollum's final moments.

"what can we do to prevent some small group of humans (the SIAI, a secret conspiracy of billionaires, a secret conspiracy of Google employees, whoever) from steering a first-mover scenario in a direction that's beneficial to themselves and perhaps their blood relatives, but harmful to the rest of humanity?"

Actually, if they managed to do that, then they have managed to build an FAI. The large(/largest ?) risk some (as SIAI I think, but I am not an expert) is that they think they are building an FAI (or perhaps a too weak AI to be really dangerous) but that they are misstaken in that assumption. In reality they have been building an uFAI that takes over the world and humanity as a whole is doomed, including a small minority of humanity that possibly the AI was supposed to be friendly to.

There seems to be three different problems here. To analyse how dangerous AIs in general are. If dangerous, how can one make an FAI, that is an AGI that is at least beneficial to some. And then if an FAI can be built, to whom should it be friendly. As I interprete your post you are discussing the third question and dangers related to that while hypothetically assuming that the small group building the AGI has managed to solve the second question? If so, you are not really discussing why some would build an uFAI half-way by purpose but why some would build an FAI that is unfriendly to most humans?

Actually, if they managed to do that, then they have managed to build an FAI.

Abandoning the 99% may fail the "beneficial to humans" test.

That's not how I understood the "on the whole, beneficial to humans and humanity." It would benefit some humans, but it wouldn't fulfill the "on the whole" part of the quoted definition of Friendly AI.

That does, though, highlight some of the confusions that seem to surround the term "Friendly AI."

I don't want to find out what I'd do with unlimited power.

Well, I think the first thing you should do is attempt to make a substantial effort to not accidentally kill yourself when you use your powers, as the person with the outcome pump does in the linked story.

But the next thing you need to do (while bearing in mind not killing yourself) is to gain at least some ability to predict the behavior of the unlimited power. This appears to run contrary to a literal interpretation of your point. I DO want to find out at least some of what I would do with unlimited power, but I want to know that before ACTUALLY having unlimited power.

For instance, when you hand a person a gun, they need to be able to predict what the gun is going to do to be able to use it without injury. If they know nothing at all, they might do the possibly fatal combination of looking in the hole while pulling the trigger. If they are more familiar with guns, they might still injure themselves if the recoil breaks their bones when they fire it holding it the wrong way. I'm actually not familiar with guns personally, so there are probably plenty of other safety measures that haven't even occurred to me.

And as you point out, a flaw in our current method of elite selection is that our elites aren't as predictable as we like, and that makes them dangerous. Maybe they'll be good for humanity... maybe they'll turn us all into slaves. Once again, the unpredictability is dangerous.

And we can't just make ourselves the elites, because we aren't necessarily safe either.

In fact, realistically, saying "I don't want to find out what I'd do with unlimited power." might ITSELF a corruption sign. It means you can truthfully say "I don't think I would do anything evil." because you haven't thought about it. It still leaves you eligible for power later. It's easy to delude yourself that you're really a good person if you don't think about the specifics.

One possible next step might be to imagine a publicly known computer algorithm which can predict what humans will do given substantial amounts of new power. The algorithm doesn't need to be generally self improving, or an oracle, or an AI. It just needs to be able to make that specific prediction relatively reliably, and ideally in a way that it can defeat people trying to fool it, and in a way that is openly verifiable to everyone. (For obvious reasons, having this kind of determination be private doesn't work, or the current elites can just write it to support themselves.)

Note, problems doing that kind of system wrong have certainly come up.

But if done correctly, we could then use that to evaluate "Would this humans be likely to go cacklingly mad with power in a bad way?" If so, we could try to not let them have power. If EVERYONE is predicted to go cacklingly mad in a bad way, then we can update based on that as well.

What's weird is that if this is a good idea, it means that realistically, I would have to be supporting some sort of precrimish system, where elites would have to live with things like "The computer predicts that you are too likely to abuse your power at this time, and even though you haven't done that YET... we need to remove you from your position of power for everyone's safety."

But, if we are worried about elites who haven't done anything wrong yet abusing their power later, I don't see any way around solving it in advance that wouldn't restrict elites in some way.

Of course, this brings up a FURTHER problem, which is that in a lot of cases, humans aren't even as capable of choosing their own elites. So even if we had a solid algorithm of judging who would be a good leader and who would be a bad leader, and even if people generally accepted this, we can't then take the candidates and simply install them without significant conflicts.

Unfortunately, realistically this post is getting substantially above the length at which I feel competent writing/thinking, so I should get some evaluation on it before continuing.

This relates to a theoretical question which is more interesting to me: are "individual extrapolated volitions" in/compatible with a "collective extrapolated volition" (a la CEV)? Are there "natural classes" of individuals who, in collection, form their own coherent extrapolated volition which is compatible among themselves, but incompatible with others extrapolated volitions, e.g. men vs. women, smart people vs. dumb people, individualists vs. collectivists, all of humanity happens to be compatible, or every person individually is incompatible with everyone else...? In other words, how much natural coherence is there among individual extrapolated volitions, and how much would CEV require to sacrifice individual extrapolated volitions for the collective extrapolated volition?

If it happened to work out that the entire collective of humanity could have a coherent extrapolated volition in which no individual extrapolated volition is sacrificed, that would be perfect and would relieve these uncomfortable questions. If that's not the case - should be proceed with CEV anyway? Or is it possible we may want to favor one group over another?

Of course we want to favor the group we are part of. Otherwise our CEV wouldn't differ.

right. this raises the point that the programmer of the FAI would rationally not write it with CEV. He would use his own individual extrapolated volition. Whether that happened to be coherent with the extrapolated volitions of the rest of humanity is irrelevant (to him) - because obviously he would rationally favor himself or his group in any instance of incoherence.

what can we do to prevent some small group of humans (the SIAI, a secret conspiracy of billionaires, a secret conspiracy of Google employees, whoever) from steering a first-mover scenario in a direction that's beneficial to themselves and perhaps their blood relatives, but harmful to the rest of humanity?

Consumers can support those organisations with the most convincing promises and strategy - while boycotting those who they don't want to see in charge.

Well there's one idea. Though getting it to work would, among other things, require convincing a large group of consumers that such a boycott would be worth it.

A man naturally loves his children better than his nephews, his nephews better than his cousins, his cousins better than strangers, where every thing else is equal." And we care more about ourselves than any single other individual on the planet (even if we might sacrifice ourselves for two brothers or eight cousins.)

As a nitpick, that isn't necessarily true. People do sacrifice themselves for other individuals. Sometimes it even makes evolutionary sense - if they have a terminal illness, for instance.

One thing that might help is to make the decisions before you have the power--that is, to "lock in" the AI's choices on the basis of what prosocial-current-you (or the EV thereof) wants, before you actually have a superintelligence and your subconscious activates powerful-selfish-you. Then by the time you can be corrupted by power, it's too late to change anything. Coding the AI before launching it might be enough, especially if other people are watching and will punish cackling madness. On the other hand, being alone coding the AI would be a position of great power at least by proxy, and so might still be open to corruption.

You didn't talk about counterexamples: for example, lots of nations like the US have few corrupt politicians. There's an availability effect here. Philanthropists are boring, bad boys are interesting.

I actually agree this is a big danger, but I'm not sure it's unsolvable with the right individuals. Does power corrupt women, or lifelong meditators, or hopelessly geeky folks, or people whose entire career has consisted of researching this sort of thing, or people who fall into all four categories?

WRT the Stanford Prison Experiment: I've hung out with some elite university undergraduates and mostly they seem full of youthful foolishness and much less reflective than median Less Wrong user. All your other examples feature even worse candidates where power corruption is concerned. (Also, iirc the portion resisting authority in the Milgram experiment was decently large.)

And you haven't discussed the problem of corruption when an entire group, not self-selected on the basis of power seeking tendencies, receives committee style power. It seems like this could significantly curtail power-seeking tendencies.

My own views on politicians in that while their behavior is genuinely better than that of power holders in many other governments today and throughout history, their behavior is also far from saintly, and the fact that they operate in a system where the incentives favor less-bad behavior.

Also, it seems to me that a group trying to build FOOM-capable AI is by definition self-selected on the basis of power seeking tendencies. But maybe they could hand off final decision making authority to someone else at last stage?

But maybe they could hand off final decision making authority to someone else at last stage?

Why would or should they believe someone else would make better decisions then they would?

Well, maybe they wouldn't be. But if you think self-selection based on power-seeking is important, then maybe they would. IDK, I was just responding to John.