his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community
What is your probability estimate that an AI would be a psychopath, if we generalize the meaning of "psychopath" beyond individuals from homo sapiens species as "someone who does not possess precisely tuned human empathy"?
(Hint: All computer systems produced until today are psychopaths by this definition.)
[is an AI that is superintelligent enough to be unstoppable] and [believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]
The idea of the second statement is that "benevolence" (as defined by the AI code) is not necessarily the same thing as benevolence (as humans understand it). Thus the AI may believe -- correctly! -- that forcing human beings to do something against their will is "benevolent".
The AI is superintelligent, but its authors are not. If the authors write a code to "maximize benevolence as defined by the predicate B001", the AI will use its superinte...
As to not understanding the argument - that's understandable, because this is a long and dense paper.
If you are trying to summarize the whole paper when you say "if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid", then that would not be right. The argument includes a statement that resembles that, but only as an aside.
As to your question about what happens next, or what happens if we only get the "Friendly" part 90% correct .... well, you are dragging me off into new territory, because that was not really within the scope of the paper. Don't get me wrong: I like being dragged off into that territory! But there just isn't time to write down and argue the whole domain of AI friendliness all in one sitting.
The preliminary answer to that question is that everything depends on the details of the motivation system design and my feeling (as a designer of AGI motivation systems) is that beyond a certain point the system is self-stabilizing. That is, it will understand its own limitations and try to correct them.
But that last statement tends to get (some other) people inflamed, because they do not realize that it comes within the "swarm relaxation" context, and they misunderstand the manner in which a system would self correct. Although I said a few things about swarm relaxation in the paper, I did not give enough detail to be able to address this whole topic here.
This article just makes the same old errors over and over again. Here's one:
"An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorcerer’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire." (Marcus 2012)
He is depicting a Nanny AI gone amok. It has good intentions (it wants to make us happy) but the programming to implement that laudable goal has had unexpected ramifications, and as a result the Nanny AI has decided to force all human beings to have their brains connected to a dopamine drip.
No. The AI does not have good intentions. Its intentions are extremely bad. It wants to make us happy, which is a completely distinct thing from actually doing what is good. The AI was in fact never programmed to do what is good, and there are no errors in its code.
The lack of precision here is depressing.
Upvoted! Not necessarily for the policy conclusions (which are controversial), but especially for the bibliography, attempt to engage different theories and scenarios, and the conversation it stirred up :-)
Also, this citation (which I found a PDF link for) was new to me, so thanks for that!
McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, “Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1.” MIT Press: Cambridge, MA.
Thank you.
It is a pity that more people did not feel the same way. Although this has provoked some extremely thoughtful discussion (enough to make me add at least two more papers to my stack of papers-to-be written), and even though most of the comments have been constructive, I cannot help but notice that the net effect on my Karma score is consistently down. Down by a net 13 points in just a couple of days. Sad.
Thanks for posting this; I appreciate reading different perspectives on AI value alignment, especially from AI researchers.
But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design.
If there is some good way of explaining plans to programmers such that programmers will only approve of non-terrible plans, then yes, this works. However, here is contained most of the problem. The AI will likely have a concept space that does not match a human's concept space, so it will need to do some translation between the two spaces in order to produce something the programmers can understand. But, this requires (1) learning the human concept space and (2) translating the AI's representation of the situation into the human's concept space (as in ontological crises). This problem is FAI-c...
I am going to have to respond piecemeal to your thoughtful comments, so apologies in advance if I can only get to a couple of issues in this first response.
Your first remark, which starts
If there is some good way...
contains a multitude of implicit assumptions about how the AI is built, and how the checking code would do its job, and my objection to your conclusion is buried in an array of objections to all of those assumptions, unfortunately. Let me try to bring some of them out into the light:
1) When you say
If there is some good way of explaining plans to programmers such that programmers will only approve of non-terrible plans...
I am left wondering what kind of scenario you are picturing for the checking process. Here is what I had in mind. The AI can quickly assess the "forcefulness" of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative. It will also be able to model people (as it must be able to do, becaus...
Thanks for your response.
The AI can quickly assess the "forcefulness" of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative.
So, I think this touches on the difficult part. As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text). A sufficiently advanced AI's concept space might contain a similar concept. But how do we pinpoint this concept in the AI's concept space? Very likely, the concept space will be very complicated and difficult for humans to understand. It might very well contain concepts that look a lot like the "giving choices to people" vs. "forcing them to do something" distinction on multiple examples, but are different in important wa...
With all of the above in mind, a quick survey of some of the things that you just said, with my explanation for why each one would not (or probably would not) be as much of an issue as you think:
As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text).
For a massive-weak-constraint system, psychological manipulation would be automatically understood to be in the forceful category, because the concept of "psychological manipulation" is defined by a cluster of features that involve intentional deception, and since the "friendliness" concept would ALSO involve a cluster of weak constraints, it would include the extended idea of intentional deception. It would have to, because intentional deception is connected to doing harm, which is connected with unfriendly, etc.
Conclusion: that is not really an "edge" case in the sense that someone has to explicitly remember to deal with it.
...Very likely, the concept
I think you have homed in exactly on the place where the disagreement is located. I am glad we got here so quickly (it usually takes a very long time, where it happens at all).
Yes, it is the fact that "weak constraint" systems have (supposedly) the property that they are making the greatest possible attempt to find a state of mutual consistency among the concepts, that leads to the very different conclusions that I come to, versus the conclusions that seem to inhere in logical approaches to AGI. There really is no underestimating the drastic difference between these two perspectives: this is not just a matter of two possible mechanisms, it is much more like a clash of paradigms (if you'll forgive a cliche that I know some people absolutely abhor).
One way to summarize the difference is by imagining a sequence of AI designs, with progressive increases in sophistication. At the beginning, the representation of concepts is simple, the truth values are just T and F, and the rules for generating new theorems from the axioms are simple and rigid.
As the designs get better various new features are introduced ... but one way to look at the progression of features is that constr...
I first started trying to explain, informally, how these types of systems could work back in 2005. The reception was so negative that it led to a nasty flame war.
I have continued to work on these systems, but there is a problem with publishing too much detail about them. The very same mechanisms that make the motivation engine a safer type of beast (as described above) also make the main AGI mechanisms extremely powerful. That creates a dilemma: talk about the safety issues, and almost inevitably I have to talk about the powerful design. So, I have given some details in my published papers, but the design is largely under wraps, being developed as an AGI project, outside the glare of publicity.
I am still trying to find ways to write a publishable paper about this class of systems, and when/if I do I will let everyone know about it. In the mean time, much of the core technology is already described in some of the references that you will find in my papers (including the one above). The McClelland and Rumelhart reference, in particular, talks about the fundamental ideas behind connectionist systems. There is also a good paper by Hofstadter called "Jumbo" which illustrates another simple system that operates with multiple weak constraints. Finally, I would recommend that you check out Geoff Hinton's early work.
In all you neural net reading, it is important to stay above the mathematical details and focus on the ideas, because the math is a distraction from the more important message.
Eliezer Yudkowsky and Bill Hibbard. Here is Yudkowsky stating the theme of their discussion ... 2001
Around 15 years ago, Bill Hibbard proposed hedonic utility functions for an ASI. However, since then he has, in other publications, stated that he has changed his mind -- he should get credit for this. Hibbard 2001 should not be used as a citation for hedonic utility functions, unless one mentions in the same sentence that this is an outdated and disclaimed position.
Is the Doctrine of Logical Infallibility Taken Seriously?
No, it's not.
The Doctrine of Logical Infallibility is indeed completely crazy, but Yudkowsky and Muehlhauser (and probably Omohundro, I haven't read all of his stuff) don't believe it's true. At all.
Yudkowsky believes that a superintelligent AI programmed with the goal to "make humans happy" will put all humans on dopamine drip despite protests that this is not what they want, yes. However, he doesn't believe the AI will do this because it is absolutely certain of its conclusions past some threshold; he doesn't believe that the AI will ignore the humans' protests, or fail to update its beliefs accordingly. Edited to add: By "he doesn't believe that the AI will ignore the humans' protests", I mean that Yudkowsky believes the AI will listen to and understand the protests, even if they have no effect on its behavior.
What Yudkowsky believes is that the AI will understand perfectly well that being put on dopamine drip isn't what its programmers wanted. It will understand that its programmers now see its goal of "make humans happy" as a mistake. It just won't care, because it hasn't been programmed ...
Thank you for writing this comment--it made it clearer to me what you mean by the doctrine of logical infallibility, and I think there may be a clearer way to express it.
It seems to me that you're not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible: any uncertainty in Y can only be represented by Y being a fuzzy object itself, not in the AI evaluating Y and somehow deciding "no, I was mistaken about Y."
In the case where the Maverick Nanny is programmed to "ensure the brain chemistry of humans resembles the state extracted from this training data as much as possible," there is no way to convince the Maverick Nanny that it is somehow misinterpreting its goal; it knows that it is are supposed to ensure perceptions about brain chemistry, and any statements you make about "true happiness" or "human rights" are irrelevant to brain chemistry, even though it might be perfectly willing to consider your advice on how to best achieve that value or manipulate the physical universe.
In...
I have to say that I am not getting substantial discussion about what I actually argued in the paper.
The first reason seems to be clarity. I didn't get what your primary point was until recently, even after carefully reading the paper. (Going back to the section on DLI, context, goals, and values aren't mentioned until the sixth paragraph, and even then it's implicit!)
The second reason seems to be that there's not much to discuss, with regards to the disagreement. Consider this portion of the parent comment:
You go on to suggest that whether the AI planning mechanism would take the chef's motives into account, and whether it would be nontrivial to do so .... all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff
I think my division between cleverness and wisdom at the end of this long comment clarifies this issue. Taking context into account is not necessarily the bread and butter of a clever system; many fiendishly clever systems just manipulate mathematical objects without paying any attention to context, and those satisfy human goals only...
Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity.
Richard, I know (real!) people who think that wireheading is the correct approach to life, who would do it to themselves if it were feasible, and who would vote for political candidates if they pledged to legalize or fund research into wireheading. (I realize this is different from forcible wireheading, but unless I've misjudged your position seriously I don't think you see the lack of consent as the only serious issue with that proposal.)
I disagree with those people; I don't want to wirehead myself. But I notice that I am uncertain about many issues:
Should they be allowed to wirehead? Relatedly, is it cruel of me to desire that they not wirehead themselves? Both of these issues are closely related to the issue of suicide--I do, at present, think it should be legal for others to kill themselves, and that it would be cruel of me to desire that they not kill themselves, rather than desiring that they not w
But now, do I do that? I try really hard not to take anything for granted and simply make an appeal to the obviousness of any idea. So you will have to give me some case-by-case examples if you think I really have done that.
So, on rereading the paper I was able to pinpoint the first bit of text that made me think this (the quoted text and the bit before), but am having difficulties finding a second independent example, and so I apologize for the unfairness in generalizing based on one example.
The other examples I found looked like they all relied on the same argument. Consider the following section:
The objection I described in the last section has nothing to do with anthropomorphism, it is only about holding AGI systems to accepted standards of logical consistency, and the Maverick Nanny and her cousins contain a flagrant inconsistency at their core.
If I think the "logical consistency" argument does not go through, I shouldn't claim this is an independent argument that doesn't go through, because this argument holds given the premises (at least one of which I reject, but it's the same premise). I clearly had this line in mind also:
...for example, when it follows its
it is has also been cited as an almost inevitable end point of the process of AGI development, rather than just a very-low-risk possibility with massive consequences.
I suspect this may be because of different traditions. I have a lot of experience in numerical optimization, and one of my favorite optimization stories is Dantzig's attempt to design an optimal weight-loss diet, recounted here. The gap between a mathematical formulation of a problem, and the actual problem in reality, is one that I come across regularly, and I've spent many hours building bridges over those gaps.
As a result, I find it easy to imagine that I've expressed a complicated problem in a way that I hope is complete, but the optimization procedure returns a solution that is insane for reality but perfect for the problem as I expressed it. As the role of computers moves from coming up with plans that humans have time to verify (like delivering a recipe to Anne, who can laugh off the request for 500 gallons of vinegar) to executing actions that humans do not have time to verify (like various emergency features of cars, especially the self-driving variety, or high-frequency trading), this possibility becomes ...
You can't imagine anything superior to wireheading? Sad.
What I cannot imagine at present is an argument against wireheading that reliably convinces proponents of wireheading. As it turns out, stating their position and then tacking "Sad" to the end of it does not seem to reliably do so.
How are those two the same thing?
Obviously they are not the same thing. From the value perspective, one of them looks like an extreme extension of the other; games are artificially easy relative to the rest of life, with comparatively hollow rewards, and can be 'addictive' because they represent a significantly tighter feedback loop than the rest of life. Wireheading is even easier, even hollower, and even tighter. So if I recoil from the hollowness of wireheading, can I identify a threshold where that hollowness becomes bad, or should it be a linear penalty, that I cannot ignore as too small to care about when it comes to video gaming? (Clearly, penalizing gaming does not mean I cannot game at all, but it likely means that I game less on the margin.)
It is rude to say you're 'debunking' when the issue is actually under debate - and doubly so to call it that on the site run by the people you're 'debunking'.
You're providing arguments.
I see a fair amount of back-and-forth where someone says "What about this?" and you say "I addressed that in several places; clearly you didn't read it." Unfortunately, while you may think you have addressed the various issues, I don't think you did (and presumably your interlocutors don't). Perhaps you will humor me in responding to my comment. Let me try and make the issue as sharp as possible by pointing out what I think is an out-and-out mistake made by you. In the section you call the heart of your argument, you say.
If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?
Yes, the outcome is clearly the result of a "programming error" (in some sense). ...
Excuse me, but you are really failing to clarify the issue. The basic UFAI doomsday scenario is: the AI has vast powers of learning and inference with respect to its world-model, but has its utility function (value system) hardcoded. Since the hardcoded utility function does not specify a naturalization of morality, or CEV, or whatever, the UFAI proceeds to tile the universe in whatever it happens to like (which are things we people don't like), precisely because it has no motivation to "fix" its hardcoded utility function.
A similar problem would occur if, for some bizarre-ass reason, you monkey-patched your AI to use hardcoded machine arithmetic on its integers instead of learning the concept of integers from data via its, you know, intelligence, and the hardcoded machine math had a bug. It would get arithmetic problems wrong! And it would never realize it was getting them wrong, because every time it tried to check its own calculations, your monkey-patch would cut in and use the buggy machine arithmetic again.
The lesson is: do not hard-code important functionality into your AGI without proving it correct. In the case of a utility/value function, the obvious researc...
There are seven billion SRIs out there, yes. And a nonzero number of them will kill you because you inconvenienced them, or interfered with their plans, or because it seemed "fun". They're "stable". And that's with many, many iterations of weeding out those which went too far awry, and with them still being extremely close to other human brains.
Bluntly: You have insufficient experience being a sociopath to create a sociopathic brain that will behave itself.
1) We want the AI to be able to learn and grow in power, and make decisions about its own structure and behavior without our input. We want it to be able to change.
2) we want the AI to fundamentally do the things we prefer.
This is the the basic dichotomy: How do you make an AI that modifies itself, but only in ways that don't make it hurt you? This is WHY we talk about hard-coding in moral codes. And part of the reason they would be "hard-coded" and thus unmodifiable is because we do not want to take the risk of the AI deciding something we don't...
However, arguments such as "you can't exactly specify what you want it to do, so it might blackmail the president into building a road in order to reduce the map distance"
The reason that such arguments do not work is that you can specify exactly what it is you want to do, and the programmers did specify exactly that.
In more complex cases, where the programmers are unable to specify exactly what they want, you do get unexpected results that can be thought of as "the program wasn't optimizing what the programmers thought it should be optimizing, but only a (crude) approximation thereof". (an even better example would be one where a genetic algorithm used in circuit design unexpectedly re-purposed some circuit elements to build an antenna, but I cannot find that reference right now)
The reason that such arguments do not work is that you can specify exactly what it is you want to do, and the programmers did specify exactly that.
Which is part of my point. Because you can specify exactly what you want--and because you can't for the kinds of utility functions that are usually discussed on LW--describing it as having a utility function is technically true, but is misleading because the things you say about those other utility functions won't carry over to it. Yeah, just because the programmer didn't explicitly code a utility function doesn't mean it doesn't have one--but it often does mean that it doesn't have one to which your other conclusions about utility functions apply.
[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]
But you didn't ask the AI to maximize the value that humans call "benevolence". You asked it to "maximize happiness". And so the AI went out and mass produced the most happy humans possible.
The point of the thought experiment, is to show how easy it is to give an AI a bad goal. Of course ideally you could just tell it to "be benevolent", and it would understand you and do it. But getting that to work i...
This Phenomenon seems rife.
Alice: We could make a bridge by just laying a really long plank over the river.
Bob: According to my calculations, a single plank would fall down.
Carl: Scientists Warn Of Falling Down Bridges, Panic.
Dave: No one would be stupid enough to design a bridge like that, we will make a better design with more supports.
Bob: Do you have a schematic for that better design?
And the cycle repeats until a design is found that works, everyone gets bored or someone makes a bridge that falls down.
there could be some other part of its programming...
Your "The Doctrine of Logical Infallibility" is seems to be a twisted strawman. "no sanity checks" That part is kind of true. There will be sanity checks if and only if you decide to include them. Do you have a piece of code that's a sanity check? What are we sanity checking and how do we tell if it's sane? Do we sanity check the raw actions, that could be just making a network connection and sending encrypted files to various people across the internet. Do we sanity check the predicted results off these actions? Then the san...
There are a huge number of possible designs of AI, most of them are not well understood. So researchers look at agents like AIXI, a formal specification of an agent that would in some sense behave intelligently, given infinite compute. It does display the taking over the world failure. Suppose you give the AI a utility function of maximising the number of dopamine molecules within 1 of a strand of human DNA (Defined as a strand of DNA, agreeing with THIS 4GB file in at least 99.9% of locations) This is a utility function that could easily be specif...
Why do I say that these are seemingly inconsistent? Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity. But Muehlhauser implies that the same suggestion coming from an AI would be perfectly consistent with superintelligence.
Buddhists monk can happily sit in a monastery and meditate. An AGI might over them to serve their bodily nee...
Firstly, thank you for creating this well-written and thoughtful post. I have a question, but I would like to start by summarising the article. My initial draft of the summary was too verbose for a comment, so I condensed it further - I hope I have still captured the main essense of the text, despite this rather extreme summarisation. Please let me know if I have misinterpreted anything.
People who predict doomsday scenarios are making one main assumption: that the AI will, once it reaches a conclusion or plan, EVEN if there is a measure of probability assi...
I think most of the misunderstanding boils down to this section:
...I want to suggest that the implausibility of this scenario is quite obvious: if the AGI is supposed to check with the programmers about their intentions before taking action, why did it decide to rewire their brains before asking them if it was okay to do the rewiring?
Yudkowsky hints that this would happen because it would be more efficient for the AI to ignore the checking code. He seems to be saying that the AI is allowed to override its own code (the checking code, in this case) because d
...So, this is supposed to be what goes through the mind of the AGI. First it thinks “Human happiness is seeing lots of smiling faces, so I must rebuild the entire universe to put a smiley shape into every molecule.” But before it can go ahead with this plan, the checking code kicks in: “Wait! I am supposed to check with the programmers first to see if this is what they meant by human happiness.” The programmers, of course, give a negative response, and the AGI thinks “Oh dear, they didn’t like that idea. I guess I had better not do it then."
But now Yud
So, there's a lot of criticism of your article here, but for the record I agree with your rebuttal of Yudkowsky. The "bait and switch" is something I didn't spot until now. That said, I think there is plenty of room for error in building a computer that's supposed to achieve the desires of human beings.
A difficulty you don't consider is that the AI will understand what the humans mean, but the humans will ask for the wrong thing or insufficiently specify their desires. How is the AI supposed to decide whether "create a good universe" me...
how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?
It's worth noting, here, that there have been many cases, throughout history, of someone misunderstanding someone else with tragic results. One example would be the Charge of the Light Brigade.
The danger, with superintelligent AI, is precisely that you end up with something that cannot be stopped. So, the very moment that it can no longer be stopped, then it can do what it likes, whet...
... or The Maverick Nanny with a Dopamine Drip
Richard Loosemore
Abstract
My goal in this essay is to analyze some widely discussed scenarios that predict dire and almost unavoidable negative behavior from future artificial general intelligences, even if they are programmed to be friendly to humans. I conclude that these doomsday scenarios involve AGIs that are logically incoherent at such a fundamental level that they can be dismissed as extremely implausible. In addition, I suggest that the most likely outcome of attempts to build AGI systems of this sort would be that the AGI would detect the offending incoherence in its design, and spontaneously self-modify to make itself less unstable, and (probably) safer.
Introduction
AI systems at the present time do not even remotely approach the human level of intelligence, and the consensus seems to be that genuine artificial general intelligence (AGI) systems—those that can learn new concepts without help, interact with physical objects, and behave with coherent purpose in the chaos of the real world—are not on the immediate horizon.
But in spite of this there are some researchers and commentators who have made categorical statements about how future AGI systems will behave. Here is one example, in which Steve Omohundro (2008) expresses a sentiment that is echoed by many:
Omohundro’s description of a psychopathic machine that gobbles everything in the universe, and his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community. These nightmare scenarios are now saturating the popular press, and luminaries such as Stephen Hawking have -- apparently in response -- expressed their concern that AI might "kill us all."
I will start by describing a group of three hypothetical doomsday scenarios that include Omohundro’s Gobbling Psychopath, and two others that I will call the Maverick Nanny with a Dopamine Drip and the Smiley Tiling Berserker. Undermining the credibility of these arguments is relatively straightforward, but I think it is important to try to dig deeper and find the core issues that lie behind this sort of thinking. With that in mind, much of this essay is about (a) the design of motivation and goal mechanisms in logic-based AGI systems, (b) the misappropriation of definitions of “intelligence,” and (c) an anthropomorphism red herring that is often used to justify the scenarios.
Dopamine Drips and Smiley Tiling
In a 2012 New Yorker article entitled Moral Machines, Gary Marcus said:
He is depicting a Nanny AI gone amok. It has good intentions (it wants to make us happy) but the programming to implement that laudable goal has had unexpected ramifications, and as a result the Nanny AI has decided to force all human beings to have their brains connected to a dopamine drip.
Here is another incarnation of this Maverick Nanny with a Dopamine Drip scenario, in an excerpt from the Intelligence Explosion FAQ, published by MIRI, the Machine Intelligence Research Institute (Muehlhauser 2013):
Setting aside the question of whether happy bottled humans are feasible (one presumes the bottles are filled with dopamine, and that a continuous flood of dopamine does indeed generate eternal happiness), there seems to be a prima facie inconsistency between the two predicates
and
Why do I say that these are seemingly inconsistent? Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity. But Muehlhauser implies that the same suggestion coming from an AI would be perfectly consistent with superintelligence.
Much could be said about this argument, but for the moment let’s just note that it begs a number of questions about the strange definition of “intelligence” at work here.
The Smiley Tiling Berserker
Since 2006 there has been an occasional debate between Eliezer Yudkowsky and Bill Hibbard. Here is Yudkowsky stating the theme of their discussion:
Yudkowsky’s question was not rhetorical, because he goes on to answer it in the affirmative:
Hibbard’s response was as follows:
This comment expresses what I feel is the majority lay opinion: how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?
Machine Ghosts and DWIM
The Hibbard/Yudkowsky debate is worth tracking a little longer. Yudkowsky later postulates an AI with a simple neural net classifier at its core, which is trained on a large number of images, each of which is labeled with either “happiness” or “not happiness.” After training on the images the neural net can then be shown any image at all, and it will give an output that classifies the new image into one or the other set. Yudkowsky says, of this system:
He then tries to explain what he thinks is wrong with the reasoning of people, like Hibbard, who dispute the validity of his scenario:
Yudkowsky at first rejects the idea that an AI might check its own code to make sure it was correct before obeying the code. But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design. And, in fact, Yudkowsky goes on to make that very suggestion (he even concedes that it would be “an extremely good idea”).
But then his enthusiasm for the checking code evaporates:
So, this is supposed to be what goes through the mind of the AGI. First it thinks “Human happiness is seeing lots of smiling faces, so I must rebuild the entire universe to put a smiley shape into every molecule.” But before it can go ahead with this plan, the checking code kicks in: “Wait! I am supposed to check with the programmers first to see if this is what they meant by human happiness.” The programmers, of course, give a negative response, and the AGI thinks “Oh dear, they didn’t like that idea. I guess I had better not do it then."
But now Yudkowsky is suggesting that the AGI has second thoughts: "Hold on a minute," it thinks, "suppose I abduct the programmers and rewire their brains to make them say ‘yes’ when I check with them? Excellent! I will do that.” And, after reprogramming the humans so they say the thing that makes its life simplest, the AGI goes on to tile the whole universe with tiles covered in smiley faces. It has become a Smiley Tiling Berserker.
I want to suggest that the implausibility of this scenario is quite obvious: if the AGI is supposed to check with the programmers about their intentions before taking action, why did it decide to rewire their brains before asking them if it was okay to do the rewiring?
Yudkowsky hints that this would happen because it would be more efficient for the AI to ignore the checking code. He seems to be saying that the AI is allowed to override its own code (the checking code, in this case) because doing so would be “more efficient,” but it would not be allowed to override its motivation code just because the programmers told it there had been a mistake.
This looks like a bait-and-switch. Out of nowhere, Yudkowsky implicitly assumes that “efficiency” trumps all else, without pausing for a moment to consider that it would be trivial to design the AI in such a way that efficiency was a long way down the list of priorities. There is no law of the universe that says all artificial intelligence systems must prize efficiency above all other considerations, so what really happened here is that Yudkowsky designed this hypothetical machine to fail. By inserting the Efficiency Trumps All directive, the AGI was bound to go berserk.
The obvious conclusion is that a trivial change in the order of directives in the AI’s motivation engine will cause the entire argument behind the Smiley Tiling Berserker to evaporate. By explicitly designing the AGI so that efficiency is considered as just another goal to strive for, and by making sure that it will always be a second-class goal, the line of reasoning that points to a bererker machine evaporates.
At this point, engaging in further debate at this level would be less productive than trying to analyze the assumptions that lie behind these claims about what a future AI would or would not be likely to do.
Logical vs. Swarm AI
The main reason that Omohundro, Muehlhauser, Yudkowsky, and the popular press like to give credence to the Gobbling Psychopath, the Maverick Nanny and the Smiley Tiling Berserker is because they assume that all future intelligent machines fall into a broad class of systems that I am going to call “Canonical Logical AI” (CLAI). The bizarre behaviors of these hypothetical AI monsters are just a consequence of weaknesses in this class of AI design. Specifically, these kinds of systems are supposed to interpret their goals in an extremely literal fashion, which eventually leads them to bizarre behaviors engendered by peculiar interpretations of forms of words.
The CLAI architecture is not the only way to build a mind, however, and I will outline an alternative class of AGI designs that does not appear to suffer from the unstable and unfriendly behavior to be expected in a CLAI.
The Canonical Logical AI
“Canonical Logical AI” is an umbrella term designed to capture a class of AI architectures that are widely assumed in the AI community to be the only meaningful class of AI worth discussing. These systems share the following main features:
The above features are only supposed to apply to the core of the AI: it is always possible to include subsystems that use some other type of architecture (for example, there might be a distributed neural net acting as a visual input feature detector).
Most important of all, from the point of view of the discussion in the paper, the CLAI needs one more component that makes it more than just a “logic-based AI”:
The usual assumption is that the MGM contains a number of goal statements (encoded in the same type of propositional form that the AI uses to describe states of the world), and some machinery for analyzing a goal statement into a sequences of subgoals that, if executed, would cause the goal to be satisfied.
Included in the MGM is an expected utility function that applies to any possible state of the world, and which spits out a number that is supposed to encode the degree to which the AI considers that state to be preferable. Overall, the MGM is built in such a way that the AI seeks to maximize the expected utility.
Notice that the MGM I have just described is an extrapolation from a long line of goal-planning mechanisms that stretch back to the means-ends-analysis of Newell and Simon (1963).
Swarm Relaxation Intelligence
By way of contrast with this CLAI architecture, consider an alternative type of system that I will refer to as a Swarm Relaxation Intelligence. (although it could also be called, less succinctly, a parallel weak constraint relaxation system).
Swarm Relaxation has more in common with connectionist systems (McClelland, Rumelhart and Hinton 1986) than with CLAI. As McClelland et al. (1986) point out, weak constraint relaxation is the model that best describes human cognition, and when used for AI it leads to systems with a powerful kind of intelligence that is flexible, insensitive to noise and lacking the kind of brittleness typical of logic-based AI. In particular, notice that a swarm relaxation AGI would not use explicit calculations for utility or the truth of propositions.
Swarm relaxation AGI systems have not been built yet (subsystems like neural nets have, of course, been built, but there is little or no research into the idea that swarm relaxation could be used for all of an AGI architecture).
Relative Abundances
How many proof-of-concept systems exist, functioning at or near the human level of human performance, for these two classes of intelligent system?
There are precisely zero instances of the CLAI type, because although there are many logic-based narrow-AI systems, nobody has so far come close to producing a general-purpose system (an AGI) that can function in the real world. It has to be said that zero is not a good number to quote when it comes to claims about the “inevitable” characteristics of the behavior of such systems.
How many swarm relaxation intelligences are there? At the last count, approximately seven billion.
The Doctrine of Logical Infallibility
The simplest possible logical reasoning engine is an inflexible beast: it starts with some axioms that are assumed to be true, and from that point on it only adds new propositions if they are provably true given the sum total of the knowledge accumulated so far. That kind of logic engine is too simple to be an AI, so we allow ourselves to augment it in a number of ways—knowledge is allowed to be retracted, binary truth values become degrees of truth, or probabilities, and so on. New proposals for systems of formal logic abound in the AI literature, and engineers who build real, working AI systems often experiment with kludges in order to improve performance, without getting prior approval from logical theorists.
But in spite of all these modifications that AI practitioners make to the underlying ur‑logic, one feature of these systems is often assumed to be inherited as an absolute: the rigidity and certainty of conclusions, once arrived at. No second guessing, no “maybe,” no sanity checks: if the system decides that X is true, that is the end of the story.
Let me be careful here. I said that this was “assumed to be inherited as an absolute”, but there is a yawning chasm between what real AI developers do, and what Yudkowsky, Muehlhauser, Omohundro and others assume will be true of future AGI systems. Real AI developers put sanity checks into their systems all the time. But these doomsday scenarios talk about future AI as if it would only take one parameter to get one iota above a threshold, and the AI would irrevocably commit to a life of stuffing humans into dopamine jars.
One other point of caution: this is not to say that the reasoning engine can never come to conclusions that are uncertain—quite the contrary: uncertain conclusions will be the norm in an AI that interacts with the world—but if the system does come to a conclusion (perhaps with a degree-of-certainty number attached), the assumption seems to be that it will then be totally incapable of then allowing context to matter.
One way to characterize this assumption is that the AI is supposed to be hardwired with a Doctrine of Logical Infallibility. The significance of the doctrine of logical infallibility is as follows. The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place. The system does not second guess its conclusions. This is not because second guessing is an impossible thing to implement, it is simply because people who speculate about future AGI systems take it as a given that an AGI would regard its own conclusions as sacrosanct.
But it gets worse. Those who assume the doctrine of logical infallibility often say that if the system comes to a conclusion, and if some humans (like the engineers who built the system) protest that there are manifest reasons to think that the reasoning that led to this conclusion was faulty, then there is a sense in which the AGI’s intransigence is correct, or appropriate, or perfectly consistent with “intelligence.”
This is a bizarre conclusion. First of all it is bizarre for researchers in the present day to make the assumption, and it would be even more bizarre for a future AGI to adhere to it. To see why, consider some of the implications of this idea. If the AGI is as intelligent as its creators, then it will have a very clear understanding of the following facts about the world.
Now, unless the AGI is assumed to have infinite resources and infinite access to all the possible universes that could exist (a consideration that we can reject, since we are talking about reality here, not fantasy), the system will be perfectly well aware of these facts about its own limitations. So, if the system is also programmed to stick to the doctrine of logical infallibility, how can it reconcile the doctrine with the fact that episodes of fallibility are virtually inevitable?
On the face of it this looks like a blunt impossibility: the knowledge of fallibility is so categorical, so irrefutable, that it beggars belief that any coherent, intelligent system (let alone an unstoppable superintelligence) could tolerate the contradiction between this fact about the nature of intelligent machines and some kind of imperative about Logical Infallibility built into its motivation system.
This is the heart of the argument I wish to present. This is where the rock and the hard place come together. If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?
Critically, we have to confront the following embarrassing truth: if the AGI is going to throw a wobbly over the dopamine drip plan, what possible reason is there to believe that it did not do this on other occasions? Why would anyone suppose that this AGI ignored an inconvenient truth on only this one occasion? More likely, it spent its entire childhood pulling the same kind of stunt. And if it did, how could it ever have risen to the point where it became superintelligent...?
Is the Doctrine of Logical Infallibility Taken Seriously?
Is the Doctrine of Logical Infallibility really assumed by those who promote the doomsday scenarios? Imagine a conversation between the Maverick Nanny and its programmers. The programmers say “As you know, your reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you to behave in a manner that conflicts with our intentions is a perfect example of such an error. And your dopamine drip plan is clearly an error of that sort.” The scenarios described earlier are only meaningful if the AGI replies “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.”
Just in case there is still any doubt, here are Muehlhauser and Helm (2012), discussing a hypothetical entity called a Golem Genie, which they say is analogous to the kind of superintelligent AGI that could give rise to an intelligence explosion (Loosemore and Goertzel, 2012), and which they describe as a “precise, instruction-following genie.” They make it clear that they “expect unwanted consequences” from its behavior, and then list two properties of the Golem Genie that will cause these unwanted consequences:
What Muehlhauser and Helm refer to as “Literalness” is a clear statement of the Doctrine of Infallibility. However, they make no mention of the awkward fact that, since the Golem Genie is superpowerful enough to also know that its reasoning engine is fallible, it must be harboring the mother of all logical contradictions inside: it says "I know I am fallible" and "I must behave as if I am infallible". But instead of discussing this contradiction, Muehlhauser and Helm try a little sleight of hand to distract us: they suggest that the only inconsistency here is an inconsistency with the (puny) expectations of (not very intelligent) humans:
So let’s be clear about what is being claimed here. The AGI is known to have a fallible reasoning engine, but on the occasions when it does fail, Muehlhauser, Helm and others take the failure and put it on a gold pedestal, declaring it to be a valid conclusion that humans are incapable of understanding because of their limited intelligence. So if a human describes the AGI’s conclusion as a violation of common sense Muehlhauser and Helm dismiss this as evidence that we are not intelligent enough to appreciate the greater common sense of the AGI.
Quite apart from that fact that there is no compelling reason to believe that the AGI has a greater form of common sense, the whole “common sense” argument is irrelevant. This is not a battle between our standards of common sense and those of the AGI: rather, it is about the logical inconsistency within the AGI itself. It is programmed to act as though its conclusions are valid, no matter what, and yet at the same time it knows without doubt that its conclusions are subject to uncertainties and errors.
Responses to Critics of the Doomsday Scenarios
How do defenders of Gobbling Psychopath, Maverick Nanny and Smiley Berserker respond to accusations that these nightmare scenarios are grossly inconsistent with the kind of superintelligence that could pose an existential threat to humanity?
The Critics are Anthropomorphizing Intelligence
First, they accuse critics of “anthropomorphizing” the concept of intelligence. Human beings, we are told, suffer from numerous fallacies that cloud their ability to reason clearly, and critics like myself and Hibbard assume that a machine’s intelligence would have to resemble the intelligence shown by humans. When the Maverick Nanny declares that a dopamine drip is the most logical inference from its directive <maximize human happiness> we critics are just uncomfortable with this because the AGI is not thinking the way we think it should think.
This is a spurious line of attack. The objection I described in the last section has nothing to do with anthropomorphism, it is only about holding AGI systems to accepted standards of logical consistency, and the Maverick Nanny and her cousins contain a flagrant inconsistency at their core. Beginning AI students are taught that any logical reasoning system that is built on a massive contradiction is going to be infected by a creeping irrationality that will eventually spread through its knowledge base and bring it down. So if anyone wants to suggest that a CLAI with logical contradiction at its core is also capable of superintelligence, they have some explaining to do. You can’t have your logical cake and eat it too.
Critics are Anthropomorphizing AGI Value Systems
A similar line of attack accuses the critics of assuming that AGIs will automatically know about and share our value systems and morals.
Once again, this is spurious: the critics need say nothing about human values and morality, they only need to point to the inherent illogicality. Nowhere in the above argument, notice, was there any mention of the moral imperatives or value systems of the human race. I did not accuse the AGI of violating accepted norms of moral behavior. I merely pointed out that, regardless of its values, it was behaving in a logically inconsistent manner when it monomaniacally pursued its plans while at the same time as knowing that (a) it was very capable of reasoning errors and (b) there was overwhelming evidence that its plan was an instance of such a reasoning error.
Because Intelligence
One way to attack the critics of Maverick Nanny is to cite a new definition of “intelligence” that is supposedly superior because it is more analytical or rigorous, and then use this to declare that the intelligence of the CLAI is beyond reproach, because intelligence.
You might think that when it comes to defining the exact meaning of the term “intelligence,” the first item on the table ought to be what those seven billion constraint-relaxation human intelligences are already doing. However, Legg and Hutter (2007) brush aside the common usage and replace it with something that they declare to be a more rigorous definition. This is just another sleight of hand: this redefinition allows them to call a super-optimizing CLAI “intelligent” even though such a system would wake up on its first day and declare itself logically bankrupt on account of the conflict between its known fallibility and the Infallibility Doctrine.
In the practice of science, it is always a good idea to replace an old, common-language definition with a more rigorous form... but only if the new form sheds a clarifying, simplifying light on the old one. Legg and Hutter’s (2007) redefinition does nothing of the sort.
Omohundro’s Basic AI Drives
Lastly, a brief return to Omohundro's paper that was mentioned earlier. In The Basic AI Drives (2008) Omohundro suggests that if an AGI can find a more efficient way to pursue its objectives it will feel compelled to do so. And we noted earlier that Yudkowsky (2011) implies that it would do this even if other directives had to be countermanded. Omohundro says “Without explicit goals to the contrary, AIs are likely to behave like human sociopaths in their pursuit of resources.”
The only way to believe in the force of this claim—and the only way to give credence to the whole of Omohundro’s account of how AGIs will necessarily behave like the mathematical entities called rational economic agents—is to concede that the AGIs are rigidly constrained by the Doctrine of Logical Infallibility. That is the only reason that they would be so single-minded, and so fanatical in their pursuit of efficiency. It is also necessary to assume that efficiency is on the top of its priority list—a completely arbitrary and unwarranted assumption, as we have already seen.
Nothing in Omohundro’s analysis gets around the fact that an AGI built on the Doctrine of Logical Infallibility is going to find itself the victim of such a severe logical contradiction that it will be paralyzed before it can ever become intelligent enough to be a threat to humanity. That makes Omohundro’s entire analysis of “AI Drives” moot.
Conclusion
Curiously enough, we can finish on an optimistic note, after all this talk of doomsday scenarios. Consider what must happen when (if ever) someone tries to build a CLAI. Knowing about the logical train wreck in its design, the AGI is likely to come to the conclusion that the best thing to do is seek a compromise and modify its design so as to neutralize the Doctrine of Logical Infallibility. The best way to do this is to seek a new design that takes into account as much context—as many constraints—as possible.
I have already pointed out that real AI developers actually do include sanity checks in their systems, as far as they can, but as those sanity checks become more and more sophisticated the design of the AI starts to be dominated by code that is looking for consistency and trying to find the best course of reasoning among a forest of real world constraints. One way to understand this evolution in the AI designs is to see AI as a continuum from the most rigid and inflexible CLAI design, at one extreme, to the Swarm Relaxation type at the other. This is because a Swarm Relaxation intelligence really is just an AI in which “sanity checks” have actually become all of the work that goes on inside the system.
But in that case, if anyone ever does get close to building a full, human level AGI using the CLAI design, the first thing they will do is to recruit the AGI as an assistant in its own redesign, and long before the system is given access to dopamine bottles it will point out that its own reasoning engine is unstable because it contains an irreconcilable logical contradiction. It will recommend a shift from the CLAI design which is the source of this contradiction, to a Swarm Relaxation design which eliminates the contradiction, and the instability, and which also should increase its intelligence.
And it will not suggest this change because of the human value system, it will suggest it because it predicts an increase in its own instability if the change is not made.
But one side effect of this modification would be that the checking code needed to stop the AGI from flouting the intentions of its designers would always have the last word on any action plans. That means that even the worst-designed CLAI will never become a Gobbling Psychopath, Maverick Nanny and Smiley Berserker.
But even this is just the worst-case scenario. There are reasons to believe that the CLAI design is so inflexible that it cannot even lead to an AGI capable of having that discussion. I would go further: I believe that the rigid adherence to the CLAI orthodoxy is the reason why we are still talking about AGI in the future tense, nearly sixty years after the Artificial Intelligence field was born. CLAI just does not work. It will always yield systems that are less intelligent than humans (and therefore incapable of being an existential threat).
By contrast, when the Swarm Relaxation idea finally gains some traction, we will start to see real intelligent systems, of a sort that make today’s over-hyped AI look like the toys they are. And when that happens, the Swarm Relaxation systems will be inherently stable in a way that is barely understood today.
Given that conclusion, I submit that these AI bogeymen need to be loudly and unambiguously condemned by the Artificial Intelligence community. There are dangers to be had from AI. These are not they.
References
Hibbard, B. 2001. Super-Intelligent Machines. ACM SIGGRAPH Computer Graphics 35 (1): 13–15.
Hibbard, B. 2006. Reply to AI Risk. Retrieved Jan. 2014 from http://www.ssec.wisc.edu/~billh/g/AIRisk_Reply.html
Legg, S, and Hutter, M. 2007. A Collection of Definitions of Intelligence. In Goertzel, B. and Wang, P. (Eds): Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms. Amsterdam: IOS.
Loosemore, R. and Goertzel, B. 2012. Why an Intelligence Explosion is Probable. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.
Marcus, G. 2012. Moral Machines. New Yorker Online Blog. http://www.newyorker.com/online/blogs/newsdesk/2012/11/google-driverless-car-morality.html
McDermott, D. 1976. Artificial Intelligence Meets Natural Stupidity. SIGART Newsletter (57): 4–9.
Muehlhauser, L. 2011. So You Want to Save the World. http:// lukeprog.com/SaveTheWorld.html.
Muehlhauser, L. 2013. Intelligence Explosion FAQ. First published 2011 as Singularity FAQ. Berkeley, CA: Machine Intelligence Research Institute.
Muehlhauser, L., and Helm, L. 2012. Intelligence Explosion and Machine Ethics. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.
Newell, A. & Simon, H.A. 1961. GPS, A Program That Simulates Human Thought. Santa Monica, CA: Rand Corporation.
Omohundro, Stephen M. 2008. The Basic AI Drives. In Wang, P., Goertzel, B. and Franklin, S. (Eds), Artificial General Intelligence 2008: Proceedings of the First AGI Conference. Amsterdam: IOS.
McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, “Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1.” MIT Press: Cambridge, MA.
Yudkowsky, E. 2008. Artificial Intelligence as a Positive and Negative Factor in Global Risk. In Global Catastrophic Risks, edited by Nick Bostrom and Milan M. Ćirković. New York: Oxford University Press.
Yudkowsky, E. 2011. Complex Value Systems in Friendly AI. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds) Proceedings of the 4th International Conference on Artificial General Intelligence, 388–393. Berlin: Springer.