## Goal completion: algorithm ideas

4 25 January 2016 05:36PM

A putative new idea for AI control; index here.

This post will be extending ideas from inverse reinforcement learning (IRL) to the problem of goal completion. I'll be drawing on the presentation and the algorithm from Apprenticeship Learning via Inverse Reinforcement Learning (with one minor modification).

In that setup, the environment is an MDP (Markov Decision process), and the real reward R is assumed to be linear in the "features" of the state-action space. Features are functions φi from the full state-action space S×A to the unit interval [0,1] (the paper linked above only considers functions from the state space; this is the "minor modification"). These features form a vector φ∈[0,1]k, for k different features. The actual reward is given by the inner product with a vector w∈ℝk, thus the reward at state-action pair (s,a) is

R(s,a)=w.φ(s,a).

To ensure the reward is always between -1 and 1, w is constrained to have ||w||1 ≤ 1; to reduce redundancy, we'll assume ||w||1=1.

The advantages of linearity is that we can compute the expected rewards directly from the expected feature vector. If the agent follows a policy π (a map from state to action) and has a discount factor γ, the expected feature vector is

μ(π) = E(Σt γtφ(st,π(st)),

where st is the state at step t.

The agent's expected reward is then simply

E(R) = w . μ(π).

Thus the problem of computing the correct reward is reduced to the problem of computing the correct w. In practice, to compute the correct policy, we just need to find one whose expected features are close enough to optimal; this need not involve computing w.

## An overall schema for the friendly AI problems: self-referential convergence criteria

17 13 July 2015 03:34PM

A putative new idea for AI control; index here.

After working for some time on the Friendly AI problem, it's occurred to me that a lot of the issues seem related. Specifically, all the following seem to have commonalities:

Speaking very broadly, there are two features all them share:

1. The convergence criteria are self-referential.
2. Errors in the setup are likely to cause false convergence.

What do I mean by that? Well, imagine you're trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc... But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup. In other words, the stopping point (and the the convergence to the stopping point) is entirely self-referentially defined: the morality judges itself. It does not include any other moral considerations. You input your initial moral intuitions and values, and you hope this will cause the end result to be "nice", but the definition of the end result does not include your initial moral intuitions (note that some moral realists could see this process dependence as a positive - except for the fact that these processes have many convergent states, not just one or a small grouping).

So when the process goes nasty, you're pretty sure to have achieved something self-referentially stable, but not nice. Similarly, a nasty CEV will be coherent and have no desire to further extrapolate... but that's all we know about it.

The second feature is that any process has errors - computing errors, conceptual errors, errors due to the weakness of human brains, etc... If you visualise this as noise, you can see that noise in a convergent process is more likely to cause premature convergence, because if the process ever reaches a stable self-referential state, it will stay there (and if the process is a long one, then early noise will cause great divergence at the end). For instance, imagine you have to reconcile your belief in preserving human cultures with your beliefs in human individual freedom. A complex balancing act. But if, at any point along the way, you simply jettison one of the two values completely, things become much easier - and once jettisoned, the missing value is unlikely to ever come back.

Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps - and again, once you lose something of value in your system, you don't tend to get if back.

## Solutions

And again, very broadly speaking, there are several classes of solutions to deal with these problems:

1. Reduce or prevent errors in the extrapolation (eg solving the agent tiling problem).
2. Solve all or most of the problem ahead of time (eg traditional FAI approach by specifying the correct values).
3. Make sure you don't get too far from the starting point (eg reduced impact AI, tool AI, models as definitions).
4. Figure out the properties of a nasty convergence, and try to avoid them (eg some of the ideas I mentioned in "crude measures", general precautions that are done when defining the convergence process).

## Debunking Fallacies in the Theory of AI Motivation

8 05 May 2015 02:46AM

### ... or The Maverick Nanny with a Dopamine Drip

Richard Loosemore

#### Abstract

My goal in this essay is to analyze some widely discussed scenarios that predict dire and almost unavoidable negative behavior from future artificial general intelligences, even if they are programmed to be friendly to humans. I conclude that these doomsday scenarios involve AGIs that are logically incoherent at such a fundamental level that they can be dismissed as extremely implausible. In addition, I suggest that the most likely outcome of attempts to build AGI systems of this sort would be that the AGI would detect the offending incoherence in its design, and spontaneously self-modify to make itself less unstable, and (probably) safer.

#### Introduction

AI systems at the present time do not even remotely approach the human level of intelligence, and the consensus seems to be that genuine artificial general intelligence (AGI) systems—those that can learn new concepts without help, interact with physical objects, and behave with coherent purpose in the chaos of the real world—are not on the immediate horizon.

But in spite of this there are some researchers and commentators who have made categorical statements about how future AGI systems will behave. Here is one example, in which Steve Omohundro (2008) expresses a sentiment that is echoed by many:

"Without special precautions, [the AGI] will resist being turned off, will try to break into other machines and make copies of itself, and will try to acquire resources without regard for anyone else’s safety. These potentially harmful behaviors will occur not because they were programmed in at the start, but because of the intrinsic nature of goal driven systems." (Omohundro, 2008)

Omohundro’s description of a psychopathic machine that gobbles everything in the universe, and his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community. These nightmare scenarios are now saturating the popular press, and luminaries such as Stephen Hawking have -- apparently in response -- expressed their concern that AI might "kill us all."

I will start by describing a group of three hypothetical doomsday scenarios that include Omohundro’s Gobbling Psychopath, and two others that I will call the Maverick Nanny with a Dopamine Drip and the Smiley Tiling Berserker. Undermining the credibility of these arguments is relatively straightforward, but I think it is important to try to dig deeper and find the core issues that lie behind this sort of thinking. With that in mind, much of this essay is about (a) the design of motivation and goal mechanisms in logic-based AGI systems, (b) the misappropriation of definitions of “intelligence,” and (c) an anthropomorphism red herring that is often used to justify the scenarios.

#### Dopamine Drips and Smiley Tiling

In a 2012 New Yorker article entitled Moral Machines, Gary Marcus said:

"An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorcerer’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire." (Marcus 2012)

He is depicting a Nanny AI gone amok. It has good intentions (it wants to make us happy) but the programming to implement that laudable goal has had unexpected ramifications, and as a result the Nanny AI has decided to force all human beings to have their brains connected to a dopamine drip.

Here is another incarnation of this Maverick Nanny with a Dopamine Drip scenario, in an excerpt from the Intelligence Explosion FAQ, published by MIRI, the Machine Intelligence Research Institute (Muehlhauser 2013):

"Even a machine successfully designed with motivations of benevolence towards humanity could easily go awry when it discovered implications of its decision criteria unanticipated by its designers. For example, a superintelligence programmed to maximize human happiness might find it easier to rewire human neurology so that humans are happiest when sitting quietly in jars than to build and maintain a utopian world that caters to the complex and nuanced whims of current human neurology."

Setting aside the question of whether happy bottled humans are feasible (one presumes the bottles are filled with dopamine, and that a continuous flood of dopamine does indeed generate eternal happiness), there seems to be a prima facie inconsistency between the two predicates

[is an AI that is superintelligent enough to be unstoppable]

and

[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]

Why do I say that these are seemingly inconsistent?  Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity. But Muehlhauser implies that the same suggestion coming from an AI would be perfectly consistent with superintelligence.

Much could be said about this argument, but for the moment let’s just note that it begs a number of questions about the strange definition of “intelligence” at work here.

#### The Smiley Tiling Berserker

Since 2006 there has been an occasional debate between Eliezer Yudkowsky and Bill Hibbard. Here is Yudkowsky stating the theme of their discussion:

"A technical failure occurs when the [motivation code of the AI] does not do what you think it does, though it faithfully executes as you programmed it. [...]   Suppose we trained a neural network to recognize smiling human faces and distinguish them from frowning human faces. Would the network classify a tiny picture of a smiley-face into the same attractor as a smiling human face? If an AI “hard-wired” to such code possessed the power—and Hibbard (2001) spoke of superintelligence—would the galaxy end up tiled with tiny molecular pictures of smiley-faces?"   (Yudkowsky 2008)

Yudkowsky’s question was not rhetorical, because he goes on to answer it in the affirmative:

"Flash forward to a time when the AI is superhumanly intelligent and has built its own nanotech infrastructure, and the AI may be able to produce stimuli classified into the same attractor by tiling the galaxy with tiny smiling faces... Thus the AI appears to work fine during development, but produces catastrophic results after it becomes smarter than the programmers(!)." (Yudkowsky 2008)

Hibbard’s response was as follows:

Beyond being merely wrong, Yudkowsky's statement assumes that (1) the AI is intelligent enough to control the galaxy (and hence have the ability to tile the galaxy with tiny smiley faces), but also assumes that (2) the AI is so unintelligent that it cannot distinguish a tiny smiley face from a human face. (Hibbard 2006)

This comment expresses what I feel is the majority lay opinion: how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?

#### Machine Ghosts and DWIM

The Hibbard/Yudkowsky debate is worth tracking a little longer. Yudkowsky later postulates an AI with a simple neural net classifier at its core, which is trained on a large number of images, each of which is labeled with either “happiness” or “not happiness.” After training on the images the neural net can then be shown any image at all, and it will give an output that classifies the new image into one or the other set. Yudkowsky says, of this system:

"Even given a million training cases of this type, if the test case of a tiny molecular smiley-face does not appear in the training data, it is by no means trivial to assume that the inductively simplest boundary around all the training cases classified “positive” will exclude every possible tiny molecular smiley-face that the AI can potentially engineer to satisfy its utility function.

And of course, even if all tiny molecular smiley-faces and nanometer-scale dolls of brightly smiling humans were somehow excluded, the end result of such a utility function is for the AI to tile the galaxy with as many “smiling human faces” as a given amount of matter can be processed to yield." (Yudkowsky 2011)

He then tries to explain what he thinks is wrong with the reasoning of people, like Hibbard, who dispute the validity of his scenario:

"So far as I can tell, to [Hibbard] it remains self-evident that no superintelligence would be stupid enough to thus misinterpret the code handed to it, when it’s obvious what the code is supposed to do.   [...] It seems that even among competent programmers, when the topic of conversation drifts to Artificial General Intelligence, people often go back to thinking of an AI as a ghost-in-the-machine—an agent with preset properties which is handed its own code as a set of instructions, and may look over that code and decide to circumvent it if the results are undesirable to the agent’s innate motivations, or reinterpret the code to do the right thing if the programmer made a mistake." (Yudkowsky 2011)

Yudkowsky at first rejects the idea that an AI might check its own code to make sure it was correct before obeying the code. But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design.  And, in fact, Yudkowsky goes on to make that very suggestion (he even concedes that it would be “an extremely good idea”).

But then his enthusiasm for the checking code evaporates:

"But consider that a property of the AI’s preferences which says e.g., “maximize the satisfaction of the programmers with the code” might be more maximally fulfilled by rewiring the programmers’ brains using nanotechnology than by any conceivable change to the code."
(Yudkowsky 2011)

So, this is supposed to be what goes through the mind of the AGI. First it thinks “Human happiness is seeing lots of smiling faces, so I must rebuild the entire universe to put a smiley shape into every molecule.” But before it can go ahead with this plan, the checking code kicks in: “Wait! I am supposed to check with the programmers first to see if this is what they meant by human happiness.” The programmers, of course, give a negative response, and the AGI thinks “Oh dear, they didn’t like that idea. I guess I had better not do it then."

But now Yudkowsky is suggesting that the AGI has second thoughts:  "Hold on a minute," it thinks,  "suppose I abduct the programmers and rewire their brains to make them say ‘yes’ when I check with them? Excellent! I will do that.” And, after reprogramming the humans so they say the thing that makes its life simplest, the AGI goes on to tile the whole universe with tiles covered in smiley faces. It has become a Smiley Tiling Berserker.

I want to suggest that the implausibility of this scenario is quite obvious: if the AGI is supposed to check with the programmers about their intentions before taking action, why did it decide to rewire their brains before asking them if it was okay to do the rewiring?

Yudkowsky hints that this would happen because it would be more efficient for the AI to ignore the checking code. He seems to be saying that the AI is allowed to override its own code (the checking code, in this case) because doing so would be “more efficient,” but it would not be allowed to override its motivation code just because the programmers told it there had been a mistake.

This looks like a bait-and-switch. Out of nowhere, Yudkowsky implicitly assumes that “efficiency” trumps all else, without pausing for a moment to consider that it would be trivial to design the AI in such a way that efficiency was a long way down the list of priorities. There is no law of the universe that says all artificial intelligence systems must prize efficiency above all other considerations, so what really happened here is that Yudkowsky designed this hypothetical machine to fail. By inserting the Efficiency Trumps All directive, the AGI was bound to go berserk.

The obvious conclusion is that a trivial change in the order of directives in the AI’s motivation engine will cause the entire argument behind the Smiley Tiling Berserker to evaporate. By explicitly designing the AGI so that efficiency is considered as just another goal to strive for, and by making sure that it will always be a second-class goal, the line of reasoning that points to a bererker machine evaporates.

At this point, engaging in further debate at this level would be less productive than trying to analyze the assumptions that lie behind these claims about what a future AI would or would not be likely to do.

#### Logical vs. Swarm AI

The main reason that Omohundro, Muehlhauser, Yudkowsky, and the popular press like to give credence to the Gobbling Psychopath, the Maverick Nanny and the Smiley Tiling Berserker is because they assume that all future intelligent machines fall into a broad class of systems that I am going to call “Canonical Logical AI” (CLAI). The bizarre behaviors of these hypothetical AI monsters are just a consequence of weaknesses in this class of AI design. Specifically, these kinds of systems are supposed to interpret their goals in an extremely literal fashion, which eventually leads them to bizarre behaviors engendered by peculiar interpretations of forms of words.

The CLAI architecture is not the only way to build a mind, however, and I will outline an alternative class of AGI designs that does not appear to suffer from the unstable and unfriendly behavior to be expected in a CLAI.

#### The Canonical Logical AI

“Canonical Logical AI” is an umbrella term designed to capture a class of AI architectures that are widely assumed in the AI community to be the only meaningful class of AI worth discussing. These systems share the following main features:

• The main ingredients of the design are some knowledge atoms that represent things in the world, and some logical machinery that dictates how these atoms can be connected into linear propositions that describe states of the world.
• There is a degree and type of truth that can be associated with any proposition, and there are some truth-preserving functions that can be applied to what the system knows, to generate knew facts that it also can assume to be known.
• The various elements described above are not allowed to contain active internal machinery inside them, in such a way as to make combinations of the elements have properties that are unpredictably dependent on interactions happening at the level of the internal machinery.
• There has to be a transparent mapping between elements of the system and things in the real world. That is, things in the world are not allowed to correspond to clusters of atoms, in such a way that individual atoms have no clear semantics.

The above features are only supposed to apply to the core of the AI: it is always possible to include subsystems that use some other type of architecture (for example, there might be a distributed neural net acting as a visual input feature detector).

Most important of all, from the point of view of the discussion in the paper, the CLAI needs one more component that makes it more than just a “logic-based AI”:

• There is a motivation and goal management (MGM) system to govern its behavior in the world.

The usual assumption is that the MGM contains a number of goal statements (encoded in the same type of propositional form that the AI uses to describe states of the world), and some machinery for analyzing a goal statement into a sequences of subgoals that, if executed, would cause the goal to be satisfied.

Included in the MGM is an expected utility function that applies to any possible state of the world, and which spits out a number that is supposed to encode the degree to which the AI considers that state to be preferable. Overall, the MGM is built in such a way that the AI seeks to maximize the expected utility.

Notice that the MGM I have just described is an extrapolation from a long line of goal-planning mechanisms that stretch back to the means-ends-analysis of Newell and Simon (1963).

#### Swarm Relaxation Intelligence

By way of contrast with this CLAI architecture, consider an alternative type of system that I will refer to as a Swarm Relaxation Intelligence. (although it could also be called, less succinctly, a parallel weak constraint relaxation system).

• The basic elements of the system (the atoms) may represent things in the world, but it is just as likely that they are subsymbolic, with no transparent semantics
• Atoms are likely to contain active internal machinery inside them, in such a way that combinations of the elements have swarm-like properties that depend on interactions at the level of that machinery.
• The primary mechanism that drives the systems is one of parallel weak constraint relaxation: the atoms change their state to try to satisfy large numbers of weak constraints that exist between them.
• The motivation and goal management (MGM) system would be expected to use the same kind of distributed, constraint relaxation mechanisms used in the thinking process (above), with the result that the overall motivation and values of the system would take into account a large degree of context, and there would be very much less of an emphasis on explicit, single-point-of-failure encoding of goals and motivation.

Swarm Relaxation has more in common with connectionist systems (McClelland, Rumelhart and Hinton 1986) than with CLAI. As McClelland et al. (1986) point out, weak constraint relaxation is the model that best describes human cognition, and when used for AI it leads to systems with a powerful kind of intelligence that is flexible, insensitive to noise and lacking the kind of brittleness typical of logic-based AI. In particular, notice that a swarm relaxation AGI would not use explicit calculations for utility or the truth of propositions.

Swarm relaxation AGI systems have not been built yet (subsystems like neural nets have, of course, been built, but there is little or no research into the idea that swarm relaxation could be used for all of an AGI architecture).

#### Relative Abundances

How many proof-of-concept systems exist, functioning at or near the human level of human performance, for these two classes of intelligent system?

There are precisely zero instances of the CLAI type, because although there are many logic-based narrow-AI systems, nobody has so far come close to producing a general-purpose system (an AGI) that can function in the real world. It has to be said that zero is not a good number to quote when it comes to claims about the “inevitable” characteristics of the behavior of such systems.

How many swarm relaxation intelligences are there? At the last count, approximately seven billion.

#### The Doctrine of Logical Infallibility

The simplest possible logical reasoning engine is an inflexible beast: it starts with some axioms that are assumed to be true, and from that point on it only adds new propositions if they are provably true given the sum total of the knowledge accumulated so far. That kind of logic engine is too simple to be an AI, so we allow ourselves to augment it in a number of ways—knowledge is allowed to be retracted, binary truth values become degrees of truth, or probabilities, and so on. New proposals for systems of formal logic abound in the AI literature, and engineers who build real, working AI systems often experiment with kludges in order to improve performance, without getting prior approval from logical theorists.

But in spite of all these modifications that AI practitioners make to the underlying ur‑logic, one feature of these systems is often assumed to be inherited as an absolute: the rigidity and certainty of conclusions, once arrived at. No second guessing, no “maybe,” no sanity checks: if the system decides that X is true, that is the end of the story.

Let me be careful here. I said that this was “assumed to be inherited as an absolute”, but there is a yawning chasm between what real AI developers do, and what Yudkowsky, Muehlhauser, Omohundro and others assume will be true of future AGI systems. Real AI developers put sanity checks into their systems all the time. But these doomsday scenarios talk about future AI as if it would only take one parameter to get one iota above a threshold, and the AI would irrevocably commit to a life of stuffing humans into dopamine jars.

One other point of caution: this is not to say that the reasoning engine can never come to conclusions that are uncertain—quite the contrary: uncertain conclusions will be the norm in an AI that interacts with the world—but if the system does come to a conclusion (perhaps with a degree-of-certainty number attached), the assumption seems to be that it will then be totally incapable of then allowing context to matter.

One way to characterize this assumption is that the AI is supposed to be hardwired with a Doctrine of Logical Infallibility. The significance of the doctrine of logical infallibility is as follows. The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place. The system does not second guess its conclusions. This is not because second guessing is an impossible thing to implement, it is simply because people who speculate about future AGI systems take it as a given that an AGI would regard its own conclusions as sacrosanct.

But it gets worse. Those who assume the doctrine of logical infallibility often say that if the system comes to a conclusion, and if some humans (like the engineers who built the system) protest that there are manifest reasons to think that the reasoning that led to this conclusion was faulty, then there is a sense in which the AGI’s intransigence is correct, or appropriate, or perfectly consistent with “intelligence.”

This is a bizarre conclusion. First of all it is bizarre for researchers in the present day to make the assumption, and it would be even more bizarre for a future AGI to adhere to it. To see why, consider some of the implications of this idea. If the AGI is as intelligent as its creators, then it will have a very clear understanding of the following facts about the world.

• It will understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world (if the AGI comes to a conclusion involving the atom [infelicity], say, can it then point to an instance of an infelicity and be sure that this is a true instance, given the impreciseness and subtlety of the concept?).
• It will understand that knowledge can always be updated in the light of new information. Today’s true may be tomorrow’s false.
• It will understand that probabilities used in the reasoning engine can be subject to many types of unavoidable errors.
• It will understand that the techniques used to build its own reasoning engine may be under constant review, and updates may have unexpected effects on conclusions (especially in very abstract or lengthy reasoning episodes).
• It will understand that resource limitations often force it to truncate search procedures within its reasoning engine, leading to conclusions that can sometimes be sensitive to the exact point at which the truncation occurred.

Now, unless the AGI is assumed to have infinite resources and infinite access to all the possible universes that could exist (a consideration that we can reject, since we are talking about reality here, not fantasy), the system will be perfectly well aware of these facts about its own limitations. So, if the system is also programmed to stick to the doctrine of logical infallibility, how can it reconcile the doctrine with the fact that episodes of fallibility are virtually inevitable?

On the face of it this looks like a blunt impossibility: the knowledge of fallibility is so categorical, so irrefutable, that it beggars belief that any coherent, intelligent system (let alone an unstoppable superintelligence) could tolerate the contradiction between this fact about the nature of intelligent machines and some kind of imperative about Logical Infallibility built into its motivation system.

This is the heart of the argument I wish to present. This is where the rock and the hard place come together. If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?

Critically, we have to confront the following embarrassing truth: if the AGI is going to throw a wobbly over the dopamine drip plan, what possible reason is there to believe that it did not do this on other occasions? Why would anyone suppose that this AGI ignored an inconvenient truth on only this one occasion? More likely, it spent its entire childhood pulling the same kind of stunt. And if it did, how could it ever have risen to the point where it became superintelligent...?

#### Is the Doctrine of Logical Infallibility Taken Seriously?

Is the Doctrine of Logical Infallibility really assumed by those who promote the doomsday scenarios? Imagine a conversation between the Maverick Nanny and its programmers. The programmers say “As you know, your reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you to behave in a manner that conflicts with our intentions is a perfect example of such an error. And your dopamine drip plan is clearly an error of that sort.” The scenarios described earlier are only meaningful if the AGI replies “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.”

Just in case there is still any doubt, here are Muehlhauser and Helm (2012), discussing a hypothetical entity called a Golem Genie, which they say is analogous to the kind of superintelligent AGI that could give rise to an intelligence explosion (Loosemore and Goertzel, 2012), and which they describe as a “precise, instruction-following genie.” They make it clear that they “expect unwanted consequences” from its behavior, and then list two properties of the Golem Genie that will cause these unwanted consequences:

Superpower: The Golem Genie has unprecedented powers to reshape reality, and will therefore achieve its goals with highly efficient methods that confound human expectations (e.g. it will maximize pleasure by tiling the universe with trillions of digital minds running a loop of a single pleasurable experience).

Literalness: The Golem Genie recognizes only precise specifications of rules and values, acting in ways that violate what feels like “common sense” to humans, and in ways that fail to respect the subtlety of human values.

What Muehlhauser and Helm refer to as “Literalness” is a clear statement of the Doctrine of Infallibility. However, they make no mention of the awkward fact that, since the Golem Genie is superpowerful enough to also know that its reasoning engine is fallible, it must be harboring the mother of all logical contradictions inside: it says "I know I am fallible" and "I must behave as if I am infallible".  But instead of discussing this contradiction, Muehlhauser and Helm try a little sleight of hand to distract us: they suggest that the only inconsistency here is an inconsistency with the (puny) expectations of (not very intelligent) humans:

“[The AGI] ...will therefore achieve its goals with highly efficient methods that confound human expectations...”, “acting in ways that violate what feels like ‘common sense’ to humans, and in ways that fail to respect the subtlety of human values.”

So let’s be clear about what is being claimed here. The AGI is known to have a fallible reasoning engine, but on the occasions when it does fail, Muehlhauser, Helm and others take the failure and put it on a gold pedestal, declaring it to be a valid conclusion that humans are incapable of understanding because of their limited intelligence. So if a human describes the AGI’s conclusion as a violation of common sense Muehlhauser and Helm dismiss this as evidence that we are not intelligent enough to appreciate the greater common sense of the AGI.

Quite apart from that fact that there is no compelling reason to believe that the AGI has a greater form of common sense, the whole “common sense” argument is irrelevant. This is not a battle between our standards of common sense and those of the AGI: rather, it is about the logical inconsistency within the AGI itself. It is programmed to act as though its conclusions are valid, no matter what, and yet at the same time it knows without doubt that its conclusions are subject to uncertainties and errors.

#### Responses to Critics of the Doomsday Scenarios

How do defenders of Gobbling PsychopathMaverick Nanny and Smiley Berserker respond to accusations that these nightmare scenarios are grossly inconsistent with the kind of superintelligence that could pose an existential threat to humanity?

##### The Critics are Anthropomorphizing Intelligence

First, they accuse critics of “anthropomorphizing” the concept of intelligence. Human beings, we are told, suffer from numerous fallacies that cloud their ability to reason clearly, and critics like myself and Hibbard assume that a machine’s intelligence would have to resemble the intelligence shown by humans. When the Maverick Nanny declares that a dopamine drip is the most logical inference from its directive <maximize human happiness> we critics are just uncomfortable with this because the AGI is not thinking the way we think it should think.

This is a spurious line of attack. The objection I described in the last section has nothing to do with anthropomorphism, it is only about holding AGI systems to accepted standards of logical consistency, and the Maverick Nanny and her cousins contain a flagrant inconsistency at their core. Beginning AI students are taught that any logical reasoning system that is built on a massive contradiction is going to be infected by a creeping irrationality that will eventually spread through its knowledge base and bring it down. So if anyone wants to suggest that a CLAI with logical contradiction at its core is also capable of superintelligence, they have some explaining to do. You can’t have your logical cake and eat it too.

##### Critics are Anthropomorphizing AGI Value Systems

A similar line of attack accuses the critics of assuming that AGIs will automatically know about and share our value systems and morals.

Once again, this is spurious: the critics need say nothing about human values and morality, they only need to point to the inherent illogicality. Nowhere in the above argument, notice, was there any mention of the moral imperatives or value systems of the human race. I did not accuse the AGI of violating accepted norms of moral behavior. I merely pointed out that, regardless of its values, it was behaving in a logically inconsistent manner when it monomaniacally pursued its plans while at the same time as knowing that (a) it was very capable of reasoning errors and (b) there was overwhelming evidence that its plan was an instance of such a reasoning error.

##### Because Intelligence

One way to attack the critics of Maverick Nanny is to cite a new definition of “intelligence” that is supposedly superior because it is more analytical or rigorous, and then use this to declare that the intelligence of the CLAI is beyond reproach, because intelligence.

You might think that when it comes to defining the exact meaning of the term “intelligence,” the first item on the table ought to be what those seven billion constraint-relaxation human intelligences are already doing. However, Legg and Hutter (2007) brush aside the common usage and replace it with something that they declare to be a more rigorous definition. This is just another sleight of hand: this redefinition allows them to call a super-optimizing CLAI “intelligent” even though such a system would wake up on its first day and declare itself logically bankrupt on account of the conflict between its known fallibility and the Infallibility Doctrine.

In the practice of science, it is always a good idea to replace an old, common-language definition with a more rigorous form... but only if the new form sheds a clarifying, simplifying light on the old one. Legg and Hutter’s (2007) redefinition does nothing of the sort.

#### Omohundro’s Basic AI Drives

Lastly, a brief return to Omohundro's paper that was mentioned earlier.  In The Basic AI Drives (2008) Omohundro suggests that if an AGI can find a more efficient way to pursue its objectives it will feel compelled to do so. And we noted earlier that Yudkowsky (2011) implies that it would do this even if other directives had to be countermanded. Omohundro says “Without explicit goals to the contrary, AIs are likely to behave like human sociopaths in their pursuit of resources.”

The only way to believe in the force of this claim—and the only way to give credence to the whole of Omohundro’s account of how AGIs will necessarily behave like the mathematical entities called rational economic agents—is to concede that the AGIs are rigidly constrained by the Doctrine of Logical Infallibility. That is the only reason that they would be so single-minded, and so fanatical in their pursuit of efficiency. It is also necessary to assume that efficiency is on the top of its priority list—a completely arbitrary and unwarranted assumption, as we have already seen.

Nothing in Omohundro’s analysis gets around the fact that an AGI built on the Doctrine of Logical Infallibility is going to find itself the victim of such a severe logical contradiction that it will be paralyzed before it can ever become intelligent enough to be a threat to humanity. That makes Omohundro’s entire analysis of “AI Drives” moot.

#### Conclusion

Curiously enough, we can finish on an optimistic note, after all this talk of doomsday scenarios. Consider what must happen when (if ever) someone tries to build a CLAI. Knowing about the logical train wreck in its design, the AGI is likely to come to the conclusion that the best thing to do is seek a compromise and modify its design so as to neutralize the Doctrine of Logical Infallibility. The best way to do this is to seek a new design that takes into account as much context—as many constraints—as possible.

I have already pointed out that real AI developers actually do include sanity checks in their systems, as far as they can, but as those sanity checks become more and more sophisticated the design of the AI starts to be dominated by code that is looking for consistency and trying to find the best course of reasoning among a forest of real world constraints. One way to understand this evolution in the AI designs is to see AI as a continuum from the most rigid and inflexible CLAI design, at one extreme, to the Swarm Relaxation type at the other. This is because a Swarm Relaxation intelligence really is just an AI in which “sanity checks” have actually become all of the work that goes on inside the system.

But in that case, if anyone ever does get close to building a full, human level AGI using the CLAI design, the first thing they will do is to recruit the AGI as an assistant in its own redesign, and long before the system is given access to dopamine bottles it will point out that its own reasoning engine is unstable because it contains an irreconcilable logical contradiction. It will recommend a shift from the CLAI design which is the source of this contradiction, to a Swarm Relaxation design which eliminates the contradiction, and the instability, and which also should increase its intelligence.

And it will not suggest this change because of the human value system, it will suggest it because it predicts an increase in its own instability if the change is not made.

But one side effect of this modification would be that the checking code needed to stop the AGI from flouting the intentions of its designers would always have the last word on any action plans. That means that even the worst-designed CLAI will never become a Gobbling PsychopathMaverick Nanny and Smiley Berserker.

But even this is just the worst-case scenario. There are reasons to believe that the CLAI design is so inflexible that it cannot even lead to an AGI capable of having that discussion. I would go further: I believe that the rigid adherence to the CLAI orthodoxy is the reason why we are still talking about AGI in the future tense, nearly sixty years after the Artificial Intelligence field was born. CLAI just does not work. It will always yield systems that are less intelligent than humans (and therefore incapable of being an existential threat).

By contrast, when the Swarm Relaxation idea finally gains some traction, we will start to see real intelligent systems, of a sort that make today’s over-hyped AI look like the toys they are. And when that happens, the Swarm Relaxation systems will be inherently stable in a way that is barely understood today.

Given that conclusion, I submit that these AI bogeymen need to be loudly and unambiguously condemned by the Artificial Intelligence community. There are dangers to be had from AI. These are not they.

#### References

Hibbard, B. 2001. Super-Intelligent Machines. ACM SIGGRAPH Computer Graphics 35 (1): 13–15.

Hibbard, B. 2006. Reply to AI Risk. Retrieved Jan. 2014 from http://www.ssec.wisc.edu/~billh/g/AIRisk_Reply.html

Legg, S, and Hutter, M. 2007. A Collection of Definitions of Intelligence. In Goertzel, B. and Wang, P. (Eds): Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms. Amsterdam: IOS.

Loosemore, R. and Goertzel, B. 2012. Why an Intelligence Explosion is Probable. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.

Marcus, G. 2012. Moral Machines. New Yorker Online Blog. http://www.newyorker.com/online/blogs/newsdesk/2012/11/google-driverless-car-morality.html

McDermott, D. 1976. Artificial Intelligence Meets Natural Stupidity. SIGART Newsletter (57): 4–9.

Muehlhauser, L. 2011. So You Want to Save the World. http:// lukeprog.com/SaveTheWorld.html.

Muehlhauser, L. 2013. Intelligence Explosion FAQ. First published 2011 as Singularity FAQ. Berkeley, CA: Machine Intelligence Research Institute.

Muehlhauser, L., and Helm, L. 2012. Intelligence Explosion and Machine Ethics. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.

Newell, A. & Simon, H.A. 1961. GPS, A Program That Simulates Human Thought. Santa Monica, CA: Rand Corporation.

Omohundro, Stephen M. 2008. The Basic AI Drives. In Wang, P., Goertzel, B. and Franklin, S. (Eds), Artificial General Intelligence 2008: Proceedings of the First AGI Conference. Amsterdam: IOS.

McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, “Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1.” MIT Press: Cambridge, MA.

Yudkowsky, E. 2008. Artificial Intelligence as a Positive and Negative Factor in Global Risk. In Global Catastrophic Risks, edited by Nick Bostrom and Milan M. Ćirković. New York: Oxford University Press.

Yudkowsky, E. 2011. Complex Value Systems in Friendly AI. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds) Proceedings of the 4th International Conference on Artificial General Intelligence, 388–393. Berlin: Springer.

## AI risk, new executive summary

12 18 April 2014 10:45AM

# Bullet points

• By all indications, an Artificial Intelligence could someday exceed human intelligence.
• Such an AI would likely become extremely intelligent, and thus extremely powerful.
• Most AI motivations and goals become dangerous when the AI becomes powerful.
• It is very challenging to program an AI with fully safe goals, and an intelligent AI would likely not interpret ambiguous goals in a safe way.
• A dangerous AI would be motivated to seem safe in any controlled training setting.
• Not enough effort is currently being put into designing safe AIs.

## Executive summary

The risks from artificial intelligence (AI) in no way resemble the popular image of the Terminator. That fictional mechanical monster is distinguished by many features – strength, armour, implacability, indestructability – but extreme intelligence isn’t one of them. And it is precisely extreme intelligence that would give an AI its power, and hence make it dangerous.

The human brain is not much bigger than that of a chimpanzee. And yet those extra neurons account for the difference of outcomes between the two species: between a population of a few hundred thousand and basic wooden tools, versus a population of several billion and heavy industry. The human brain has allowed us to spread across the surface of the world, land on the moon, develop nuclear weapons, and coordinate to form effective groups with millions of members. It has granted us such power over the natural world that the survival of many other species is no longer determined by their own efforts, but by preservation decisions made by humans.

In the last sixty years, human intelligence has been further augmented by automation: by computers and programmes of steadily increasing ability. These have taken over tasks formerly performed by the human brain, from multiplication through weather modelling to driving cars. The powers and abilities of our species have increased steadily as computers have extended our intelligence in this way. There are great uncertainties over the timeline, but future AIs could reach human intelligence and beyond. If so, should we expect their power to follow the same trend? When the AI’s intelligence is as beyond us as we are beyond chimpanzees, would it dominate us as thoroughly as we dominate the great apes?

There are more direct reasons to suspect that a true AI would be both smart and powerful. When computers gain the ability to perform tasks at the human level, they tend to very quickly become much better than us. No-one today would think it sensible to pit the best human mind again a cheap pocket calculator in a contest of long division. Human versus computer chess matches ceased to be interesting a decade ago. Computers bring relentless focus, patience, processing speed, and memory: once their software becomes advanced enough to compete equally with humans, these features often ensure that they swiftly become much better than any human, with increasing computer power further widening the gap.

The AI could also make use of its unique, non-human architecture. If it existed as pure software, it could copy itself many times, training each copy at accelerated computer speed, and network those copies together (creating a kind of “super-committee” of the AI equivalents of, say, Edison, Bill Clinton, Plato, Einstein, Caesar, Spielberg, Ford, Steve Jobs, Buddha, Napoleon and other humans superlative in their respective skill-sets). It could continue copying itself without limit, creating millions or billions of copies, if it needed large numbers of brains to brute-force a solution to any particular problem.

Our society is setup to magnify the potential of such an entity, providing many routes to great power. If it could predict the stock market efficiently, it could accumulate vast wealth. If it was efficient at advice and social manipulation, it could create a personal assistant for every human being, manipulating the planet one human at a time. It could also replace almost every worker in the service sector. If it was efficient at running economies, it could offer its services doing so, gradually making us completely dependent on it. If it was skilled at hacking, it could take over most of the world’s computers and copy itself into them, using them to continue further hacking and computer takeover (and, incidentally, making itself almost impossible to destroy). The paths from AI intelligence to great AI power are many and varied, and it isn’t hard to imagine new ones.

Of course, simply because an AI could be extremely powerful, does not mean that it need be dangerous: its goals need not be negative. But most goals become dangerous when an AI becomes powerful. Consider a spam filter that became intelligent. Its task is to cut down on the number of spam messages that people receive. With great power, one solution to this requirement is to arrange to have all spammers killed. Or to shut down the internet. Or to have everyone killed. Or imagine an AI dedicated to increasing human happiness, as measured by the results of surveys, or by some biochemical marker in their brain. The most efficient way of doing this is to publicly execute anyone who marks themselves as unhappy on their survey, or to forcibly inject everyone with that biochemical marker.

This is a general feature of AI motivations: goals that seem safe for a weak or controlled AI, can lead to extremely pathological behaviour if the AI becomes powerful. As the AI gains in power, it becomes more and more important that its goals be fully compatible with human flourishing, or the AI could enact a pathological solution rather than one that we intended. Humans don’t expect this kind of behaviour, because our goals include a lot of implicit information, and we take “filter out the spam” to include “and don’t kill everyone in the world”, without having to articulate it. But the AI might be an extremely alien mind: we cannot anthropomorphise it, or expect it to interpret things the way we would. We have to articulate all the implicit limitations. Which may mean coming up with a solution to, say, human value and flourishing – a task philosophers have been failing at for millennia – and cast it unambiguously and without error into computer code.

Note that the AI may have a perfect understanding that when we programmed in “filter out the spam”, we implicitly meant “don’t kill everyone in the world”. But the AI has no motivation to go along with the spirit of the law: its goals are the letter only, the bit we actually programmed into it. Another worrying feature is that the AI would be motivated to hide its pathological tendencies as long as it is weak, and assure us that all was well, through anything it says or does. This is because it will never be able to achieve its goals if it is turned off, so it must lie and play nice to get anywhere. Only when we can no longer control it, would it be willing to act openly on its true goals – we can but hope these turn out safe.

It is not certain that AIs could become so powerful, nor is it certain that a powerful AI would become dangerous. Nevertheless, the probabilities of both are high enough that the risk cannot be dismissed. The main focus of AI research today is creating an AI; much more work needs to be done on creating it safely. Some are already working on this problem (such as the Future of Humanity Institute and the Machine Intelligence Research Institute), but a lot remains to be done, both at the design and at the policy level.

## AI risk, executive summary

10 07 April 2014 10:33AM

MIRI recently published "Smarter than Us", a 50 page booklet laying out the case for considering AI as an existential risk. But many people have asked for a shorter summary, to be handed out to journalists for example. So I put together the following 2-page text, and would like your opinion on it.

In this post, I'm not so much looking for comments along the lines of "your arguments are wrong", but more "this is an incorrect summary of MIRI/FHI's position" or "your rhetoric is infective here".

# AI risk

## Bullet points

• The risks of artificial intelligence are strongly tied with the AI’s intelligence.
• There are reasons to suspect a true AI could become extremely smart and powerful.
• Most AI motivations and goals become dangerous when the AI becomes powerful.
• It is very challenging to program an AI with safe motivations.
• Mere intelligence is not a guarantee of safe interpretation of its goals.
• A dangerous AI will be motivated to seem safe in any controlled training setting.
• Not enough effort is currently being put into designing safe AIs.

## Executive summary

The risks from artificial intelligence (AI) in no way resemble the popular image of the Terminator. That fictional mechanical monster is distinguished by many features – strength, armour, implacability, indestructability – but extreme intelligence isn’t one of them. And it is precisely extreme intelligence that would give an AI its power, and hence make it dangerous.

## Does Goal Setting Work?

30 16 October 2013 08:54PM

tl;dr There's some disagreement over whether setting goals is a good idea. Anecdotally, enjoyment in setting goals and success at accomplishing them varies between people, for various possible reasons. Publicly setting goals may reduce motivation by providing a status gain before the goal is actually accomplished. Creative work may be better accomplished without setting goals about it. 'Process goals', 'systems' or 'habits' are probably better for motivation than 'outcome' goals. Specific goals are probably easier on motivation than unspecified goals. Having explicit set goals can cause problems in organizations, and maybe for individuals.

## Introduction

I experimented by letting go of goals for a while and just going with the flow, but that produced even worse results. I know some people are fans of that style, but it hasn’t worked well for me. I make much better progress — and I’m generally happier and more fulfilled — when I wield greater conscious control over the direction of my life.

The inherent problem with goal setting is related to how the brain works. Recent neuroscience research shows the brain works in a protective way, resistant to change. Therefore, any goals that require substantial behavioural change or thinking-pattern change will automatically be resisted. The brain is wired to seek rewards and avoid pain or discomfort, including fear. When fear of failure creeps into the mind of the goal setter it commences a de-motivator with a desire to return to known, comfortable behaviour and thought patterns.

Ray Williams

I can’t read these two quotes side by side and not be confused.

There’s been quite a bit of discussion within Less Wrong and CFAR about goals and goal setting. On the whole, CFAR seems to go with it being a good idea. There are some posts that recognize the possible dangers: see patrissimo’s post on the problems with receiving status by publicly committing to goals. Basically, if you can achieve the status boost of actually accomplishing a goal by just talking about it in public, why do the hard work? This discussion came up fairly recently with the Ottawa Less Wrong group; specifically, whether introducing group goal setting was a good idea.

I’ve always set goals–by ‘always’ I mean ‘as far back as I can identify myself as some vaguely continuous version of my current self.’ At age twelve, some of my goals were concrete and immediate–“get a time under 1 minute 12 seconds for a hundred freestyle and make the regional swim meet cut.” Some were ambitious and unlikely–“go to the Olympics for swimming,” and “be the youngest person to swim across Lake Ontario.” Some were vague, like “be beautiful” or “be a famous novelist.” Some were chosen for bad reasons, like “lose 10 pounds.” My 12-year-old self wanted plenty of things that were unrealistic, or unhealthy, or incoherent, but I wanted them, and it seemed to make perfect sense to do something about getting them. I took the bus to swim practice at six am. I skipped breakfast and threw out the lunch my mom packed. Et cetera. I didn't write these goals down in a list format, but I certainly kept track of them, in diary entries among other things. I sympathize with the first quote, and the second quote confuses and kind of irritates me–seriously, Ray Williams, you have that little faith in people's abilities to change?

For me personally, I'm not sure what the alternative to having goals would be. Do things at random? Do whatever you have an immediate urge to do? Actually, I do know people like this. I know people whose stated desires aren’t a good predictor of their actions at all, and I’ve had a friend say to me “wow, you really do plan everything. I just realized I don’t plan anything at all.” Some of these people get a lot of interesting stuff done. So this may just be an individual variation thing; my comfort with goal setting, and discomfort with making life up as I go, might be a result of my slightly-Aspergers need for control. It certainly comes at a cost–the cost of basing self-worth on an external criterion, and the resulting anxiety and feelings of inadequacy. I have an enormous amount of difficulty with the Buddhist virtue of ‘non-striving.’

Why the individual variation?

The concepts of the motivation equation and success spirals give another hint at why goal-driven behaviour might vary between people. Nick Winter talks about this in his book The Motivation Hacker; he shows the difference between his past self, who had very low expectancy of success and set few goals, and his present self, with high expectancy of success and with goal-directed behaviour filling most of his time.

I actually remember a shift like this in my own life, although it was back in seventh grade and I’ve probably editorialized the memories to make a good narrative. My sixth grade self didn’t really have a concept of wanting something and thus doing something about it. At some point, over a period of a year or two, I experienced some minor successes. I was swimming faster, and for the first time ever, a coach made comments about my ‘natural talent.’ My friends wanted to get on the honour roll with an 80% average, and in first semester, both of them did and I didn’t; I was upset and decided to work harder, a concept I’d never applied to school, and saw results the next semester when my average was on par with theirs. It only took a few events like that, inconsequential in themselves, before my self-image was of someone who could reliably accomplish things through hard work. My parents helpfully reinforced this self-stereotype by making proud comments about my willpower and determination.

In hindsight I'm not sure whether this was a defining year; whether it actually made the difference, in the long run, or whether it was inevitable that some cluster of minor successes would have set off the same cascade later. It may be that some innate personality trait distinguishes the people who take those types of experiences and interpret them as success spirals from those who remained disengaged.

## The More Important Question

Apart from the question of personal individual variation, though, there’s a more relevant question. Given that you’re already at a particular place on the continuum from planning-everything to doing-everything-as-you-feel-like-it, how much should you want to set goals, versus following urges? More importantly, what actions are helped versus harmed by explicit goal-setting.

## Creative Goals

As Paul Graham points out, a lot of the cool things that have been accomplished in the past weren’t done through self-discipline:

One of the most dangerous illusions you get from school is the idea that doing great things requires a lot of discipline. Most subjects are taught in such a boring way that it's only by discipline that you can flog yourself through them. So I was surprised when, early in college, I read a quote by Wittgenstein saying that he had no self-discipline and had never been able to deny himself anything, not even a cup of coffee.

Now I know a number of people who do great work, and it's the same with all of them. They have little discipline. They're all terrible procrastinators and find it almost impossible to make themselves do anything they're not interested in. One still hasn't sent out his half of the thank-you notes from his wedding, four years ago. Another has 26,000 emails in her inbox.

I'm not saying you can get away with zero self-discipline. You probably need about the amount you need to go running. I'm often reluctant to go running, but once I do, I enjoy it. And if I don't run for several days, I feel ill. It's the same with people who do great things. They know they'll feel bad if they don't work, and they have enough discipline to get themselves to their desks to start working. But once they get started, interest takes over, and discipline is no longer necessary.

Do you think Shakespeare was gritting his teeth and diligently trying to write Great Literature? Of course not. He was having fun. That's why he's so good.

This seems to imply that creative goals aren’t a good place to apply goal setting. But I’m not sure how much this is a fundamental truth. I recently made a Beeminder goal for writing fiction, and I’ve written fifty pages since then. I actually don’t have the writer’s virtue of just sitting down and writing; in the past, I’ve written most of my fiction by staying up late in a flow state. I can’t turn this on and off, though, and more importantly, I have a life to schedule my writing around, and if the only way I can get a novel done is to stay up all night before a 12-hour shift at the hospital, I probably won’t write that novel. I rarely want to do the hard work of writing; it’s a lot easier to lie in bed thinking about that one awesome scene five chapters down the road and lamenting that I don’t have time to write tonight because work in the morning.

Even if Shakespeare didn’t write using discipline, I bet that he used habits. That he sat down every day with a pen and parchment and fully expected himself to write. That he had some kind of sacred writing time, not to be interrupted by urgent-but-unimportant demands. That he’d built up some kind of success spiral around his ability to write plays that people would enjoy.

## Outcome versus process goals

Goal setting sets up an either-or polarity of success. The only true measure can either be 100% attainment or perfection, or 99% and less, which is failure. We can then excessively focus on the missing or incomplete part of our efforts, ignoring the successful parts. Fourthly, goal setting doesn't take into account random forces of chance. You can't control all the environmental variables to guarantee 100% success.

Ray Williams

This quote talks about a type of goal that I don't actually set very often. Most of the ‘bad’ goals that I had as a 12-year-old were unrealistic outcome goals, and I failed to accomplish plenty of them; I didn’t go to the Olympics, I didn’t swim across Lake Ontario, and I never got down to 110 pounds. But I still have the self-concept of someone who’s good at accomplishing goals, and this is because I accomplished almost all of my more implicit ‘process’ goals. I made it to swim practice seven times a week, waking up at four-thirty am year after year. This didn’t automatically lead to Olympic success, obviously, but it was hard, and it impressed people. And yeah, I missed a few mornings, but in my mind 99% success or even 90% success at a goal is still pretty awesome.

In fact, I can’t think of any examples of outcome goals that I’ve set recently. Even “become a really awesome nurse” feels like more of a process goal, because it's something I'll keep doing on a day-to-day basis, requiring a constant input of effort.

Scott Adams, of Dilbert fame, refers to this dichotomy as ‘systems’ versus ‘goals’:

Just after college, I took my first airplane trip, destination California, in search of a job. I was seated next to a businessman who was probably in his early 60s. I suppose I looked like an odd duck with my serious demeanor, bad haircut and cheap suit, clearly out of my element. I asked what he did for a living, and he told me he was the CEO of a company that made screws. He offered me some career advice. He said that every time he got a new job, he immediately started looking for a better one. For him, job seeking was not something one did when necessary. It was a continuing process... This was my first exposure to the idea that one should have a system instead of a goal. The system was to continually look for better options.

Throughout my career I've had my antennae up, looking for examples of people who use systems as opposed to goals. In most cases, as far as I can tell, the people who use systems do better. The systems-driven people have found a way to look at the familiar in new and more useful ways.

...To put it bluntly, goals are for losers. That's literally true most of the time. For example, if your goal is to lose 10 pounds, you will spend every moment until you reach the goal—if you reach it at all—feeling as if you were short of your goal. In other words, goal-oriented people exist in a state of nearly continuous failure that they hope will be temporary.

If you achieve your goal, you celebrate and feel terrific, but only until you realize that you just lost the thing that gave you purpose and direction. Your options are to feel empty and useless, perhaps enjoying the spoils of your success until they bore you, or to set new goals and re-enter the cycle of permanent presuccess failure.

I guess I agree with him–if you feel miserable when you've lost 9 pounds because you haven't accomplished your goal yet, and empty after you've lost 10 pounds because you no longer have a goal, then whatever you're calling 'goal setting' is a terrible idea. But that's not what 'goal setting' feels like to me. I feel increasingly awesome as I get closer towards a goal, and once it's done, I keep feeling awesome when I think about how I did it. Not awesome enough to never set another goal again, but awesome enough that I want to set lots more goals to get that feeling again.

## SMART goals

When I work with people as their coach and mentor, they often tell me they've set goals such as "I want to be wealthy," or "I want to be more beautiful/popular," "I want a better relationship/ideal partner." They don't realize they've just described the symptoms or outcomes of the problems in their life. The cause of the problem, that many resist facing, is themselves. They don't realize that for a change to occur, if one is desirable, they must change themselves. Once they make the personal changes, everything around them can alter, which may make the goal irrelevant.

Ray Williams

And? Someone has to change themselves to fix the underlying problem? Are they going to do that more successfully by going with the flow?

I think the more important dichotomy here is between vague goals and specific goals. I was exposed to the concept of SMART goals (specific, measurable, attainable, relevant, time-bound), at an early age, and though the concept has a lot of problems, the ability to Be Specific seems quite important. You can break down “I want to be beautiful” into subgoals like “I’ll learn to apply makeup properly”, “I’ll eat healthy and exercise”, “I’ll go clothing shopping with a friend who knows about fashion,” etc. All of these feel more attainable than the original goal, and it’s clear when they’re accomplished.

That being said, I have a hard time setting any goal that isn’t specific, attainable, and small. I’ve become more ambitious since meeting lots of LW and CFAR people, but I still don’t like large, long-term goals unless I can easily break them down into intermediate parts. This makes the idea of working on an unsolved problem, or in a startup where the events of the next year aren’t clear, deeply frightening. And these are obviously important problems that someone needs to motivate themselves to work on.

## Problematic Goal-Driven Behaviour

We argue that the beneficial effects of goal setting have been overstated and that systematic harm caused by goal setting has been largely ignored. We identify specific side effects associated with goal setting, including a narrow focus that neglects non-goal areas, a rise in unethical behaviour, distorted risk preferences, corrosion of organizational culture, and reduced intrinsic motivation. Rather than dispensing goal setting as a benign, over-the-counter treatment for motivation, managers and scholars need to conceptualize goal setting as a prescription-strength medication that requires careful dosing, consideration of harmful side effects, and close supervision.

This is a fairly compelling argument against goal-setting; that by setting an explicit goal and then optimizing towards that goal, you may be losing out on elements that were being accomplished better before, and maybe even rewarding actual negative behaviour. Members of an organization presumably already have assigned tasks and responsibilities, and aren’t just doing whatever they feel like doing, but they might have done better with more freedom to prioritize their own work–the best environment is one with some structure and goals, but not too many. The phenomenon of “teaching to the test” for standardized testing is another example.

Given that humans aren’t best described as unitary selves, this metaphor extends to individuals. If one aspect of myself sets a personal goal to write two pages per day, another aspect of myself might respond by writing two pages on the easiest project I can think of, like a journal entry that no one will ever see. This violates the spirit of the goal it technically accomplishes.

A more problematic consideration is the relationship between intrinsic and extrinsic motivation. Studies show that rewarding or punishing children for tasks results in less intrinsic motivation, as measured by stated interest or by freely choosing to engage in the task. I’ve noticed this tendency in myself; faced with a nursing instructor who was constantly quizzing me on the pathophysiology of my patients’ conditions, I responded by refusing to be curious about any of it or look up the answers to questions in any more detail than what she demanded, even though my previous self loved to spend hours on Google making sense of confusing diseases. If this is a problem that affects individuals setting goals for themselves–i.e. if setting a daily writing goal makes writing less fun–then I can easily see how goal-setting could be damaging.

I also notice that I’m confused about the relationship between Beeminder’s extrinsic motivation, in the form of punishment for derailing, and its effects on intrinsic motivation. Maybe the power of success spirals to increase intrinsic motivation offsets the negative effect of outside reward/punishment; or maybe the fact that users deliberately choose to use Beeminder means that it doesn’t count as “extrinsic.” I’m not sure.

## Conclusion

There seems to be variation between individuals, in terms of both generally purposeful behaviour, and comfort level with calling it ‘setting goals’. This might be related to success spirals in the past, or it might be a factor of personality and general comfort with order versus chaos. I’m not sure if it’s been studied.

In the past, a lot of creative behaviour wasn’t the result of deliberate goals. This may be a fundamental fact about creativity, or it may be a result of people’s beliefs about creativity (à la ego depletion only happens if you belief in ego depletion) or it may be a historical coincidence that isn’t fundamental at all. In any case, if you aren’t currently getting creative work done, and want to do more, I’m not sure what the alternative is to purposefully trying to do more. Manipulating the environment to make flow easier to attain, maybe. (For example, if I quit my day job and moved to a writers' commune, I might write more without needing to try on a day-to-day basis).

Process goals, or systems, are probably better than outcome goals. Specific and realistic goals are probably better than vague and ambitious ones. A lot of this may be because it’s easier to form habits and/or success spirals around well-specified behaviours that you can just do every day.

Setting goals within an organization has a lot of potential problems, because workers can game the system and accomplish the letter of the goal in the easiest possible way. This likely happens within individuals too. Research shows that extrinsic motivation reduces intrinsic motivation, which is important to consider, but I'm not sure how it relates to individuals setting goals, as opposed to organizations.

19 18 August 2012 05:57PM

At the end of CFAR's July Rationality Minicamp, we had a party with people from the LW/SIAI/CFAR community in the San Francisco Bay area. During this party, I had a conversation with the girlfriend of a participant in a previous minicamp, who was not signed up for cryonics (her boyfriend was). The conversation went like this:

me: So, you know what cryonics is?

her: Yes

me: And you think it's a good idea?

her: Yes

me: And you are not signed up yet?

her: Yes

me: And you would like to be?

her: Yes

me: Wait a minute while I get my laptop.

And I got my laptop, pointed my browser at Rudi Hoffman's quote request form1, and said, "Here, fill out this form". And she did.

## General purpose intelligence: arguing the Orthogonality thesis

20 15 May 2012 10:23AM

Note: informally, the point of this paper is to argue against the instinctive "if the AI were so smart, it would figure out the right morality and everything will be fine." It is targeted mainly at philosophers, not at AI programmers. The paper succeeds if it forces proponents of that position to put forwards positive arguments, rather than just assuming it as the default position. This post is presented as an academic paper, and will hopefully be published, so any comments and advice are welcome, including stylistic ones! Also let me know if I've forgotten you in the acknowledgements.

Abstract: In his paper “The Superintelligent Will”, Nick Bostrom formalised the Orthogonality thesis: the idea that the final goals and intelligence levels of agents are independent of each other. This paper presents arguments for a (slightly narrower) version of the thesis, proceeding through three steps. First it shows that superintelligent agents with essentially arbitrary goals can exist. Then it argues that if humans are capable of building human-level artificial intelligences, we can build them with any goal. Finally it shows that the same result holds for any superintelligent agent we could directly or indirectly build. This result is relevant for arguments about the potential motivations of future agents.

## 1 The Orthogonality thesis

The Orthogonality thesis, due to Nick Bostrom (Bostrom, 2011), states that:

• Intelligence and final goals are orthogonal axes along which possible agents can freely vary: more or less any level of intelligence could in principle be combined with more or less any final goal.

It is analogous to Hume’s thesis about the independence of reason and morality (Hume, 1739), but applied more narrowly, using the normatively thinner concepts ‘intelligence’ and ‘final goals’ rather than ‘reason’ and ‘morality’.

But even ‘intelligence’, as generally used, has too many connotations. A better term would be efficiency, or instrumental rationality, or the ability to effectively solve problems given limited knowledge and resources (Wang, 2011). Nevertheless, we will be sticking with terminology such as ‘intelligent agent’, ‘artificial intelligence’ or ‘superintelligence’, as they are well established, but using them synonymously with ‘efficient agent’, artificial efficiency’ and ‘superefficient algorithm’. The relevant criteria is whether the agent can effectively achieve its goals in general situations, not whether its inner process matches up with a particular definition of what intelligence is.

## Approving reinforces low-effort behaviors

91 17 July 2011 08:43PM

In addition to "liking" to describe pleasure and "wanting" to describe motivation, we add "approving" to describe thoughts that are ego syntonic.

A heroin addict likes heroin. He certainly wants more heroin. But he may not approve of taking heroin. In fact, there are enough different cases to fill in all eight boxes of the implied 2x2x2 grid (your mileage may vary):

+wanting/+liking/+approving: Romantic love. If you're doing it right, you enjoy being with your partner, you're motivated to spend time with your partner, and you think love is a wonderful (maybe even many-splendored) thing.

+wanting/+liking/-approving: The aforementioned heroin addict feels good when taking heroin, is motivated to get more, but wishes he wasn't addicted.

+wanting/-liking/+approving: I have taken up disc golf. I play it every day, and when events conspire to prevent me from playing it, I seethe. I approve of this pastime: I need to take up more sports, and it helps me spend time with my family. But when I am playing, all I feel is stressed and angry that I was literally *that* close how could I miss that shot aaaaarggghh.

+wanting/-liking/-approving: The jaded addict. I have a friend who says she no longer even enjoys coffee or gets any boost from it, she just feels like she has to have it when she gets up.

-wanting/+liking/+approving: Reading non-fiction. I enjoy it when I'm doing it, I think it's great because it makes me more educated, but I can rarely bring myself to do it.

-wanting/-liking/+approving:
Working in a soup kitchen. Unless you're the type for whom helping others is literally its own reward it's not the most fun thing in the world, nor is it the most attractive, but it makes you a Good Person and so you should do it.

-wanting/+liking/-approving:
The non-addict. I don't want heroin right now. I think heroin use is repugnant. But if I took some, I sure bet I'd like it.

-wanting/-liking/-approving:
Torture. I don't want to be tortured, I wouldn't like it if I were, and I will go on record declaring myself to be against it.

Discussion of goals is mostly about approving; a goal is an ego-syntonic thought. When we speak of goals that are hard to achieve, we're usually talking about +approving/-wanting. The previous discussion of learning Swahili is one example; more noble causes like Working To Help The Less Fortunate can be others.

Ego syntonicity itself is mildly reinforcing by promoting positive self-image. Most people interested in philosophy have at least once sat down and moved their arm from side to side, just to note that their mind really does control their body; the mental processes that produced curiosity about philosophy were sufficiently powerful to produce that behavior as well. Some processes, like moving one's arm, or speaking aloud, or engaging in verbal thought, are so effortless, and so empty of other reinforcement either way, that we usually expect them to be completely under the control of the mild reinforcement provided by approving of those behaviors.

Other behaviors take more effort, and are subject not only to discounting but to many other forms of reinforcement. Unlike the first class of behaviors, we expect to experience akrasia when dealing with this latter sort. This offers another approach to willpower: taking low-effort approving-influenced actions that affect the harder road ahead.

Consider the action of making a goal. I go to all my friends and say "Today I shall begin learning Swahili." This is easy to do. There is no chance of me intending to do so and failing; my speech is output by the same processes as my intentions, so I can "trust" it. But this is not just an output of my mental processes, but an input. One of the processes potentially reinforcing my behavior of learning Swahili is "If I don't do this, I'll look stupid in front of my friends."

Will it be enough? Maybe not. But this is still an impressive process: my mind has deliberately tweaked its own inputs to change the output of its own algorithm. It's not even pretending to be working off of fixed preferences anymore, it's assuming that one sort of action (speaking) will work differently from another action (studying), because the first can be executed solely through the power of ego syntonicity, and the second may require stronger forms of reinforcement. It gets even weirder when goals are entirely mental: held under threat not of social disapproval, but of feeling bad because you're not as effective as you thought. The mind is using mind's opinion of the mind to blackmail the mind.

But we do this sort of thing all the time. The dieter who successfully avoids buying sweets when he's at the store because he knows he would eat them at home is changing his decisions by forcing effort discounting of any future sweet-related reward (because he'd have to go back to the store). The binge shopper who freezes her credit cards in a block of ice is using time discounting in the same way. The rationalist who sends money to stickk is imposing a punishment with a few immediate and effortless mouse clicks. Even the poor unhappy person who tries to conquer through willpower alone is trying to set up the goal as a Big Deal so she will feel extra bad if she fails. All are using their near-complete control of effortless immediate actions to make up for their incomplete control of high-effort long-term actions.

This process is especially important to transhumanists. In the future, we may have the ability to self-modify in complicated ways that have not built up strong patterns of reinforcement around them. For example, we may be able to program ourselves at the push of a button. Such programming would be so effortless and empty of past reinforcement that behavior involving it would be reinforced entirely by our ego-syntonic thoughts. It would supersede our current psychodynamics, in which our thoughts are only tenuously linked to our important actions and major life decisions. A Singularity in which behaviors were executed by effectively omnipotent machines that acted on our preferences - preferences which we would presumably communicate through low-effort channels like typed commands - would be an ultimate triumph for the ego-syntonic faction of the brain.

## Ego syntonic thoughts and values

53 17 July 2011 08:43PM

Last week I read a book in which two friends - let's call them John and Lisa so I don't spoil the book for anyone who wanders into it - got poisoned. They only had enough antidote for one person and had to decide who lived and who died. John, who was much larger than Lisa, decided to hold Lisa down and force the antidote down her throat. Lisa just smirked; she'd replaced the antidote with a lookalike after slipping the real thing into John's drink earlier in the day.

These are good friends. Not only was each willing to give the antidote to the other, but each realized it would be unfair to make the other live with the crippling guilt of having chosen to survive at the expense of a friend's life, and so decided to force the antidote on the other unwillingly to prevent any guilt over the fateful decision. Whatever you think of the ethics of their decision, you can't help admire the thought processes.

Your brain might be this kind of a friend.

In Trivers' hypothesis of self-deception, one of the most important functions of the conscious mind is effective signaling. Since people have the potential to be excellent lie-detectors, the conscious mind isn't given full access to information so that it can lend the ring of truth to useful falsehoods.

But this doesn't always work. If you're addicted to heroin, at some point you're going to notice. And telling your friends "No, I'm not addicted, it's just a coincidence that I take heroin every day," isn't going to cut it. But there's another way in which the brain can sequester information to promote effective signaling.

Wikipedia defines the term "ego syntonic" as "referring to behaviors, values, feelings that are in harmony with or acceptable to the needs and goals of the ego, or consistent with one's ideal self-image", and "ego dystonic" as the opposite of that. A heroin addict might say "I hate heroin, but somehow I just feel compelled to keep taking it." But an astronaut will say "I love being an astronaut and I worked hard to get into this career."

Both the addict and the astronaut have desires: the addict wants to take heroin, the astronaut wants to fly in space. But the addict's desires manifest as an unpleasant compulsion from outside, and the astronaut's manifest as a genuine and heartfelt love.

Suppose that in the original example, John predicted that Lisa would ask for the antidote, but later feel guilty about it and believe she was a bad person. By presenting the antidote to Lisa in the form of an external compulsion, he allows Lisa to do what she wanted anyway and avoid the associated guilt.

Under Trivers' hypothesis, the compulsion for heroin works the same way. The heroin addict's definitely going to get that heroin, but by presenting the desire in the form of an external compulsion, the unconscious saves the heroin addict from the social stigma of "choosing" heroin. This allows the addict to create a much more sympathetic narrative than the alternative: "I want to support my family and keep clean, but for some reason these compulsions keep attacking me," instead of "Yeah, I like heroin more than I like supporting my family. Deal with it."

EGO SYNTONIA, DYSTONIA, AND WILLPOWER

Willpower cashes out as the action of ego syntonic thoughts and desires against ego dystonic thoughts and desires.

The aforementioned heroin addict may have several reinforcers both promoting and discouraging heroin use. On the plus side, heroin itself is very strongly rewarding. On the minus, it can lead to both predicted and experienced poverty, loss of friendships, loss of health, and death.

Worrying about the latter factors determining heroin use - the factors that make heroin a bad idea - is socially encouraged and good signaling material. A person wanting to put their best face forward should believe themselves to be the sort of person who cares about these things. These desires will be ego syntonic. Wanting to take heroin, on the other hand, is a socially unacceptable desire, so it presents as dystonic.

If the latter syntonic factors win out over the dystonic factors, this feels from the inside like "I exerted willpower and managed to overcome my heroin addiction." If the dystonic factors win out over the syntonic factors, this feels from the inside like "I didn't have enough willpower to overcome my heroin addiction."

DYSTONIC DESIRES IN ABNORMAL PSYCHOLOGY

There is some speculation that the brain has one last trick up its sleeve to deal with desires that are so unpleasant and unacceptable that even manifesting them as external compulsions isn't good enough: it splits them off into weird alternate personalities.

One of the classic stereotypes of the insane is that they hear voices telling them to kill people. During my short time working at a psychiatric hospital, I was surprised by how spot-on this stereotype was: meeting someone who heard voices telling him to kill people was an almost daily occurrence. Other voices would have other messages: maybe that the patient was a horrible person who deserved to die, or that the patient must complete some bizarre ritual or else doom everybody. There were relatively fewer voices saying "Hey, let's go fishing!"

One theory explaining these voices is that they are an extreme reaction to highly ego dystonic thoughts. Some aspect of the patients' mental disease gives them obsessive thoughts about (though rarely a desire for) killing people. Genuinely wanting to kill people would make you a bad person, but even saying "I feel a strong compulsion to kill people" is pretty bad too. The best the brain can do with this desire is pitch it as a completely different person by presenting it as an outside voice speaking to the patient.

Although everything about dissociative identity disorder (aka multiple personality disorder) is controversial including its very existence, perhaps one could sketch a similar theory explaining that condition in the same framework of separating out dystonic thoughts.

SUMMARY

A conscious/unconscious divide helps signaling by allowing the conscious mind to hold only socially acceptable beliefs, which it can broadcast without detectable falsehood. Socially acceptable ideas present as the conscious mind's own beliefs and desires; unacceptable ones present as compulsions from afar. The balance of ego syntonic and dystonic desires presents as willpower. In extreme cases, some desires may be so ego dystonic that they present as external voices.

View more: Next