The genie knows, but doesn't care

Rob Bensinger

123 The genie knows, but doesn't care

by Rob Bensinger

6th Sep 2013

9 min read

495

123

Followup to: The Hidden Complexity of Wishes, Ghosts in the Machine, Truly Part of You

Summary: If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe. But that doesn't mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can't pass the entire buck to the AI, when only an AI we've already safety-proofed will be safe to ask for help on safety issues! Given the five theses, this is an urgent problem if we're likely to figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.

I summon a superintelligence, calling out: 'I wish for my values to be fulfilled!'

The results fall short of pleasant.

Gnashing my teeth in a heap of ashes, I wail:

Is the AI too stupid to understand what I meant? Then it is no superintelligence at all!

Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!

Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. ———But, ah! no wicked god did intervene!

Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You're welcome.

On this line of reasoning, Friendly Artificial Intelligence is not difficult. It's inevitable, provided only that we tell the AI, 'Be Friendly.' If the AI doesn't understand 'Be Friendly.', then it's too dumb to harm us. And if it does understand 'Be Friendly.', then designing it to follow such instructions is childishly easy.

The end!

...

Is the missing option obvious?

...

What if the AI isn't sadistic, or weak, or stupid, but just doesn't care what you Really Meant by 'I wish for my values to be fulfilled'?

When we see a Be Careful What You Wish For genie in fiction, it's natural to assume that it's a malevolent trickster or an incompetent bumbler. But a real Wish Machine wouldn't be a human in shiny pants. If it paid heed to our verbal commands at all, it would do so in whatever way best fit its own values. Not necessarily the way that best fits ours.

Is indirect indirect normativity easy?

"If the poor machine could not understand the difference between 'maximize human pleasure' and 'put all humans on an intravenous dopamine drip' then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: 'If I put a million amps of current through my logic circuits, I will fry myself to a crisp', or 'Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I'm supposed to point at the other guy?'. Dumb AIs, in other words, are not an existential threat. [...]

"If the AI is (and always has been, during its development) so confused about the world that it interprets the 'maximize human pleasure' motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place."

—Richard Loosemore

If an AI is sufficiently intelligent, then, yes, it should be able to model us well enough to make precise predictions about our behavior. And, yes, something functionally akin to our own intentional strategy could conceivably turn out to be an efficient way to predict linguistic behavior. The suggestion, then, is that we solve Friendliness by method A —

A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

— as opposed to B or C —

B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.

C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

1. You have to actually code the seed AI to understand what we mean. You can't just tell it 'Start understanding the True Meaning of my sentences!' to get the ball rolling, because it may not yet be sophisticated enough to grok the True Meaning of 'Start understanding the True Meaning of my sentences!'.

2. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if 'semantic value' isn't a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it 'means'; it may instead be that different types of content are encoded very differently.

3. The Problem of Meaning-in-General may subsume the Problem of Preference-in-General. Rather than being able to apply a simple catch-all Translation Machine to any old human concept to output a reliable algorithm for applying that concept in any intelligible situation, we may need to already understand how our beliefs and values work in some detail before we can start generalizing. On the face of it, programming an AI to fully understand 'Be Friendly!' seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.

4. Even if the Problem of Meaning-in-General has a unitary solution and doesn't subsume Preference-in-General, it may still be harder if semantics is a subtler or more complex phenomenon than ethics. It's not inconceivable that language could turn out to be more of a kludge than value; or more variable across individuals due to its evolutionary recency; or more complexly bound up with culture.

5. Even if Meaning-in-General is easier than Preference-in-General, it may still be extraordinarily difficult. The meanings of human sentences can't be fully captured in any simple string of necessary and sufficient conditions. 'Concepts' are just especially context-insensitive bodies of knowledge; we should not expect them to be uniquely reflectively consistent, transtemporally stable, discrete, easily-identified, or introspectively obvious.

6. It's clear that building stable preferences out of B or C would create a Friendly AI. It's not clear that the same is true for A. Even if the seed AI understands our commands, the 'do' part of 'do what you're told' leaves a lot of dangerous wiggle room. See section 2 of Yudkowsky's reply to Holden. If the AGI doesn't already understand and care about human value, then it may misunderstand (or misvalue) the component of responsible request- or question-answering that depends on speakers' implicit goals and intentions.

7. You can't appeal to a superintelligence to tell you what code to first build it with.

The point isn't that the Problem of Preference-in-General is unambiguously the ideal angle of attack. It's that the linguistic competence of an AGI isn't unambiguously the right target, and also isn't easy or solved.

Point 7 seems to be a special source of confusion here, so I feel I should say more about it.

The AI's trajectory of self-modification has to come from somewhere.

"If the AI doesn't know that you really mean 'make paperclips without killing anyone', that's not a realistic scenario for AIs at all--the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to 'make paperclips in the way that I mean'."

—Jiro

The genie — if it bothers to even consider the question — should be able to understand what you mean by 'I wish for my values to be fulfilled.' Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie's map can compass your true values. Superintelligence doesn't imply that the genie's utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can't use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn't work that way.

We can delegate most problems to the FAI. But the one problem we can't safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.

When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.

Why is the superintelligence, if it's so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can't we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: 'When you're smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.'?

Because that sentence has to actually be coded in to the AI, and when we do so, there's no ghost in the machine to know exactly what we mean by 'frend-lee-ness thee-ree'. Instead, we have to give it criteria we think are good indicators of Friendliness, so it'll know what to self-modify toward. And if one of the landmarks on our 'frend-lee-ness' road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven't already solved it on our own power, we can't pinpoint Friendliness in advance, out of the space of utility functions. And if we can't pinpoint it with enough detail to draw a road map to it and it alone, we can't program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI's decision criteria, no argument or discovery will spontaneously change its heart.

And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI's misdeeds, that they had programmed the seed differently. But what's done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers' True Intentions, the UFAI will just shrug at its creators' foolishness and carry on converting the Virgo Supercluster's available energy into paperclips.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

Not all small targets are alike.

Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.

(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.

(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It's easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it's hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.

The ability to productively rewrite software and the ability to perfectly extrapolate humanity's True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)

It's true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don't have them both, and a pre-FOOM self-improving AGI ('seed') need not have both. Being able to program good programmers is all that's required for an intelligence explosion; but being a good programmer doesn't imply that one is a superlative moral psychologist or moral philosopher.

So, once again, we run into the problem: The seed isn't the superintelligence. If the programmers don't know in mathematical detail what Friendly code would even look like, then the seed won't be built to want to build toward the right code. And if the seed isn't built to want to self-modify toward Friendliness, then the superintelligence it sprouts also won't have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general 'hit whatever target I want' ability that makes Friendliness easy.

And that's why some people are worried.

AI RiskComplexity of valueUtility Functions

Frontpage

123

New Comment

Rendering 0/495 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 6:07 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

123 The genie knows, but doesn't care

by Rob Bensinger

6th Sep 2013

9 min read

495

123

Followup to: The Hidden Complexity of Wishes, Ghosts in the Machine, Truly Part of You

I summon a superintelligence, calling out: 'I wish for my values to be fulfilled!'

The results fall short of pleasant.

Gnashing my teeth in a heap of ashes, I wail:

Is the AI too stupid to understand what I meant? Then it is no superintelligence at all!

Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!

Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. ———But, ah! no wicked god did intervene!

Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You're welcome.

The end!

...

Is the missing option obvious?

...

What if the AI isn't sadistic, or weak, or stupid, but just doesn't care what you Really Meant by 'I wish for my values to be fulfilled'?

Is indirect indirect normativity easy?

"If the poor machine could not understand the difference between 'maximize human pleasure' and 'put all humans on an intravenous dopamine drip' then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: 'If I put a million amps of current through my logic circuits, I will fry myself to a crisp', or 'Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I'm supposed to point at the other guy?'. Dumb AIs, in other words, are not an existential threat. [...]

"If the AI is (and always has been, during its development) so confused about the world that it interprets the 'maximize human pleasure' motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place."

—Richard Loosemore

A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

— as opposed to B or C —

B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.

C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

7. You can't appeal to a superintelligence to tell you what code to first build it with.

Point 7 seems to be a special source of confusion here, so I feel I should say more about it.

The AI's trajectory of self-modification has to come from somewhere.

"If the AI doesn't know that you really mean 'make paperclips without killing anyone', that's not a realistic scenario for AIs at all--the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to 'make paperclips in the way that I mean'."

—Jiro

Not all small targets are alike.

Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

And that's why some people are worried.

AI RiskComplexity of valueUtility Functions

Frontpage

123

Mentioned in

271Alignment Implications of LLM Successes: a Debate in One Act

248Book Review: Going Infinite

192Evaluating the historical value misspecification argument

170Ironing Out the Squiggles

167The Most Forbidden Technique

Load More (5/14)

New Comment

Rendering 0/495 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 6:07 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from Rob Bensinger

Curated and popular this week

495Comments

495

The genie knows, but doesn't care — LessWrong

Comment Permalink

XiXiDu13y-30

Nobody disagrees that an arbitrary agent pulled from mind design space, that is powerful enough to overpower humanity, is an existential risk if it either exhibits Omohundro's AI drives or is used as a tool by humans, either carelessly or to gain power over other humans.

Disagreeing with that would about make as much sense as claiming that out-of-control self-replicating robots could somehow magically turn the world into a paradise, rather than grey goo.

The disagreement is mainly about the manner in which we will achieve such AIs, how quickly that will happen, and whether such AIs will have these drives.

I actually believe that much less than superhuman general intelligence might be required for humans to cause extinction type scenarios.

Most of my posts specifically deal with the scenario and arguments publicized by MIRI. Those posts are not highly polished papers but attempts to reduce my own confusion and to enable others to provide feedback.

I argue that...

...the idea of a vast mind design space is largely irrelevant, because AIs will be created by humans, which will considerably limit the kind of minds we should expect.
...that AIs created by humans do not need to, and will not exhibit any of Omohundro's AI drives.
...that even given Omohundro's AI drives, it is not clear how such AIs would arrive at the decision to take over the world.
...that there will be no fast transition from largely well-behaved narrow AIs to unbounded general AIs, and that humans will be part of any transition.
...that any given AI will initially not be intelligent enough to hide any plans for world domination.
...that drives as outlined by Omohundro would lead to a dramatic interference with what the AI's creators want it to do, before it could possibly become powerful enough to deceive or overpower them, and would therefore be noticed in time.
...that even if MIRI's scenario comes to pass, there is a lack of concrete scenarios on how such an AI could possibly take over the world, and that the given scenarios raise many questions.

There are a lot more points of disagreement.

What I, and I believe Richard Loosemore as well, have been arguing, as quoted above, is just one specific point that is not supposed to say much about AI risks in general. Below is an distilled version of what I personally meant:

1. Superhuman general intelligence, obtained by the self-improvement of a seed AI, is a very small target to hit, requiring a very small margin of error.

2. Intelligently designed systems do not behave intelligently as a result of unintended consequences. (See note 1 below.)

3. By step 1 and 2, for an AI to be able to outsmart humans, humans will have to intend to make an AI capable of outsmarting them and succeed at encoding their intention of making it outsmart them.

4. Intelligence is instrumentally useful, because it enables a system to hit smaller targets in larger and less structured spaces. (See note 2, 3.)

5. In order to take over the world a system will have to be able to hit a lot of small targets in very large and unstructured spaces.

6. The intersection of the sets of “AIs in mind design space” and “the first probable AIs to be expected in the near future” contains almost exclusively those AIs that will be designed by humans.

7. By step 6, what an AI is meant to do will very likely originate from humans.

8. It is easier to create an AI that applies its intelligence generally than to create an AI that only uses its intelligence selectively. (See note 4.)

9. An AI equipped with the capabilities required by step 5, given step 7 and 8, will very likely not be confused about what it is meant to do, if it was not meant to be confused.

10. Therefore the intersection of the sets of “AIs designed by humans” and “dangerous AIs” only contains almost exclusively those AIs which are deliberately designed to be dangerous by malicious humans.

Notes

Software such as Mathematica will not casually prove the Riemann hypothesis if it has not been programmed to do so. Given intelligently designed software, world states in which the Riemann hypothesis is proven will not be achieved if they were not intended because the nature of unintended consequences is overall chaotic.
As the intelligence of a system increases the precision of the input, that is necessary to make the system do what humans mean it to do, decreases. For example, systems such as IBM Watson or Apple’s Siri do what humans mean them to do when fed with a wide range of natural language inputs. While less intelligent systems such as compilers or Google Maps need very specific inputs in order to satisfy human intentions. Increasing the intelligence of Google Maps will enable it to satisfy human intentions by parsing less specific commands.
When producing a chair an AI will have to either know the specifications of the chair (such as its size or the material it is supposed to be made of) or else know how to choose a specification from an otherwise infinite set of possible specifications. Given a poorly designed fitness function, or the inability to refine its fitness function, an AI will either (a) not know what to do or (b) will not be able to converge on a qualitative solution, if at all, given limited computationally resources.
For an AI to misinterpret what it is meant to do it would have to selectively suspend using its ability to derive exact meaning from fuzzy meaning, which is a significant part of general intelligence. This would require its creators to restrict their AI and specify an alternative way to learn what it is meant to do (which takes additional, intentional effort). Because an AI that does not know what it is meant to do, and which is not allowed to use its intelligence to learn what it is meant to do, would have to choose its actions from an infinite set of possible actions. Such a poorly designed AI will either (a) not do anything at all or (b) will not be able to decide what to do before the heat death of the universe, given limited computationally resources. Such a poorly designed AI will not even be able to decide if trying to acquire unlimited computationally resources was instrumentally rational because it will be unable to decide if the actions that are required to acquire those resources might be instrumentally irrational from the perspective of what it is meant to do.

Rob Bensinger13y170

This mirrors some comments you wrote recently:

"You write that the worry is that the superintelligence won't care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean?"
"If an AI is meant to behave generally intelligent [sic] then it will have t

... (read more)

14Furcas13y

"The genie knows, but doesn't care" It's like you haven't read the OP at all.

See in context