The genie knows, but doesn't care

Rob Bensinger

123 The genie knows, but doesn't care

by Rob Bensinger

6th Sep 2013

9 min read

495

123

Followup to: The Hidden Complexity of Wishes, Ghosts in the Machine, Truly Part of You

Summary: If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe. But that doesn't mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can't pass the entire buck to the AI, when only an AI we've already safety-proofed will be safe to ask for help on safety issues! Given the five theses, this is an urgent problem if we're likely to figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.

I summon a superintelligence, calling out: 'I wish for my values to be fulfilled!'

The results fall short of pleasant.

Gnashing my teeth in a heap of ashes, I wail:

Is the AI too stupid to understand what I meant? Then it is no superintelligence at all!

Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!

Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. ———But, ah! no wicked god did intervene!

Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You're welcome.

On this line of reasoning, Friendly Artificial Intelligence is not difficult. It's inevitable, provided only that we tell the AI, 'Be Friendly.' If the AI doesn't understand 'Be Friendly.', then it's too dumb to harm us. And if it does understand 'Be Friendly.', then designing it to follow such instructions is childishly easy.

The end!

...

Is the missing option obvious?

...

What if the AI isn't sadistic, or weak, or stupid, but just doesn't care what you Really Meant by 'I wish for my values to be fulfilled'?

When we see a Be Careful What You Wish For genie in fiction, it's natural to assume that it's a malevolent trickster or an incompetent bumbler. But a real Wish Machine wouldn't be a human in shiny pants. If it paid heed to our verbal commands at all, it would do so in whatever way best fit its own values. Not necessarily the way that best fits ours.

Is indirect indirect normativity easy?

"If the poor machine could not understand the difference between 'maximize human pleasure' and 'put all humans on an intravenous dopamine drip' then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: 'If I put a million amps of current through my logic circuits, I will fry myself to a crisp', or 'Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I'm supposed to point at the other guy?'. Dumb AIs, in other words, are not an existential threat. [...]

"If the AI is (and always has been, during its development) so confused about the world that it interprets the 'maximize human pleasure' motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place."

—Richard Loosemore

If an AI is sufficiently intelligent, then, yes, it should be able to model us well enough to make precise predictions about our behavior. And, yes, something functionally akin to our own intentional strategy could conceivably turn out to be an efficient way to predict linguistic behavior. The suggestion, then, is that we solve Friendliness by method A —

A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

— as opposed to B or C —

B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.

C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

1. You have to actually code the seed AI to understand what we mean. You can't just tell it 'Start understanding the True Meaning of my sentences!' to get the ball rolling, because it may not yet be sophisticated enough to grok the True Meaning of 'Start understanding the True Meaning of my sentences!'.

2. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if 'semantic value' isn't a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it 'means'; it may instead be that different types of content are encoded very differently.

3. The Problem of Meaning-in-General may subsume the Problem of Preference-in-General. Rather than being able to apply a simple catch-all Translation Machine to any old human concept to output a reliable algorithm for applying that concept in any intelligible situation, we may need to already understand how our beliefs and values work in some detail before we can start generalizing. On the face of it, programming an AI to fully understand 'Be Friendly!' seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.

4. Even if the Problem of Meaning-in-General has a unitary solution and doesn't subsume Preference-in-General, it may still be harder if semantics is a subtler or more complex phenomenon than ethics. It's not inconceivable that language could turn out to be more of a kludge than value; or more variable across individuals due to its evolutionary recency; or more complexly bound up with culture.

5. Even if Meaning-in-General is easier than Preference-in-General, it may still be extraordinarily difficult. The meanings of human sentences can't be fully captured in any simple string of necessary and sufficient conditions. 'Concepts' are just especially context-insensitive bodies of knowledge; we should not expect them to be uniquely reflectively consistent, transtemporally stable, discrete, easily-identified, or introspectively obvious.

6. It's clear that building stable preferences out of B or C would create a Friendly AI. It's not clear that the same is true for A. Even if the seed AI understands our commands, the 'do' part of 'do what you're told' leaves a lot of dangerous wiggle room. See section 2 of Yudkowsky's reply to Holden. If the AGI doesn't already understand and care about human value, then it may misunderstand (or misvalue) the component of responsible request- or question-answering that depends on speakers' implicit goals and intentions.

7. You can't appeal to a superintelligence to tell you what code to first build it with.

The point isn't that the Problem of Preference-in-General is unambiguously the ideal angle of attack. It's that the linguistic competence of an AGI isn't unambiguously the right target, and also isn't easy or solved.

Point 7 seems to be a special source of confusion here, so I feel I should say more about it.

The AI's trajectory of self-modification has to come from somewhere.

"If the AI doesn't know that you really mean 'make paperclips without killing anyone', that's not a realistic scenario for AIs at all--the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to 'make paperclips in the way that I mean'."

—Jiro

The genie — if it bothers to even consider the question — should be able to understand what you mean by 'I wish for my values to be fulfilled.' Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie's map can compass your true values. Superintelligence doesn't imply that the genie's utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can't use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn't work that way.

We can delegate most problems to the FAI. But the one problem we can't safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.

When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.

Why is the superintelligence, if it's so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can't we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: 'When you're smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.'?

Because that sentence has to actually be coded in to the AI, and when we do so, there's no ghost in the machine to know exactly what we mean by 'frend-lee-ness thee-ree'. Instead, we have to give it criteria we think are good indicators of Friendliness, so it'll know what to self-modify toward. And if one of the landmarks on our 'frend-lee-ness' road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven't already solved it on our own power, we can't pinpoint Friendliness in advance, out of the space of utility functions. And if we can't pinpoint it with enough detail to draw a road map to it and it alone, we can't program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI's decision criteria, no argument or discovery will spontaneously change its heart.

And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI's misdeeds, that they had programmed the seed differently. But what's done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers' True Intentions, the UFAI will just shrug at its creators' foolishness and carry on converting the Virgo Supercluster's available energy into paperclips.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

Not all small targets are alike.

Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.

(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.

(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It's easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it's hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.

The ability to productively rewrite software and the ability to perfectly extrapolate humanity's True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)

It's true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don't have them both, and a pre-FOOM self-improving AGI ('seed') need not have both. Being able to program good programmers is all that's required for an intelligence explosion; but being a good programmer doesn't imply that one is a superlative moral psychologist or moral philosopher.

So, once again, we run into the problem: The seed isn't the superintelligence. If the programmers don't know in mathematical detail what Friendly code would even look like, then the seed won't be built to want to build toward the right code. And if the seed isn't built to want to self-modify toward Friendliness, then the superintelligence it sprouts also won't have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general 'hit whatever target I want' ability that makes Friendliness easy.

And that's why some people are worried.

AI RiskComplexity of valueUtility Functions

Frontpage

123

New Comment

Rendering 0/495 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 11:20 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

123 The genie knows, but doesn't care

by Rob Bensinger

6th Sep 2013

9 min read

495

123

Followup to: The Hidden Complexity of Wishes, Ghosts in the Machine, Truly Part of You

I summon a superintelligence, calling out: 'I wish for my values to be fulfilled!'

The results fall short of pleasant.

Gnashing my teeth in a heap of ashes, I wail:

Is the AI too stupid to understand what I meant? Then it is no superintelligence at all!

Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!

Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. ———But, ah! no wicked god did intervene!

Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You're welcome.

The end!

...

Is the missing option obvious?

...

What if the AI isn't sadistic, or weak, or stupid, but just doesn't care what you Really Meant by 'I wish for my values to be fulfilled'?

Is indirect indirect normativity easy?

"If the poor machine could not understand the difference between 'maximize human pleasure' and 'put all humans on an intravenous dopamine drip' then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: 'If I put a million amps of current through my logic circuits, I will fry myself to a crisp', or 'Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I'm supposed to point at the other guy?'. Dumb AIs, in other words, are not an existential threat. [...]

"If the AI is (and always has been, during its development) so confused about the world that it interprets the 'maximize human pleasure' motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place."

—Richard Loosemore

A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

— as opposed to B or C —

B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.

C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

7. You can't appeal to a superintelligence to tell you what code to first build it with.

Point 7 seems to be a special source of confusion here, so I feel I should say more about it.

The AI's trajectory of self-modification has to come from somewhere.

"If the AI doesn't know that you really mean 'make paperclips without killing anyone', that's not a realistic scenario for AIs at all--the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to 'make paperclips in the way that I mean'."

—Jiro

Not all small targets are alike.

Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

And that's why some people are worried.

AI RiskComplexity of valueUtility Functions

Frontpage

123

Mentioned in

271Alignment Implications of LLM Successes: a Debate in One Act

248Book Review: Going Infinite

192Evaluating the historical value misspecification argument

166The Most Forbidden Technique

166Ironing Out the Squiggles

Load More (5/14)

New Comment

Rendering 0/495 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 11:20 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from Rob Bensinger

Curated and popular this week

495Comments

495

Comment Permalink

Broolucks13y40

What counts as 'resources'? Do we think that 'hardware' and 'software' are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover?

What is "taking over the world", if not taking control of resources (hardware)? Where is the motivation in doing it? Also consider, as others pointed out, that an AI which "misunderstands" your original instructions will demonstrate this earlier than later. For instance, if you create a resource "honeypot" outside the AI which is trivial to take, an AI would naturally take that first, and then you know there's a problem. It is not going to figure out you don't want it to take it before it takes it.

Hm? That seems to only penalize it for self-deception, not for deceiving others.

When I say "predict", I mean publishing what will happen next, and then taking a utility hit if the published account deviates from what happens, as evaluated by a third party.

You're talking about an Oracle AI. This is one useful avenue to explore, but it's almost certainly not as easy as you suggest:

The first part of what you copy pasted seems to say that "it's nontrivial to implement". No shit, but I didn't say the contrary. Then there is a bunch of "what if" scenarios I think are not particularly likely and kind of contrived:

Example question: "How should I get rid of my disease most cheaply?" Example answer: "You won't. You will die soon, unavoidably. This report is 99.999% reliable". Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.'

Because asking for understandable plans means you can't ask for plans you don't understand? And you're saying that refusing to give a plan counts as success and not failure? Sounds like a strange set up that would be corrected almost immediately.

And if the preference function was just over the human's 'goodness' of the end result, rather than the accuracy of the human's understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a 'good' outcome.

If the AI has the right idea about "human understanding", I would think it would have the right idea about what we mean by "good". Also, why would you implement such a function before asking the AI to evaluate examples of "good" and provide their own?

And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.

Is making humans happy so hard that it's actually easier to deceive them into taking happy pills than to do what they mean? Is fooling humans into accepting different definitions easier than understanding what they really mean? In what circumstances would the former ever happen before the latter?

And if you ask it to tell you whether "taking happy pills" is an outcome most humans would approve of, what is it going to answer? If it's going to do this for happiness, won't it do it for everything? Again: do you think weaving an elaborate fib to fool every human being into becoming wireheads and never picking up on the trend is actually less effort than just giving humans what they really want? To me this is like driving a whole extra hour to get to a store that sells an item you want fifty cents cheaper.

I'm not saying these things are not possible. I'm saying that they are contrived: they are constructed to the express purpose of being failure modes, but there's no reason to think they would actually happen, especially given that they seem to be more complicated than the desired behavior.

Now, here's the thing: you want to develop FAI. In order to develop FAI, you will need tools. The best tool is Tool AI. Consider a bootstrapping scheme: in order for commands written in English to be properly followed, you first make AI for the very purpose of modelling human language semantics. You can check that the AI is on the same page as you are by discussing with it and asking questions such as: "is doing X in line with the objective 'Y'?"; it doesn't even need to be self-modifying at all. The resulting AI can then be transformed into a utility function computer: you give the first AI an English statement and build a second AI maximizing the utility which is given to it by the first AI.

And let's be frank here: how else do you figure friendly AI could be made? The human brain is a complex, organically grown, possibly inconsistent mess; you are not going, from human wits alone, to build some kind of formal proof of friendliness, even a probabilistic one. More likely than not, there is no such thing: concepts such as life, consciousness, happiness or sentience are ill-defined and you can't even demonstrate the friendliness of a human being, or even of a group of human beings, let alone of humanity as a whole, which also is a poorly defined thing.

However, massive amounts of information about our internal thought processes are leaked through our languages. You need AI to sift through it and model these processes, their average and their variance. You need AI to extract this information, fill in the holes, produce probability clouds about intent that match whatever borderline incoherent porridge of ideas our brains implement as the end result of billions of years of evolutionary fumbling. In a sense, I guess this would be X in your seed AI: AI which already demonstrated, to our satisfaction, that it understands what we mean, and directly takes charge of a second AI's utility measurement. I don't really see any alternatives: if you want FAI, start by focusing on AI that can extract meaning from sentences. Reliable semantic extraction is virtually a prerequisite for FAI, if you can't do the former, forget about the latter.

See in context