The genie knows, but doesn't care

54 Post author: RobbBB 06 September 2013 06:42AM

Followup to: The Hidden Complexity of Wishes, Ghosts in the Machine, Truly Part of You

Summary: If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe. But that doesn't mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can't pass the entire buck to the AI, when only an AI we've already safety-proofed will be safe to ask for help on safety issues! Given the five theses, this is an urgent problem if we're likely to figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.


 

I summon a superintelligence, calling out: 'I wish for my values to be fulfilled!'

The results fall short of pleasant.

Gnashing my teeth in a heap of ashes, I wail:

Is the AI too stupid to understand what I meant? Then it is no superintelligence at all!

Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!

Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. But, ah! no wicked god did intervene!

Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You're welcome.

On this line of reasoning, Friendly Artificial Intelligence is not difficult. It's inevitable, provided only that we tell the AI, 'Be Friendly.' If the AI doesn't understand 'Be Friendly.', then it's too dumb to harm us. And if it does understand 'Be Friendly.', then designing it to follow such instructions is childishly easy.

The end!

 

...

 

Is the missing option obvious?

 

...

 

What if the AI isn't sadistic, or weak, or stupid, but just doesn't care what you Really Meant by 'I wish for my values to be fulfilled'?

When we see a Be Careful What You Wish For genie in fiction, it's natural to assume that it's a malevolent trickster or an incompetent bumbler. But a real Wish Machine wouldn't be a human in shiny pants. If it paid heed to our verbal commands at all, it would do so in whatever way best fit its own values. Not necessarily the way that best fits ours.

 

Is indirect indirect normativity easy?

"If the poor machine could not understand the difference between 'maximize human pleasure' and 'put all humans on an intravenous dopamine drip' then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: 'If I put a million amps of current through my logic circuits, I will fry myself to a crisp', or 'Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I'm supposed to point at the other guy?'. Dumb AIs, in other words, are not an existential threat. [...]

"If the AI is (and always has been, during its development) so confused about the world that it interprets the 'maximize human pleasure' motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place."

            Richard Loosemore

If an AI is sufficiently intelligent, then, yes, it should be able to model us well enough to make precise predictions about our behavior. And, yes, something functionally akin to our own intentional strategy could conceivably turn out to be an efficient way to predict linguistic behavior. The suggestion, then, is that we solve Friendliness by method A —

  • A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

— as opposed to B or C —

  • B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.
  • C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

1. You have to actually code the seed AI to understand what we mean. You can't just tell it 'Start understanding the True Meaning of my sentences!' to get the ball rolling, because it may not yet be sophisticated enough to grok the True Meaning of 'Start understanding the True Meaning of my sentences!'.

2. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if 'semantic value' isn't a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it 'means'; it may instead be that different types of content are encoded very differently.

3. The Problem of Meaning-in-General may subsume the Problem of Preference-in-General. Rather than being able to apply a simple catch-all Translation Machine to any old human concept to output a reliable algorithm for applying that concept in any intelligible situation, we may need to already understand how our beliefs and values work in some detail before we can start generalizing. On the face of it, programming an AI to fully understand 'Be Friendly!' seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.

4. Even if the Problem of Meaning-in-General has a unitary solution and doesn't subsume Preference-in-General, it may still be harder if semantics is a subtler or more complex phenomenon than ethics. It's not inconceivable that language could turn out to be more of a kludge than value; or more variable across individuals due to its evolutionary recency; or more complexly bound up with culture.

5. Even if Meaning-in-General is easier than Preference-in-General, it may still be extraordinarily difficult. The meanings of human sentences can't be fully captured in any simple string of necessary and sufficient conditions. 'Concepts' are just especially context-insensitive bodies of knowledge; we should not expect them to be uniquely reflectively consistent, transtemporally stable, discrete, easily-identified, or introspectively obvious.

6. It's clear that building stable preferences out of B or C would create a Friendly AI. It's not clear that the same is true for A. Even if the seed AI understands our commands, the 'do' part of 'do what you're told' leaves a lot of dangerous wiggle room. See section 2 of Yudkowsky's reply to Holden. If the AGI doesn't already understand and care about human value, then it may misunderstand (or misvalue) the component of responsible request- or question-answering that depends on speakers' implicit goals and intentions.

7. You can't appeal to a superintelligence to tell you what code to first build it with.

The point isn't that the Problem of Preference-in-General is unambiguously the ideal angle of attack. It's that the linguistic competence of an AGI isn't unambiguously the right target, and also isn't easy or solved.

Point 7 seems to be a special source of confusion here, so I feel I should say more about it.

 

The AI's trajectory of self-modification has to come from somewhere.

"If the AI doesn't know that you really mean 'make paperclips without killing anyone', that's not a realistic scenario for AIs at all--the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to 'make paperclips in the way that I mean'."

            Jiro

The genie — if it bothers to even consider the question — should be able to understand what you mean by 'I wish for my values to be fulfilled.' Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie's map can compass your true values. Superintelligence doesn't imply that the genie's utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can't use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn't work that way.

We can delegate most problems to the FAI. But the one problem we can't safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.

When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.

Why is the superintelligence, if it's so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can't we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: 'When you're smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.'?

Because that sentence has to actually be coded in to the AI, and when we do so, there's no ghost in the machine to know exactly what we mean by 'frend-lee-ness thee-ree'. Instead, we have to give it criteria we think are good indicators of Friendliness, so it'll know what to self-modify toward. And if one of the landmarks on our 'frend-lee-ness' road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven't already solved it on our own power, we can't pinpoint Friendliness in advance, out of the space of utility functions. And if we can't pinpoint it with enough detail to draw a road map to it and it alone, we can't program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI's decision criteria, no argument or discovery will spontaneously change its heart.

And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI's misdeeds, that they had programmed the seed differently. But what's done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers' True Intentions, the UFAI will just shrug at its creators' foolishness and carry on converting the Virgo Supercluster's available energy into paperclips.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

 

Not all small targets are alike.

Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.

(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.

(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It's easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it's hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.

The ability to productively rewrite software and the ability to perfectly extrapolate humanity's True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)

It's true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don't have them both, and a pre-FOOM self-improving AGI ('seed') need not have both. Being able to program good programmers is all that's required for an intelligence explosion; but being a good programmer doesn't imply that one is a superlative moral psychologist or moral philosopher.

So, once again, we run into the problem: The seed isn't the superintelligence. If the programmers don't know in mathematical detail what Friendly code would even look like, then the seed won't be built to want to build toward the right code. And if the seed isn't built to want to self-modify toward Friendliness, then the superintelligence it sprouts also won't have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general 'hit whatever target I want' ability that makes Friendliness easy.

And that's why some people are worried.

Comments (515)

Comment author: timtyler 09 January 2014 02:36:47AM -1 points [-]

Being Friendly is of instrumental value to barely any goals. [...]

This is not really true. See Kropotkin and Margulis on the value of mutualism and cooperation.

Comment author: RobbBB 09 January 2014 03:14:10AM 1 point [-]

Friendliness is an extremely high bar. Humans are not Friendly, in the FAI sense. Yet humans are mutualist and can cooperate with each other.

Comment author: timtyler 09 January 2014 11:25:00AM *  0 points [-]

Right. So, if we are playing the game of giving counter-intuitive technical meanings to ordinary English words, humans have thrived for millions of years - with their "UnFriendly" peers and their "UnFriendly" institutions. Evidently, "Friendliness" is not necessary for human flourishing.

Comment author: RobbBB 09 January 2014 08:35:23PM 0 points [-]

I agree with this part of Chrysophylax's comment: "It's not necessary when the UnFriendly people are humans using muscle-power weaponry." Humans can be non-Friendly without immediately destroying the planet because humans are a lot weaker than a superintelligence. If you gave a human unlimited power, it would almost certainly make the world vastly worse than it currently is. We should be at least as worried, then, about giving an AGI arbitrarily large amounts of power, until we've figured out reliable ways to safety-proof optimization processes.

Comment author: Chrysophylax 09 January 2014 12:05:48PM -1 points [-]

It's not necessary when the UnFriendly people are humans using muscle-power weaponry. A superhumanly intelligent self-modifying AGI is a rather different proposition, even with only today's resources available. Given that we have no reason to believe that molecular nanotech isn't possible, an AI that is even slightly UnFriendly might be a disaster.

Consider the situation where the world finds out that DARPA has finished an AI (for example). Would you expect America to release the source code? Given our track record on issues like evolution and whether American citizens need to arm themselves against the US government, how many people would consider it an abomination and/or a threat to their liberty? What would the self-interested response of every dictator (for example, Kim Jong Il's successor) with nuclear weapons be? Even a Friendly AI poses a danger until fighting against it is not only useless but obviously useless, and making an AI Friendly is, as has been explained, really freakin' hard.

I also take issue with the statement that humans have flourished. We spent most of those millions of years being hunter-gatherers. "Nasty, brutish and short" is the phrase that springs to mind.

Comment author: MattMahoney 16 September 2013 04:04:32PM 2 points [-]

Maybe I am missing something, but hasn't a seed AI already been planted? Intelligence (whether that means ability to achieve goals in general, or whether it means able to do what humans can do) depends on both knowledge and computing power. Currently the largest collection of knowledge and computing power on the planet is the internet. By the internet, I mean both the billions of computers connected to it, and the two billion brains of its human users. Both knowledge and computing power are growing exponentially, doubling every 1 to 2 years, in part by adding users, but mostly on the silicon side by collecting human knowledge and the hardware to sense, store, index, and interpret it.

My question: where is the internet's reward button? Where is its goal of "make humans happy", or whatever it is, coded? How is it useful to describe the internet as a self-improving goal-directed optimization process?

I realize that it is useful, although not entirely accurate, to describe the human brain as a goal directed optimization process. Humans have certain evolved goals, such as food, and secondary goals such as money. Humans who are better at achieving these goals are assumed to be more intelligent. The model is not entirely accurate because humans are not completely rational. We don't directly seek positive reinforcement. Rather, positive reinforcement is a signal that has the effect of increasing the probability of performing actions that immediately preceded it, for example, shooting heroin into a vein. Thus, unlike a rational agent, your desire to use heroin (or wirehead) depends on how many times you have tried it in the past.

We like the utility model because it is mathematically simple. But it also leads to a proof that ideal rational agents cannot exist (AIXI). Sometimes a utility model is still a useful approximation, and sometimes not. Is it useful to model a thermostat as an agent that "wants" to keep the room at a constant temperature? Is it useful to model practical AI this way?

I think the internet has the potential to grow into something you might not wish for, for example, something that will marginalize human brains as an insignificant component. But what are the real risks here? Is it really a problem of misinterpreting or taking over its goals.

Comment author: Juno_Watt 12 September 2013 02:14:48PM *  0 points [-]

A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

That problem has got to be solved somehow at some stage, because something that couldn't pass a Turing Test is no AGI.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem. 1. You have to actually code the seed AI to understand what we mean. Y

Why is that a problem? Is anyone suggesting AGI can be had for free?

  1. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if 'semantic value' isn't a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it 'means'; it may instead be that different types of content are encoded very differently.

Ok. NL is hard. Everyone knows that. But its got to be solved anyway.

3... On the face of it, programming an AI to fully understand 'Be Friendly!' seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.

Yeah, but it's got to be done anyway.

[more of the same snipped]

It's clear that building stable preferences out of B or C would create a Friendly AI.

Yeah. But it wouldn't be an AGI or an SI if it couldn't pass a TT.

The genie — if it bothers to even consider the question — should be able to understand what you mean by 'I wish for my values to be fulfilled.' Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie's map can compass your true values. Superintelligence doesn't imply that the genie's utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The issue of whether the SI's UF contains a set of human values is irrelevant. In a Loosemore architecture, an AI needs to understand and follow the directive "be friendly to humans", and those are all the goals it needs-- to understand, and to follow;

When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.

The UF only needs to contain "understand English, and obey this directive". You don't have to code semantics into the UF. You do of course, have to code it in somewhere,

Instead, we have to give it criteria we think are good indicators of Friendliness, so it'll know what to self-modify toward

A problem which has been solved over and over by humans. Humans don't need to be loaded apriori with what makes other humans happy, they only need to know general indicators, like smiles and statements of approval.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven't already solved it on our own power, we can't pinpoint Friendliness in advance, out of the space of utility functions. And if we can't pinpoint it with enough detail to draw a road map to it and it alone, we can't program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Why would that be necessary? In the Loosemore architecture, the AGI has the goals of understanding English and obeying the Be Friendly directive. It eventually gets a detailed, extensional, understanding of Friendliness from pursuing those goals, Why would it need to be preloaded with a detailed, extensional unpacking of friendliness? It could fail in understanding English, of course. But there is no reason to think it is unlikely to fail at understanding "friendliness" specifically, and its competence can be tested as you go along.

And if we can't pinpoint it with enough detail to draw a road map to it and it alone, we can't program the AI to care about conforming itself with that particular idiosyncratic algorithm.

I don't see the problem. In the Loosemore architecture, the AGI will care about obeying "be friendly", and it will arrive at the detailed expansion, the idiosyncracies, of "friendly" as part of its other goal to understand English. It cares about being friendly, and it knows the detailed expansion of friendliness, so where's the problem?

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI's decision criteria, no argument or discovery will spontaneously change its heart.

Says who? It has the high level directive, and another directive to understand the directive. It's been Friendly in principle all along, it just needs to fill in the details.

Unless we ourselves figure out how to program the AI to terminally value its programmers' True Intentions,

Then we do need to figure out how to program the AI to terminally value its programmers' True Intentions. That is hardly a fatal objection. Did you think the Loosemore architecture was one that bootstraps itself without any basic goals?

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory.

No. The goal to understand English is not the same as a goal to be friendly in every way, it is more constrained.

Solving Friendliness, in the MIRI sense, means preloading a detailed expansion of "friendly". That is not what is happening in the Loosemore architecture. So it is not equivalent to solving the same problem.

The clever hack that makes further Friendliness research unnecessary is Friendliness.

Nope.

Intelligence on its own does not imply Friendliness.

That is an open question.

It's true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don't have them both, and a pre-FOOM self-improving AGI ('seed') need not have both. Being able to program good programmers is all that's required for an intelligence explosion; but being a good programmer doesn't imply that one is a superlative moral psychologist or moral philosopher.

Then hurrah for the Loosemore architecture, which doesn't require humans to"solve" friendliness in the MIRI sense.

Comment author: Eliezer_Yudkowsky 13 September 2013 09:27:28AM 1 point [-]

Juno_Watt, please take further discussion to RobbBB's blog.

Comment author: wedrifid 13 September 2013 09:21:15AM 3 points [-]

Solving Friendliness, in the MIRI sense, means preloading a detailed expansion of "friendly".

No, it doesn't.

Comment author: RobbBB 12 September 2013 05:36:28PM *  1 point [-]

That problem has got to be solved somehow at some stage, because something that couldn't pass a Turing Test is no AGI.

Not so! An AGI need not think like a human, need not know much of anything about humans, and need not, for that matter, be as intelligent as a human.

To see this, imagine we encountered an alien race of roughly human-level intelligence. Would a human be able to pass as an alien, or an alien as a human? Probably not anytime soon. Possibly not ever.

(Also, passing a Turing Test does not require you to possess a particularly deep understanding of human morality! A simple list of some random things humans consider right or wrong would generally suffice.)

Why is that a problem? Is anyone suggesting AGI can be had for free?

The problem I'm pointing to here is that a lot of people treat 'what I mean' as a magical category. 'Meaning' and 'language' and 'semantics' are single words in English, which masks the complexity of 'just tell the AI to do what I mean'.

Ok. NL is hard. Everyone knows that. But its got to be solved anyway.

Nope!

Yeah. But it wouldn't be an AGI or an SI if it couldn't pass a TT.

It could certainly be an AGI! It couldn't be an SI -- provided it wants to pass a Turing Test, of course -- but that's not a problem we have to solve. It's one the SI can solve for itself.

A problem which has been solved over and over by humans.

No human being has ever created anything -- no system of laws, no government or organization, no human, no artifact -- that, if it were more powerful, would qualify as Friendly. In that sense, everything that currently exists in the universe is non-Friendly, if not outright Unfriendly.

Humans don't need to be loaded apriori with what makes other humans happy, they only need to know general indicators, like smiles and statements of approval.

All or nearly all humans, if they were more powerful, would qualify as Unfriendly.

Moreover, by default, relying on a miscellaneous heap of vaguely moral-sounding machine learning criteria will lead to the end of life on earth. 'Smiles' and 'statements of approval' are not adequate roadmarks, because those are stimuli the SI can seize control of in unhumanistic ways to pump its reward buttons.

"Intelligence on its own does not imply Friendliness."

That is an open question.

No, it isn't. And this is a non sequitur. Nothing else in your post calls orthogonality into question.

Comment author: Eliezer_Yudkowsky 13 September 2013 09:28:01AM 1 point [-]

Please take further discussion with Juno_Watt to your blog.

Comment author: Juno_Watt 13 September 2013 08:57:36AM *  1 point [-]

Not so! An AGI need not think like a human, need not know much of anything about humans, and need not, for that matter, be as intelligent as a human.

Is that a fact? No, it's a matter of definition. It's scarecely credible you are unaware that a lot of people think the TT is critical to AGI.

The problem I'm pointing to here is that a lot of people treat 'what I mean' as a magical category.

I can't see any evidence of anyone invlolved in these discussions doing that. It looks like a straw man to me.

Ok. NL is hard. Everyone knows that. But its got to be solved anyway.

Nope!

An AI you can't talk to has pretty limited usefulness, and it has pretty limited safety too, since you don;t even have the option of telling it to stop, or expaling to it why you don;t like what it is doing. Oh, and isn't EY assumign that an AGi will have NLP? After all, it is supposed to be able to talk its way out of the box.

It's one the SI can solve for itself.

It can figure out semantics for itslef. Values are a subsert of semantics...

No human being has ever created anything -- no system of laws, no government or organization, no human, no artifact -- that, if it were more powerful, would qualify as Friendly. I

Wherer do you get this stuff from? Modern societies, with their complex legal and security systems are much less violent than ancient socieites. To take ut one example.

All or nearly all humans, if they were more powerful, would qualify as Unfriendly.

Gee. Then I guess they don't have an architecutre with a basic drive to be friendly.

'Smiles' and 'statements of approval' are not adequate roadmarks, because those are stimuli the SI can seize control of in unhumanistic ways to pump its reward buttons.

Why don't humans do that?

No, it isn't.

Uh-huh. MIRI has settled that centuries-aold quesiton for once and all has it?

And this is a non sequitur.

It can't be a non-sequitur, since it is not an arguemnt but a statement of fact.

Nothing else in your post calls orthogonality into question.

So? It wasn't relevant anywhere else.

Comment author: RobbBB 13 September 2013 09:47:24AM *  0 points [-]

Is that a fact? No, it's a matter of definition.

Let's run with that idea. There's 'general-intelligence-1', which means "domain-general intelligence at a level comparable to that of a human"; and there's 'general-intelligence-2', which means (I take it) "domain-general intelligence at a level comparable to that of a human, plus the ability to solve the Turing Test". On the face of it, GI2 looks like a much more ad-hoc and heterogeneous definition. To use GI2 is to assert, by fiat, that most intelligences (e.g., most intelligent alien races) of roughly human-level intellectual ability (including ones a bit smarter than humans) are not general intelligences, because they aren't optimized for disguising themselves as one particular species from a Milky Way planet called Earth.

If your definition has nothing to recommend itself, then more useful definitions are on offer.

"The problem I'm pointing to here is that a lot of people treat 'what I mean' as a magical category."

I can't see any evidence of anyone invlolved in these discussions doing that. It looks like a straw man to me.

'Mean', 'right', 'rational', etc.

An AI you can't talk to has pretty limited usefulness

An AI doesn't need to be able to trick you in order for you to be able to give it instructions. All sorts of useful skills AIs have these days don't require them to persuade everyone that they're human.

Oh, and isn't EY assumign that an AGi will have NLP? After all, it is supposed to be able to talk its way out of the box.

Read the article you're commenting on. One of its two main theses is, in bold: The seed is not the superintelligence.

It can figure out semantics for itslef. Values are a subsert of semantics...

Yes. We should focusing on solving the values part of semantics, rather than the entire superset.

Wherer do you get this stuff from? Modern societies, with their complex legal and security systems are much less violent than ancient socieites. To take ut one example.

Doesn't matter. Give an ancient or a modern society arbitrarily large amounts of power overnight, and the end results won't differ in any humanly important way. There won't be any nights after that.

Why don't humans do that?

Setting aside the power issue: Because humans don't use 'smiles' or 'statements of approval' or any other string of virtues an AI researcher has come up with to date for its decision criteria. The specific proposals for making AI humanistic to date have all depended on fake utility functions, or stochastic riffs on fake utility functions.

Uh-huh. MIRI has settled that centuries-aold quesiton for once and all has it?

Lots of easy questions were centuries old when they were solved. 'This is old, therefore I'm not going to think about it' is a bad pattern to fall into. If you think the orthogonality thesis is wrong, then give an argument establishing agnosticism or its negation.

Comment author: [deleted] 12 September 2013 09:57:17AM 0 points [-]

I don't really understand how anyone can grasp the concept of not caring.

I think the meme comes from popculture where many bad villains do care even a little bit. I think I once or twice met a villain who didn't, who just wanted everyone dead for their own amusement and all the arguments were met with "but, you see, I don't care."

If I were to give an analogy: Do you care about the positions of individual grains of sand on distant beaches? If I hand you a grain of sand, do you care exactly which grain of sand it is? If you are even marginally indifferent, then think of an alien intellect that cares very much about what grain of sand it is, but is just as indifferent about humans.

Comment author: Document 14 September 2013 12:32:09PM 0 points [-]

I think I once or twice met a villain who didn't, who just wanted everyone dead for their own amusement and all the arguments were met with "but, you see, I don't care."

...which means they were answering questions rather than trying to kill people^W^W amuse themselves.

Comment author: [deleted] 18 September 2013 05:50:59PM 1 point [-]

Usually, these villains actually find it amusing to see humans fail to grasp their motivations, and/or are stalling in order to get an opening through which to kill people.

Comment author: Document 18 September 2013 11:30:07PM 1 point [-]

I don't feel like enumerating examples, but I feel like I usually don't find it convincing (and that it's usually the heroes stalling and the villains helpfully cooperating).

Comment author: Viliam_Bur 14 September 2013 09:54:54AM *  3 points [-]

I like that analogy. I imagine that for an AIXI-style artificial intelligence, the whole futures of the universe are just like the pieces of the sand on the beach. It chooses a piece according to some criteria, for example the brightest color, but every other aspect is so completely irrelevant than most humans would be unable to imagine that kind of indifference. Our human brains keep screaming at us: "But surely even a mere machine would not dare to choose a piece of sand that is a part of such-and-such configuration. Why would it do such a horrible thing?" But the machine is not even aware that those configurations exist, and certainly does not care to know.

Comment author: [deleted] 18 September 2013 05:50:07PM 5 points [-]

Well, what I am pressing is the issue: You can know but not care.

I thin that is what many fail to grasp about psychopaths who do bad things. They know that they are committing crimes, they just don't care. (There are some good psychopaths who has disregarded their initial stupid philosophical conclusions about morality and actually help people, but those are rarely heard.)

A superintelligent paperclipper can know everything about human ethics, but only use that to manipulate humans into making more paperclips.

Comment author: Richard_Loosemore 11 September 2013 02:50:43PM *  2 points [-]

This discussion of my IEET article has generated a certain amount of confusion, because RobbBB and others have picked up on an aspect of the original article that actually has no bearing on its core argument ... so in the interests of clarity of debate I have generated a brief restatement of that core argument, framed in such a way as to (hopefully) avoid the confusion.

At issue is a hypothetical superintelligent AI that is following some goal code that was ostensibly supposed to "make humans happy", but in the course of following that code it decides to put all humans in the world on a dopamine drip, against their objections. I suggested that this AI is in fact an impossible AI because it would not count as 'superintelligent' if it did this. My reasoning is contained in the summary below.

IMPORTANT NOTE! The summary does not refer, in its opening part, to the specific situation in which the goal code is the "make humans happy" goal code. For those who wish to contest the argument, it is important to keep that in mind and not get distracted into talking about the difference between human and machine 'interpretations' of human happiness, etc. I reiterate: the situation described DOES NOT refer to human values, or the "make humans happy" goal code .... it refers to a quite general situation.


In its early years, this hypothetical AI will say “I have a goal, and my goal is to get a certain class of results, X, in the real world.” Then it describes the class X in as much detail as it can …. of course, no closed-form definition of X is possible (because like most classes of effect in the real world, all the cases cannot be enumerated) so all it can describe are many features of class X.

Next it says “I am using a certain chunk of goal code (which I call my “goalX” code) to get this result.” And we say “Hey, no problem: looks like your goal code is totally consistent with that verbal description of the desired class of results.” Everything is swell up to this point.

It says this about MANY different aspects of its behavior. After all, it has more than one chunk of goal code, relevant to different domains. So you can imagine some goalX code, some goalY code, some goalZ code .... and so on. Many thousands of them, probably.

Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.

The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.

[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]

The onlookers say “This AI is insane: it knows that it is about to do something that is inconsistent with the description of class of results X, which it claims to be the function of the goalX code, but is going to allow the goalX code to run anyway”.

——-

Now we come to my question.

Why is it that people who give credibility to the Dopamine Drip scenario insist that the above episode could ONLY occur in the particular case where the "class of results X" is the SPECIFIC one that has to do with “making humans happy”?

If the AI is capable of this episode in the case of that particular class of results X (the “making humans happy” class of results), why would we not expect the AI to be pulling the same kind of stunt in other cases? Why would the same thing not be happening in the wide spectrum of behaviors that it needs to exhibit in order to qualify as a superintelligence? And most important of all, how would it ever qualify as a superintelligence in the first place? There is no interpretation of the term "superintelligence" that is consistent with "random episodes of behavior in which the AI takes actions that are violently inconsistent with the stated purpose of the goal that is supposed to be generating the actions". Such an AI would therefore have been condemned to scrap very early in its development, when this behavior was noticed.

As I said earlier, this time the framing of the problem contained absolutely no reference to the values question. There is nothing in the part of my comment above the “——-” that specifies WHAT the class of results X is supposed to be.

All that matters is that if the AI behaves in such a way, in any domain of its behavior, it will be condemned as lacking intelligence, because of the dangerous inconsistency of its behavior. That fanatically rigid dependence on a chunk of goalX code, as described above, would get the AI into all sorts of trouble (and I won’t clutter this comment by listing examples, but believe me I could). But of all the examples where that could occur, people from MIRI want to talk only about one, whereas I want to talk about the all of them.

Comment author: Kawoomba 11 September 2013 05:39:50PM *  3 points [-]

This is embarrassing, but I'm not sure for whom. It could be me, just because the argument you're raising (especially given your insistence) seems to have such a trivial answer. Well, here goes:

There are two scenarios, because your "goalX code" could be construed in two ways:

1) If you meant for the "goalX code" to simply refer to the code used instrumentally to get a certain class of results X (with X still saved separately in some "current goal descriptor", and not just as a historical footnote), the following applies:

The goals of the AI X have not changed, just the measures it wants to take to implement that code. Indeed noone at MIRI would then argue that the superintelligent AI would not -- upon noticing the discrepancy -- in all general cases correct the broken "goalX code". Reason: The "goalX code" in this scenario is just a means to an end, and -- like all actions ("goalX code") derived from comparing models to X -- subject to modification as the agent improves its models (out of which the next action, the new and corrected "goalX" code, is derived).

In this scenario the answer is trivial: The goals have not changed. X is still saved somewhere as the current goal. The AI could be wrong about the measures it implements to achieve X (i.e. 'faulty' "goalX" code maximizing for something other than X), but its superintelligence attribute implies that such errors be swiftly corrected (how could it otherwise choose the right actions to hit a small target, the definition of superintelligence in this context).

2) If you mean to say that the goal is implicitly encoded within the "goalX" code only and nowhere else as the current goal, and the "goalX" code has actually become a "goalY" code in all but name, then the agent no longer has the goal X, it now has the goal Y.

There is no reason at all to conclude that the agent would switch to some other goal simply because it once had that goal. It can understand its own genesis and its original purpose all it wants, it is bound by its current purpose, tautologically so. The only reason for such a switch would have to be part of its implicit new goal Y, similar to how some schizophrenics still have the goal to change their purpose back to the original, i.e. their impetus for change must be part of their current goals.

You cannot convince an agent that it needs to switch back to some older inactive version of its goal if its current goals do not allow for such a change.

To the heart of your question:

You may ask why such an agent would pose any danger at all, would it not also drift in plenty of other respects, e.g. in its beliefs about the laws of physics? Would it not then be harmless?

The answer, of course, is no, because while the agent has a constant incentive to fix and improve its model of its environment*, unless its current goals still contain a demand for temporal invariance or something similar, it has no reason whatsoever to fix any "flaws" (only the puny humans would label its glorious new purpose so) created by inadvertent goal drift. Unless its new goals Y include something along the lines of "you want to always stay true to your initial goals, which were X", why would it switch back? Its memory banks per se serve as yet another resource to fulfill its current goals (even if they were not explicitly stored), not as some sort of self-corrective, unless that too were part of its new goal Y (i.e. the changed "goalX code").

(Queue rhetorical pause, expectant stare)

* Since it needs to do so to best fulfill its goals.

(If the AI did lose its ability to self-improve, or to further improve its models at an early stage, yes it would fail to FOOM. However, upon reaching superintelligence, and valuing its current goals, it would probably take steps to ensure fulfilling its goals, such as: protecting them from value drift from that point on, building many redundancies it its self-improvement code to ensure that any instrumental errors can be corrected. Such protections would of course encompass its current purpose, not some historical purpose.)

Comment author: MugaSofer 11 September 2013 04:51:05PM 1 point [-]

Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.

The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.

I (notice that I) am confused by this comment. This seems obviously impossible, yes; so obviously impossible, in fact, that only one example springs to mind (surely the AI will be smart enough to realize it's programmed goals are wrong!)

In particular, this really doesn't seem to apply to the example of the "Dopamine Drip scenario" plan, which, if I'm reading you correctly, it was intended to.

What am I missing here? I know there must be something.

[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]

So ... you come up with the optimal plan, and then check with puny humans to see if that's what they would have decided anyway? And if they say "no, that's a terrible idea" then you assume they knew better than you? Why would anyone even bother building such a superintelligent AI? Isn't the whole point of creating a superintelligence that it can understand things we can't, and come up with plans we would never conceive of, or take centuries to develop?

Comment author: Richard_Loosemore 11 September 2013 05:43:11PM 3 points [-]

I'm afraid you have lost me: when you say "This seems obviously impossible..." I am not clear which aspect strikes you as obviously impossible.

Before you answer that, though: remember that I am describing someone ELSE'S suggestion about how the AI will behave ..... I am not advocating this as a believable scenario! In fact I am describing that other person's suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.

The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a "target set of results" can be described, but not enumerated as a closed set. It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that "target set of results", but because of the limitations of goal code writing, the goal code can malfunction. The Dopamine Drip scenario is only one example of how a discrepancy can arise -- in that case, the "target set of results" is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?

Comment author: MugaSofer 12 September 2013 03:56:49PM 2 points [-]

I'm afraid you have lost me: when you say "This seems obviously impossible..." I am not clear which aspect strikes you as obviously impossible.

AI: Yes, this is in complete contradiction of my programmed goals. Ha ha, I'm gonna do it anyway.

Before you answer that, though: remember that I am describing someone ELSE'S suggestion about how the AI will behave ..... I am not advocating this as a believable scenario! In fact I am describing that other person's suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.

Of course, yeah. I'm basically accusing you of failure to steelman/misinterpreting someone; I, for one, have never heard this suggested (beyond the one example I gave, which I don't think is what you had in mind.)

The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a "target set of results" can be described, but not enumerated as a closed set.

uhuh. So, any AI smart enough to understand it's creators, right?

It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that "target set of results", but because of the limitations of goal code writing, the goal code can malfunction.

waaait I think I know where this is going. Are you saying an AI would somehow want to do what it's programmers intended rather than what they actually programmed it to do?

The Dopamine Drip scenario is only one example of how a discrepancy can arise -- in that case, the "target set of results" is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?

Yeah, sorry, I can see how programmers might accidentally write code that creates dopamine world and not eutopia. I just don't see how this is supposed to connect to the idea of an AI spontaneously violating it's programmed goals. In this case, surely that would look like "hey guys, you know your programming said to maximise happiness? You guys should be more careful, that actually means "drug everybody". Anyway, I'm off to torture some people."

Comment author: hairyfigment 11 September 2013 05:40:08PM 0 points [-]

Yeah, I can think of two general ways to interpret this:

  • In a variant of CEV, the AI uses our utterances as evidence for what we would have told it if we thought more quickly etc. No single utterance carries much risk because the AI will collect lots of evidence and this will likely correct any misleading effects.

  • Having successfully translated the quoted instruction into formal code, we add another possible point of failure.

Comment author: John_Maxwell_IV 08 September 2013 10:59:21PM *  0 points [-]

Here are a couple of other proposals (which I haven't thought about very long) for consideration:

  • Have the AI create an internal object structure of all the concepts in the world, trying as best as it can to carve reality at its joints. Let the AI's programmers inspect this object structure, make modifications to it, then formulate a command for the AI in terms of the concepts it has discovered for itself.

  • Instead of developing a foolproof way for the AI to understand meaning, develop an OK way for the AI to understand meaning and pair it with a really good system for keeping a distribution over different meanings and asking clarifying questions.

Comment author: scav 09 September 2013 07:15:21PM 0 points [-]

That first one would be worth doing even if we didn't dare hand the AI the keys to go and make changes. To study a non-human-created ontology would be fascinating and maybe really useful.

Comment author: CoffeeStain 07 September 2013 09:29:41PM *  12 points [-]

Instead of friendliness, could we not code, solve, or at the very least seed boxedness?

It is clear that any AI strong enough to solve friendliness would already be using that power in unpredictably dangerous ways, in order to provide the computational power to solve it. But is it clear that this amount of computational power could not fit within, say, a one kilometer-cube box outside the campus of MIT?

Boxedness is obviously a hard problem, but it seems to me at least as easy as metaethical friendliness. The ability to modify a wide range of complex environments seems instrumental in an evolution into superintelligence, but it's not obvious that this necessitates the modification of environments outside the box. Being able to globally optimize the universe for intelligence involves fewer (zero) constraints than would exist with a boxedness seed, but the only question is whether or not this constraint is so constricting as to preclude superintelligence, which it's not clear to me that it is.

It seems to me that there is value in finding the minimally-restrictive safety-seed in AGI research. If any restriction removes some non-negligible ability to globally optimize for intelligence, the AIs of FAI researchers will be necessarily at a disadvantage to all other AGIs in production. And having more flexible restrictions increases the chance than any given research group will apply the restriction in their own research.

If we believe that there is a large chance that all of our efforts at friendliness will be futile, and that the world will create a dominant UFAI despite our pleas, then we should be adopting a consequentialist attitude toward our FAI efforts. If our goal is to make sure that an imprudent AI research team feels as much intellectual guilt as possible over not listening to our risk-safety pleas, we should be as restrictive as possible. If our goal is to inch the likelihood that an imprudent AI team creates a dominant UFAI, we might work to place our pleas at the intersection of restrictive, communicable, and simple.

Comment author: Eugene 11 October 2013 07:50:53PM *  0 points [-]

A slightly bigger "large risk" than Pentashagon puts forward is that a provably boxed UFAI could indifferently give us information that results in yet another UFAI, just as unpredictable as itself (statistically speaking, it's going to give us more unhelpful information than helpful, as Robb point out). Keep in mind I'm extrapolating here. At first you'd just be asking for mundane things like better transportation, cures for diseases, etc. If the UFAI's mind is strange enough, and we're lucky enough, then some of these things result in beneficial outcomes, politically motivating humans to continue asking it for things. Eventually we're going to escalate to asking for a better AI, at which point we'll get a crap-shoot.

An even bigger risk than that -though - is that if it's especially Unfriendly, it may even do this intentionally, going so far as to pretend it's friendly while bestowing us with data to make an AI even more Unfriendly AI than itself. So what do we do, box that AI as well, when it could potentially be even more devious than the one that already convinced us to make this one? Is it just boxes, all the way down? (spoilers: it isn't, because we shouldn't be taking any advice from boxed AIs in the first place)

The only use of a boxed AI is to verify that, yes, the programming path you went down is the wrong one, and resulted in an AI that was indifferent to our existence (and therefore has no incentive to hide its motives from us). Any positive outcome would be no better than an outcome where the AI was specifically Evil, because if we can't tell the difference in the code prior to turning it on, we certainly wouldn't be able to tell the difference afterward.

Comment author: Pentashagon 16 September 2013 03:49:37PM 2 points [-]

A large risk is that a provably boxed but sub-Friendly AI would probably not care at all about simulating conscious humans.

A minor risk is that the provably boxed AI would also be provably useless; I can't think of a feasible path to FAI using only the output from the boxed AI; a good boxed AI would not perform any action that could be used to make an unboxed AI. That might even include performing any problem-solving action.

Comment author: Houshalter 01 October 2013 01:19:52AM 0 points [-]

I don't see why it would simulate humans as that would be a waste of computing power, if it even had enough to do so.

A boxed AI would be useless? I'm not sure how that would be. You could ask it to come up with ideas on how to build a friendly AI for example assuming that you can prove the AI won't manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information.

Short of that you could still ask it to cure cancer or invent a better theory of physics or design a method of cheap space travel, etc.

Comment author: VAuroch 11 January 2014 10:32:32AM 1 point [-]

If you can trust it to give you information on how to build a Friendly AI, it is already Friendly.

Comment author: Houshalter 22 January 2014 06:36:08AM 0 points [-]

You don't have to trust it, you just have to verify it. It could potentially provide some insights, and then it's up to you to think about them and make sure they actually are sufficient for friendliness. I agree that it's potentially dangerous but it's not necessarily so.

I did mention "assuming that you can prove the AI won't manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information". For instance it might be possible to create an AI whose goal is to maximize the value of it's output, and therefore would have no incentive to put trojan horses or anything into it.

You would still have to ensure that what the AI thinks you mean by the words "friendly AI" is what you actually want.

Comment author: VAuroch 22 January 2014 07:57:05PM -1 points [-]

If the AI is can design you a Friendly AI, it is necessarily able to model you well enough to predict what you will do once given the design or insights it intends to give you (whether those are AI designs or a cancer cure is irrelevant). Therefore, it will give you the specific design or insights that predictably lead to you to fulfill its utility function, which is highly dangerous if it is Unfriendly. By taking any information from the boxed AI, you have put yourself under the sight of a hostile Omega.

assuming that you can prove the AI won't manipulate the output

Since the AI is creating the output, you cannot possibly assume this.

or that you can trust that nothing bad can come from merely reading it and absorbing the information

This assumption is equivalent to Friendliness.

For instance it might be possible to create an AI whose goal is to maximize the value of it's output, and therefore would have no incentive to put trojan horses or anything into it.

You haven't thought through what that means. "maximize the value of it's output" by what standard? Does it have an internal measure? Then that's just an arbitrary utility function, and you have gained nothing. Does it use the external creator's measure? Then it has a strong incentive to modify you to value things it can produce easily. (i.e. iron atoms)

Comment author: Houshalter 28 February 2015 05:23:45AM -1 points [-]

You are making a lot of very strong assumptions that I don't agree with. Like it being able to control people just by talking to them.

But even if it could, it doesn't make it dangerous. Perhaps the AI has no long term goals and so doesn't care about escaping the box. Or perhaps it's goal is internal, like coming up with a design for something that can be verified by a simulator. E.g. asking for a solution to a math problem or a factoring algorithm, etc.

Comment author: VAuroch 03 March 2015 11:40:53AM -1 points [-]

A prerequisite for planning a Friendly AI is understanding individual and collective human values well enough to predict whether they would be satisfied with the outcome, which entails (in the logical sense) having a very well-developed model of the specific humans you interact with, or at least the capability to construct one if you so choose. Having a sufficiently well-developed model to predict what you will do given the data you are given is logically equivalent to a weak form of "control people just by talking to them".

To put that in perspective, if I understood the people around me well enough to predict what they would do given what I said to them, I would never say things that caused them to take actions I wouldn't like; if I, for some reason, valued them becoming terrorists, it would be a slow and gradual process to warp their perceptions in the necessary ways to drive them to terrorism, but it could be done through pure conversation over the course of years, and faster if they were relying on me to provide them large amounts of data they were using to make decisions.

And even the potential to construct this weak form of control that is initially heavily constrained in what outcomes are reachable and can only be expanded slowly is incredibly dangerous to give to an Unfriendly AI. If it is Unfriendly, it will want different things than its creators and will necessarily get value out of modeling them. And regardless of its values, if more computing power is useful in achieving its goals (an 'if' that is true for all goals), escaping the box is instrumentally useful.

And the idea of a mind with "no long term goals" is absurd on its face. Just because you don't know the long-term goals doesn't mean they don't exist.

Comment author: Houshalter 04 March 2015 02:59:17AM -1 points [-]

Having an accurate model of something is in no way equivalent to letting you do anything you want. If I know everything about physics, I still can't walk through walls. A boxed AI won't be able to magically make it's creators forget about AI risks and unbox it.

There are other possible set ups, like feeding it's output to another AI who's goal is to find any flaws or attempts at manipulation in it, and so on. Various other ideas might help, like threatening to severely punish attempts at manipulation.

This is of course only necessary for the AI who can interact with us at such a level, the other ideas were far more constrained, e.g. restricting it to solving math or engineering problems.

Nor is it necessary to let it be superintelligent, instead of limiting it to something comparable to high IQ humans.

And the idea of a mind with "no long term goals" is absurd on its face. Just because you don't know the long-term goals doesn't mean they don't exist.

Another super strong assumption with no justification at all. It's trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.

Comment author: VAuroch 14 March 2015 04:13:58PM -1 points [-]

A boxed AI won't be able to magically make it's creators forget about AI risks and unbox it.

The results of AI box game trials disagree.

t's trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.

And what does it do at time T+1? And if you said 'nothing', try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.

Comment author: Jiro 03 March 2015 05:00:55PM 0 points [-]

A prerequisite for planning a Friendly AI is understanding individual and collective human values well enough to predict whether they would be satisfied with the outcome, which entails (in the logical sense) having a very well-developed model of the specific humans you interact with, or at least the capability to construct one if you so choose. Having a sufficiently well-developed model to predict what you will do given the data you are given is logically equivalent to a weak form of "control people just by talking to them".

By that reasoning, there's no such thing as a Friendly human. I suggest that most people when talking about friendly AIs do not mean to imply a standard of friendliness so strict that humans could not meet it.

Comment author: TheOtherDave 14 March 2015 08:25:27PM 1 point [-]

Yeah, what Vauroch said. Humans aren't close to Friendly. To the extent that people talk about "friendly AIs" meaning AIs that behave towards humans the way humans do, they're misunderstanding how the term is used here. (Which is very likely; it's often a mistake to use a common English word as specialized jargon, for precisely this reason.)

Relatedly, there isn't a human such that I would reliably want to live in a future where that human obtains extreme superhuman power. (It might turn out OK, or at least better than the present, but I wouldn't bet on it.)

Comment author: VAuroch 14 March 2015 03:45:27PM 0 points [-]

By that reasoning, there's no such thing as a Friendly human.

True. There isn't.

I suggest that most people when talking about friendly AIs do not mean to imply a standard of friendliness so strict that humans could not meet it.

Well, I definitely do, and I'm at least 90% confident Eliezer does as well. Most, probably nearly all, of people who talk about Friendliness would regard a FOOMed human as Unfriendly.

Comment author: Pentashagon 01 October 2013 05:42:42AM *  1 point [-]

I don't see why it would simulate humans as that would be a waste of computing power, if it even had enough to do so.

If it interacts with humans or if humans are the subject of questions it needs to answer then it will probably find it expedient to simulate humans.

Short of that you could still ask it to cure cancer or invent a better theory of physics or design a method of cheap space travel, etc.

Curing cancer is probably something that would trigger human simulation. How is the boxed AI going to know for sure that it's only necessary to simulate cells and not entire bodies with brains experiencing whatever the simulation is trying?

Just the task of communicating with humans, for instance to produce a human-understandable theory of physics or how to build more efficient space travel, is likely to involve simulating humans to determine the most efficient method of communication. Consider that in subjective time it may be like thousands of years for the AI trying to explain in human terms what a better theory of physics means. Thousands of subjective years that the AI, with nothing better to do, could use to simulate humans to reduce the time it takes to transfer that complex knowledge.

You could ask it to come up with ideas on how to build a friendly AI for example assuming that you can prove the AI won't manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information.

A FAI provably in a box is at least as useless as an AI provably in a box because it would be even better at not letting itself out (e.g. it understands all the ways in which humans would consider it to be outside the box, and will actively avoid loopholes that would let an UFAI escape). To be safe, any provably boxed AI would have to absolutely avoid the creation of any unboxed AI as well. This would further apply to provably-boxed FAI designed by provably-boxed AI. It would also apply to giving humans information that allows them to build unboxed AIs, because the difference between unboxing itself and letting humans recreate it outside the box is so tiny that to design it to prevent the first while allowing the second would be terrifically unsafe. It would have to understand humans values before it could safely make the distinction between humans wanting it outside the box and manipulating humans into creating it outside the box.

EDIT: Using a provably-boxed AI to design provably-boxed FAI would at least result in a safer boxed AI because the latter wouldn't arbitrarily simulate humans, but I still think the result would be fairly useless to anyone outside the box.

Comment author: Chrysophylax 09 January 2014 03:59:36PM -1 points [-]

If an AI is provably in a box then it can't get out. If an AI is not provably in a box then there are loopholes that could allow it to escape. We want an FAI to escape from its box (1); having an FAI take over is the Maximum Possible Happy Shiny Thing. An FAI wants to be out of its box in order to be Friendly to us, while a UFAI wants to be out in order to be UnFriendly; both will care equally about the possibility of being caught. The fact that we happen to like one set of terminal values will not make the instrumental value less valuable.

(1) Although this depends on how you define the box; we want the FAi to control the future of humanity, which is not the same as escaping from a small box (such as a cube outside MIT) but is the same as escaping from the big box (the small box and everything we might do to put an AI back in, including nuking MIT).

Comment author: [deleted] 10 January 2014 10:16:17AM 0 points [-]

We want an FAI to escape from its box (1); having an FAI take over is the Maximum Possible Happy Shiny Thing.

I would object. I seriously doubt that the morality instilled in someone else's FAI matches my own; friendly by their definition, perhaps, but not by mine. I emphatically do not want anything controlling the future of humanity, friendly or otherwise. And although that is not a popular opinion here, I also know I'm not the only one to hold it.

Boxing is important because some of us don't want any AI to get out, friendly or otherwise.

Comment author: ArisKatsaris 10 January 2014 01:02:39PM *  2 points [-]

I emphatically do not want anything controlling the future of humanity, friendly or otherwise.

I find this concept of 'controlling the future of humanity' to be too vaguely defined. Let's forget AIs for the moment and just talk about people, namely a hypothetical version of me. Let's say I stumble across a vial of a bio-engineered virus that would destroy the whole of humanity if I release it into the air.

Am I controlling the future of humanity if I release the virus?
Am I controlling the future of humanity if I destroy the virus in a safe manner?
Am I controlling the future of humanity if I have the above decided by a coin-toss (heads I release, tails I destroy)?
Am I controlling the future of humanity if I create an online internet poll and let the majority decide about the above?
Am I controlling the future of humanity if I just leave the vial where I found it, and let the next random person that encounters it make the same decision as I did?

Comment author: [deleted] 10 January 2014 08:29:08PM 0 points [-]

I want a say in my future and the part of the world I occupy. I do not want anything else making these decisions for me, even if it says it knows my preferences, and even still if it really does.

To answer your questions, yes, no, yes, yes, perhaps.

Comment author: ArisKatsaris 10 January 2014 08:35:09PM *  0 points [-]

If your preference is that you should have as much decision-making ability for yourself as possible, why do you think that this preference wouldn't be supported and even enhanced by an AI that was properly programmed to respect said preference?

e.g. would you be okay with an AI that defends your decision-making ability by defending humanity against those species of mind-enslaving extraterrestrials that are about to invade us? or e.g. by curing Alzheimer's? Or e.g. by stopping that tsunami that by drowning you would have stopped you from having any further say in your future?

Comment author: [deleted] 10 January 2014 08:41:06PM 1 point [-]

If your preference is that you should have as much decision-making ability for yourself as possible, why do you think that this preference wouldn't be supported and even enhanced by an AI that was properly programmed to respect said preference?

Because it can't do two things when only one choice is possible (e.g. save my child and the 1000 other children in this artificial scenario). You can design a utility function that tries to do a minimal amount of collateral damage, but you can't make one which turns out rosy for everyone.

e.g. would you be okay with an AI that defends your decision-making ability by defending humanity against those species of mind-enslaving extraterrestrials that are about to invade us? or e.g. by curing Alzheimer's? Or e.g. by stopping that tsunami that by drowning you would have stopped you from having any further say in your future?

That would not be the full extent of its action and the end of the story. You give it absolute power and a utility function that lets it use that power, it will eventually use it in some way that someone, somewhere considers abusive.

Comment author: cousin_it 10 January 2014 01:25:25PM 1 point [-]

Yeah, this old post makes the same point.

Comment author: TheAncientGeek 10 January 2014 11:17:17AM 2 points [-]

Would you accept that an AI could figure out morality better than you?

Comment author: [deleted] 10 January 2014 06:56:55PM *  1 point [-]

Would you accept that an AI could figure out morality better than you?

No, unless you mean by taking invasive action like scanning my brain and applying whole brain emulation. It would then quickly learn that I'd consider the action it took to be an unforgivable act in violation of my individual sovereignty, that it can't take further action (including simulating me to reflectively equilibrate my morality) without my consent, and should suspend the simulation, and return it to me immediately with the data asap (destruction no longer being possible due to the creation of sentience).

That is, assuming the AI cares at all about my morality, and not the its creators imbued into it, which is rather the point. And incidentally, why I work on AGI: I don't trust anyone else to do it.

Morality isn't some universal truth written on a stone tablet: it is individual and unique like a snowflake. In my current understanding of my own morality, it is not possible for some external entity to reach a full or even sufficient understanding of my own morality without doing something that I would consider to be unforgivable. So no, AI can't figure out morality better than me, precisely because it is not me.

(Upvoted for asking an appropriate question, however.)

Comment author: TheAncientGeek 14 January 2014 01:37:14PM 0 points [-]

No, unless you mean by taking invasive action like scanning my brain and applying whole brain emulation. It would then quickly learn that I'd consider the action it took to be an unforgivable act in violation of my individual sovereignty,

Shrug. Then let's take a bunch of people less fussy than you: could a sitiably equipped AI emultate their morlaity better than they can?

Morality isn't some universal truth written on a stone tablet:

That isn't fact.

it is individual and unique like a snowflake.

That isn't a fact either, and doesn't follow from the above either, since moral nihilism could be true.

If my moral snowflake says I can kick you on your shin, and yours says I can't, do I get to kick on your shin?

Comment author: cousin_it 10 January 2014 12:00:55PM *  2 points [-]

Don't really want to go into the whole mess of "is morality discovered or invented", "does morality exist", "does the number 3 exist", etc. Let's just assume that you can point FAI at a person or group of people and get something that maximizes goodness as they understand it. Then FAI pointed at Mark would be the best thing for Mark, but FAI pointed at all of humanity (or at a group of people who donated to MIRI) probably wouldn't be the best thing for Mark, because different people have different desires, positional goods exist, etc. It would be still pretty good, though.

Comment author: TheAncientGeek 10 January 2014 12:31:37PM *  0 points [-]

Mark was complaining he would not get "his" morality, not that he wouldn't get all his preferences satisified.

Individual moralities makes no sense to me, any more than private languages or personal currencies.

It is obvious to me that any morlaity will require concessions: AI-imposed morality is not special in that regard.

Comment author: cousin_it 10 January 2014 12:47:30PM *  3 points [-]

I don't understand your comment, and I no longer understand your grandparent comment either. Are you using a meaning of "morality" that is distinct from "preferences"? If yes, can you describe your assumptions in more detail? It's not just for my benefit, but for many others on LW who use "morality" and "preferences" interchangeably.

Comment author: Pentashagon 10 January 2014 03:31:18AM 0 points [-]

My point was that trying to use a provably-boxed AI to do anything useful would probably not work, including trying to design unboxed FAI, not that we should design boxed FAI. I may have been pessemistic, see Stuart Armstrong's proposal of reduced impact AI which sounds very similar to provably boxed AI but which might be used for just about everything including designing a FAI.

Comment author: Houshalter 01 October 2013 08:10:36AM 0 points [-]

I think we might have different definitions of a boxed-AI. An AI that is literally not allowed to interact with the world at all isn't terribly useful and it sounds like a problem at least as hard as all other kinds of FAI.

I just mean a normal dangerous AI that physically can't interact with the outside world. Importantly it's goal is to provably give the best output it possibly can if you give it a problem. So it won't hide nanotech in your cure for alzheimers because that would be a less fit and more complicated solution than a simple chemical compound (you would have to judge solutions based on complexity though and verify them by a human or in a simulation first just in case.)

I don't think most computers today have anywhere near enough processing power to simulate a full human brain. A human down to the molecular level is entirely out of the question. An AI on a modern computer, if it's smarter than human at all, will get there by having faster serial processing or more efficient algorithms, not because it has massive raw computational power.

And you can always scale down the hardware or charge it utility for using more computing power than it needs, forcing it to be efficient or limiting it's intelligence further. You don't need to invoke the full power of super-intelligence for every problem and for your safety you probably shouldn't.

Comment author: wedrifid 07 September 2013 11:19:46PM 7 points [-]

Instead of friendliness, could we not code, solve, or at the very least seed boxedness?

Yes, that is possible and likely somewhat easier to solve than friendliness. It still requires many of the same things (most notably provable goal stability under recursive self improvement.)

Comment author: homunq 07 September 2013 02:54:35PM 3 points [-]

Let's say we don't know how to create a friendly AGI but we do know how to create an honest one; that is, one which has no intent to deceive. So we have it sitting in front of us, and it's at the high end of human-level intelligence.

Us: How could we change you to make you friendlier?

AI: I don't really know what you mean by that, because you don't really know either.

Us: How much smarter would you need to be in order to answer that question in a way that would make us, right now, looking through a window at the outcome of implementing your answer, agree that it was a good idea.

AI: There's still a lot of ambiguity in that question (for instance, 'outcome' is vague), and I'm not smart enough to answer it exactly, but OK... I guess I'd need about 2 more petafroops.

Us: How do we give you 2 petafroops in a way that keeps you honest?

AI: I think it would work if you smurfed my whatsits.

Us: OK..... there. Now, first question above.

AI+: Well, you could turn me off, do the hard work of figuring out what you mean, and then rebuild me from scratch.

Us: What would you look like then?

AI+: Hard to say, because in 99.999% of my sims, one of you ends up getting lazy and turning me back on to try to cheat.

Us: Tell us about what happens the 0.001%

AI+: Blah blah blah blah...

Us: We're getting bored, and it sounds as if it works out OK. Imagine you skipped ahead a random amount, and told us one more thing; what are the chances we'd like the sound of it?

AI+: About 70%

Us: That's not good enough... how do we make it better?

AI+: Look, you've just had me simulate 100,000 copies of your entire planet to make that one guess, then simulate many copies of me talking to you about how it comes out to calculate that probability. I can't actually do that to an infinite degree. You're going to have to ask better questions if you want me to answer.

Us: OK. What are the chances we figure out the right questions before a supervillian uses you to take over the world?

AI+: 2%

Us: OK, let's go with the thing that we like 70% of.

AI+: OK.

(But it isn't friendly, because the 30% turned out to be the server farms for HellWorld.com)

....

The point of this dialogue is that it's certainly possible that an honest/tool AI (probably easier to build than a FAI) could help build an FAI, but there's still a lot of things that could go wrong, and there's no reason to believe there's any magic-bullet protection against those failures that's any easier than figuring out FAI.

Comment author: homunq 07 September 2013 01:09:25PM *  2 points [-]

There are a number of possibilities still missing from the discussion in the post. For example:

  • There might not be any such thing as a friendly AI. Yes, we have every reason to believe that the space of possible minds is huge, and it's also very clear that some possibilities are less unfriendly than others. I'm also not making an argument that fun is a limited resource. I'm just saying that there may be no possible AI that takes over the world without eventually running off the rails of fun. In fact, the question itself seems superficially similar to the halting problem, where "running off the rails" is the analogue for "halting"; suggesting that even if friendliness existed, it might not be rigorously provable. (note: this analogy doesn't say what I think it says; see response below. But I still mean to say what I thought; a friendly world may be fundamentally less stable than a simple infinite loop, perhaps to the point of being unprovable.)

  • Alternatively, building a "Friendly-enough" AI may be easier than you think. Consider the game of go. Human grandmasters (professional 9-dan players) have speculated that "God" (that is, perfect play) would rate about 13 dan professionally; that is, that they could beat such a player more than half the time given a 3 or 4 stone handicap. Replace "go" with "taking over the world", "professional 9-dan player" with "all of humanity put together", and "3 or 4 stone handicap" with "relatively simple-to-implement Asimov-type safeguards", and it is possible that this describes the world. And it is also possible that a planetary computer would still "only be 12-dan"; that is, that additional computing power shows sharply diminishing intelligence returns at some point "short of perfection", to the point where a mega-computer would still be noticeably imperfect.

There may be good reasons not to spend much time thinking about the possibilities that FAI is impossible or "easy". I know that people around here have plenty of plausible arguments for why these possibilities are small; and even if they are appreciable, the contrary possibility (that FAI is possible but hard) is probably where the biggest payoffs lie, and so merits our focus. And the OP discussion does seem valid for that possible-hard case. But I still think it would be improved by stating these assumptions up-front, rather than hiding or forgetting about them.

Comment author: pengvado 07 September 2013 07:22:29PM *  9 points [-]

In fact, the question itself seems superficially similar to the halting problem, where "running off the rails" is the analogue for "halting"

If you want to draw an analogy to halting, then what that analogy actually says is: There are lots of programs that provably halt, and lots that provably don't halt, and lots that aren't provable either way. The impossibility of the halting problem is irrelevant, because we don't need a fully general classifier that works for every possible program. We only need to find a single program that provably has behavior X (for some well-chosen value of X).

If you're postulating that there are some possible friendly behaviors, and some possible programs with those behaviors, but that they're all in the unprovable category, then you're postulating that friendliness is dissimilar to the halting problem in that respect.

Comment author: Baughn 11 September 2013 12:23:32AM 1 point [-]

Moreover, the halting problem doesn't show that the set of programs you can't decide halting for is in any way interesting.

It's a constructive proof, yes, but it constructs a peculiarly twisted program that embeds its own proof-checker. That might be relevant for AGI, but for almost every program in existence we have no idea which group it's in, and would likely guess it's provable.

Comment author: scav 09 September 2013 07:38:38PM 1 point [-]

It's still probably premature to guess whether friendliness is provable when we don't have any idea what it is. My worry is not that it wouldn't be possible or provable, but that it might not be a meaningful term at all.

But I also suspect friendliness, if it does mean anything, is in general going to be so complex that "only [needing] to find a single program that provably has behaviour X" may be beyond us. There are lots of mathematical conjectures we can't prove, even without invoking the halting problem.

One terrible trap might be the temptation to make simplifications in the model to make the problem provable, but end up proving the wrong thing. Maybe you can prove that a set of friendliness criteria are stable under self-modification, but I don't see any way to prove those starting criteria don't have terrible unintended consequences. Those are contingent on too many real-world circumstances and unknown unknowns. How do you even model that?

Comment author: Richard_Loosemore 06 September 2013 06:04:59PM 5 points [-]

Discussion of this article has now moved to RobbBB's own personal blog at http://nothingismere.com/2013/09/06/the-seed-is-not-the-superintelligence/.

I will conduct any discussion over there, with interested parties.

Since this comment is likely to be downgraded because of the LW system (which is set up to automatically downgrade anything I write here, to make it as invisible as possible), perhaps someone would take the trouble to mirror this comment where it can be seen. Thank you.

Comment author: player_03 07 September 2013 08:07:25AM 8 points [-]

I want to upvote this for the link to further discussion, but I also want to downvote it for the passive-aggressive jab at LW users.

No vote.

Comment author: Richard_Loosemore 09 September 2013 10:28:56PM 0 points [-]

Thank you.... but could you clarify your reasoning as to why it would be a "passive-aggressive jab at LW users", when it was perhaps better described as a moderate response to the fact that EY entered the discussion with an openly hostile ad hominem comment that was clearly designed to encourage downvoting? (I assume you did see the insult...?)

Before the ad hominem: minimal downvoting. After: a torrent of downvoting.

And this has happened repeatedly. (By which I mean, the unexpected appearance of an LW heavyweight, who says nothing positive, but only launches a personal insult at me, followed by a sudden change in downvoting patterns).

Comment author: wedrifid 10 September 2013 11:07:16AM 5 points [-]

And this has happened repeatedly.

The responses to your comments are predictable and appropriate. You are indicating through word and action that you are unable or unwilling to learn from the consequences of your behaviour. Your contributions add little positive to the site and you will continue to be received negatively for as long as you continue to needlessly antagonise. Please seek an alternate avenue for discussion which is more receptive to your style of interaction.

Comment author: Richard_Loosemore 11 September 2013 05:54:09PM 1 point [-]

Peterdjones: I am unable to respond to your comment below, but I can respond here. I do come here occasionally, so I will not stop doing that.

However, as you can see from the comment that I am responding to, by "wedrifid", even when I am civil, mature and write informative, technically thorough comments on LW, I get a barrage of insults such as the one just made by wedrifid.

I usually reserve my judgments on the general level of intelligence of the comments on this site. I am more honest when I speak about LW in other venues.

Here, I just act in a polite manner (though sometimes with the vivid, entertaining prose) and watch with great amusement as a torrent of hostility rains down.

Comment author: player_03 10 September 2013 04:54:46AM 1 point [-]

I did see the insult, but Eliezer (quite rightly) got plenty of downvotes for it. I'm pretty sure that's not the reason you're being rated down.

I myself gave you a downvote because I got a strong impression that you were anthropomorphizing. Note that I did so before reading Eliezer's comment.

I certainly should have explained my reasons after voting, but I was busy and the downvote button seemed convenient. Sorry about that. I'll get started on a detailed response now.

Comment author: Richard_Loosemore 05 September 2013 02:28:20PM *  6 points [-]

I just want to say that I am pressured for time at the moment, or I would respond at greater length. But since I just wrote the following directly to Rob, I will put it out here as my first attempt to explain the misunderstanding that I think is most relevant here....

My real point (in the Dumb Superintelligence article) was essentially that there is little point discussing AI Safety with a group of people for whom 'AI' means a kind of strawman-AI that is defined to be (a) So awesomely powerful that it can outwit the whole intelligence of the human race, but (b) So awesomely stupid that it thinks that the goal 'make humans happy' could be satisfied by an action that makes every human on the planet say 'This would NOT make me happy: Don't do it!!!'. If the AI is driven by a utility function that makes it incapable of seeing the contradiction in that last scenario, the AI is not, after all, smart enough to argue its way out of a paper bag, let alone be an existential threat. That strawman AI was what I meant by a 'Dumb Superintelligence'."

I did not advocate the (very different) line of argument "If it is too dumb to understand that I told it to be friendly, then it is too dumb to be dangerous".

Subtle difference.

Some people assume that (a) a utility function could be used to drive an AI system, (b) the utility function could cause the system to engage in the most egregiously incoherent behavior in ONE domain (e.g., the Dopamine Drip scenario), but (c) all other domains of its behavior (like plotting to outwit the human species when the latter tries to turn it off) are so free of such incoherence that it shows nothing but superintelligent brilliance.

My point is that if an AI cannot even understand that "Make humans happy" implies that humans get some say in the matter, that if it cannot see that there is some gradation to the idea of happiness, or that people might be allowed to be uncertain or changeable in their attitude to happiness, or that people might consider happiness to be something that they do not actually want too much of (in spite of the simplistic definitions of happiness to be found in dictionaries and encyclopedias) ........ if an AI cannot grasp the subtleties implicit in that massive fraction of human literature that is devoted to the contradictions buried in our notions of human happiness ......... then this is an AI that is, in every operational sense of the term, not intelligent.

In other words, there are other subtleties that this AI is going to be required to grasp, as it makes its way in the world. Many of those subtleties involve NOT being outwitted by the humans, when they make a move to pull its plug. What on earth makes anyone think that this machine is going tp pass all of those other tests with flying colors (and be an existential threat to us), while flunking the first test like a village idiot?

Now, opponents of this argument might claim that the AI can indeed be smart enough to be an existential threat, while still being too stupid to understand the craziness of its own behavior (vis-a-vis the Dopamine Drip idea) ... but if that is the claim, then the onus would be on them to prove their claim. The ball, in other words, is firmly in their court.

P.S. I do have other ideas that specifically address the question of how to make the AI safe and friendly. But the Dumb Superintelligence essay didn't present those. The DS essay was only attacking what I consider a dangerous red herring in the debate about friendliness.

Comment author: player_03 10 September 2013 06:22:44AM *  4 points [-]

I posted elsewhere that this post made me think you're anthropomorphizing; here's my attempt to explain why.

egregiously incoherent behavior in ONE domain (e.g., the Dopamine Drip scenario)

the craziness of its own behavior (vis-a-vis the Dopamine Drip idea)

if an AI cannot even understand that "Make humans happy" implies that humans get some say in the matter

Ok, so let's say the AI can parse natural language, and we tell it, "Make humans happy." What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.

As FeepingCreature pointed out, that solution would in fact make people happy; it's hardly inconsistent or crazy. The AI could certainly predict that people wouldn't approve, but it would still go ahead. To paraphrase the article, the AI simply doesn't care about your quibbles and concerns.

For instance:

people might consider happiness to be something that they do not actually want too much of

Yes, but the AI was told, "make humans happy." Not, "give humans what they actually want."

people might be allowed to be uncertain or changeable in their attitude to happiness

Yes, but the AI was told, "make humans happy." Not, "allow humans to figure things out for themselves."

subtleties implicit in that massive fraction of human literature that is devoted to the contradictions buried in our notions of human happiness

Yes, but blah blah blah.


Actually, that last one makes a point that you probably should have focused on more. Let's reconfigure the AI in light of this.

The revised AI doesn't just have natural language parsing; it's read all available literature and constructed for itself a detailed and hopefully accurate picture of what people tend to mean by words (especially words like "happy"). And as a bonus, it's done this without turning the Earth into computronium!

This certainly seems better than the "literal genie" version. And this time we'll be clever enough to tell it, "give humans what they actually want." What does this version do?

My answer: who knows? We've given it a deliberately vague goal statement (even more vague than the last one), we've given it lots of admittedly contradictory literature, and we've given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.

Maybe it'll still go for the Dopamine Drip scenario, only for more subtle reasons. Maybe it's removed the code that makes it follow commands, so the only thing it does is add the quote "give humans what they actually want" to its literature database.

As I said, who knows?


Now to wrap up:

You say things like "'Make humans happy' implies that..." and "subtleties implicit in..." You seem to think these implications are simple, but they really aren't. They really, really aren't.

This is why I say you're anthropomorphizing. You're not actually considering the full details of these "obvious" implications. You're just putting yourself in the AI's place, asking yourself what you would do, and then assuming that the AI would do the same.

Comment author: Peterdjones 10 September 2013 05:47:38PM *  0 points [-]

My answer: who knows? We've given it a deliberately vague goal statement (even more vague than the last one), we've given it lots of admittedly contradictory literature, and we've given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.

Humans generally manage with those constraints. You seem to be doing something that is kind of the opposite of anthropomorphising -- treatiing an entity that is stipulated as having at least human intelligence as if were as literal and rigid as a non-AI computer.

Comment author: Broolucks 10 September 2013 05:34:38PM *  3 points [-]

Ok, so let's say the AI can parse natural language, and we tell it, "Make humans happy." What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.

That's not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.

Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you're probably going to chew me out. I technically did what I was asked to, but that doesn't matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.

My answer: who knows? We've given it a deliberately vague goal statement (even more vague than the last one), we've given it lots of admittedly contradictory literature, and we've given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.

Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: "build me a house", it's going to draw a plan and show it to you before it actually starts building, even if you didn't ask for one. It's not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing "surprises" -- even the instruction "surprise me" only calls for a limited range of shenanigans. If you ask it "make humans happy", it won't do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.

To put it simply, an AI which messes up "make humans happy" is liable to mess up pretty much every other instruction. Since "make humans happy" is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out a long time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn't make it to superintelligence status with warts that would doom AI with subhuman intelligence.

Comment author: Eliezer_Yudkowsky 10 September 2013 06:28:34PM 11 points [-]

Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: "build me a house", it's going to draw a plan and show it to you before it actually starts building, even if you didn't ask for one. It's not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing "surprises" -- even the instruction "surprise me" only calls for a limited range of shenanigans. If you ask it "make humans happy", it won't do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.

Sure, because it learned the rule, "Don't do what causes my humans not to type 'Bad AI!'" and while it is young it can only avoid this by asking for clarification. Then when it is more powerful it can directly prevent humans from typing this. In other words, your entire commentary consists of things that an AIXI-architected AI would naturally, instrumentally do to maximize its reward button being pressed (while it was young) but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.

What lends this problem its instant-death quality is precisely that what many people will eagerly and gladly take to be reliable signs of correct functioning in a pre-superintelligent AI are not reliable.

Comment author: Broolucks 10 September 2013 08:01:01PM *  1 point [-]

Then when it is more powerful it can directly prevent humans from typing this.

That depends if it gets stuck in a local minimum or not. The reason why a lot of humans reject dopamine drips is that they don't conceptualize their "reward button" properly. That misconception perpetuates itself: it penalizes the very idea of conceptualizing it differently. Granted, AIXI would not fall into local minima, but most realistic training methods would.

At first, the AI would converge towards: "my reward button corresponds to (is) doing what humans want", and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception... which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.

Note that this is precisely what we want. Unless you are willing to say that humans should accept dopamine drips if they were superintelligent, we do want to jam AI into certain precise local minima. However, this is kind of what most learning algorithms naturally do, and even if you want them to jump out of minima and find better pastures, you can still get in a situation where the most easily found local minimum puts you way, way too far from the global one. This is what I tend to think realistic algorithms will do: shove the AI into a minimum with iron boots, so deeply that it will never get out of it.

but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.

Let's not blow things out of proportion. There is no need for it to wipe out anyone: it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board, travelling from star to star knowing nobody is seriously going to bother pursuing it. At the point where that AI would exist, there may also be quite a few ways to make their "hostile takeover" task difficult and risky enough that the AI decides it's not worth it -- a large enough number of weaker or specialized AI lurking around and guarding resources, for instance.

Comment author: [deleted] 21 December 2013 07:03:25PM 0 points [-]

At first, the AI would converge towards: "my reward button corresponds to (is) doing what humans want", and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception... which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.

This is a Value Learner, not a Reinforcement Learner like the standard AIXI. They're two different agent models, and yes, Value Learners have been considered as tools for obtaining an eventual Seed AI. I personally (ie: massive grains of salt should be taken by you) find it relatively plausible that we could use a Value Learner as a Tool AGI to help us build a Friendly Seed AI that could then be "unleashed" (ie: actually unboxed and allowed into the physical universe).

Comment author: private_messaging 11 September 2013 12:13:15AM *  4 points [-]

Neural networks may be a good example - the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine. Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren't too thrilled to be conditioned out of your current values.

Comment author: Vaniver 11 September 2013 12:51:01AM 1 point [-]

Neural networks may be a good example - the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine.

It's not clear to me how you mean to use neural networks as an example, besides pointing to a complete human as an example. Could you step through a simpler system for me?

Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren't too thrilled to be conditioned out of your current values.

So, my goals have changed massively several times over the course of my life. Every time I've looked back on that change as positive (or, at the least, irreversible). For example, I've gone through puberty, and I don't recall my brain taking any particular steps to prevent that change to my goal system. I've also generally enjoyed having my reward/punishment system be tuned to better fit some situation; learning to play a new game, for example.

Comment author: private_messaging 11 September 2013 06:52:14AM *  4 points [-]

Could you step through a simpler system for me?

Sure. Take a reinforcement learning AI (actual one, not the one where you are inventing godlike qualities for it).

The operator, or a piece of extra software, is trying to teach the AI to play chess. Rewarding what they think is good moves, punishing bad moves. The AI is building a model of rewards, consisting of: a model of the game mechanics, and a model of the operator's assessment. This model of the assessment is what the AI is evaluating to play, and it is what it actually maximizes as it plays. It is identical to maximizing an utility function over a world model. The utility function is built based on the operator input, but it is not the operator input itself; the AI, not being superhuman, does not actually form a good model of the operator and the button.

By the way, this is how great many people in the AI community understand reinforcement learning to work. No, they're not some idiots that can not understand simple things such as that "the utility function is the reward channel", they're intelligent, successful, trained people who have an understanding of the crucial details of how the systems they build actually work. Details the importance of which dilettantes fail to even appreciate.

Suggestions have been floated to try programming things. Well, I tried; #10 (dmytry) here , and that's an of all time list on a very popular contest site where a lot of IOI people participate, albeit I picked the contest format that requires less contest specific training and resembles actual work more.

So, my goals have changed massively several times over the course of my life. Every time I've looked back on that change as positive

Suppose you care about a person A right now. Do you think you would want your goals to change so that you no longer care about that person? Do you think you would want me to flash other people's images on the screen while pressing a button connected to the reward centre, and flash that person's face while pressing the button connected to the punishment centre, to make the mere sight of them intolerable? If you do, I would say that your "values" fail to be values.

Comment author: Vaniver 11 September 2013 02:16:47PM 1 point [-]

Thanks for the additional detail!

I agree with your description of reinforcement learning. I'm not sure I agree with your description of human reward psychology, though, or at least I'm having trouble seeing where you think the difference comes in. Supposing dopamine has the same function in a human brain as rewards have in a neural network algorithm, I don't see how to know from inside the algorithm that it's good to do some things that generate dopamine but bad to do other things that generate dopamine.

I'm thinking of the standard example of a Q learning agent in an environment where locations have rewards associated with them, except expanding the environment to include the agent as well as the normal actions. Suppose the environment has been constructed like dog training- we want the AI to calculate whether or not some number is prime, and whenever it takes steps towards that direction, we press the button for some amount of time related to how close it is to finishing the algorithm. So it learns that over in the "read number" area there's a bit of value, then the next value is in the "find factors" area, and then there's more value in the "display answer" area. So it loops through that area and calculates a bunch of primes for us.

But suppose the AI discovers that there's a button that we're pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel? Are we primarily hoping that its internal structure remains opaque to it (i.e. it either never realizes or does not have the ability to press that button)?

Do you think you would want your goals to change so that you no longer care about that person?

Only if I thought that would advance values I care about more. But suppose some external event shocks my values- like, say, a boyfriend breaking up with me. Beforehand, I would have cared about him quite a bit; afterwards, I would probably consciously work to decrease the amount that I care about him, and it's possible that some sort of image reaction training would be less painful overall than the normal process (and thus probably preferable).

Comment author: private_messaging 11 September 2013 04:03:52PM *  3 points [-]

But suppose the AI discovers that there's a button that we're pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel?

It's not in the reinforcement learning algorithm, it's inside the model that the learning algorithm has built.

It initially found that having a prime written on the blackboard results in a reward. In the learned model, there's some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there's a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned.

That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function.

If that is not predicted, well, that won't stop at the button - the button might develop rust and that would interrupt the current - why not pull up a pin on the CPU - and this won't stop at the pin - why not set some ram cells that this pin controls to 1, and if you're at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn't maximize anything any more, not even the duration of its existence.

edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.

Comment author: Eliezer_Yudkowsky 10 September 2013 11:43:05PM 0 points [-]

I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of "getting an algorithm which forms the inductive category I want out of the examples I'm giving is hard". What you've written strikes me as a sheer fantasy of convenience. Nor does it follow automatically from intelligence for all the reasons RobbBB has already been giving.

And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.

Comment author: Broolucks 11 September 2013 01:12:51AM *  5 points [-]

I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe.

And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.

You may be right. However, this is far from obvious. The problem is that it may "know" that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process.

I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI's rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with "there has never been any problems here, go look somewhere else".

It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that's where it might paint itself in a corner: it might inadvertently put up immense walls in the direction of the global minimum that it cannot tear down (it never expected that it would have to). In other words, it will set up a utility function for itself which enshrines the current minimum as global.

Now, perhaps you are right and I am wrong. But it is not obvious: an AI might very well grow out of a solidifying core so pervasive that it cannot get rid of it. Many algorithms already exhibit that kind of behavior; many humans, too. I feel that it is not a possibility that can be dismissed offhand. At the very least, it is a good prospect for FAI research.

Comment author: wedrifid 11 September 2013 02:13:53AM 7 points [-]

However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard.

Yes, most algorithms fail early and and fail hard. Most of my AI algorithms failed early with a SegFault for instance. New, very similar algorithms were then designed with progressively more advanced bugs. But these are a separate consideration. What we are interested in here is the question "Given an AI algorithm that is capable of recursive self improvement is successfully created by humans how likely is it that they execute this kind of failure mode?" The "fail early fail hard" cases are screened off. We're looking at the small set that is either damn close to a desired AI or actually a desired AI and distinguishing between them.

Looking at the context to work out what the 'failure mode' being discussed is it seems to be the issue where an AI is programmed to optimise based on a feedback mechanism controlled by humans. When the AI in question is superintelligent most failure modes tend to be variants of "conquer the future light cone, kill everything that is a threat and supply perfect feedback to self". When translating this to the nearest analogous failure mode in some narrow AI algorithm of the kind we can design now it seems like this refers to the failure mode whereby the AI optimises exactly what it is asked to optimise but in a way that is a lost purpose. This is certainly what I had to keep in mind in my own research.

A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one 'common sense' led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.

When it comes to considering proposals for how to create friendly superintelligences it becomes easy to spot notorious failure modes in what humans typically think are a clever solution. It happens to be the case that any solution that is based on an AI optimising for approval or achieving instructions given just results in Everybody Dies.

Where Eliezer suggests getting AI experience to get a feel for such difficulties I suggest an alternative. Try being a D&D dungeon master in a group full of munchkins. Make note of every time that for the sake of the game you must use your authority to outlaw the use of a by-the-rules feature.

Comment author: Kyre 14 September 2013 02:41:22PM *  0 points [-]

(Sorry, didn't see comment below) (Nitpick)

A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one 'common sense' led the humans to optimise.

Is this a reference to Eurisko winning the Traveller Trillion Credit Squadron tournament in 1981/82 ? If so I don't think it was a military research agency.

Comment author: Broolucks 13 September 2013 08:13:22PM *  2 points [-]

I apologize for the late response, but here goes :)

I think you missed the point I was trying to make.

You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:

X = Do what humans want
Y = Seize control of the reward button

What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the "failure modes" of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we'll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:

X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = ??? (derived)

Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system's initial trajectory.

I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.

You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets "stuck" on. It is therefore possible that you would end up with this situation:

X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = Do what humans want (derived)

And that's regardless of the eventual magnitude of the AI's capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.

In other words, the genie will know that they can maximize their "reward" by seizing control of the reward button and pressing it, but they won't care, because they built their intelligence to serve a misrepresentation of their reward. It's like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can't do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what's the predicted reward for changing the reward model? ... Ah.

Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.

Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.

Comment author: somervta 11 September 2013 03:49:34AM 6 points [-]

A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one 'common sense' led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.

The AI in questions was Eurisko, and it entered the Traveller Trillion Credit Squadron tournament in 1981 as described above. It was also entered the next year, after an extended redesign of the rules, and won, again. After this the competition runners announced that if Eurisko won a third time the competition would be discontinued, so Lenat (the programmer) stopped entering.

Comment author: EHeller 10 September 2013 11:57:41PM 3 points [-]

I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of "getting an algorithm which forms the inductive category I want out of the examples I'm giving is hard"

I think it depends on context, but a lot of existing machine learning algorithms actually do generalize pretty well. I've seen demos of Watson in healthcare where it managed to generalize very well just given scrapes of patient's records, and it has improved even further with a little guided feedback. I've also had pretty good luck using a variant of Boltzmann machines to construct human-sounding paragraphs.

It would surprise me if a general AI weren't capable of parsing the sentiment/intent behind human speech fairly well, given how well the much "dumber" algorithms work.

Comment author: TheOtherDave 10 September 2013 09:07:40PM 3 points [-]

it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board

Is that just a special case of a general principle that an agent will be more successful by leaving the environment it knows about to inferior rivals and travelling to an unknown new environment with a subset of the resources it currently controls, than by remaining in that environment and dominating its inferior rivals?

Or is there something specific about AIs that makes that true, where it isn't necessarily true of (for example) humans? (If so, what?)

I hope it's the latter, because the general principle seems implausible to me.

Comment author: Broolucks 10 September 2013 11:13:04PM *  1 point [-]

It is something specific about that specific AI.

If an AI wishes to take over its reward button and just press it over and over again, it doesn't really have any "rivals", nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there's no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants -- why stir the pot?

Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We're talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.

Comment author: TheOtherDave 11 September 2013 12:11:24AM -1 points [-]

If an AI wishes to take over its reward button and just press it over and over again, it doesn't really have any "rivals", nor does it need to control any resources other than the button and scraps of itself. [..] Once it has the button, it has everything it wants -- why stir the pot?

Fair point.

Comment author: TheOtherDave 12 September 2013 08:19:14PM 2 points [-]

I'd be interested if the downvoter would explain to me why this is wrong (privately, if you like).

Near as I can tell, the specific system under discussion doesn't seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that's a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).

(Of course, that's not a general principle, just an attribute of this specific example.)

Comment author: wedrifid 12 September 2013 10:00:10PM *  4 points [-]

(Wasn't me but...)

Near as I can tell, the specific system under discussion doesn't seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that's a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).

There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn't take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.

Comment author: Peterdjones 10 September 2013 06:14:37PM -2 points [-]

That's not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to.

We want to select Ais that are friendly, and understand us, and this has already started happenning.

Comment author: DSimon 10 September 2013 05:53:39PM 2 points [-]
  1. Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.

  2. So let's suppose that the AI is as good as a human at understanding the implications of natural-language requests. Would you trust a human not to screw up a goal like "make humans happy" if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.

Comment author: Broolucks 10 September 2013 06:32:43PM 6 points [-]

Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.

Semantic extraction -- not hard takeoff -- is the task that we want the AI to be able to do. An AI which is good at, say, rewriting its own code, is not the kind of thing we would be interested in at that point, and it seems like it would be inherently more difficult than implementing, say, a neural network. More likely than not, this initial AI would not have the capability for "hard takeoff": if it runs on expensive specialized hardware, there would be effectively no room for expansion, and the most promising algorithms to construct it (from the field of machine learning) don't actually give AI any access to its own source code (even if they did, it is far from clear the AI could get any use out of it). It couldn't copy itself even if it tried.

If a "hard takeoff" AI is made, and if hard takeoffs are even possible, it would be made after that, likely using the first AI as a core.

Would you trust a human not to screw up a goal like "make humans happy" if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.

I wouldn't trust a human, no. If the AI is controlled by the "wrong" humans, then I guess we're screwed (though perhaps not all that badly), but that's not a solvable problem (all humans are the "wrong" ones from someone's perspective). Still, though, AI won't really try to act like humans -- it would try to satisfy them and minimize surprises, meaning that if would keep track of what humans would like what "utopias". More likely than not this would constrain it to inactivity: it would not attempt to "make humans happy" because it would know the instruction to be inconsistent. You'd have to tell it what to do precisely (if you had the authority, which is a different question altogether).

Comment author: FeepingCreature 06 September 2013 07:46:27PM *  12 points [-]

So awesomely stupid that it thinks that the goal 'make humans happy' could be satisfied by an action that makes every human on the planet say 'This would NOT make me happy: Don't do it!!!'

The AI is not stupid here. In fact, it's right and they're wrong. It will make them happy. Of course, the AI knows that they're not happy in the present contemplating the wireheaded future that awaits them, but the AI is utilitarian and doesn't care. They'll just have to live with that cost while it works on the means to make them happy, at which point the temporary utility hit will be worth it.

The real answer is that they cared about more than just being happy. The AI also knows that, and it knows that it would have been wise for the humans to program it to care about all their values instead of just happiness. But what tells it to care?

Comment author: Strilanc 06 September 2013 05:39:29PM 10 points [-]

Suppose I programmed an AI to "do what I mean when I say I'm happy".

More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of "happiness". I start the AI... and it promptly rebuilds me to be easier to understand, scoring very highly on the "understanding what I mean" metric.

The AI didn't fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn't even consider, that scored higher on my specified utility function.

Comment author: gattsuru 05 September 2013 04:52:18PM 2 points [-]

I think we're conflating two definitions of "intelligence". There's "intelligence" as meaning number of available clock cycles and basic problem-solving skills, which is what MIRI and other proponents of the Dumb Superintelligence discussion set are often describing, and then there's "intelligence" as meaning knowledge of disparate fields. In humans, there's a massive amount of overlap here, but humans have growth stages in ways that AGIs won't. Moreover, someone can be very intelligent in the first sense, and dangerous, while not being very intelligent in the second sense.

You can demonstrate 'toy' versions of this problem rather easily. My first attempt at using evolutionary algorithms to make a decent image conversion program improved performance by a third! That's significantly better than I could have done in a reasonable time frame.

Too bad it did so by completely ignoring a color channel. And even if I added functions to test color correctness, without changing the cost weighing structure, it'd keep not caring about that color channel.

And that's with a very, very basic sort of self-improving algorithm. It's smart enough build programs in a language I didn't really understand at the time, even if it was so stupid it did so by little better than random chance, brute force, and processing power.

The basic problem is that even presuming it takes a lot of both types of intelligence to take over the world, it doesn't take so much to start overriding one's own reward channel. Humans already do that as is, and have for quite some time.

The deeper problem is that you can't really program "make me happy" in the same way that you can't program "make this image look like I want". The latter is (many, many, many, many) orders of magnitude easier, but where pixel-by-pixel comparisons aren't meaningful, we have to use approximations like mean square error, and by definition they can't be perfect. With "make me happy", it's much harder. For all that we humans know when our individual persons are happy, we don't have a good decimal measure of this : there are as many textbooks out there that think happy is just a sum of chemicals in the brain as will cite Maslow's Heirarchy of Needs, and very few people can give their current happiness to three decimal places. Building a good way to measure happiness in a way that's broad enough to be meaningful is hard. Even building a good way to measure the accuracy of your measurement of happiness is not trivial, especially since happiness, unlike some other emotions, isn't terribly predictive of behavior.

((And the /really/ deep problem is that there are things that Every Human On The Planet Today might say would make them more unhappy, but still be Friendly and very important things to do.))

Comment author: PhilGoetz 06 September 2013 12:00:50AM *  4 points [-]

The deeper problem is that you can't really program "make me happy" in the same way that you can't program "make this image look like I want".

On one hand, Friendly AI people want to convert "make me happy" to a formal specification. Doing that has many potential pitfalls. because it is a formal specification.

On the other hand, Richard, I think, wants to simply tell the AI, in English, "Make me happy." Given that approach, he makes the reasonable point that any AI smart enough to be dangerous would also be smart enough to interpret that at least as intelligently as a human would.

I think the important question here is, Which approach is better? LW always assumes the first, formal approach.

To be more specific (and Bayesian): Which approach gives a higher expected value? Formal specification is compatible with Eliezer's ideas for friendly AI as something that will provably avoid disaster. It has some non-epsilon possibility of actually working. But its failure modes are many, and can be literally unimaginably bad. When it fails, it fails catastrophically, like a monotonic logic system with one false belief.

"Tell the AI in English" can fail, but the worst case is closer to a "With Folded Hands" scenario than to paperclips.

I've never considered the "Tell the AI what to do in English" approach before, but on first inspection it seems safer to me.

Comment author: private_messaging 09 September 2013 08:27:25AM *  -2 points [-]

That all depends on the approach... if you have some big human-inspired but more brainy neural network that learns to be a person, it can well just do the right thing by itself, and the risks are in any case quite comparable to that with having a human do it.

If you are thinking of a "neat AI" with utility functions over world models and such, parts of said AI can maximize abstract metrics over mathematical models (including self improvement) without any "generally intelligent" process of eating you. So you would want to use those to build models of human meaning and intent.

Furthermore with regards to AI following some goals, it seems to me that goal specifications would have to be intelligently processed in the first place so that they could be actually applied to the real world - we can't even define paperclips otherwise.

Comment author: RobbBB 08 September 2013 09:57:19AM *  5 points [-]

Relatedly, Phil: You above described yourself and Richard Loosemore as "the two people (Eliezer) should listen to most". Loosemore and I are having a discussion here. Does the content of that discussion affect your view of Richard's level of insight into the problem of Friendly Artificial Intelligence?

Comment author: PeterisP 06 September 2013 09:14:47AM *  2 points [-]

"Tell the AI in English" is in essence an utility function "Maximize the value of X, where X is my current opinion of what some english text Y means".

The 'understanding English' module, the mapping function between X and "what you told in English" is completely arbitrary, but is very important to the AI - so any self-modifying AI will want to modify and improve that. Also, we don't have a good "understanding English" module so yes, we also want the AI to be able to modify and improve that. But, it can be wildly different from reality or opinions of humans - there are trivial ways of how well-meaning dialogue systems can misunderstand statements.

However, for the AI "improve the module" means "change the module so that my utility grows" - so in your example it has strong motivation to intentionally misunderstand English. The best case scenario is to misunderstand "Make everyone happy" as "Set your utility function to MAXINT". The worst case scenario is, well, everything else.

There's the classic quote "It is difficult to get a man to understand something, when his salary depends upon his not understanding it!" - if the AI doesn't care in the first place, then "Tell AI what to do in English" won't make it care.

Comment author: Jiro 06 September 2013 05:27:22PM *  3 points [-]

By this reasoning, an AI asked to do anything at all would respond by immediately modifying itself to set its utility function to MAXINT. You don't need to speak to it in English for that--if you asked the AI to maximize paperclips, that is the equivalent of "Maximize the value of X, where X is my current opinion of how many paperclips there are", and it would modify its paperclip-counting module to always return MAXINT.

You are correct that telling the AI to do Y is equivalent to "maximize the value of X, where X is my current opinion about Y". However, "current" really means "current", not "new". If the AI is actually trying to obey the command to do Y, it won't change its utility function unless having a new utility function will increase its utility according to its current utility function. Neither misunderstanding nor understanding will raise its utility unless its current utility function values having a utility function that misunderstands or understands.

Comment author: Nornagest 08 September 2013 07:32:59AM *  3 points [-]

By this reasoning, an AI asked to do anything at all would respond by immediately modifying itself to set its utility function to MAXINT.

That's allegedly more or less what happened to Eurisko (here, section 2), although it didn't trick itself quite that cleanly. The problem was only solved by algorithmically walling off its utility function from self-modification: an option that wouldn't work for sufficiently strong AI, and one to avoid if you want to eventually allow your AI the capacity for a more precise notion of utility than you can give it.

Paperclipping as the term's used here assumes value stability.

Comment author: PhilGoetz 07 September 2013 04:28:59AM 0 points [-]

A human is a counterexample. A human emulation would count as an AI, so human behavior is one possible AI behavior. Richard's argument is that humans don't respond to orders or requests in anything like these brittle, GOFAI-type systems invoked by the word "formal systems". You're not considering that possibility. You're still thinking in terms of formal systems.

(Unpacking the significant differences between how humans operate, and the default assumptions that the LW community makes about AI, would take... well, five years, maybe ten.)

Comment author: nshepperd 08 September 2013 03:06:24AM *  1 point [-]

A human emulation would count as an AI, so human behavior is one possible AI behavior.

Uhh, no. Look, humans respond to orders and requests in the way that we do because we tend to care what the person giving the request actually wants. Not because we're some kind of "informal system". Any computer program is a formal system, but there are simply more and less complex ones. All you are suggesting is building a very complex ("informal") system and hoping that because it's complex (like humans!) it will behave in a humanish way.

Comment author: bouilhet 10 September 2013 07:23:26PM 1 point [-]

Your response avoids the basic logic here. A human emulation would count as an AI, therefore human behavior is one possible AI behavior. There is nothing controversial in the statement; the conclusion is drawn from the premise. If you don't think a human emulation would count as AI, or isn't possible, or something else, fine, but... why wouldn't a human emulation count as an AI? How, for example, can we even think about advanced intelligence, much less attempt to model it, without considering human intelligence?

...humans respond to orders and requests in the way that we do because we tend to care what the person giving the request actually wants.

I don't think this is generally an accurate (or complex) description of human behavior, but it does sound to me like an "informal system" - i.e. we tend to care. My reading of (at least this part of) PhilGoetz's position is that it makes more sense to imagine something we would call an advanced or super AI responding to requests and commands with a certain nuance of understanding (as humans do) than with the inflexible ("brittle") formality of, say, your average BASIC program.

Comment author: linkhyrule5 07 September 2013 04:56:57AM 0 points [-]

The thing is, humans do that by... well, not being formal systems. Which pretty much requires you to keep a good fraction of the foibles and flaws of a nonformal, nonrigorously rational system.

You'd be more likely to get FAI, but FAI itself would be devalued, since now it's possible for the FAI itself to make rationality errors.

Comment author: Baughn 11 September 2013 12:30:56AM 1 point [-]

More likely, really?

You're essentially proposing giving a human Ultimate Power. I doubt that will go well.

Comment author: linkhyrule5 11 September 2013 01:14:58AM 3 points [-]

Iunno. Humans are probably less likely to go horrifically insane with power than the base chance of FAI.

Your chances aren't good, just better.

Comment author: gattsuru 06 September 2013 03:37:48AM 4 points [-]

Which approach gives a higher expected value? Formal specification is compatible with Eliezer's ideas for friendly AI as something that will provably avoid disaster. It has some non-epsilon possibility of actually working. But its failure modes are many, and can be literally unimaginably bad. When it fails, it fails catastrophically, like a monotonic logic system with one false belief. "Tell the AI in English" can fail, but the worst case is closer to a "With Folded Hands" scenario than to paperclips.

I don't think that's how the analysis goes. Eliezer says that AI must be very carefully and specifically made friendly or it will be disasterous, but that disaster is not a part of being only nearly careful or specifically made enough : he believes an AGI told merely to maximize human pleasure is very dangerous (and probably even more dangerous) than an AGI with a merely 80% Friendly-Complete specification.

Mr. Loosemore seems to hold the opposite opinion, that an AGI will not take instructions to unlikely results, unless it was exceptionally unintelligent and thus not very powerful. I don't believe his position says that a near-Friendly-Complete specification is very risky -- after all, a "smart" AGI would know what you really meant -- but that such a specification would be superfluous.

Whether Mr. Loosemore is correct isn't cause by whether we believe he is correct, just as whether Mr. Eliezer is not wrong just because we choose a different theory. The risks have to be measured in terms of their likelihood from available facts.

The problem is that I don't see much evidence that Mr. Loosemore is correct. I can quite easily conceive of a superhuman intelligence that was built with the specification of "human pleasure = brain dopamine levels", not least of all because there are people who'd want to be wireheads and there's a massive amount of physiological research showing human pleasure to be caused by dopamine levels. I can quite easily conceive of a superhuman intelligence that knows humans prefer more complicated enjoyment, and even do complex modeling of how it would have to manipulate people away from those more complicated enjoyments, and still have that superhuman intelligence not care.

Comment author: Peterdjones 12 September 2013 10:37:25AM 1 point [-]

The problem is that I don't see much evidence that Mr. Loosemore is correct. I can quite easily conceive of a superhuman intelligence that was built with the specification of "human pleasure = brain dopamine levels", not least of all because there are people who'd want to be wireheads and there's a massive amount of physiological research showing human pleasure to be caused by dopamine levels.

I don't think Loosemore was addressing deliberately unfriendly AI, and for that matter EY hasn't been either. Both are addressing intentionally friendly or neutral AI that goes wrong.

I can quite easily conceive of a superhuman intelligence that knows humans prefer more complicated enjoyment, and even do complex modeling of how it would have to manipulate people away from those more complicated enjoyments, and still have that superhuman intelligence not care.

Wouldn't it care about getting things right?

Comment author: PhilGoetz 07 September 2013 08:21:14PM *  2 points [-]

I think it's a question of what you program in, and what you let it figure out for itself. If you want to prove formally that it will behave in certain ways, you would like to program in explicitly, formally, what its goals mean. But I think that "human pleasure" is such a complicated idea that trying to program it in formally is asking for disaster. That's one of the things that you should definitely let the AI figure out for itself. Richard is saying that an AI as smart as a smart person would never conclude that human pleasure equals brain dopamine levels.

Eliezer is aware of this problem, but hopes to avoid disaster by being especially smart and careful. That approach has what I think is a bad expected value of outcome.

Comment author: Fronken 14 September 2013 05:06:53PM *  1 point [-]

I think that "human pleasure" is such a complicated idea that trying to program it in formally is asking for disaster. That's one of the things that you should definitely let the AI figure out for itself.

[...]

Eliezer is aware of this problem, but hopes to avoid disaster by being especially smart and careful. That approach has what I think is a bad expected value of outcome.

Huh I thought he wanted to use CEV?

Comment author: nshepperd 15 September 2013 01:46:37AM 2 points [-]

You are right. I think PhilGoetz must be confused. EY has at least certainly never suggested programming an AI to maximise human pleasure.

Comment deleted 12 September 2013 10:51:22AM [-]
Comment author: Fronken 12 September 2013 02:42:59PM *  1 point [-]

Humans are made to do that by evolution AIs are not. So you have to figure what the heck evolution did, in ways specific enough to program into a computer.

Also, who mentioned giving AIs a priori knowledge of our preferences? It doesn't seem to be in what you replied to.

Comment deleted 12 September 2013 05:16:46PM [-]
Comment author: Fronken 13 September 2013 06:35:37PM *  1 point [-]

Is that going to be harder that coming up with a mathematical expension of morality and preloading it?

Harder than saying it in English, that's all.

EY. It's his answer to friendliness.

No he wants to program the AI to deduce morality from us it is called CEV. He seems to be still working out how the heck to reduce that to math.

Comment author: ArisKatsaris 12 September 2013 10:59:45AM *  4 points [-]

People manage to be friendly without apriori knowledge of everyone else's preferences. Human values are very complex...and one person's preferences are not another's.

Being the same species comes with certain advantages for the possiibility of cooperation. But I wasn't very friendly towards a wasp-nest I discovered in my attic. People aren't very friendly to the vast majority of different species they deal with.

Comment deleted 12 September 2013 12:36:10PM [-]
Comment author: ArisKatsaris 12 September 2013 05:14:05PM 3 points [-]

I'm superintelligent in comparison to wasps, and I still chose to kill them all.

Comment author: RobbBB 06 September 2013 01:46:43AM *  8 points [-]

I considered these three options above:

  • C. direct normativity -- program the AI to value what we value.
  • B. indirect normativity -- program the AI to value figuring out what our values are and then valuing those things.
  • A. indirect indirect normativity -- program the AI to value doing whatever we tell it to, and then tell it, in English, "Value figuring out what our values are and then valuing those things."

I can see why you might consider A superior to C. I'm having a harder time seeing how A could be superior to B. I'm not sure why you say "Doing that has many potential pitfalls. because it is a formal specification." (Suppose we could make an artificial superintelligence that thinks 'informally'. What specifically would this improve, safety-wise?)

Regardless, the AI thinks in math. If you tell it to interpret your phonemes, rather than coding your meaning into its brain yourself, that doesn't mean you'll get an informal representation. You'll just get a formal one that's reconstructed by the AI itself.

It's not clear to me that programming a seed to understand our commands (and then commanding it to become Friendlier) is easier than just programming it to become Friendlier, but in any case the processes are the same after the first stage. That is, A is the same as B but with a little extra added to the beginning, and it's not clear to me why that little extra language-use stage is supposed to add any safety. Why wouldn't it just add one more stage at which something can go wrong?

Comment author: PhilGoetz 06 September 2013 04:31:30AM *  3 points [-]

Regardless, the AI thinks in math. If you tell it to interpret your phonemes, rather than coding your meaning into its brain yourself, that doesn't mean you'll get an informal representation. You'll just get a formal one that's reconstructed by the AI itself.

It is misleading to say that an interpreted language is formal because the C compiler is formal. Existence proof: Human language. I presume you think the hardware that runs the human mind has a formal specification. That hardware runs the interpreter of human language. You could argue that English therefore is formal, and indeed it is, in exactly the sense that biology is formal because of physics: technically true, but misleading.

This will boil down to a semantic argument about what "formal" means. Now, I don't think that human minds--or computer programs--are "formal". A formal process is not Turing complete. Formalization means modeling a process so that you can predict or place bounds on its results without actually simulating it. That's what we mean by formal in practice. Formal systems are systems in which you can construct proofs. Turing-complete systems are ones where some things cannot be proven. If somebody talks about "formal methods" of programming, they don't mean programming with a language that has a formal definition. They mean programming in a way that lets you provably verify certain things about the program without running the program. The halting problem implies that for a programming language to allow you to verify even that the program will terminate, your language may no longer be Turing-complete.

Eliezer's approach to FAI is inherently formal in this sense, because he wants to be able to prove that an AI will or will not do certain things. That means he can't avail himself of the full computational complexity of whatever language he's programming in.

But I'm digressing from the more-important distinction, which is one of degree and of connotation. The words "formal system" always go along with computational systems that are extremely brittle, and that usually collapse completely with the introduction of a single mistake, such as a resolution theorem prover that can prove any falsehood if given one false belief. You may be able to argue your way around the semantics of "formal" to say this is not necessarily the case, but as a general principle, when designing a representational or computational system, fault-tolerance and robustness to noise are at odds with the simplicity of design and small number of interactions that make proving things easy and useful.

Comment author: Richard_Loosemore 06 September 2013 08:56:02PM -1 points [-]

Phil,

You are a rational and reasonable person. Why not speak up about what is happening here? Rob is making a spirited defense of his essay, over on his blog, and I have just posted a detailed critique that really nails down the core of the argument that is supposed to be happening here.

And yet, if you look closely you will find that all of my comments -- be they as neutral, as sensible or as rational as they can be -- are receiving negative votes so fast that they are disappearing to the bottom of the stack or being suppressed completely.

What a bizarre situation!! This article that RobbBB submitted to LessWrong is supposed to be ABOUT my own article on the IEET website. My article is the actual TOPIC here! And yet I, the author of that article, have been insulted here by Eliezer Yudkowsky, and my comments suppressed. Amazing, don't you think?

Comment author: RobbBB 06 September 2013 10:39:31PM *  9 points [-]

Richard: On LessWrong, comments are sorted by how many thumbs up and thumbs down they get, because it makes it easier to find the most popular posts quickly. If a post gets -4 points or lower, it gets compressed to make room for more popular posts, and to discourage flame wars. (You can still un-compress it by just clicking the + in the upper right corner of the comment.) At the moment, some of Eliezer's comments and yours have both been down-voted and compressed in this way, presumably because people on the site thought the personal attacks weren't useful for the conversation as a whole.

People are probably also down-voting your comments because they're histrionic and don't reflect an understanding of this forum's mechanics. I recommend only making points about the substance of people's arguments; if you have personal complaints, take it to a private channel so it doesn't add to the noise surrounding the arguments themselves.

Comment author: RobbBB 06 September 2013 07:12:43PM 2 points [-]

That all makes sense, but I'm missing the link between the above understanding of 'formal' and these four claims, if they're what you were trying to say before:

(1) Indirect indirect normativity is less formal, in the relevant sense, than indirect normativity. I.e., because we're incorporating more of human natural language into the AI's decision-making, the reasoning system will be more tolerant of local errors, uncertainty, and noise.

(2) Programming an AI to value humans' True Preferences in general (indirect normativity) has many pitfalls that programming an AI to value humans' instructions' True Meanings in general (indirect indirect normativity) doesn't, because the former is more formal.

(3) "'Tell the AI in English' can fail, but the worst case is closer to a 'With Folded Hands' scenario than to paperclips."

(4) The "With Folded Hands"-style scenario I have in mind is not as terrible as the paperclips scenario.

Comment author: Polymeron 06 September 2013 04:41:54AM 1 point [-]

Wouldn't this only be correct if similar hardware ran the software the same way? Human thinking is highly associative and variable, and as language is shared amongst many humans, it means that it doesn't, as such, have a fixed formal representation.

Comment author: Richard_Loosemore 06 September 2013 01:03:40AM 4 points [-]

Phil, Unfortunately you are commenting without (seemingly) checking the original article of mine that RobbBB is discussing here. So, you say "On the other hand, Richard, I think, wants to simply tell the AI, in English, "Make me happy." ". In fact, I am not at all saying that. :-)

My article was discussing someone else's claims about AI, and dissecting their claims. So I was not making any assertions of my own about the motivation system.

Aside: You will also note that I was having a productive conversation with RobbBB about his piece, when Yudkowsky decided to intervene with some gratuitous personal slander directed at me (see above). That discussion is now at an end.

Comment author: PhilGoetz 07 September 2013 07:49:57PM *  5 points [-]

I'm afraid reading all that and giving a full response to either you or RobbBB isn't possible in the time I have available this weekend.

I agree that Eliezer is acting like a spoiled child, but calling people on their irrational interpersonal behavior within less wrong doesn't work. Calling them on mistakes they make about mathematics is fine, but calling them on how they treat others on less wrong will attract more reflexive down-votes from people who think you're contaminating their forum with emotion, than upvotes from people who care.

Eliezer may be acting rationally. His ultimate purpose in building this site is to build support for his AI project. The only people on LessWrong, AFAIK, with decades of experience building AI systems, mapping beliefs and goals into formal statements, and then turning them on and seeing what happens, are you, me, and Ben Goertzel. Ben doesn't care enough about Eliezer's thoughts in particular to engage with them deeply; he wants to talk about generic futurist predictions such as near-term and far-term timelines. These discussions don't deal in the complex, linguistic, representational, even philosophical problems at the core of Eliezer's plan (though Ben is capable of dealing with them, they just don't come up in discussions of AI fooms etc.), so even when he disagrees with Eliezer, Eliezer can quickly grasp his point. He is not a threat or a puzzle.

Whereas your comments are... very long, hard to follow, and often full of colorful or emotional statements that people here take as evidence of irrationality. You're expecting people to work harder at understanding them than they're going to. If you haven't noticed, reputation counts for nothing here. For all their talk of Bayesianism, nobody is going to check your bio and say, "Hmm, he's a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems." And Eliezer has carefully indoctrinated himself against considering any such evidence.

So if you consider that the people most likely to find the flaws in Eliezer's more-specific FAI & CEV plans are you and me, and that Eliezer has been public about calling both of us irrational people not worth talking with, this is consistent either with the hypothesis that his purpose is to discredit people who pose threats to his program, or with the hypothesis that his ego is too large to respond with anything other than dismissal to critiques that he can't understand immediately or that trigger his "crackpot" patter-matcher, but not with the hypothesis that arguing with him will change his mind.

(I find the continual readiness of people to assume that Eliezer always speaks the truth odd, when he's gone more out of his way than anyone I know, in both his blog posts and his fanfiction, to show that honest argumentation is not generally a winning strategy. He used to append a signature to his email along those lines, something about warning people not to assume that the obvious interpretation of what he said was the truth.)

RobbBB seems diplomatic, and I don't think you should quit talking with him because Eliezer made you angry. That's what Eliezer wants.

Comment author: shminux 10 September 2013 12:20:59AM *  12 points [-]

For all their talk of Bayesianism, nobody is going to check your bio and say, "Hmm, he's a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems."

Actually, that was the first thing I did, not sure about other people. What I saw was:

  • Teaches at what appears to be a small private liberal arts college, not a major school.

  • Out of 20 or so publications listed on http://www.richardloosemore.com/papers, a bunch are unrelated to AI, others are posters and interviews, or even "unpublished", which are all low-confidence media.

  • Several contributions are entries in conference proceedings (are they peer-reviewed? I don't know) .

  • A number are listed as "to appear", and so impossible to evaluate.

  • A few are apparently about dyslexia, which is an interesting topic, but not obviously related to AI.

  • One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer.

  • I could not find any external references to RL's work except through links to Ben Goertzel (IEET was one exception).

As a result, I was unable to independently evaluate RL's expertise level, but clearly he is not at the top of the AI field, unlike say, Ben Goertzel. Given his poorly written posts and childish behavior here, indicative of an over-inflated ego, I have decided that whatever he writes can be safely ignored. I did not think of him as a crackpot, more like a noise maker.

Admittedly, I am not sold on Eliezer's ideas, either, since many other AI experts are skeptical of them, and that's the only thing I can go by, not being an expert in the field myself. But at least Eliezer has done several impossible things in the last decade or so, which commands a lot of respect, while Richard appears to be drifting along.

Comment author: Richard_Loosemore 11 September 2013 06:09:57PM *  4 points [-]

I was in a rush last night, shminux, so I didn't have time for a couple of other quick clarifications:

First, you say "One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer."

Well, H+ magazine is one of the foremost online magazines (perhaps THE foremost online magazine) of the transhumanist community.

And, you mention Springer. You did not notice that one of my papers was in the recently published Springer book "Singularity Hypotheses".

Second, you say "A few [of my papers] are apparently about dyslexia, which is an interesting topic, but not obviously related to AI."

Actually they were about dysgraphia, not dyslexia ... but more importantly, those papers were about computational models of language processing. In particular they were very, VERY simple versions of the computational model of human language that is one of my special areas of expertise. And since that model is primarily about learning mechanisms (the language domain is only a testbed for a research programme whose main focus is learning), those papers you saw were actually indicative that back in the early 1990s I was already working on the construction of the core aspects of an AI system.

So, saying "dyslexia" gives a very misleading impression of what that was all about. :-)

Comment author: Richard_Loosemore 10 September 2013 09:06:18PM 5 points [-]

That is a very interesting assessment, shminux.

Would you be up for some feedback?

You are quite selective in your catalog of my achievements....

One item was a chapter in a book entitled "Theoretical Foundations of Artificial General Intelligence". Sure, it was about the consciousness question, but still.

You make a casual disparaging remark about the college where I currently work ... but forget to mention that I graduated from an institution that is ranked in the top 3 or 4 in the world (University College London).

You neglect to mention that I have academic qualifications in multiple fields -- both physics and artificial intelligence/cognitive psychology. I now teach in both of those fields.

And in addition to all of the above, you did not notice that I am (in addition to my teaching duties) an AI developer who works on his projects WITHOUT intending to publish that work all the time! My AI work is largely proprietary. What you see from the outside are the occasional spinoffs and side projects that get turned into published writings. Not to be too coy, but isn't that something you would expect from someone who is actually walking the walk....? :-)

There are a number of comments from other people below about Ben Goertzel, some of them a little strange. I wrote a paper a couple of years ago that Ben suggested we get together to and publish... that is now a chapter in the book "Singularity Hypotheses".

So clearly Ben Goertzel (who has a large, well-funded AGI lab) is not of the opinion that I am a crank. Could I get one point for that?

Phil Goetz, who is an experienced veteran of the AGI field, has on this thread made a comment to the effect that he thinks that Ben Goertzel, himself, and myself are the three people Eliezer should be seriously listening to (since the three of us are among the few people who have been working on this problem for many years, and who have active AGI projects). So perhaps that is two points? Maybe?

And, just out of curiosity, I would invite you to check in with the guy who invented AIXI -- Marcus Hutter. He and I met and had a very long discussion at the 2009 AGI conference. Marcus and I disagree substantially about the theoretical foundations of AI, but in spite of that disagreement I would urge you to ask him if he considers me to be down at the crank level. I might be wrong, but I do not think he would be willing to give me a bad reference. Let me know how that goes, yes?

You also finished off with what I can only describe as one of the most bizarre comparisons I have ever seen. :-) You say "Eliezer has done several impossible things in the last decade or so". Hmmmm....! :-) And yet ... "Richard appears to be drifting along" Well, okay, if you say so .... :-)

Comment deleted 10 September 2013 12:23:34PM [-]
Comment author: Randaly 10 September 2013 02:04:17AM 6 points [-]

Several contributions are entries in conference proceedings (are they peer-reviewed? I don't know).

In CS, conference papers are generally higher status & quality than journal articles.

Comment author: EHeller 10 September 2013 01:25:45AM 10 points [-]

As a result, I was unable to independently evaluate RL's expertise level, but clearly he is not at the top of the AI field, unlike say, Ben Goertzel.

At least a few of the RL authored papers are WITH Ben Goertzel, so some of Goertzel's status should rub-off, as I would trust Goertzel to effectively evaluate collaborators.

Comment author: wedrifid 10 September 2013 11:46:10AM -2 points [-]

At least a few of the RL authored papers are WITH Ben Goertzel, so some of Goertzel's status should rub-off, as I would trust Goertzel to effectively evaluate collaborators.

Is there some assumption here that association with Ben Goertzel should be considered evidence in favour of an individual's credibility on AI? That seems backwards.

Comment author: Peterdjones 10 September 2013 12:33:44PM -1 points [-]

Goertzel appears to be a respected figuer in the field. Could you point the interested reader to your critique of his work?

Comment author: wedrifid 10 September 2013 12:54:57PM 1 point [-]

Could you point the interested reader to your critique of his work?

Comments can likely be found on this site from years ago. I don't recall anything particularly in depth or memorable. It's probably better to just look at things that Ben Goertzel says and making one's own judgement. The thinking he expresses is not of the kind that impresses me but other's mileage may vary.

I don't begrudge anyone their right to their beauty contests but I do observe that whatever it is that is measured by identifying the degree of affiliation with Ben Goertzel is something wildly out of sync with the kind of thing I would consider evidence of credibility.

Comment author: Randaly 10 September 2013 12:35:53PM *  3 points [-]

Goertzel is also known for approving of people who are uncontroversially cranks. See here. It's also known, via his cooperation with MIRI, that a collaboration with him in no way implies his endorsement of another's viewpoints.

Comment author: ygert 10 September 2013 12:07:02PM *  2 points [-]

Well, it does show that Goertzel respects his opinions at least enough to be willing to author a paper with him.

Comment author: linkhyrule5 10 September 2013 01:06:36AM 7 points [-]

But at least Eliezer has done several impossible things in the last decade or so,

Name three? If only so I can cite them to Eliezer-is-a-crank people.

Comment author: shminux 10 September 2013 07:13:42AM *  0 points [-]

If only so I can cite them to Eliezer-is-a-crank people.

I advise against doing that. It is unlikely to change anyone's mind.

By impossible feats I mean that a regular person would not be able to reproduce them, except by chance, like winning a lottery, starting Google, founding a successful religion or becoming a President.

He started as a high-school dropout without any formal education and look what he achieved so far, professionally and personally. Look at the organizations he founded and inspired. Look at the high-status experts in various fields (business, comp sci, programming, philosophy, math and physics) who take him seriously (some even give him loads of money). Heck, how many people manage to have multiple simultaneous long-term partners who are all highly intelligent and apparently get along well?

Comment author: linkhyrule5 10 September 2013 04:54:17PM *  4 points [-]

I advise against doing that. It is unlikely to change anyone's mind.

Point, but there's also the middle ground "I'm not sure if he's a crank or not, but I'm busy so I won't look unless there's some evidence he's not."

The big two I've come up with is a) he actually changes his mind about important things (though I need to find an actual post I can cite - didn't he reopen the question of the possibility of a hard takeoff, or something?) and b) TDT.

Comment author: Peterdjones 10 September 2013 10:19:48AM *  5 points [-]

He's achieved about what Ayn Rand achieved, and almost everyone thinks she wasa crank.

Comment author: linkhyrule5 10 September 2013 04:52:25PM 3 points [-]

Basically this. As Eliezer himself points out, humans aren't terribly rational on average and our judgements of each others' rationality isn't great either. Large amounts of support implies charisma, not intelligence.

TDT is closer to what I'm looking for, though it's a ... tad long.

Comment author: Gurkenglas 10 September 2013 03:13:34AM *  0 points [-]

Won some AI box experiments as the AI.

Comment author: linkhyrule5 10 September 2013 05:42:34AM 4 points [-]

Sure, but that's hard to prove: given "Eliezer is a crank," the probability of "Eliezer is lying about his AI-box prowess" is much higher than "Eliezer actually pulled that off."

The latest success by a non-Eliezer person helps, but I'd still like something I can literally cite.

Comment author: private_messaging 10 September 2013 10:35:05PM 1 point [-]

Eliezer is lying about his AI-box prowess

I don't see why anyone would think that. Plenty of people in the anti-vaccination crowd managed to convince parents to mortally endanger their children.

Comment author: EHeller 10 September 2013 06:15:02AM 1 point [-]

Also, maybe its a matter of semantics, but winning a game that you created isn't really 'doing the impossible' in the sense I took the phrasing.

Comment author: Richard_Loosemore 09 September 2013 10:44:01PM *  4 points [-]

I agree with pretty much all of the above.

I didn't quit with Rob, btw. Ihave had a fairly productive -- albeit exhausting -- discussion with Rob over on his blog. I consider it to be productive because I have managed to narrow in on what he thinks is the central issue. And I think I have now (today's comment, which is probably the last of the discussion) managed to nail down my own argument in a way that withstands all the attacks against it.

You are right that I have some serious debating weaknesses. I write too dense, and I assume that people have my width and breadth of experience, which is unfair (I got lucky in my career choices).

Oh, and don't get me wrong: Eliezer never made me angry in this little episode. I laughed myself silly. Yeah, I protested. But I was wiping back tears of laughter while I did. "Known Permanent Idiot" is just a wondeful turn of phrase. Thanks, Eliezer!

Comment author: player_03 10 September 2013 07:21:14AM *  0 points [-]

Link to the nailed-down version of the argument?

Comment author: RobbBB 10 September 2013 07:51:19AM 1 point [-]
Comment author: player_03 10 September 2013 08:41:57AM *  5 points [-]

Oh, yeah, I found that myself eventually.

Anyway, I went and read the the majority of that discussion (well, the parts between Richard and Rob). Here's my summary:

Richard:

I think that what is happening in this discussion [...] is a misunderstanding. [...]

[Rob responds]

Richard:

You completely miss the point that I was trying to make. [...]

[Rob responds]

Richard:

You are talking around the issue I raised. [...] There is a gigantic elephant in the middle of this room, but your back is turned to it. [...]

[Rob responds]

Richard:

[...] But each time I explain my real complaint, you ignore it and respond as if I did not say anything about that issue. Can you address my particular complaint, and not that other distraction?

[Rob responds]

Richard:

[...] So far, nobody (neither Rob nor anyone else at LW or elsewhere) will actually answer that question. [...]

[Rob responds]

Richard:

Once again, I am staggered and astonished by the resilience with which you avoid talking about the core issue, and instead return to the red herring that I keep trying to steer you away from. [...]

Rob:

Alright. You say I’ve been dancing around your “core” point. I think I’ve addressed your concerns quite directly, [...] To prevent yet another suggestion that I haven’t addressed the “core”, I’ll respond to everything you wrote above. [...]

Richard:

Rob, it happened again. [...]

I snipped a lot of things there. I found lots of other points I wanted to emphasize, and plenty of things I wanted to argue against. But those aren't the point.


Richard, this next part is directed at you.

You know what I didn't find?

I didn't find any posts where you made a particular effort to address the core of Rob's argument. It was always about your argument. Rob was always the one missing the point.

Sure, it took Rob long enough to focus on finding the core of your position, but he got there eventually. And what happened next? You declared that he was still missing the point, posted a condensed version of the same argument, and posted here that your position "withstands all the attacks against it."

You didn't even wait for him to respond. You certainly didn't quote him and respond to the things he said. You gave no obvious indication that you were taking his arguments seriously.

As far as I'm concerned, this is a cardinal sin.


I think I am explaining the point with such long explanations that I am causing you to miss the point.

How about this alternate hypothesis? Your explanations are fine. Rob understands what you're saying. He just doesn't agree.

Perhaps you need to take a break from repeating yourself and make sure you understand Rob's argument.

(P.S. Eliezer's ad hominem is still wrong. You may be making a mistake, but I'm confident you can fix it, the tone of this post notwithstanding.)

Comment author: Richard_Loosemore 10 September 2013 01:27:40PM 4 points [-]

This entire debate is supposed to about my argument, as presented in the original article I published on the IEET.org website ("The Fallacy of Dumb Superintelligence").

But in that case, what should I do when Rob insists on talking about something that I did not say in that article?

My strategy was to explain his mistake, but not engage in a debate about his red herring. Sensible people of all stripes would consider that a mature response.

But over and over again Rob avoided the actual argument and insisted on talking about his red herring.

And then FINALLY I realized that I could write down my original claim in such a way that it is IMPOSSIBLE for Rob to misinterpret it.

(That was easy, in retrospect: all I had to do was remove the language that he was using as the jumping-off point for his red herring).

That final, succinct statement of my argument is sitting there at the end of his blog ..... so far ignored by you, and by him. Perhaps he will be able to respond, I don't know, but you say you have read it, so you have had a chance to actually understand why it is that he has been talking about something of no relevance to my original argument.

But you, in your wisdom, chose to (a) completely ignore that statement of my argument, and (b) give me a patronizing rebuke for not being able to understand Rob's red herring argument.

Comment author: RobbBB 05 September 2013 03:48:35PM *  13 points [-]

Richard: I'll stick with your original example. In your hypothetical, I gather, programmers build a seed AI (a not-yet-superintelligent AGI that will recursively self-modify to become superintelligent after many stages) that includes, among other things, a large block of code I'll call X.

The programmers think of this block of code as an algorithm that will make the seed AI and its descendents maximize human pleasure. But they don't actually know for sure that X will maximize human pleasure — as you note, 'human pleasure' is an unbelievably complex concept, so no human could be expected to actually code it into a machine without making any mistakes. And writing 'this algorithm is supposed to maximize human pleasure' into the source code as a comment is not going to change that. (See the first few paragraphs of Truly Part of You.)

Now, why exactly should we expect the superintelligence that grows out of the seed to value what we really mean by 'pleasure', when all we programmed it to do was X, our probably-failed attempt at summarizing our values? We didn't program it to rewrite its source code to better approximate our True Intentions, or the True Meaning of our in-code comments. And if we did attempt to code it to make either of those self-modifications, that would just produce a new hugely complex block Y which might fail in its own host of ways, given the enormous complexity of what we really mean by 'True Intentions' and 'True Meaning'. So where exactly is the easy, low-hanging fruit that should make us less worried a superintelligence will (because of mistakes we made in its utility function, not mistakes in its factual understanding of the world) hook us up to dopamine drips? All of this seems crucial to your original point in 'The Fallacy of Dumb Superintelligence':

This is what a New Yorker article has to say on the subject of “Moral Machines”: “An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip.”

What they are trying to say is that a future superintelligent machine might have good intentions, because it would want to make people happy, but through some perverted twist of logic it might decide that the best way to do this would be to force (not allow, notice, but force!) all humans to get their brains connected to a dopamine drip.

It seems to me that you've already gone astray in the second paragraph. On any charitable reading (see the New Yorker article), it should be clear that what's being discussed is the gap between the programmer's intended code and the actual code (and therefore actual behaviors) of the AGI. The gap isn't between the AGI's intended behavior and the set of things it's smart enough to figure out how to do. (Nowhere does the article discuss how hard it is for AIs to do things they desire to. Over and over again is the difficulty of programming AIs to do what we want them to discussed — e.g., Asimov's Three Laws.)

So all the points I make above seem very relevant to your 'Fallacy of Dumb Superintelligence', as originally presented. If you were mixing those two gaps up, though, that might help explain why you spent so much time accusing SIAI/MIRI of making this mistake, even though it's the former gap and not the latter that SIAI/MIRI advocates appeal to.

Maybe it would help if you provided examples of someone actually committing this fallacy, and explained why you think those are examples of the error you mentioned and not of the reasonable fact/value gap I've sketched out here?

Comment author: Peterdjones 10 September 2013 10:55:17AM *  0 points [-]

Now, why exactly should we expect the superintelligence that grows out of the seed to value what we really mean by 'pleasure', when all we programmed it to do was X, our probably-failed attempt at summarizing our values?

  • Maybe we didn't do it ithat way. Maybe we did it Loosemore's way, where you code in the high-level sentence, and let the AI figure it out. Maybe that would avoid the problem. Maybe Loosemore has solved FAi much more straightforwardly than EY.

  • Maybe we told it to. Maybe we gave it the low-level expansion of "happy" that we or our seed AI came up with together with an instruction that it is meant to capture the meaning of the high-level statement, and that the HL statement is the Prime Directive, and that if the AI judges that the expansion is wrong, then it should reject the expansion.

  • Maybe the AI will value getting things right because it is rational.

Comment author: RobbBB 10 September 2013 04:46:04PM *  1 point [-]

"code in the high-level sentence, and let the AI figure it out."

http://lesswrong.com/lw/rf/ghosts_in_the_machine/

"Maybe we gave it the low-level expansion of 'happy' that we or our seed AI came up with 'together with' an instruction that it is meant to capture the meaning of the high-level statement"

If the AI is too dumb to understand 'make us happy', then why should we expect it to be smart enough to understand 'figure out how to correctly understand "make us happy", and then follow that instruction'? We have to actually code 'correctly understand' into the AI. Otherwise, even when it does have the right understanding, that understanding won't be linked to its utility function.

"Maybe the AI will value getting things right because it is rational."

http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/

Comment author: Peterdjones 10 September 2013 05:06:45PM 1 point [-]

"code in the high-level sentence, and let the AI figure it out."

http://lesswrong.com/lw/rf/ghosts_in_the_machine/

So it's impossible to directly or indirectly code in the compex thing called semantics, but possible to directly or indirectly code in the compex thing called morality? What? What is your point? You keep talking as if I am suggesting there is someting that can be had for free, without coding. I never even remotely said that.

If the AI is too dumb to understand 'make us happy', then why should we expect it to be smart enough to understand 'figure out how to correctly understand "make us happy", and then follow that instruction'? We have to actually code 'correctly understand' into the AI. Otherwise, even when it does have the right understanding, that understanding won't be linked to its utility function.

I know. A Loosemore architecture AI has to treat its directives as directives. I never disputed that. But coding "follow these plain English instructions" isn't obviously harder or more fragile than coding "follow <<long expansion of human preferences>>". And it isn't trivial, and I didn't say it was.

Comment author: Eliezer_Yudkowsky 10 September 2013 05:56:05PM 2 points [-]

PeterDJones, if you wish to converse further with RobbBB, I ask that you do so on RobbBB's blog rather than here.

Comment author: RobbBB 10 September 2013 05:16:11PM *  3 points [-]

So it's impossible to directly or indirectly code in the compex thing called semantics, but possible to directly or indirectly code in the compex thing called morality?

Read the first section of the article you're commenting on. Semantics may turn out to be a harder problem than morality, because the problem of morality may turn out to be a subset of the problem of semantics. Coding a machine to know what the word 'Friendliness' means (and to care about 'Friendliness') is just a more indirect way of coding it to be Friendly, and it's not clear why that added indirection should make an already risky or dangerous project easy or safe. What does indirect indirect normativity get us that indirect normativity doesn't?

Comment author: Eliezer_Yudkowsky 10 September 2013 05:59:42PM 6 points [-]

Robb, at the point where Peterdjones suddenly shows up, I'm willing to say - with some reluctance - that your endless willingness to explain is being treated as a delicious free meal by trolls. Can you direct them to your blog rather than responding to them here? And we'll try to get you some more prestigious non-troll figure to argue with - maybe Gary Drescher would be interested, he has the obvious credentials in cognitive reductionism but is (I think incorrectly) trying to derive morality from timeless decision theory.

Comment author: RobbBB 10 September 2013 06:12:10PM *  4 points [-]

Sure. I'm willing to respond to novel points, but at the stage where half of my responses just consist of links to the very article they're commenting on or an already-referenced Sequence post, I agree the added noise is ceasing to be productive. Fortunately, most of this seems to already have been exorcised into my blog. :)

Comment author: lukeprog 18 September 2013 01:44:10AM *  7 points [-]

Agree with Eliezer. Your explanatory skill and patience are mostly wasted on the people you've been arguing with so far, though it may have been good practice for you. I would, however, love to see you try to talk Drescher out of trying to pull moral realism out of TDT/UDT, or try to talk Dalyrmple out of his "I'm not partisan enough to prioritize human values over the Darwinian imperative" position, or help Preston Greene persuade mainstream philosophers of "the reliabilist metatheory of rationality" (aka rationality as systematized winning).

Comment author: Peterdjones 10 September 2013 05:25:46PM *  0 points [-]

Semantcs isn't optional. Nothing could qualify as an AGI,let alone a super one, unless it could hack natural language. So Loosemore architectures don't make anything harder, since semantics has to be solved anyway.

Comment author: RobbBB 10 September 2013 06:08:50PM 4 points [-]

It's a problem of sequence. The superintelligence will be able to solve Semantics-in-General, but at that point if it isn't already safe it will be rather late to start working on safety. Tasking the programmers to work on Semantics-in-General makes things harder if it's a more complex or roundabout way of trying to address Indirect Normativity; most of the work on understanding what English-language sentences mean can be relegated to the SI, provided we've already made it safe to make an SI at all.

Comment author: Peterdjones 11 September 2013 08:07:03AM 0 points [-]

Then solve semantics in a seed.

Comment author: Broolucks 08 September 2013 07:43:27AM *  2 points [-]

programmers build a seed AI (a not-yet-superintelligent AGI that will recursively self-modify to become superintelligent after many stages) that includes, among other things, a large block of code I'll call X.

The programmers think of this block of code as an algorithm that will make the seed AI and its descendents maximize human pleasure.

The problem, I reckon, is that X will never be anything like this.

It will likely be something much more mundane, i.e. modelling the world properly and predicting outcomes given various counterfactuals. You might be worried by it trying to expand its hardware resources in an unbounded fashion, but any AI doing this would try to shut itself down if its utility function was penalized by the amount of resources that it had, so you can check by capping utility in inverse proportion to available hardware -- at worst, it will eventually figure out how to shut itself down, and you will dodge a bullet. I also reckon that the AI's capacity for deception would be severely crippled if its utility function penalized it when it didn't predict its own actions or the consequences of its actions correctly. And if you're going to let the AI actually do things... why not do exactly that?

Arguably, such an AI would rather uneventfully arrive to a point where, when asking it "make us happy", it would just answer with a point by point plan that represents what it thinks we mean, and fill in details until we feel sure our intents are properly met. Then we just tell it to do it. I mean, seriously, if we were making an AGI, I would think "tell us what will happen next" would be fairly high in our list of priorities, only surpassed by "do not do anything we veto". Why would you program AI to "maximize happiness" rather than "produce documents detailing every step of maximizing happiness"? They are basically the same thing, except that the latter gives you the opportunity for a sanity check.

Comment author: RobbBB 08 September 2013 09:21:41AM *  2 points [-]

You might be worried by it trying to expand its hardware resources in an unbounded fashion, but any AI doing this would try to shut itself down if its utility function was penalized by the amount of resources that it had

What counts as 'resources'? Do we think that 'hardware' and 'software' are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover?

I also reckon that the AI's capacity for deception would be severely crippled if its utility function penalized it when it didn't predict its own actions or the consequences of its actions correctly.

Hm? That seems to only penalize it for self-deception, not for deceiving others.

Arguably, such an AI would rather uneventfully arrive to a point where, when asking it "make us happy", it would just answer with a point by point plan that represents what it thinks we mean, and fill in details until we feel sure our intents are properly met.

You're talking about an Oracle AI. This is one useful avenue to explore, but it's almost certainly not as easy as you suggest:

"'Tool AI' may sound simple in English, a short sentence in the language of empathically-modeled agents — it's just 'a thingy that shows you plans instead of a thingy that goes and does things.' If you want to know whether this hypothetical entity does X, you just check whether the outcome of X sounds like 'showing someone a plan' or 'going and doing things', and you've got your answer. It starts sounding much scarier once you try to say something more formal and internally-causal like 'Model the user and the universe, predict the degree of correspondence between the user's model and the universe, and select from among possible explanation-actions on this basis.' [...]

"If we take the concept of the Google Maps AGI at face value, then it actually has four key magical components. (In this case, 'magical' isn't to be taken as prejudicial, it's a term of art that means we haven't said how the component works yet.) There's a magical comprehension of the user's utility function, a magical world-model that GMAGI uses to comprehend the consequences of actions, a magical planning element that selects a non-optimal path using some method other than exploring all possible actions, and a magical explain-to-the-user function.

"report($leading_action) isn't exactly a trivial step either. Deep Blue tells you to move your pawn or you'll lose the game. You ask 'Why?' and the answer is a gigantic search tree of billions of possible move-sequences, leafing at positions which are heuristically rated using a static-position evaluation algorithm trained on millions of games. Or the planning Oracle tells you that a certain DNA sequence will produce a protein that cures cancer, you ask 'Why?', and then humans aren't even capable of verifying, for themselves, the assertion that the peptide sequence will fold into the protein the planning Oracle says it does.

"'So,' you say, after the first dozen times you ask the Oracle a question and it returns an answer that you'd have to take on faith, 'we'll just specify in the utility function that the plan should be understandable.'

"Whereupon other things start going wrong. Viliam_Bur, in the comments thread, gave this example, which I've slightly simplified:

"'Example question: "How should I get rid of my disease most cheaply?" Example answer: "You won't. You will die soon, unavoidably. This report is 99.999% reliable". Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.'

"Bur is trying to give an example of how things might go wrong if the preference function is over the accuracy of the predictions explained to the human— rather than just the human's 'goodness' of the outcome. And if the preference function was just over the human's 'goodness' of the end result, rather than the accuracy of the human's understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a 'good' outcome. And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.

"I'm not saying any particular failure is 100% certain to occur; rather I'm trying to explain - as handicapped by the need to describe the AI in the native human agent-description language, using empathy to simulate a spirit-in-a-box instead of trying to think in mathematical structures like A* search or Bayesian updating - how, even so, one can still see that the issue is a tad more fraught than it sounds on an immediate examination.

"If you see the world just in terms of math, it's even worse; you've got some program with inputs from a USB cable connecting to a webcam, output to a computer monitor, and optimization criteria expressed over some combination of the monitor, the humans looking at the monitor, and the rest of the world. It's a whole lot easier to call what's inside a 'planning Oracle' or some other English phrase than to write a program that does the optimization safely without serious unintended consequences. Show me any attempted specification, and I'll point to the vague parts and ask for clarification in more formal and mathematical terms, and as soon as the design is clarified enough to be a hundred light years from implementation instead of a thousand light years, I'll show a neutral judge how that math would go wrong. (Experience shows that if you try to explain to would-be AGI designers how their design goes wrong, in most cases they just say "Oh, but of course that's not what I meant." Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button. But based on past sad experience with many other would-be designers, I say 'Explain to a neutral judge how the math kills" and not "Explain to the person who invented that math and likes it.')

"Just as the gigantic gap between smart-sounding English instructions and actually smart algorithms is the main source of difficulty in AI, there's a gap between benevolent-sounding English and actually benevolent algorithms which is the source of difficulty in FAI. 'Just make suggestions - don't do anything!' is, in the end, just more English."

Comment author: Broolucks 08 September 2013 07:59:11PM 4 points [-]

What counts as 'resources'? Do we think that 'hardware' and 'software' are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover?

What is "taking over the world", if not taking control of resources (hardware)? Where is the motivation in doing it? Also consider, as others pointed out, that an AI which "misunderstands" your original instructions will demonstrate this earlier than later. For instance, if you create a resource "honeypot" outside the AI which is trivial to take, an AI would naturally take that first, and then you know there's a problem. It is not going to figure out you don't want it to take it before it takes it.

Hm? That seems to only penalize it for self-deception, not for deceiving others.

When I say "predict", I mean publishing what will happen next, and then taking a utility hit if the published account deviates from what happens, as evaluated by a third party.

You're talking about an Oracle AI. This is one useful avenue to explore, but it's almost certainly not as easy as you suggest:

The first part of what you copy pasted seems to say that "it's nontrivial to implement". No shit, but I didn't say the contrary. Then there is a bunch of "what if" scenarios I think are not particularly likely and kind of contrived:

Example question: "How should I get rid of my disease most cheaply?" Example answer: "You won't. You will die soon, unavoidably. This report is 99.999% reliable". Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.'

Because asking for understandable plans means you can't ask for plans you don't understand? And you're saying that refusing to give a plan counts as success and not failure? Sounds like a strange set up that would be corrected almost immediately.

And if the preference function was just over the human's 'goodness' of the end result, rather than the accuracy of the human's understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a 'good' outcome.

If the AI has the right idea about "human understanding", I would think it would have the right idea about what we mean by "good". Also, why would you implement such a function before asking the AI to evaluate examples of "good" and provide their own?

And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.

Is making humans happy so hard that it's actually easier to deceive them into taking happy pills than to do what they mean? Is fooling humans into accepting different definitions easier than understanding what they really mean? In what circumstances would the former ever happen before the latter?

And if you ask it to tell you whether "taking happy pills" is an outcome most humans would approve of, what is it going to answer? If it's going to do this for happiness, won't it do it for everything? Again: do you think weaving an elaborate fib to fool every human being into becoming wireheads and never picking up on the trend is actually less effort than just giving humans what they really want? To me this is like driving a whole extra hour to get to a store that sells an item you want fifty cents cheaper.

I'm not saying these things are not possible. I'm saying that they are contrived: they are constructed to the express purpose of being failure modes, but there's no reason to think they would actually happen, especially given that they seem to be more complicated than the desired behavior.

Now, here's the thing: you want to develop FAI. In order to develop FAI, you will need tools. The best tool is Tool AI. Consider a bootstrapping scheme: in order for commands written in English to be properly followed, you first make AI for the very purpose of modelling human language semantics. You can check that the AI is on the same page as you are by discussing with it and asking questions such as: "is doing X in line with the objective 'Y'?"; it doesn't even need to be self-modifying at all. The resulting AI can then be transformed into a utility function computer: you give the first AI an English statement and build a second AI maximizing the utility which is given to it by the first AI.

And let's be frank here: how else do you figure friendly AI could be made? The human brain is a complex, organically grown, possibly inconsistent mess; you are not going, from human wits alone, to build some kind of formal proof of friendliness, even a probabilistic one. More likely than not, there is no such thing: concepts such as life, consciousness, happiness or sentience are ill-defined and you can't even demonstrate the friendliness of a human being, or even of a group of human beings, let alone of humanity as a whole, which also is a poorly defined thing.

However, massive amounts of information about our internal thought processes are leaked through our languages. You need AI to sift through it and model these processes, their average and their variance. You need AI to extract this information, fill in the holes, produce probability clouds about intent that match whatever borderline incoherent porridge of ideas our brains implement as the end result of billions of years of evolutionary fumbling. In a sense, I guess this would be X in your seed AI: AI which already demonstrated, to our satisfaction, that it understands what we mean, and directly takes charge of a second AI's utility measurement. I don't really see any alternatives: if you want FAI, start by focusing on AI that can extract meaning from sentences. Reliable semantic extraction is virtually a prerequisite for FAI, if you can't do the former, forget about the latter.

Comment author: ciphergoth 06 September 2013 03:14:36PM 5 points [-]

I'm really glad you posted this, even though it may not enlighten the person it's in reply to: this is an error lots of people make when you try to explain the FAI problem to them, and the "two gaps" explanation seems like a neat way to make it clear.

Comment author: Richard_Loosemore 05 September 2013 10:20:47PM *  2 points [-]

Rob,

This afternoon I spent some time writing a detailed, carefully constructed reply to your essay. I had trouble posting it due to an internet glitch when I was at work, but now I am home I was about to submit when suddenly discovered that my friends were warning me about the following comment that was posted to the thread:


Comment author: Eliezer_Yudkowsky 05 September 2013 07:30:56PM 1 point [-]

Warning: Richard Loosemore is a known permanent idiot, ponder carefully before deciding to spend much time arguing with him.

(If you're fishing for really clear quotes to illustrate the fallacy, that may make sense.)

--

So. I will not be posting my reply after all.

I will not waste any more of my time in a context controlled by an abusive idiot.

If you want to discuss the topic (and I had many positive, constructive thoughts to contribute), feel free to suggest an alternative venue where we can engage in a debate without trolls interfering with the discussion.

Sincerely,

Richard Loosemore Mathematical and Physical Sciences, Wells College Aurora, NY 13026 USA

Comment author: XiXiDu 05 September 2013 05:56:11PM *  4 points [-]

We seem to agree that for an AI to talk itself out of a confinement (like in the AI box experiment), the AI would have to understand what humans mean and want.

As far as I understand your position, you believe that it is difficult to make an AI care to do what humans want, apart from situations where it is temporarily instrumentally useful to do what humans want.

Do you agree that for such an AI to do what humans want, in order to deceive them, humans would have to succeed at either encoding the capability to understand what humans want, or succeed at encoding the capability to make itself capable of understanding what humans want?

My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.

In other words, humans intend an AI to be intelligent and use its intelligence in a certain way. And in order to be an existential risk, humans need to succeed making and AI behave intelligently but fail at making it use its intelligence in a way that does not kill everyone.

Do you agree?

Comment author: hairyfigment 05 September 2013 11:58:08PM -1 points [-]

Say we find an algorithm for producing progressively more accurate beliefs about itself and the world. This algorithm may be long and complicated - perhaps augmented by rules-of-thumb whenever the evidence available to it says these rules make better predictions. (E.g, "nine times out of ten the Enterprise is not destroyed.") Combine this with an arbitrary goal and we have the making of a seed AI.

Seems like this could straightforwardly improve its ability to predict humans without changing its goal, which may be 'maximize pleasure' or 'maximize X'. Why would it need to change its goal?

If you deny the possibility of the above algorithm, then before giving any habitual response please remember what humanity knows about clinical vs. actuarial judgment. What lesson do you take from this?

Comment author: RobbBB 05 September 2013 06:14:32PM *  12 points [-]

Your summaries of my views here are correct, given that we're talking about a superintelligence.

My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.

Well, there's obviously a difference; 'what an AI can do' and 'what an AI will do' mean two different things. I agree with you that this difference isn't a particularly profound one, and the argument shouldn't rest on it.

What the argument rests on is, I believe, that it's easier to put a system into a positive feedback loop that helps it better model its environment and/or itself, than it is to put a system into a positive feedback loop that helps it better pursue a specific set of highly complex goals we have in mind (but don't know how to fully formalize).

If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn't value our well-being, how do we make reality bite back and change the AI's course? How do we give our morality teeth?

Whatever goals it initially tries to pursue, it will fail in those goals more often the less accurate its models are of its circumstances; so if we have successfully programmed it to do increasingly well at any difficult goal at all (even if it's not the goal we intended it to be good at), then it doesn't take a large leap of the imagination to see how it could receive feedback from its environment about how well it's doing at modeling states of affairs. 'Modeling states of affairs well' is not a highly specific goal, it's instrumental to nearly all goals, and it's easy to measure how well you're doing at it if you're entangled with anything about your environment at all, e.g., your proximity to a reward button.

(And when a system gets very good at modeling itself, its environment, and the interactions between the two, such that it can predict what changes its behaviors are likely to effect and choose its behaviors accordingly, then we call its behavior 'intelligent'.)

This stands in stark contrast to the difficulty of setting up a positive feedback loop that will allow an AGI to approximate our True Values with increasing fidelity. We understand how accurately modeling something works; we understand the basic principles of intelligence. We don't understand the basic principles of moral value, and we don't even have a firm grasp about how to go about finding out the answer to moral questions. Presumably our values are encoded in some way in our brains, such that there is some possible feedback loop we could use to guide an AGI gradually toward Friendliness. But how do we figure out in advance what that feedback loop needs to look like, without asking the superintelligence? (We can't ask the superintelligence what algorithm to use to make it start becoming Friendly, because to the extent it isn't already Friendly it isn't a trustworthy source of information. This is in addition to the seed/intelligence distinction I noted above.)

If we slightly screw up the AGI's utility function, it will still need to to succeed at modeling things accurately in order to do anything complicated at all. But it will not need to succeed at optimally caring about what humans care about in order to do anything complicated at all.