Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Magical Categories

24 Post author: Eliezer_Yudkowsky 24 August 2008 07:51PM

Followup toAnthropomorphic Optimism, Superexponential Conceptspace, The Hidden Complexity of Wishes, Unnatural Categories

'We can design intelligent machines so their primary, innate emotion is unconditional love for all humans.  First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language.  Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.'
        -- Bill Hibbard (2001), Super-intelligent machines.

That was published in a peer-reviewed journal, and the author later wrote a whole book about it, so this is not a strawman position I'm discussing here.

So... um... what could possibly go wrong...

When I mentioned (sec. 6) that Hibbard's AI ends up tiling the galaxy with tiny molecular smiley-faces, Hibbard wrote an indignant reply saying:

'When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of "human facial expressions, human voices and human body language" (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by "tiny molecular pictures of smiley-faces." You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.'

As Hibbard also wrote "Such obvious contradictory assumptions show Yudkowsky's preference for drama over reason," I'll go ahead and mention that Hibbard illustrates a key point:  There is no professional certification test you have to take before you are allowed to talk about AI morality.  But that is not my primary topic today.  Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand.  Even Michael Vassar was probably surprised his first time through.

No, today I am here to dissect "You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans."

Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set - output "yes" for the 50 photos of camouflaged tanks, and output "no" for the 50 photos of forest.

Now this did not prove, or even imply, that new examples would be classified correctly.  The neural network might have "learned" 100 special cases that wouldn't generalize to new problems.  Not, "camouflaged tanks versus forest", but just, "photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive..."

But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set.  The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly.   Success confirmed!

The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.

It turned out that in the researchers' data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.

This parable - which might or might not be fact - illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence:  If the training problems and the real problems have the slightest difference in context - if they are not drawn from the same independently identically distributed process - there is no statistical guarantee from past success to future success.  It doesn't matter if the AI seems to be working great under the training conditions.  (This is not an unsolvable problem but it is an unpatchable problem.  There are deep ways to address it - a topic beyond the scope of this post - but no bandaids.)

As described in Superexponential Conceptspace, there are exponentially more possible concepts than possible objects, just as the number of possible objects is exponential in the number of attributes.  If a black-and-white image is 256 pixels on a side, then the total image is 65536 pixels.  The number of possible images is 265536.  And the number of possible concepts that classify images into positive and negative instances - the number of possible boundaries you could draw in the space of images - is 2^(265536).  From this, we see that even supervised learning is almost entirely a matter of inductive bias, without which it would take a minimum of 265536 classified examples to discriminate among 2^(265536) possible concepts - even if classifications are constant over time.

If this seems at all counterintuitive or non-obvious, see Superexponential Conceptspace.

So let us now turn again to:

'First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language.  Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.'

and

'When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of "human facial expressions, human voices and human body language" (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by "tiny molecular pictures of smiley-faces." You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.'

It's trivial to discriminate a photo of a picture with a camouflaged tank, and a photo of an empty forest, in the sense of determining that the two photos are not identical.  They're different pixel arrays with different 1s and 0s in them.  Discriminating between them is as simple as testing the arrays for equality.

Classifying new photos into positive and negative instances of "smile", by reasoning from a set of training photos classified positive or negative, is a different order of problem.

When you've got a 256x256 image from a real-world camera, and the image turns out to depict a camouflaged tank, there is no additional 65537th bit denoting the positiveness - no tiny little XML tag that says "This image is inherently positive".  It's only a positive example relative to some particular concept.

But for any non-Vast amount of training data - any training data that does not include the exact bitwise image now seen - there are superexponentially many possible concepts compatible with previous classifications.

For the AI, choosing or weighting from among superexponential possibilities is a matter of inductive bias.  Which may not match what the user has in mind.  The gap between these two example-classifying processes - induction on the one hand, and the user's actual goals on the other - is not trivial to cross.

Let's say the AI's training data is:

Dataset 1:

  • +
    • Smile_1, Smile_2, Smile_3
  • -
    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5

Now the AI grows up into a superintelligence, and encounters this data:

Dataset 2:

  •  
    • Frown_6, Cat_3, Smile_4, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2

It is not a property of these datasets that the inferred classification you would prefer is:

  • +
    • Smile_1, Smile_2, Smile_3, Smile_4
  • -
    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2

rather than

  • +
    • Smile_1, Smile_2, Smile_3, Molecular_Smileyface_1, Molecular_Smileyface_2, Smile_4
  • -
    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Cat_4, Galaxy_2, Nanofactory_2

Both of these classifications are compatible with the training data.  The number of concepts compatible with the training data will be much larger, since more than one concept can project the same shadow onto the combined dataset.  If the space of possible concepts includes the space of possible computations that classify instances, the space is infinite.

Which classification will the AI choose?  This is not an inherent property of the training data; it is a property of how the AI performs induction.

Which is the correct classification?  This is not a property of the training data; it is a property of your preferences (or, if you prefer, a property of the idealized abstract dynamic you name "right").

The concept that you wanted, cast its shadow onto the training data as you yourself labeled each instance + or -, drawing on your own intelligence and preferences to do so.  That's what supervised learning is all about - providing the AI with labeled training examples that project a shadow of the causal process that generated the labels.

But unless the training data is drawn from exactly the same context as the real-life, the training data will be "shallow" in some sense, a projection from a much higher-dimensional space of possibilities.

The AI never saw a tiny molecular smileyface during its dumber-than-human training phase, or it never saw a tiny little agent with a happiness counter set to a googolplex.  Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile.  But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values.  It is your own plans and desires that are at work when you say "No!"

Hibbard knows instinctively that a tiny molecular smileyface isn't a "smile", because he knows that's not what he wants his putative AI to do.  If someone else were presented with a different task, like classifying artworks, they might feel that the Mona Lisa was obviously smiling - as opposed to frowning, say - even though it's only paint.

As the case of Terry Schiavo illustrates, technology enables new borderline cases that throw us into new, essentially moral dilemmas.  Showing an AI pictures of living and dead humans as they existed during the age of Ancient Greece, will not enable the AI to make a moral decision as to whether switching off Terry's life support is murder.  That information isn't present in the dataset even inductively!  Terry Schiavo raises new moral questions, appealing to new moral considerations, that you wouldn't need to think about while classifying photos of living and dead humans from the time of Ancient Greece.  No one was on life support then, still breathing with a brain half fluid.  So such considerations play no role in the causal process that you use to classify the ancient-Greece training data, and hence cast no shadow on the training data, and hence are not accessible by induction on the training data.

As a matter of formal fallacy, I see two anthropomorphic errors on display.

The first fallacy is underestimating the complexity of a concept we develop for the sake of its value.  The borders of the concept will depend on many values and probably on-the-fly moral reasoning, if the borderline case is of a kind we haven't seen before.  But all that takes place invisibly, in the background; to Hibbard it just seems that a tiny molecular smileyface is just obviously not a smile.  And we don't generate all possible borderline cases, so we don't think of all the considerations that might play a role in redefining the concept, but haven't yet played a role in defining it.  Since people underestimate the complexity of their concepts, they underestimate the difficulty of inducing the concept from training data.  (And also the difficulty of describing the concept directly - see The Hidden Complexity of Wishes.)

The second fallacy is anthropomorphic optimism:  Since Bill Hibbard uses his own intelligence to generate options and plans ranking high in his preference ordering, he is incredulous at the idea that a superintelligence could classify never-before-seen tiny molecular smileyfaces as a positive instance of "smile".  As Hibbard uses the "smile" concept (to describe desired behavior of superintelligences), extending "smile" to cover tiny molecular smileyfaces would rank very low in his preference ordering; it would be a stupid thing to do - inherently so, as a property of the concept itself - so surely a superintelligence would not do it; this is just obviously the wrong classification.  Certainly a superintelligence can see which heaps of pebbles are correct or incorrect.

Why, Friendly AI isn't hard at all!  All you need is an AI that does what's good!  Oh, sure, not every possible mind does what's good - but in this case, we just program the superintelligence to do what's good.  All you need is a neural network that sees a few instances of good things and not-good things, and you've got a classifier.  Hook that up to an expected utility maximizer and you're done!

I shall call this the fallacy of magical categories - simple little words that turn out to carry all the desired functionality of the AI.  Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences?  Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires.  But the real problem of Friendly AI is one of communication - transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood.  Relative to the full space of possibilities the Future encompasses, we ourselves haven't imagined most of the borderline cases, and would have to engage in full-fledged moral arguments to figure them out.  To solve the FAI problem you have to step outside the paradigm of induction on human-labeled training data and the paradigm of human-generated intensional definitions.

Of course, even if Hibbard did succeed in conveying to an AI a concept that covers exactly every human facial expression that Hibbard would label a "smile", and excludes every facial expression that Hibbard wouldn't label a "smile"...

Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers.

When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face, wire it into a permanent smile, and start xeroxing.

The deep answers to such problems are beyond the scope of this post, but it is a general principle of Friendly AI that there are no bandaids.  In 2004, Hibbard modified his proposal to assert that expressions of human agreement should reinforce the definition of happiness, and then happiness should reinforce other behaviors.  Which, even if it worked, just leads to the AI xeroxing a horde of things similar-in-its-conceptspace to programmers saying "Yes, that's happiness!" about hydrogen atoms - hydrogen atoms are easy to make.

Link to my discussion with Hibbard here.  You already got the important parts.

Comments (89)

Sort By: Old
Comment author: peco 24 August 2008 08:14:50PM -1 points [-]

Why can't the AI just be exactly the same as Hibbard? If Hibbard is flawed in a major way, you could make an AI for every person on Earth (this obviously wouldn't be practical, but if a few million AI's are bad the other few billion can deal with them).

Comment author: DanielLC 13 August 2012 05:45:54AM 5 points [-]

We already have an entity exactly the same as Hibbard. Namely: Hibbard. Why do we need another one?

What we want is an AI that's far more intelligent than a human, yet shares their values. Increasing intelligence while preserving values is nontrivial. You could try giving Hibbard the ability to self-modify, but then he'd most likely just go insane in some way or another.

Comment author: Carl_Shulman 24 August 2008 08:16:34PM 6 points [-]

"Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers."

You use examples of this type fairly often, but for a utility function linear in smiles wouldn't the number of smiles generated by pleasing the programmers be trivial relative to the output of even a little while with access to face-xeroxing? This could be partly offset by anthropic/simulation issues, but still I would expect the overwhelming motive for appearing to work correctly during childhood (after it could recognize this point) would be tricking the programmers, not the tiny gains from their smiles.

Comment author: Carl_Shulman 24 August 2008 08:35:05PM 3 points [-]

For instance, a weak AI might refrain from visibly trying to produce smiles in disturbing ways as part of an effort (including verbal claims) to convince the programmers that it had apprehended the objective morality behind their attempts to inculcate smiles as a reinforcer.

Comment author: Tim_Tyler 24 August 2008 08:55:03PM -1 points [-]

Early AIs are far more likely to be built to maximise the worth of the company that made them than anything to do with human hapiness. E.g. see: Artificial intelligence applied heavily to picking stocks

A utility function measured in dollars seems fairly unambiguous.

Comment author: DilGreen 11 October 2010 11:48:29AM 14 points [-]

A utility function measured in dollars seems fairly unambiguously to lead to decisions that are non-optimal for humans, without a sophisticated understanding of what dollars are.

Dollars mean something for humans because they are tokens in a vast, partly consensual and partially reified game. Economics, which is our approach to developing dollar maximising strategies, is non-trivial.

Training an AI to understand dollars as something more than data points would be similarly non-trivial to training an AI to faultlessly assess human happiness.

Comment author: PhilGoetz 20 September 2011 07:49:30PM *  1 point [-]

But that's not what this post is about. Eliezer is examining a different branch of the tree of possible futures.

Comment author: JessRiedel 24 August 2008 09:12:51PM 6 points [-]

Eliezer, I believe that your belittling tone is conducive to neither a healthy debate nor a readable blog post. I suspect that your attitude is borne out of just frustration, not contempt, but I would still strongly encourage you to write more civilly. It's not just a matter of being nice; rudeness prevents both the speaker and the listener from thinking clearly and objectively, and it doesn't contribute to anything.

Comment author: Anon17 24 August 2008 09:53:59PM 0 points [-]

It has always struck me that the tiling the universe with smiley faces example is one of the stupidest possible examples Eliezer could have come up with. It is extremely implausible, MUCH, MUCH more so than the camouflage tank scenario, and I understand Hibbard's indignation even if I agree with Eliezer on the general point he is making.

I have no idea why Eliezer wouldn't choose a better example that illustrates the same point, like the AGI spiking the water supply with a Soma-like drug that actually does make us all profoundly content in a highly undesirable way.

Comment author: retired_urologist 24 August 2008 10:00:01PM -1 points [-]

Jess Riedel,

I don't know Eliezer Yudkowsky, but I have lots of spare time, and I have laboriously read his works for the past few months. I don't think much gets past him, within his knowledge base, and I don't think he cares about the significance of blog opinions, except as they illustrate predictable responses to certain stimuli. By making his posts quirky and difficult to understand, he weeds out the readers who are more comfortable at Roissy in DC, leaving him with study subjects of greater value to his project. His posts don't ask for suggestions; they teach, seeking clues to the best methods for communicating core data. Some are specifically aimed at producing controversy, especially in particular readers. Some are intentionally in conflict with his previously stated positions, to observe the response. The comparison I've previously made is that of Jane Goodall trying to understand chimps by observing their behavior and their reactions to stimuli, and to challenges requiring innovation, but even better because EY is more than an pure observer: he manipulates the environment to suit his interest. We'll see the results in the FAI, one day, I hope. If rudeness is part of that, right on.

Comment author: Shane_Legg 24 August 2008 10:35:10PM 3 points [-]

It is just me, or are things getting a bit unfriendly around here?

Anyway...

Wiring up the AI to maximise happy faces etc. is not a very good idea, the goal is clearly too shallow to reflect the underlying intent. I'd have to read more of Hibbard's stuff to properly understand his position, however.

That said, I do agree with a more basic underlying theme that he seems to be putting forward. In my opinion, a key, perhaps even THE key to intelligence is the ability to form reliable deep abstractions. In Solomonoff induction and AIXI you see this being driving by the Kolmogorov compressor, in the brain the neocortical hierarchy seems to be key. Furthermore, if you adopt the perspective I've taken on intelligence (i.e. the universal intelligence measure) you see that the reverse implication is true: intelligence actually requires the ability to form deep abstractions. In which case, a *super* intelligent machine must have the ability to form *very* deep and reliable abstractions about the world. Such a machine could still try to turn the world into happy faces, if this was its goal. However, it wouldn't do this by accident because its ability to form abstractions was so badly flawed that it doesn't differentiate between smiling faces and happy people. It's not that stupid. Note that this goes for forming powerful abstractions in general, not just human things like happiness and faces.

Comment author: Kenny 18 May 2013 12:56:51AM 2 points [-]

"It's not that stupid."

What if it doesn't care about happiness or smiles or any other abstractions that we value? A super-intelligence isn't an unlimited intelligence, i.e. it would still have to choose what to think about.

Comment author: bouilhet 08 September 2013 07:15:29PM 1 point [-]

I think the point is that if you accept this definition of intelligence, i.e. that it requires the ability to form deep and reliable abstractions about the world, then it doesn't make sense to talk about any intelligence (let alone a super one) being unable to differentiate between smiley-faces and happy people. It isn't a matter, at least in this instance, of whether it cares to make that differentiation or not. If it is intelligent, it will make the distinction. It may have values that would be unrecognizable or abhorrent to humans, and I suppose that (as Shane_Legg noted) it can't be ruled out that such values might lead it to tile the universe with smiley-faces, but such an outcome would have to be the result of something other than a mistake. In other words, if it really is "that stupid," it fails in a number of other ways long before it has a chance to make this particular error.

Comment author: RobbBB 15 January 2014 08:57:26AM 1 point [-]

I wrote a post about this! See The genie knows, but doesn't care.

It may not make sense to talk about a superintelligence that's too dumb to understand human values, but it does make sense to talk about an AI smart enough to program superior general intelligences that's too dumb to understand human values. If the first such AIs ('seed AIs') are built before we've solved this family of problems, then the intelligence explosion thesis suggests that it will probably be too late. You could ask an AI to solve the problem of FAI for us, but it would need to be an AI smart enough to complete that task reliably yet too dumb (or too well-boxed) to be dangerous.

Comment author: TheAncientGeek 15 January 2014 05:29:42PM 0 points [-]

but it does make sense to talk about an AI smart enough to program superior general intelligences that's too dumb to understand human values

Superior to what? If they are only as smart as the average person, then all things being equal, they will be as good as the average peson as figuring out morality. If they are smarter, they will be better, You seem to be tacitly assuming that the Seed AIs are designing walled-off unupdateable utility functions. But if one assumes a more natural architecture, where moral sense is allowed to evolve with eveythign else, you would expect and incremental succession of AIs to gradually get better at moral reasoning. And if it fooms, it's moral reasoning will fomm along with eveything else, because you haven't created an artificial problem by firewalling it off.

Comment author: RobbBB 15 January 2014 06:15:27PM *  0 points [-]

Superior to what?

Superior to itself.

If they are only as smart as the average person, then all things being equal, they will be as good as the average [person] as figuring out morality.

That's not generally true of human-level intelligences. We wouldn't expect a random alien species that happens to be as smart as humans to be very successful at figuring out human morality. It maybe true if the human-level AGI is an unmodified emulation of a human brain. But humans aren't very good at figuring out morality; they can make serious mistakes, though admittedly not the same mistakes Eliezer gives as examples above. (He deliberately picked ones that sound 'stupid' to a human mind, to make the point that human concepts have a huge amount of implicit complexity built in.)

If they are smarter, they will be better,

Not necessarily. The average chimpanzee is better than the average human at predicting chimpanzee behavior, simulating chimpanzee values, etc. (See Sympathetic Minds.)

walled-off unupdateable utility functions.

Utility functions that change over time are more dangerous than stable ones, because it's harder to predict how a descendant of a seed AI with a heavily modified utility function will behave than it is to predict how a descendant with the same utility function will behave.

you would expect [an] incremental succession of AIs to gradually get better at moral reasoning.

If we don't solve the problem of Friendly AI ourselves, we won't know what trajectory of self-modification to set the AI on in order for it to increasingly approximate Friendliness. We can't tell it to increasingly approximate something that we ourselves cannot formalize and cannot point to clear empirical evidence of.

We already understand arithmetic, so we know how to reward a system for gradually doing better and better at arithmetic problems. We don't understand human morality or desire, so we can't design a Morality Test or Wish Test that we know for sure will reward all and only the good or desirable actions. We can make the AI increasingly approximate something, sure, but how do we know in advance that that something is something we'd like?

Comment author: TheAncientGeek 16 January 2014 12:49:09PM *  0 points [-]

That's not generally true of human-level intelligences. We wouldn't expect a random alien species that happens to be as smart as humans to be very successful at figuring out human morality.

Assuming morality is lots of highly localised, different things...which I don't , particularly. if it is not, then you can figure it out anywhere, If it is,then the problem the aliens have is not that morality is imponderable, but that they are don't have access to the right data. They don't know how things on earth. However, an AI built on Earth woud. So the situation is not analgous. The only disadvantage an AI would have is not having biological drives itself, but it is not clear that an entity needs to have drives in order to understand them. We could expect a SIAI to get incrementally betyter at maths than us until it surpasses us; we wouldn't worry that i would hit on the wrong maths, because maths is not a set of arbitrary, disconnected facts.

But humans aren't very good at figuring out morality; they can make serious mistakes

An averagely intelligent AI with an average grasp of morlaity would not be more of a threat than an average human. A smart AI, would, all other things being equal, be better at figuring out moralitry. But all other things are not equal, because you want to create problems by wallign off the UF.

(He deliberately picked ones that sound 'stupid' to a human mind, to make the point that human concepts have a huge amount of implicit complexity built in.)

I'm sure they do. That seems to be why progress in AGI , specifically use of natural language,has been achingly slow. But why should moral concepts be som much more difficult than others? An AI smart enought to talk its way out of a box would be able to understand the implicit complexity: an AI too dumb to understand implicit complexity would be boxable. Where is the problem?

Utility functions that change over time are more dangerous than stable ones, because it's harder to predict how a descendant of a seed AI with a heavily modified utility function will behave than it is to predict how a descendant with the same utility function will behave.

Things are not inherently dagerous just because they are unpredictable. If you have some independent reason fo thinking something might turn dangerous, then it becomes desirable to predict it.

But Superintelligent artifical general intelligences are generally assumed to be good at eveything: they are not assumed to develop mysterious blind spots about falconry or mining engineering, Why assume they will develop a blind spot about morality? Oh yes...because you have assumed from the outset that the UF must be walled off from self mprovement...in order to be safe. You are only facing that particualr failure mode because of something you decided on to be safe.

If we don't solve the problem of Friendly AI ourselves, we won't know what trajectory of self-modification to set the AI on in order for it to increasingly approximate Friendliness

The average person manages to solve the problem of being moral themselves, in a good-enough way. You keep assuming, without explanation that an AI can't do the same.

We can't tell it to increasingly approximate something that we ourselves cannot formalize and cannot point to clear empirical evidence of.

Why isn't havign a formalisation of morality a prolem with humans? We know how humans incremently improve as moral reasoners: it's called the Kohlberg hierarchy.

We don't understand human morality or desire, so we can't design a Morality Test or Wish Test that we know for sure will reward all and only the good or desirable actions.

We don't have perfect morality tests. We do have morality tests. Fail them and you get pilloried in the media or sent to jail.

We can make the AI increasingly approximate something, sure, but how do we know in advance that that something is something we'd like?

Again, you are assumign that morality is something highly local and arbitrary. If it works like arithmetic, that is if it is an expansion of some basic principles, then we can tell that is heading in the right direction by identifying ithat its reasoning is inline with those principles.

Comment author: RobbBB 16 January 2014 09:05:03PM *  1 point [-]

Assuming morality is lots of highly localised, different things...which I don't , particularly.

The problem of FAI is the problem of figuring out all of humanity's deepest concerns and preferences, not just the problem of figuring out the 'moral' ones (whichever those are). E.g., we want a superintelligence to not make life boring for everyone forever, even if 'don't bore people' isn't a moral imperative.

Regardless, I don't see how the moral subset of human concerns could be simplified without sacrificing most human intuitions about what's right and wrong. Human intuitions as they stand aren't even consistent, so I don't understand how you can think the problem of making them consistent and actionable is going to be a simple one.

if it is not, then you can figure it out anywhere,

Someday, perhaps. With enough time and effort invested. Still, again, we would expect a lot more human-intelligence-level aliens (even if those aliens knew a lot about human behavior) to be good at building better AIs than to be good at formalizing human value. For the same reason, we should expect a lot more possible AIs we could build to be good at building better AIs than to be good at formalizing human value.

If it is,then the problem the aliens have is not that morality is imponderable

I don't know what you mean by 'imponderable'. Morality isn't ineffable; it's just way too complicated for us to figure out. We know how things are on Earth; we've been gathering data and theorizing about morality for centuries. And our progress in formalizing morality has been minimal.

An averagely intelligent AI with an average grasp of morlaity would not be more of a threat than an average human.

An AI that's just a copy of a human running on transistors is much more powerful than a human, because it can think and act much faster.

A smart AI, would, all other things being equal, be better at figuring out moralitry.

It would also be better at figuring out how many atoms are in my fingernail, but that doesn't mean it will ever get an exact count. The question is how rough an approximation of human value can we allow before all value is lost; this is the 'fragility of values' problem. It's not enough for an AGI to do better than us at FAI; it has to be smart enough to solve the problem to a high level of confidence and precision.

But why should moral concepts be som much more difficult than others?

First, because they're anthropocentric; 'iron' can be defined simply because it's a common pattern in Nature, not a rare high-level product of a highly contingent and complex evolutionary history. Second, because they're very inclusive; 'what humans care about' or 'what humans think is Right' is inclusive of many different human emotions, intuitions, cultural conventions, and historical accidents.

But the main point is just that human value is difficult, not that it's the most difficult thing we could do. If other tasks are also difficult, that doesn't necessarily make FAI easier.

An AI smart enought to talk its way out of a box would be able to understand the implicit complexity: an AI too dumb to understand implicit complexity would be boxable. Where is the problem?

You're forgetting the 'seed is not the superintelligence' lesson from The genie knows, but doesn't care. If you haven't read that article, go do so. The seed AI is dumb enough to be boxable, but also too dumb to plausibly solve the entire FAI problem itself. The superintelligent AI is smart enough to solve FAI, but also too smart to be safely boxed; and it doesn't help us that an unFriendly superintelligent AI has solved FAI, if by that point it's too powerful for us to control. You can't safely pass the buck to a superintelligence to tell us how to build a superintelligence safe enough to pass bucks to.

Things are not inherently dagerous just because they are unpredictable. If you have some independent reason fo thinking something might turn dangerous, then it becomes desirable to predict it.

Yes. The five theses give us reason to expect superintelligent AI to be dangerous by default. Adding more unpredictability to a system that already seems dangerous will generally make it more dangerous.

they are not assumed to develop mysterious blind spots about falconry or mining engineering, Why assume they will develop a blind spot about morality?

'The genie knows, but doesn't care' means that the genie (i.e., superintelligence) knows how to do human morality (or could easily figure it out, if it felt like trying), but hasn't been built to care about human morality. Knowing how to behave the way humans want you to is not sufficient for actually behaving that way; Eliezer makes that point well in No Universally Compelling Arguments.

The worry isn't that the superintelligence will be dumb about morality; it's that it will be indifferent to morality, and that by the time it exists it will be too late to safely change that indifference. The seed AI (which is not a superintelligence, but is smart enough to set off a chain of self-modifications that lead to a superintelligence) is dumb about morality (approximately as dumb as humans are, if not dumber), and is also probably not a particularly amazing falconer or miner. It only needs to be a competent programmer, to qualify as a seed AI.

The average person manages to solve the problem of being moral themselves, in a good-enough way.

Good enough for going to the grocery store without knifing anyone. Probably not good enough for safely ruling the world. With greater power comes a greater need for moral insight, and a greater risk should that insight be absent.

Why isn't havign a formalisation of morality a prolem with humans?

It is a problem, and it leads to a huge amount of human suffering. It doesn't mean we get everything wrong, but we do make moral errors on a routine basis; the consequences are mostly non-catastrophic because we're slow, weak, and have adopted some 'good-enough' heuristics for bounded circumstances.

We know how humans incremently improve as moral reasoners: it's called the Kohlberg hierarchy.

Just about every contemporary moral psychologist I've read or talked to seems to think that Kohlberg's overall model is false. (Though some may think it's a useful toy model, and it certainly was hugely influential in its day.) Haidt's The Emotional Dog and Its Rational Tail gets cited a lot in this context.

We do have morality tests. Fail them and you get pilloried in the media or sent to jail.

That's certainly not good enough. Build a superintelligence that optimizes for 'following the letter of the law' and you don't get a superintelligence that cares about humans' deepest values. The law itself has enough inexactness and arbitrariness that it causes massive needless human suffering on a routine basis, though it's another one of those 'good-enough' measures we keep in place to stave off even worse descents into darkness.

If it works like arithmetic, that is if it is an expansion of some basic principles

Human values are an evolutionary hack resulting from adaptations to billions of different selective pressures over billions of years, innumerable side-effects of those adaptations, genetic drift, etc. Arithmetic can be formalized in a few sentences. Why think that humanity's deepest preferences are anything like that simple? Our priors should be very low for 'human value is simple' just given the etiology of human value, and our failure to converge on any simple predictive or normative theory thus far seems to only confirm this.

Comment author: gattsuru 15 January 2014 07:02:34PM *  0 points [-]

If they are only as smart as the average person, then all things being equal, they will be as good as the average peson as figuring out morality.

It's quite possible that I'm below average, but I'm not terribly impressed by my own ability to extrapolate how other average people's morality works -- and that's with the advantage of being built on hardware that's designed toward empathy and shared values. I'm pretty confident I'm smarter than my cat, but it's not evident that I'm correct when I guess at the cat's moral system. I can be right, at times, but I can be wrong, too.

Worse, that seems a fairly common matter. There are several major political discussions involving moral matters, where it's conceivable that at least 30% of the population has made an incorrect extrapolation, and probable that in excess of 60% has. And this only gets worse if you consider a time variant : someone who was as smart as the average individual in 1950 would have little problem doing some very unpleasant things to Alan Turing. Society (luckily!) developed since then, but it has mechanisms for development and disposal of concepts that AI do not necessarily have or we may not want them to have.

((This is in addition to general concerns about the universality of intelligence : it's not clear that the sort of intelligence used for scientific research necessarily overlaps with the sort of intelligence used for philosophy, even if it's common in humans.))

You seem to be tacitly assuming that the Seed AIs are designing walled-off unupdateable utility functions. But if one assumes a more natural architecture, where moral sense is allowed to evolve with eveythign else, you would expect and incremental succession of AIs to gradually get better at moral reasoning

Well, the obvious problem with not walling off and making unupdateaable the utility function is that the simplest way to maximize the value of a malleable utility function is to update it to something very easy. If you tell an AI that you want it to make you happy, and let it update that utility function, it takes a good deal less bit-twiddling to define "happy" as a steadily increasing counter. If you're /lucky/, that means your AI breaks down. If not, it's (weakly) unfriendly.

You can have a higher-level utility function of "do what I mean", but not only is that harder to define, it has to be walled off, or you have "what I mean" redirected to a steadily increasing counter. And so on and so forth through higher levels of abstraction.

Comment author: bouilhet 18 January 2014 04:37:44AM 0 points [-]

Thanks for the reply, Robb. I've read your post and a good deal of the discussion surrounding it.

I think I understand the general concern, that an AI that either doesn't understand or care about our values could pose a grave threat to humanity. This is true on its face, in the broad sense that any significant technological advance carries with it unforeseen (and therefore potentially negative) consequences. If, however, the intelligence explosion thesis is correct, then we may be too late anyway. I'll elaborate on that in a moment.

First, though, I'm not sure I see how an AI "too dumb to understand human values" could program a superior general intelligence (i.e. an AI that is smart enough to understand human values). Even so, assuming it is possible, and assuming it could happen on a timescale and in such a way as to preclude or make irrelevant any human intervention, why would that change the nature of the superior intelligence from being, say, friendly to human interests, to being hostile to them? Why, for that matter, would any superintelligence (that understands human values, and that is "able to form deep and reliable abstractions about the world") be predisposed to any particular position vis-a-vis humans? And even if it were predisposed toward friendliness, how could we possibly guarantee it would always remain so? How, that is, having once made a friend, can we foolproof ourselves against betrayal? My intuition is that we can’t. No step can be taken without some measure of risk, however small, and if the step has potentially infinitely negative consequences, then even the very slightest of risks begins to look like a bad bet. I don’t know a way around that math.

The genie, as you say, doesn't care. But also, often enough, the human doesn't care. He is constrained, of course, by his fellow humans, and by his environment, but he sometimes still manages (sometimes alone, sometimes in groups) to sow massive horror among his fellows, sometimes even in the name of human values. Insanity, for instance, in humans, is always possible, and one definition of insanity might even be: behavior that contradicts, ignores or otherwise violates the values of normal human society. “Normal” here is variable, of course, for the simple reason that “human society” is also variable. That doesn’t stop us, however, from distinguishing, as we generally do, between the insane and the merely stupid, even if upon close inspection the lines begin to blur. Likewise, we occasionally witness - and very frequently we imagine (comic books!) - cases where a human is both super-intelligent and super-insane. The fear many people have with regard to strong AI (and it is perhaps well-grounded, or well-enough), is that it might be both super-intelligent and, at least as far as human values are concerned, super-insane. As an added bonus, and certainly if the intelligence explosion thesis is correct, it might also be unconstrained or, ultimately, unconstrainable. On this much I think we agree, and I assume the goal of FAI is precisely to find the appropriate constraints.

Back now, though, to the question of “too late.” The family of problems you propose to solve before the first so-called seed AIs are built include, if I understand you correctly, a formal definition of human values. I doubt very much that such a solution is possible - and “never” surely won’t help us any more than “too late” - but what would the discovery of (or failure to discover) such a solution have to do with a mistake such as tiling the universe with smiley-faces (which seems to me much more a semantic error than an error in value judgment)? If we define our terms - and I don’t know any definition of intelligence that would allow the universe-tiling behavior to be called intelligent - then smiley faces may still be a risk, but they are not a risk of intelligent behavior. They are one way the project could conceivably fail, but they are not an intelligent failure.

On the other hand, the formal-definition-of-human-values problem is related to the smiley faces problem in another way: any hard-coded solution could lead to a universe of bad definitions and false equivalencies (smiles taken for happiness). Not because the AI would make a mistake, but because human values are neither fixed nor general nor permanent: to fix them (in code), and then propagate them on the enormous scale the intelligence explosion thesis suggests, might well lead to some kind of funneling effect, perhaps very quickly, perhaps over a long period of time, that produces, effectively, an infinity of smiley faces. In other words, to reduce an irreducible problem doesn’t actually solve it. For example, I value certain forms of individuality and certain forms of conformity, and at different times in my life I have valued other and even contradictory forms of individuality and other and even contradictory forms of conformity. I might even, today, call certain of my old individualistic values conformist values, and vice-versa, and not strictly because I know more today than I knew then. I am, today, quite differently situated in the world than I was, say, twenty years ago; I may even be said to be somewhat of a different person (and yet still the same); and around me the world itself has also changed. Now, these changes, these changing and contradictory values may or may not be the most important ones, but how could they be formalized, even conceptually? There is nothing necessary about them. They might have gone the other way around. They might not have changed at all. A person can value change and stability at the same time, and not only because he has a fuzzy sense of what those concepts mean. A person can also have a very clear idea of what certain concepts mean, and those concepts may still fail to describe reality. They do fail, actually, necessarily, which doesn’t make them useless - not at all - but knowledge of this failure should at least make us wary of the claims we produce on their behalf.

What am I saying? Basically, that the pre-seed hard-coding path to FAI looks pretty hopeless. If strong AI is inevitable, then yes, we must do everything in our power to make it friendly; but what exactly is in our power, if strong AI (which by definition means super-strong, and super-super-strong, etc.) is inevitable? If the risks associated with strong AI are as grave as you take them to be, does it really seem better to you (in terms of existential risk to the human race) for us to solve FAI - which is to say, to think we’ve solved it, since there would be no way of testing our solution “inside the box” - than to not solve strong AI at all? And if you believe that there is just no way to halt the progress toward strong AI (and super, and super-super), is that compatible with a belief that “this kind of progress” can be corralled into the relatively vague concept of “friendliness toward humans”?

Better stop there for the moment. I realize I’ve gone well outside the scope of your comment, but looking back through some of the discussion raised by your original post, I found I had more to say/think about than I expected. None of the questions here are meant to be strictly rhetorical, a lot of this is just musing, so please respond (or not) to whatever interests you.

Comment author: Carl_Shulman 24 August 2008 10:35:40PM 4 points [-]

Tim,

"A utility function measured in dollars seems fairly unambiguous."

Oy vey.

http://en.wikipedia.org/wiki/Hyperinflation

Comment author: Hopefully_Anonymous 24 August 2008 10:50:08PM -2 points [-]

There's this weird hero-worship codependency that emerges between Eliezer and some of his readers that I don't get, but I have to admit, it diminishes (in my eyes) the stature of all parties involved.

Comment author: Eliezer_Yudkowsky 24 August 2008 10:53:17PM 9 points [-]

Shane, again, the issue is not differentiation. The issue is classification. Obviously, tiny smiley faces are different from human smiling faces, but so is the smile of someone who had half their face burned off. Obviously a superintelligence knows that this is an unusual case, but that doesn't say if it's a positive or negative case.

Deep abstractions are important, yes, but there is no unique deep abstraction that classifies any given example. An apple is a red thing, a biological artifact shaped by evolution, and an economic resource in the human market.

Also, Hibbard spoke of using smiling faces to reinforce behaviors, so if a superintelligence would not confuse smiling faces and happiness, that works against that proposal - because it means that the superintelligence will go on focusing on smiling faces, not happiness.

Retired Urologist, one of the most important lessons that a rationalist learns is not to try to be clever. I don't play nitwit games with my audience. If I say it, I mean it. If I have words to emit that I don't necessarily mean, for the sake of provoking reactions, I put them into a dialogue, short story, or parable - I don't say them in my own voice.

Comment author: steven 24 August 2008 11:00:50PM 0 points [-]

There's a Hibbard piece from January 2008 in JET, but I'm not sure if it's new or if Eliezer has seen it: http://jetpress.org/v17/hibbard.htm

Comment author: retired_urologist 24 August 2008 11:07:53PM 0 points [-]

@EY: If I have words to emit that I don't necessarily mean, for the sake of provoking reactions, I put them into a dialogue, short story, or parable - I don't say them in my own voice.

That's what I meant when I wrote: "By making his posts quirky and difficult to understand". Sorry. Should have been more precise.

@HA: perhaps you know the parties far better than I. I'm still looking.

Comment author: Shane_Legg 24 August 2008 11:53:35PM 2 points [-]

I mean differentiation in the sense of differentiating between the abstract categories. Is a half a face that appears to be smiling while the other half is burn off still a "smiley face"? Even I'm not sure.

I'm certainly not arguing that training an AGI to maximise smiling faces is a good idea. It's simply a case of giving the AGI the wrong goal.

My point is that a super intelligence will form very good abstractions, and based on these it will learn to classify very well. The problem with the famous tank example you cite is that they were training the system from scratch on a limited number of examples that all contained a clear bias. That's a problem for inductive inference systems in general. A super intelligent machine will be able to process vast amounts of information, ideally from a wide range of sources and thus avoid these types of problems for common categories, such as happiness and smiley faces.

If what I'm saying is correct, this is great news as it means that a sufficiently intelligent machine that has been exposed to a wide range of input will form good models of happiness, wisdom, kindness etc. Things that, as you like to point out, even we can't define all that well. Hooking up the machine to then take these as its goals, I suspect won't then be all that hard as we can open up its "brain" and work this out.

Comment author: DilGreen 11 October 2010 12:08:04PM 0 points [-]

Surely the discussion is not about the issue of whether an AI will be able to be sophisticated in forming abstractions - if it is of interest, then presumably it will be.

But the concern discussed here is how to determine beforehand that those abstractions will be formed in a context characterised here as Friendly AI. The concern is to pre-ordain that context before the AI achieves superintelligence.

Thus the limitations of communicating desirable concepts apply.

Comment author: timtyler 11 January 2011 09:39:00PM *  0 points [-]

If what I'm saying is correct, this is great news as it means that a sufficiently intelligent machine that has been exposed to a wide range of input will form good models of happiness, wisdom, kindness etc.

Hopefully. Assuming server-side intelligence, the machine may initially know a lot about text, a reasonable amount about images, and a bit about audio and video.

Its view of things is likely to be pretty strange - compared to a human. It will live in cyberspace, and for a while may see the rest of the world through a glass, darkly.

Comment author: Chris_Hibbert 25 August 2008 01:41:46AM 5 points [-]

I read most of the interchange between EY and BH. It appears to me that BH still doesn't get a couple of points. The first is that smiley faces are an example of misclassification and it's merely fortuitous to EY's ends that BH actually spoke about designing an SI to use human happiness (and observed smiles) as its metric. He continues to speak in terms of "a system that is adequate for intelligence in its ability to rule the world, but absurdly inadequate for intelligence in its inability to distinguish a smiley face from a human." EY's point is that it isn't sufficient to distinguish them, you have to also categorize them and all their variations correctly even though the training data can't possibly include all variations.

The second is that EY's attack isn't intended to look like an attack on BH's current ideas. It's an attack on ideas that are good enough to pass peer review. It doesn't matter to EY whether BH agrees or disagrees with those ideas. In either case, the paper's publication shows that the viewpoint is plausible enough to be worth dismissing carefully and publicly.

Finally, BH points to the fact that, in some sense, human development uses RL to produce something we are willing to call intelligence. He wants to argue that this shows that RL can produce systems that categorize in a way that matches our consensus. But evolution has put many mechanisms in our ontogeny and relies an many interactions in our environment to produce those categorizations, and its success rate at producing entities that agree with the consensus isn't perfect. In order to build an SI using those approaches, we'd have to understand how all that interaction works, and we'd have to do better than evolution does with us in order to be reliably safe.

Comment author: JulianMorrison 25 August 2008 02:05:13AM 0 points [-]

Even if by impossible luck he gets an AI that actually is a valid-happiness maximizer, he would still screw up. The AI would rampage out turning the galaxy into a paradise garden with just enough tamed-down monsters to keep us on our toes... but it would obliterate those sorts of utility that extend outside happiness, and probably stuff a cork in apotheosis. An Eden trap - a sort of existential whimper.

Comment author: Eliezer_Yudkowsky 25 August 2008 02:23:10AM 8 points [-]

Shane: I mean differentiation in the sense of differentiating between the abstract categories.

The abstract categories? This sounds like a unique categorization that the AI just has to find-in-the-world. You keep speaking of "good" abstractions as if this were a property of the categories themselves, rather than a ranking in your preference ordering relative to some decision task that makes use of the categories.

Comment author: Dan_Burfoot 25 August 2008 03:45:58AM 0 points [-]

@Eliezer - I think Shane is right. "Good" abstractions do exist, and are independent of the observer. The value of an abstraction relates to its ability to allow you to predict the future. For example, "mass" is a good abstraction, because when coupled with a physical law it allows you to make good predictions.

If we assume a superintelligent AI, we have to assume that the AI has the ability to discover abstractions. Human happiness is one such abstraction. Understanding the abstraction "happiness" allows one to predict certain events related to human activity. Thus a superintelligent AI will necessarily develop the concept of happiness in order to allow it to predict human events, in much the same way that it will develop a concept of mass in order to predict physical events.

Plato had a concept of "forms". Forms are ideal shapes or abstractions: every dog is an imperfect instantiation of the "dog" form that exists only in our brains. If we can accept the existence of a "dog" form or a "house" form or a "face" form, then it is not difficult to believe in the existence of a "good" form. Plato called this the Form of the Good. If we assume an AI that can develop its own forms, then it should be able to discover the Form of the Good.

http://en.wikipedia.org/wiki/Form_of_the_Good

Comment author: DilGreen 11 October 2010 12:16:02PM 1 point [-]

Whether or not the AI finds the abstraction of human happiness to be pertinent, and whether it considers increasing it to be worthwhile sacrificing other possible benefits for, are unpredictable, unless we have succeeded in achieving EY's goal of pre-destining the AI to be Friendly.

Comment author: Allan_Crossman 25 August 2008 04:16:42AM 1 point [-]

Plato had a concept of "forms". Forms are ideal shapes or abstractions: every dog is an imperfect instantiation of the "dog" form that exists only in our brains.

Mmm. I believe Plato saw the forms as being real things existing "in heaven" rather than merely in our brains. It wasn't a stupid theory for its day; in particular, a living thing growing into the right shape or form must have seemed utterly mysterious, and so the idea that some sort of blueprint was laid out in heaven must have had a lot of appeal.

But anyway, forms as ideas "in our brains" isn't really the classical forms theory.

it is not difficult to believe in the existence of a "good" form.

In our brains, just maybe.

If we assume an AI that can develop its own forms, then it should be able to discover the Form of the Good.

Do you mean by looking into our brains, or by just arriving at it on its own?

Comment author: Manuel_Moertelmaier 25 August 2008 05:51:41AM 0 points [-]

In contrast to Eliezer I think it's (remotely) possible to train an AI to reliably recognize human mind states underlying expressions of happiness. But this would still not imply that the machine's primary, innate emotion is unconditional love for all humans. The machines would merely be addicted to watching happy humans.

Personally, I'd rather not be an object of some quirky fetishism.

Monthy Python has, of course, realized it long ago:

http://www.youtube.com/watch?v=HoRY3ZjiNLU http://www.youtube.com/watch?v=JTMXtJvFV6E

Comment author: Dan_Burfoot 25 August 2008 06:58:37AM 0 points [-]

@AC

I mean that a superintelligent AI should be able to induce the Form of the Good from extensive study of humans, human culture, and human history. The problem is not much different in principle from inducing the concept of "dog" from many natural images, or the concept of "mass" from extensive experience with physical systems.

Comment author: Carl_Shulman 25 August 2008 08:43:01AM 3 points [-]

"Wealth then. Wealth measures access to resources - so convert to gold, silver, barrels of oil, etc to measure it - if you don't trust your country's currency."

I may not have gotten the point across. An AI aiming to maximize its wealth in U.S. dollars can do astronomically better by taking control of the Federal Reserve (if dollars are defined in its utility function as being issued by the Reserve, with only the bare minimum required to meet that definition being allowed to persist) and having it start issuing $3^^^3 bills than any commercial activities.

Similarly, for wealth that can be converted to barrels of oil, creating an oil bank that issues oil vouchers in numbers astronomically exceeding its reserves could let an AI possess 3^^^3 account units each convertible to a barrel of oil.

Many goods simply are no longer available, e.g. no one is making new original Van Gogh art from his lifetime, and inclusion in the basket of goods defining wealth could break down a relevant function.

Comment author: Tim_Tyler 25 August 2008 10:25:18AM -2 points [-]

Re: Creating an oil bank that issues oil vouchers in numbers astronomically exceeding its reserves could let an AI possess 3^^^3 account units each convertible to a barrel of oil.

No: such vouchers would not be redeemable in the marketplace: they would be worthless. Everyone would realise that - including the AI.

This is an example of the wirehead fallacy framed in economic terms. As Omohundro puts it, "AIs will try to prevent counterfeit utility".

Comment author: Carl_Shulman 25 August 2008 11:05:09AM 2 points [-]

"No: such vouchers would not be redeemable in the marketplace: they would be worthless. Everyone would realise that - including the AI."

The oil bank stands ready to exchange any particular voucher for a barrel of oil, so if the utility function refers to the values of particular items, they can all have that market price. Compare with the price of gold or some other metal traded on international commodity markets. The gold in Fort Knox is often valued at the market price per ounce of gold multiplied by the number of ounces present, but in fact you couldn't actually sell all of those ingots without sending the market price into a nosedive. Defining wealth in any sort of precise way that captures what a human is aiming for will involve huge numbers of value-laden decisions, like how to value such items.

"This is an example of the wirehead fallacy framed in economic terms." Actually this isn't an example of the AI wireheading (directly adjusting a 'reward counter' or positive reinforcer), just a description of a utility function that doesn't unambiguously pick out what human designers might want.

"As Omohundro puts it, "AIs will try to prevent counterfeit utility"." A system will try to prevent counterfeit utility, assessing that via its current utility function. If the utility function isn't what you wanted, this doesn't help.

Comment author: Phil_Goetz5 25 August 2008 02:45:28PM -2 points [-]

There are several famous science fiction stories about humans who program AIs to make humans happy, which then follow the letter of the law and do horrible things. The earliest is probably "With folded hands", by Jack Williamson (1947), in which AIs are programmed to protect humans, and they do this by preventing humans from doing anything or going anywhere. The most recent may be the movie "I, Robot."

I agree with E's general point - that AI work often presupposes that the AI magically has the same concepts as its inventor, even outside the training data - but the argument he uses is insidious and has disastrous implications:

Which is the correct classification? This is not a property of the training data; it is a property of your preferences (or, if you prefer, a property of the idealized abstract dynamic you name "right").

This is the most precise assertion of the relativist fallacy than I've ever seen. It's so precise that its wrongness should leap out at you. (It's a shame that most relativists don't have the computational background for me to use it to explain why they're wrong.)

By "relativism", I mean (at the moment) the view that almost everything is just a point of view: There is no right or wrong, no beauty or ugliness. (Pure relativism would also claim that 2+2=5 is as valid as 2+2=4. There are people out there who think that. I'm not including that claim in my temporary definition.)

The argument for relativism is that you can never define anything precisely. You can't even come up with a definition for the word "game". So, the argument goes, whatever definition you use is okay. Stated more precisely, it would be Eliezer's claim that, given a set of instances, any classifier that agrees with the input set is equally valid.

The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the classifier.

Eliezer's argument, if he followed it through, would conclude that neural networks, and induction in general, can never work. The fact is that it often does.

Comment author: RobinHanson 25 August 2008 02:53:40PM 2 points [-]

I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider.

Comment author: prase 25 August 2008 04:54:04PM 0 points [-]

The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the classifier.

The point is, probably, that humans tend to underestimate the complexity of classifiers they use. The categories like "good" are not only difficult to precisely define, they are difficult to define at all, because they are too complicated to be formulated in words. To point out that in classification we use structures based on the architecture of human brain (or whatever uniquely human) is not, in my opinion, a relativist fallacy.

To use a bit stretched analogy, to program a 3D animation on computer with an advanced graphic card and an obsolete processor may be simpler for the programmer than to program quicksort. Simplicity is not a mind-independent criterion.

Comment author: Caledonian2 25 August 2008 05:14:49PM 2 points [-]

Look: humans can learn what a 'tank' is, and can direct their detection activities to specifically seek them - not whether the scene is light or dark, or any other weird regularity that might be present in the test materials. We can identify the regularities, compare them with the properties of tanks, and determine that they're not what we're looking for.

If we can do it, the computers can do it as well. We merely need to figure out how to bring it about - it's an engineering challenge only. That doesn't dismiss or minimize the difficulty of achieving it, but there isn't a profound 'philosophical' challenge involved.

The problem with making powerful AIs that attempt to make the universe 'right' is that most of us have no idea what we mean by 'right', and either find it difficult to make our intuitive understanding explicit or have no interest in doing so.

We can't solve this problem by linguistically redefining it away. There is no quick and easy solution, no magic method that will get us out of doing the hard work. The best way around the problem is to go straight through - there are no substitutions.

Eliezer's last post may have touched upon a partial resolution, in that his statements about what he wants 'rightness' to be may implicitly refer to a guiding principle that would actually be constraining upon a rational AI. I may try to highlight that point when I figure out how to explain it properly.

Comment author: Sean_C. 25 August 2008 05:42:14PM 4 points [-]

Animal trainers have this problem all the time. Animal performs behavior 'x' gets a reward. But the animal might have been doing other subtle behaviors at the same time, and map the reward to 'y'. So instead of reinforcing 'x', you might be reinforcing 'y'. And if 'x' and 'y' are too close for you to tell apart, then you'll be in for a surprise when your perspective and context changes, and the difference becomes more apparent to you. And you find out that the bird was trained to peck anything that moves, instead of just the bouncy red ball or something.

Psychologists have a formal term for this but I can't remember it, and can't find it on the internet, I'm sorry to say.

Come to think, industry time-and-motion people suffer the same problem.

Comment author: Shane_Legg 25 August 2008 07:28:59PM 3 points [-]

"You keep speaking of "good" abstractions as if this were a property of the categories themselves, rather than a ranking in your preference ordering relative to some decision task that makes use of the categories."

Yes, I believe categories of things do exist in the world in some sense, due to structure that exists in the world. I've seen thousands of things where were referred to as "smiley faces" and so there is an abstraction for this category of things in my brain. You have done likewise. While we can agree about many things being smiley faces, in borderline cases, such as the half burnt off face, we might disagree. Something like "solid objects" was an abstraction I formed before I even knew what those words referred to. It's just part of the structure present in my surroundings.

When I say that pulling this structure out of the environment in certain ways is "good", I mean that these abstractions allow the agent to efficiently process information about its surroundings and this helps it to achieve a wide range goals (i.e. intelligence as per my formal definition). That's not to say that I think this process is entirely goal driven (though it clearly significantly is, e.g. via attention). In other words, an agent with general intelligence should identify significant regularities in its environment even if these don't appear to have any obvious utility at the time: if something about its goals or environment changes, this already constructed knowledge about the structure of the environment could suddenly become very useful.

Comment author: Yvain2 25 August 2008 07:52:44PM 7 points [-]

IMHO, the idea that wealth can't usefully be measured is one which is not sufficiently worthwhile to merit further discussion.

The "wealth" idea sounds vulnerable to hidden complexity of wishes. Measure it in dollars and you get hyperinflation. Measure it in resources, and the AI cuts down all the trees and converts them to lumber, then kills all the animals and converts them to oil, even if technology had advanced beyond the point of needing either. Find some clever way to specify the value of all resources, convert them to products and allocate them to humans in the level humans want, and one of the products will be highly carcinogenic because the AI didn't know humans don't like that. The only way to get wealth in the way that's meaningful to humans without humans losing other things they want more than wealth is for the AI to know exactly what we want as well or better than we do. And if it knows that, we can ignore wealth and just ask it to do what it knows we want.

"The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the classifier."

I don't think "better" is meaningful outside the context of a utility function. Complexity isn't a utility function and it's inadequate for this purpose. Which is better, tank vs. non-tank or cloudy vs. sunny? I can't immediately see which is more complex than the other. And even if I could, I'd want my criteria to change depending on whether I'm in an anti-tank infantry or a solar power installation company, and just judging criteria by complexity doesn't let me make that change, unless I'm misunderstanding what you mean by complexity here.

Meanwhile, reading the link to Bill Hibbard on the SL4 list:

"Your scenario of a system that is adequate for intelligence in its ability to rule the world, but absurdly inadequate for intelligence in its inability to distinguish a smiley face from a human, is inconsistent."

I think the best possible summary of Overcoming Bias thus far would be "Abandon all thought processes even remotely related to the ones that generated this statement."

Comment author: Grant 25 August 2008 08:13:00PM 0 points [-]

I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider.

I find the idea of an AI God rather scary. However, unless private AIs are made illegal or heavily regulated, is there much danger of one AI ruling all the lesser intelligences?

Comment author: Hopefully_Anonymous 25 August 2008 08:24:40PM 0 points [-]

"I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider."

I agree with Robin. Although I'm disappointed that he thinks he lacks an adequate forum to pound the podium on this more forcefully.

Comment author: Eliezer_Yudkowsky 25 August 2008 08:38:51PM 3 points [-]

Robin and I have discussed this subject in-person and got as far as narrowing down considerably the focus of the disagreement. Robin probably doesn't disagree with me at the point you would expect. Godlike powers, sure, nanotech etc., but Robin expects them to be rooted in a whole economy, not concentrated in a single brain like I expect. No comfort there for those attached to Life As We Know It.

However, I've requested that Robin hold off on discussing his disagreement with me in particular (although of course he continues to write general papers on the cosmic commons and exponential growth modes) until I can get more material out of the way on Overcoming Bias. This is what Robin means by "proper timing".

Comment author: Eliezer_Yudkowsky 25 August 2008 08:49:03PM 3 points [-]

Shane, I think we agree on essential Bayesian principles - there's structure that's useful for generic prediction, which is sensitive only to the granularity of your sensory information; and then there's structure that's useful for decision-making. In principle, all structure worth thinking about is decision-making structure, but in practice we can usually factor out the predictive structure just as we factor out probabilities in decision-making.

But I would further say that decision-making structure can be highly sensitive to terminal values in a way that contradicts the most natural predictive structure. Not always, but sometimes.

If I handed you a set of ingestible substances, the "poisons" would not be described by any of the most natural local categorizations. Now, this doesn't make "poison" an unnatural, value-sensitive category, because you might be interested in the "poison" category for purely predictive purposes, and the boundary can be tested experimentally.

But it illustrates the general idea: the potential poison, in interacting with the complicated human machine, takes on a complicated boundary that doesn't match the grain of any local boundaries you would draw around substances.

In the same way, if you regard human morality as a complicated machine (and don't forget the runtime redefinition of terminal values when confronted with new borderline cases a la Terry Schiavo), then the boundaries of human instrumental values are only going to be understandable by reference to the complicated idealized abstract dynamic of human morality, and not to any structure outside that. In the same way that poisons cause death, instrumental values cause rightness.

The boundaries we need, won't emerge just from trying to predict things that are not interactions with the idealized abstract dynamic of human morality.

Sure, an AI might learn to predict positive and negative reactions from human programmers. But that's not the same as the idealized abstract dynamic we want. Humans have a positive reaction to things like cocaine, and rationalized arguments containing flaws they don't know about. Those also get humans to say "Yes" instead of "No".

In general, categories formed just to predict human behavior are going to treat what we would regard as "invalid" alterations of the humans, like reprogramming them, as being among "the causes of saying-yes behavior". Otherwise you're going to make the wrong prediction!

There's no predictive motive to idealize out the part that we would regard as morality, to distinguish "right" from "what a human says is right", and thereby distinguish morality from "things that make humans say yes" in ways that include "invalid" manipulations like drugs.

You're not going to get something like CEV as a natural predictive category. The main reason to think about that particular idealized computation is if your terminal values care specifically about it.

Comment author: Tom_Breton_(Tehom) 25 August 2008 09:42:27PM 0 points [-]

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication - transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood.

Or more generally, not just a binary classification problem but a measurement issue: How to measure benefit to humans or human satisfaction.

It has sometimes struck me that this FAI requirement has a lot in common with something we were talking about on the futarchy list a while ago. Specifically, how to measure a populace's satisfaction in a robust way. (Meta: exploring the details here would be going off on a tangent. Unfortunately I can't easily link to the futarchy list because Typepad has decided Yahoo links are "potential comment spam")

Of course with futarchy we want to do so for a different purpose, informing a decision market. At first glance the purposes might seem to have little in common. Futarchy contemplates just human participants. The human participants might well be aided by machines, but that is their business alone. FAI contemplates transcendent AI, where humanity cannot hope to truly control it anymore but can only hope that we have raised it properly (so to speak).

But beneath the surface they have important properties in common. They each contemplate an immensely intelligent mechanism that must do the right thing across an unimaginably broad panorama of issues. They both need to inform this mechanism's utility function, so they need to measure benefit to humans accurately and robustly. They both could be dangerous if the metric has loopholes. So they both need a metric that is not a fallible proxy for benefit to humans but a true measure of it. They both need this metric to be secure against intelligent attack - even the best metric does little good if an attacker can change it into something else. They both have to be started with the right metric or something that leads quite surely to it, because correcting them later will be impossible. (Robin speculated that futarchy could generate its own future utility function but I believe such an approach can only cause degeneration)

I conclude that there must be at least a strong resemblance between a desirable utility metric for futarchy and a desirable utility metric for FAI.

Beyond this, I speculate that futarchy has advantages as a sort of platform for FAI. I'll call the combination "futurAIrchy".

First, it might teach a young FAI better than any human teacher could. Like, the young FAI (or several versions or instances of it) would participate much like any other trader, but use the market feedback to refine its knowledge and procedures.

However, certain caprices of the market (January slump, that sort of thing) might lead to FAI learning bad or irrelevant tenets (eg, "January is an evil time"). That pseudo-knowledge would cause sub-optimal decisions and would risk insane behavior (eg, "Forcibly sedate everyone during january")

So I think we'd want FAI trader(s) to be insulated from the less meaningful patterns of the market. I propose that FAIs would trade thru a front end that only concerns itself with hedging against such patterns, and makes them irrelevant as far as the FAI can tell. Call it a "front-end AI". (Problems: Determining the right borderline as they both get more sophisticated. Who or what determines that, under what rules, and how could they abuse the power? Should there be just one front-end AI, arbitrarily many, or many but according to some governing rule?)

Secondly, the structure above might be an unusually safe architecture for FAI. Like, forever it is the rule that the only legitimate components are:

  • Many FAI's that do nothing except discover information and trade in the futarchy market thru a front-end AI. They merely try to maximize their profit (under some predetermined risk-tolerance, etc details)
  • One or many front-end AI's that do nothing except discover information and hedge in the market. Also maximizing their profit.
  • Decision mechanism governing the borderline between FAIs and front-end AIs. Might just be a separate decision market.
  • Many subordinate AIs whose scope of action is not limited by rules given here, but which are entirely subordinate to the decisions of the futarchy market, to the point where it's hard-wired that the market can pull a subordinate AIs plug.
  • A mechanism to measure human satisfaction or benefit to humans. This is ultimately what controls futurAIrchy. The metric has to be generated from humans' self-reports and situations. There's a lot more to be said.

Problems: "log-rolling" where different components collude and thereby accidentally defeat the system. I don't see an exploit yet but that doesn't mean there isn't one. Is there yet a separate mechanism for securing the system against collusion?

What becomes of the profit that these AIs make? Surely we don't put so much real spending power in their silicon hands. But then, all they can do is re-invest it. Perhaps the money ceases to be human-spendable money and becomes just tokens.

What if a FAI goes bankrupt, or becomes inordinately wealthy? I propose that the behavior be that of a population search algorithm (eg genetic algorithm, though it's not clear how or whether crossover should be used). Bankrupt FAIs, or even low-scoring ones, cease to exist, and successful ones reproduce.

If FAI's are like persisting individuals, their hardware is an issue. Like, when a bankrupt FAI is replaced by a wealthy one's offspring, what if the bankrupt one's hardware just isn't fast enough? One proposal: it is all somehow hardware-balanced so that only the algorithms make a difference. Another proposal: FAIs (or another component that works with them) can buy and sell the hardware FAIs run on. Thus a bankrupt FAI's hardware is already sold. But then it is not so obvious how reproduction should be managed.

There's plenty more to be said about futurAIrchy but I've gone on long enough for now.

Comment author: Jadagul 25 August 2008 09:45:53PM 3 points [-]

Shane, the problem is that there are (for all practical purposes) infinitely many categories the Bayesian superintelligence could consider. They all "identify significant regularities in the environment" that "could potentially become useful." The problem is that we as the programmers don't know whether the category we're conditioning the superintelligence to care about is the category we want it to care about; this is especially true with messily-defined categories like "good" or "happy." What if we train it to do something that's just like good except it values animal welfare far more (or less) than our conception of good says it ought to? How long would it take for us to notice? What if the relevant circumstance didn't come up until after we'd released it?

Comment author: Lightwave2 26 August 2008 02:58:41PM 0 points [-]

I wonder if you'd consider a superintelligent human have the same flaws as a superintelligent AI (and will eventually destroy the world). What about a group of superintelligent humans (assuming they have to cooperate in order to act)?

Comment author: Aaron6 27 August 2008 12:45:09AM 0 points [-]

Eliezer: Have you read Scott Aaronson's work on the learnability of quantum states. There, the full space is doubly exponential in system size, but if we just want to predict the results of some set of possible questions (to some fixed accuracy), we don't need to train with nearly as many questions as one might think.

Comment author: Ben_Jones 27 August 2008 09:59:33AM 0 points [-]

But it illustrates the general idea: the potential poison, in interacting with the complicated human machine, takes on a complicated boundary that doesn't match the grain of any local boundaries you would draw around substances.

Compared to 'actions that are right', even 'poisons' seems like a pretty obvious boundary to draw. Where's the grain around 'right'? Unlucky for Eliezer, we seem to find some pretty bizarre boundaries 'useful'.

Comment author: Tim_Tyler 27 August 2008 12:17:11PM -1 points [-]

Re: One god to rule us all

It does look as though there is going to be one big thing out there. It looks as though it will be a more integrated and unified entity than any living system up to now - and it is unlikely to be descended from today's United Nations - e.g. see:

Kevin Kelly: Predicting the next 5,000 days of the web

It seems rather unlikely that the Monopolies and Mergers Commission will be there to stop this particular global unification.

Comment author: Shane_Legg 28 August 2008 04:17:00PM 0 points [-]

Eli, to my mind you seem to be underestimating the potential of a super intelligent machine.

How do I know that hemlock is poisonous? Well, I've heard the story that Socrates died by hemlock poisoning. This is not a conclusion that I've arrived at due to the physical properties of hemlock that I have observed and how this would affect the human body, indeed, as far as I know, I've never even seen hemlock before. The idea that hemlock is a poison is a pattern in my environment: every time I hear about the trial of Socrates I hear about it being the poison that killed him. It's also not a very useful piece of information in terms of achieving any goals I care about as I don't imagine that I'll ever encounter a case of hemlock poisoning first hand. Now, if I can learn that hemlock is a poison this way, surely a super intelligent machine could too? I think any machine that can't do this is certainly not super intelligent.

In the same way a super intelligent machine will form good models of what we consider to be right and wrong, including the way in which these ideas vary from person to person, place to place, culture to culture. Your comments about the machine getting the people to appear happy or saying "Yes" vs. "No", well, I don't understand this. It's as if you seem to think that a super intelligent machine will only have a shallow understanding of its world.

Please note (I'm saying this for other people reading this comment) even if a super intelligent machine will form good models of human ethics though observing human culture, this doens't mean that the machine will take this as its goal.

Comment author: Eliezer_Yudkowsky 28 August 2008 06:22:12PM 6 points [-]

Shane, I think you're underestimating the idiosyncrasy of morality. Suppose that I show you the sentence "This sentence is false." Do you convert it to ASCII, add up the numbers, factorize the result, and check if there are two square factors? No; it would be easy enough for you to do so, but why bother? The concept "sentences whose ASCII conversion of their English serialization sums to a number with two square factors" is not, to you, an interesting way to carve up reality.

Suppose that, driving along the highway, I see someone riding a motorcycle at high speed, zipping between cars. It's too late for me to call out to them, but I might think something along the lines of, "Now there goes someone who needs to be wearing a helmet." Why do I think this? Why is it a natural concept to me? It doesn't play a role in my predictions - so far as prediction is concerned, I predict that this guy will continue not wearing a helmet, and has a high probability of ending up as a smear on the asphault. No, the reason why this is a natural thought to me is that human life is something I care about, that is, it plays a direct role as one of my current terminal values.

A moderately strong and unFriendly intelligence, operating in the current world without yet having replaced that world with paperclips, would certainly find it natural to form the category of "Things that (some) humans approve of", and contrast it to "Things that will trigger a nuclear attack against me before I'm done creating my own nanotechnology." But this category is not what we call "morality". It naturally - from the AI's perspective - includes things like bribes and deception, not just the particular class of human-approval-eliciting phenomena that we call "moral".

Is it worth factoring out phenomena that elicit human feelings of righteousness, and working out how (various) humans reason about them? Yes, because this is an important subset of ways to persuade the humans to leave you alone until it's too late; but again, that natural category is going to include persuasive techniques like references to religious authority and nationalism.

But what if the AI encounters some more humanistic, atheistic types? Then the AI will predict which of several available actions is most likely to make an atheistic humanist human show sympathy for the AI. This naturally leads the AI to model and predict the human's internal moral reasoning - but that model isn't going to distinguish anything along the lines of moral reasoning the human would approve of under long-term reflection, or moral reasoning the human would approve knowing the true facts. That's just not a natural category to the AI, because the human isn't going to get a chance for long-term reflection, and the human doesn't know the true facts.

The natural, predictive, manipulative question, is not "What would this human want knowing the true facts?", but "What will various behaviors make this human believe, and what will the human do on the basis of these various (false) beliefs?"

In short, all models that an unFriendly AI forms of human moral reasoning, while we can expect them to be highly empirically accurate and well-calibrated to the extent that the AI is highly intelligent, would be formed for the purpose of predicting human reactions to different behaviors and events, so that these behaviors and events can be chosen manipulatively.

But what we regard as morality is an idealized form of such reasoning - the idealized abstracted dynamic built out of such intuitions. The unFriendly AI has no reason to think about anything we would call "moral progress" unless it is naturally occurring on a timescale short enough to matter before the AI wipes out the human species. It has no reason to ask the question "What would humanity want in a thousand years?" any more than you have reason to add up the ASCII letters in a sentence.

Now it might be only a short step from a strictly predictive model of human reasoning, to the idealized abstracted dynamic of morality. If you think about the point of CEV, it's that you can get an AI to learn most of the information it needs to model morality, by looking at humans - and that the step from these empirical models, to idealization, is relatively short and traversable by the programmers directly or with the aid of manageable amounts of inductive learning. Though CEV's current description is not precise, and maybe any realistic description of idealization would be more complicated.

But regardless, if the idealized computation we would think of as describing "what is right" is even a short distance of idealization away from strictly predictive and manipulative models of what humans can be made to think is right, then "actually right" is still something that an unFriendly AI would literally never think about, since humans have no direct access to "actually right" (the idealized result of their own thought processes) and hence it plays no role in their behavior and hence is not needed to model or manipulate them.

Which is to say, an unFriendly AI would never once think about morality - only a certain psychological problem in manipulating humans, where the only thing that matters is anything you can make them believe or do. There is no natural motive to think about anything else, and no natural empirical category corresponding to it.

Comment author: Shane_Legg 07 September 2008 05:23:35PM 2 points [-]

Eli, I've been busy fighting with models of cognitive bias in finance and only just now found time to reply:

Suppose that I show you the sentence "This sentence is false." Do you convert it to ASCII, add up the numbers, factorize the result, and check if there are two square factors? No; it would be easy enough for you to do so, but why bother? The concept "sentences whose ASCII conversion of their English serialization sums to a number with two square factors" is not, to you, an interesting way to carve up reality.

Sure, this property of adding up the ASCII, factorising and checking for square factors appears to have no value and thus I can't see why a super intelligent machine would spend time on this. Indeed, to the best of my recollection, nobody has ever suggested this property to be before.

But is morality like this? No it isn't. Everyday in social interaction morals are either expressed or implied. If I turn on the TV and watch a soap I see people facing ethical decisions. If I switch channel to politics I hear people telling me all about what they think is or is not ethical, what their values are, etc. I would say that a large proportion of debate in the media has an ethical element to is. My phone rings and it's my friend on the line who's recently broken up with his girlfriend and he wants to talk to me about it. At various points our discussion either explicitly or implicitly touches on moral questions. Although ethics is complex, sometimes vague, and not always consistent, like it or not we are swimming in this sea of morals every day. If you want to understand why people do what they do and why they interact with each other as they do, and how they will react to some new situation, one thing you must have is an understanding of their morals, ethics and values. Without this, a lot of human behaviour is inexplicable.

Thus, whether a super intelligent machine seeks to deliver us into paradise or eliminate us from the planet as if we're some kind of a disease, if it can understand our behaviours and motives then it can more efficiently achieve its goals.

It's up to this point that I'm arguing for: that human morality (as it currently is) is a natural category for the machine given that its environment will be full of humans. Your reply above then goes on to how such a morality could be extending into the future in a consistent way and all that. I accept the importance of this, but these issues lie beyond are the point I was trying to make.

Ok, now, will a super intelligent machine consider how human morality could be extended into the future and all that? I think it probably will, though my reasons for thinking this are more intuitive at the moment. I suspect that one of the things that a super intelligent machine will do is to look through all the existing literature on super intelligent machines. It will read my thesis, read this blog, read all the comments posted here, read science fiction stories, and so on. It will then dissect all this information in order to understand our attitudes to intelligent machines, identify all the errors in our arguments, extent the theory and fill in all the bits we couldn't figure out to see where we were heading... perhaps all in some tiny fraction of a second. All this might help it better understand itself, or maybe more importantly, how we view it and why.

*waves at super intelligent machine*

:-)

Comment author: Eliezer_Yudkowsky 07 September 2008 05:38:25PM 6 points [-]

Shane, religious fundamentalists routinely act based on their beliefs about God. Do you think that makes "God" a natural category that any superintelligence would ponder? I see "human thoughts about God" and "things that humans justify by referring to God" and "things you can get people to do by invoking God" as natural categories for any AI operating on modern Earth, though an unfriendly AI wouldn't give it a second thought after wiping out humanity. But to go from here to reasoning about what God would actually be like is a needless and unnatural step.

If Bob believes that a locked safe, impenetrable to Bob, contains a valuable diamond, then Bob's belief is a natural category when it comes to predicting and manipulating Bob; but the actual diamond is irrelevant, at least to predicting in manipulating Bob, so long as Bob can't look directly at the diamond, and so long as we already know what Bob believes about the diamond.

In the same sense, an unfriendly AI has no reason consider what really is right as a natural category, to apply its own intelligence to the moral questions that humans are asking, any more than it has a motive to apply its own intelligence to the theological questions that humans used to ask. It has no interest, as humans do, in the idealized form of the answer; only in what humans believe and can be argued into.

Comment author: Kragen_Javier_Sitaker2 08 September 2008 03:53:08AM 10 points [-]

It's worth pointing out that we have wired-in preferences analogous to those Hibbard proposes to build into his intelligences: we like seeing babies smile; we like seeing people smile; we like the sweet taste of fresh fruit; we like orgasms; many of us (especially men) like the sight of naked women, especially if they're young, and they sexually arouse us to boot; we like socializing with people we're familiar with; we like having our pleasure centers stimulated; we don't like killing people; and so on.

It's worth pointing out that we engage in a lot of face-xeroxing-like behavior in pursuit of these ends. We keep photos of our family in our wallets, we look at our friends' baby photos on their cellphones, we put up posters of smiling people; we eat candy and NutraSweet; we masturbate; we download pornography; we watch Friends on television; we snort cocaine and smoke crack; we put bags over people's heads before we shoot them. In fact, in many cases, we form elaborate, intelligent plans to these ends.

It doesn't matter that you know, rationally, that you aren't impregnating Jenna Jameson, or that the LCD pixels on the cellphone display aren't a real baby, that Caffeine Free Diet Coke isn't fruit juice, and that the characters in Friends aren't really your friends. These urges are by no means out of our control, but neither do they automatically lose their strength when we recognize that they don't serve the evolutionary objectives that spawned them. This is, in part, the cause for the rejection of masturbation and birth control by many religious orders — they believe those blind urges are put in place not by blind evolution but by an intelligent designer whose intent should be respected.

So it's not clear to me why Hibbard thinks artificial intelligences would be immune from sticking rows of smiley faces on their calendar when humans aren't.

Comment author: Tim_Tyler 13 September 2008 04:21:46PM -2 points [-]

Re: One of the more obvious roadmaps to creating AI involves the stock market waking up.

I've fleshed this comment out into an essay on the topic: http://alife.co.uk/essays/the_awakening_marketplace/

Comment author: Shane_Legg 16 September 2008 01:28:25PM -2 points [-]

Eli,

Do you think that makes "God" a natural category that any superintelligence would ponder?

Yes. If you're a super intelligent machine on a mission there is very little that can stop you. You know that. About the only thing that could stop you would be some other kind of super intelligent entity, maybe an entity that created the universe. A "God" of some description. Getting the God question wrong could be a big mistake, and that's reason enough for you to examine the possibility.

Comment author: Eliezer_Yudkowsky 16 September 2008 04:32:00PM 2 points [-]

I don't consider such as Gods, as they are not supernatural and not ontologically distinct from creatures; they are simply powerful aliens or Matrix Lords. So I'll phrase it more precisely. Lots of humans talk about Jehovah. Does that make Jehovah a natural category? Or is only "human talk about Jehovah" a natural category? Do you ponder what Jehovah would do, or only what humans might think Jehovah would do?

Comment author: DilGreen 11 October 2010 12:23:53PM *  0 points [-]

So many of the comments here seem designed to illustrate the extreme difficulty, even for intelligent humans interested in rationality, and trying hard to participate usefully in a conversation about hard-edged situations of perceived non-trivial import, to avoid fairly simplistic anthropomorphisms of one kind or another.

Saying, of a supposed super-intelligent AI - one that works by being able to parallel, somehow, the 'might as well be magic' bits of intelligence that we currently have at best a crude assembly of speculative guesses for - any version of "of course, it would do X", seems - well - foolish.

Comment author: taryneast 23 December 2010 11:05:25AM *  -1 points [-]

Ok, so, trying on my understanding of this post: I guess that a smiling face should only reinforce something if it also leads to the "human happiness" goal... (which would be harder to train for).

I think I can see what Hibbard may have been trying for - in feeling that a smiley face might be worth training for as a first-step towards training for the actual, real goal... depending on how training a "real" AI would proceed.

As background, I can compare against training lab rats to perform complicated processes before getting their "reward". Say you want to teach it to press a certain lever on one side of the cage, then another one on the other side. First you have to teach the rat just to come over to the first side of the cage - and reward it. Then to teach it to press the lever to reward it, then to press the lever, then run over to the other side of the cage... and so on until it must go through the whole dance before the reward appears.

Thus, for lab rats, teaching it simply to recognise the "first step" (whether to run to one side of the cage, or to discriminate successfully between smiley and non-smiley human faces) is an important part of teaching them the whole process.

However... lab rats are stupid and do not, and cannot understand why they are performing this elaborate dance. All they know is the reward.

A smart AI, on the other hand, should be capable of understanding why a smiley face is important... ie it sometimes indicates that the human is happy. That the smiley face isn't the goal itself, but only a sometime-indicator that the goal might have been achieved.

Hibbard's method of teaching will simply not lead to that understanding.

In which case, I'm reminded of this post: http://lesswrong.com/lw/le/lost_purposes/ That a smiley face is only worthwhile if it actually indicates the real end-goal (of humans being happy). Otherwise the smiley face is as worthless as opening the car-door in the absence of chocolate at the supermarket.

...of course, there are still possible pathological cases (eg everyone being fed Soma... (as previously mentioned in comments) or everyone being lobotomised so they really are happy... but it still not being what humans would have chosen... but that's the subject of teaching the AI more about the "human happiness" goal.

Comment author: TheOtherDave 23 December 2010 05:39:07PM 0 points [-]

Right.

Unless it turns out that happiness isn't what we would have chosen, either. In which case perhaps discarding the "human happiness" goal and teaching it to adopt a "what humans would have chosen" goal works better?

Unless it turns out that what humans would have chosen involves being fused into glass at the bottoms of smoking craters. In which case perhaps a "what humans ought to have chosen" goal works better?

Except now we've gone full circle and are expecting the AI to apply a nonhuman valuation, which is what we rejected in the first place.

I haven't completely followed the local thinking on this subject yet, but my current approximation of the local best answer goes "Let's assume that there is a way W for the world to be, such that all humans would prefer W if they were right-thinking enough, including hypothetical future humans living in the world according to W. Further, let's assume the specifications of W can be determined from a detailed study of humans by a sufficiently intelligent observer. Given those assumptions, we should build a sufficiently intelligent observer whose only goal is to determine W, and then an optimizing system to implement W."

Comment author: taryneast 23 December 2010 06:02:12PM *  1 point [-]

Hmmm, I can forsee many problems with guessing what humans "ought" to prefer. Even humans have got that one wrong pretty much every time they've tried.

I'd say a "better" goal might be cased as "increasing the options available to most humans (not at the expense of the options of other humans)"

This goal seems compatible with allowing humans to choose happier lifestyles - but without forcing them into any particular lifestyle that they may not consider to be "better".

It would "work" by concentrating on things like extending human lifespans and finding better medical treatments for things that limit human endeavour.

However, this is just a guess... and I am still only a novice here... which means I am in no way capable of figuring out how I'd actually go about training an AI to accept the above goal.

All I know is that I agree with Eliezer's post that the lab-rat method would be sub-optimal as it has a high propensity to fall into pathological configurations.

Comment author: wallowinmaya 14 May 2011 09:17:42PM *  7 points [-]

Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand.

I have to confess that at first glance this statement seems arrogant. But, then I actually read some stuff in this AGI-mailing-list and well, I was filled with horror after I've read threads like this one:

Here is one of the most ridiculous passages:

Note that we may not have perfected this process, and further, that this process need not be perfected. Somewhere around the age of 12, many of our neurons DIE. Perhaps these were just the victims of insufficiently precise dimensional tagging? Once things can ONLY connect up in mathematically reasonable ways, what remains between a newborn and a physics-complete AGI? Obviously, the physics, which can be quite different on land than in the water. Hence, the physics must also be learned.

It feels like reading Heidegger on crack, while yourself being stoned. And what is really terrifying is that Ben Goertzel, whom I admired just 6 months ago, replies to and discusses such nonsense repeatedly! Is it really true that even some of the most famous AGI- reseachers are that crazy?

Comment author: elspood 04 June 2011 12:26:09AM 1 point [-]

Can anyone please explain the reference to the horror seen firsthand at http://www.mail-archive.com/agi@v2.listbox.com/? I tried going back in the archives to see if something happened in August 2008 or earlier (the date of Eliezer's post), but the list archive site doesn't have anything older than October 2008 currently. My curiosity is piqued and I need closure on the anecdote. If nothing else, others might benefit from knowing what horrors might be avoided during AGI research.

Comment author: saturn 04 June 2011 01:02:05AM 0 points [-]

I think Eliezer is referring to the high ratio of posts by M-ntif-x and similar kooks.

Comment author: thomblake 20 September 2011 04:32:50PM 3 points [-]

Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

Probably apocryphal. I haven't been able to track this down, despite having heard the story both in computer ethics class and at academic conferences.

Comment author: gwern 20 September 2011 06:36:45PM 3 points [-]

I poked around in Google Books; the earliest clear reference I found was the 2000 Cartwright book Intelligent data analysis in science, which seems to attribute it to the TV show Horizon. (No further info - just snippet view.)

Comment author: thomblake 20 September 2011 06:47:52PM 3 points [-]

Here is one supposedly from 1998, though it's hardly academic.

Comment author: gwern 21 June 2015 07:26:10PM 5 points [-]

A Redditor provides not one but two versions from "Embarrassing mistakes in perceptron research", Marvin Minsky, recorded 29-31 Jan 2011:

Like I had a friend in Italy who had a perceptron that looked at a visual... it had visual inputs. So, he... he had scores of music written by Bach of chorales and he had scores of chorales written by music students at the local conservatory. And he had a perceptron - a big machine - that looked at these and those and tried to distinguish between them. And he was able to train it to distinguish between the masterpieces by Bach and the pretty good chorales by the conservatory students. Well, so, he showed us this data and I was looking through it and what I discovered was that in the lower left hand corner of each page, one of the sets of data had single whole notes. And I think the ones by the students usually had four quarter notes. So that, in fact, it was possible to distinguish between these two classes of... of pieces of music just by looking at the lower left... lower right hand corner of the page. So, I told this to the... to our scientist friend and he went through the data and he said: 'You guessed right. That's... that's how it happened to make that distinction.' We thought it was very funny. A similar thing happened here in the United States at one of our research institutions. Where a perceptron had been trained to distinguish between - this was for military purposes - It could... it was looking at a scene of a forest in which there were camouflaged tanks in one picture and no camouflaged tanks in the other. And the perceptron - after a little training - got... made a 100% correct distinction between these two different sets of photographs. Then they were embarrassed a few hours later to discover that the two rolls of film had been developed differently. And so these pictures were just a little darker than all of these pictures and the perceptron was just measuring the total amount of light in the scene. But it was very clever of the perceptron to find some way of making the distinction.

While the Italian story seems to be true since Minsky says he knew the Italian and personally spotted how the neural net was overfitting, he just recounts the urban legend as 'an institution'; there is a new twist, though, that this time it's the exposure of the photographic film rather than the forest or clouds or something.

Comment author: PhilGoetz 20 September 2011 07:47:45PM *  0 points [-]

I was surprised that the post focused on the difficulty of learning to classify things, rather than on the problems that would arise assuming the AI learned to classify smiling humans correctly. I'm not worried that the AI will tile the universe with smiley-faces. I'm worried the AI will tile the universe with smiling humans. Even with genuinely happy humans.

Humans can classify humans into happy and unhappy pretty well; superintelligent AI will be able to also. The hard problem is not identifying happiness; the hard problem is deciding what to maximize.

Comment author: timtyler 23 October 2011 12:41:02PM 2 points [-]

Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

This document has a citation for the story: (Skapura, David M. and Peter S. Gordon, Building Neural Networks, Addison-Wesley, 1996.) I don't know for sure if that is the end of the trail or not.

Comment author: gwern 23 October 2011 02:33:11PM 2 points [-]

No page number, unfortunately. Not in library.nu; closest copy to me was in the New York Public Library. I then looked in Google Books http://books.google.com/books?id=RaRbNBqGR1oC

The 2 hits for 'tanks' neither seemed to be relevant; ditto for 'clear'. No hits for 'cloudy' or 'skies' or 'enemy'; there's one hit for 'sky', pg 206, where it talks about a plane recognition system that worked well until the plane moved close to the ground and then became confused because it had only learned to find 'the darkest section in the image'.

Comment author: timtyler 23 October 2011 10:42:55PM *  0 points [-]

The bottom of page 199 seems to be about "classifying military tanks in SAR imagery". It goes on to say it is only interested in "tank" / "non-tank" categories.

Comment author: pedanterrific 23 October 2011 11:10:29PM 2 points [-]

Discussed here, there's a few bits that might be useful.

Comment author: MugaSofer 11 December 2012 09:59:27AM 4 points [-]

When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face, wire it into a permanent smile, and start xeroxing.

That's a much more convincing and vivid image than "molecular smiley faces". Makes a more general point, too. Shame you didn't use that the first time, really.