Magical Categories

Eliezer Yudkowsky

LESSWRONG
LW

Magical Categories — LessWrong

Value Theory

80 Magical Categories

by Eliezer Yudkowsky

24th Aug 2008

11 min read

143

80

'We can design intelligent machines so their primary, innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.'
-- Bill Hibbard (2001), Super-intelligent machines.

That was published in a peer-reviewed journal, and the author later wrote a whole book about it, so this is not a strawman position I'm discussing here.

So... um... what could possibly go wrong...

When I mentioned (sec. 6) that Hibbard's AI ends up tiling the galaxy with tiny molecular smiley-faces, Hibbard wrote an indignant reply saying:

'When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of "human facial expressions, human voices and human body language" (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by "tiny molecular pictures of smiley-faces." You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.'

As Hibbard also wrote "Such obvious contradictory assumptions show Yudkowsky's preference for drama over reason," I'll go ahead and mention that Hibbard illustrates a key point: There is no professional certification test you have to take before you are allowed to talk about AI morality. But that is not my primary topic today. Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand. Even Michael Vassar was probably surprised his first time through.

No, today I am here to dissect "You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans."

Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set - output "yes" for the 50 photos of camouflaged tanks, and output "no" for the 50 photos of forest.

Now this did not prove, or even imply, that new examples would be classified correctly. The neural network might have "learned" 100 special cases that wouldn't generalize to new problems. Not, "camouflaged tanks versus forest", but just, "photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive..."

But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed!

The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.

It turned out that in the researchers' data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.

This parable - which might or might not be fact - illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence: If the training problems and the real problems have the slightest difference in context - if they are not drawn from the same independently identically distributed process - there is no statistical guarantee from past success to future success. It doesn't matter if the AI seems to be working great under the training conditions. (This is not an unsolvable problem but it is an unpatchable problem. There are deep ways to address it - a topic beyond the scope of this post - but no bandaids.)

As described in Superexponential Conceptspace, there are exponentially more possible concepts than possible objects, just as the number of possible objects is exponential in the number of attributes. If a black-and-white image is 256 pixels on a side, then the total image is 65536 pixels. The number of possible images is 2⁶⁵⁵³⁶. And the number of possible concepts that classify images into positive and negative instances - the number of possible boundaries you could draw in the space of images - is 2^(2⁶⁵⁵³⁶). From this, we see that even supervised learning is almost entirely a matter of inductive bias, without which it would take a minimum of 2⁶⁵⁵³⁶ classified examples to discriminate among 2^(2⁶⁵⁵³⁶) possible concepts - even if classifications are constant over time.

If this seems at all counterintuitive or non-obvious, see Superexponential Conceptspace.

So let us now turn again to:

'First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.'

and

'When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of "human facial expressions, human voices and human body language" (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by "tiny molecular pictures of smiley-faces." You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.'

It's trivial to discriminate a photo of a picture with a camouflaged tank, and a photo of an empty forest, in the sense of determining that the two photos are not identical. They're different pixel arrays with different 1s and 0s in them. Discriminating between them is as simple as testing the arrays for equality.

Classifying new photos into positive and negative instances of "smile", by reasoning from a set of training photos classified positive or negative, is a different order of problem.

When you've got a 256x256 image from a real-world camera, and the image turns out to depict a camouflaged tank, there is no additional 65537th bit denoting the positiveness - no tiny little XML tag that says "This image is inherently positive". It's only a positive example relative to some particular concept.

But for any non-Vast amount of training data - any training data that does not include the exact bitwise image now seen - there are superexponentially many possible concepts compatible with previous classifications.

For the AI, choosing or weighting from among superexponential possibilities is a matter of inductive bias. Which may not match what the user has in mind. The gap between these two example-classifying processes - induction on the one hand, and the user's actual goals on the other - is not trivial to cross.

Let's say the AI's training data is:

Dataset 1:

+
- Smile_1, Smile_2, Smile_3
-
- Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5

Now the AI grows up into a superintelligence, and encounters this data:

Dataset 2:

- Frown_6, Cat_3, Smile_4, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2

It is not a property of these datasets that the inferred classification you would prefer is:

+
- Smile_1, Smile_2, Smile_3, Smile_4
-
- Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2

rather than

+
- Smile_1, Smile_2, Smile_3, Molecular_Smileyface_1, Molecular_Smileyface_2, Smile_4
-
- Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Cat_4, Galaxy_2, Nanofactory_2

Both of these classifications are compatible with the training data. The number of concepts compatible with the training data will be much larger, since more than one concept can project the same shadow onto the combined dataset. If the space of possible concepts includes the space of possible computations that classify instances, the space is infinite.

Which classification will the AI choose? This is not an inherent property of the training data; it is a property of how the AI performs induction.

Which is the correct classification? This is not a property of the training data; it is a property of your preferences (or, if you prefer, a property of the idealized abstract dynamic you name "right").

The concept that you wanted, cast its shadow onto the training data as you yourself labeled each instance + or -, drawing on your own intelligence and preferences to do so. That's what supervised learning is all about - providing the AI with labeled training examples that project a shadow of the causal process that generated the labels.

But unless the training data is drawn from exactly the same context as the real-life, the training data will be "shallow" in some sense, a projection from a much higher-dimensional space of possibilities.

The AI never saw a tiny molecular smileyface during its dumber-than-human training phase, or it never saw a tiny little agent with a happiness counter set to a googolplex. Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile. But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values. It is your own plans and desires that are at work when you say "No!"

Hibbard knows instinctively that a tiny molecular smileyface isn't a "smile", because he knows that's not what he wants his putative AI to do. If someone else were presented with a different task, like classifying artworks, they might feel that the Mona Lisa was obviously smiling - as opposed to frowning, say - even though it's only paint.

As the case of Terry Schiavo illustrates, technology enables new borderline cases that throw us into new, essentially moral dilemmas. Showing an AI pictures of living and dead humans as they existed during the age of Ancient Greece, will not enable the AI to make a moral decision as to whether switching off Terry's life support is murder. That information isn't present in the dataset even inductively! Terry Schiavo raises new moral questions, appealing to new moral considerations, that you wouldn't need to think about while classifying photos of living and dead humans from the time of Ancient Greece. No one was on life support then, still breathing with a brain half fluid. So such considerations play no role in the causal process that you use to classify the ancient-Greece training data, and hence cast no shadow on the training data, and hence are not accessible by induction on the training data.

As a matter of formal fallacy, I see two anthropomorphic errors on display.

The first fallacy is underestimating the complexity of a concept we develop for the sake of its value. The borders of the concept will depend on many values and probably on-the-fly moral reasoning, if the borderline case is of a kind we haven't seen before. But all that takes place invisibly, in the background; to Hibbard it just seems that a tiny molecular smileyface is just obviously not a smile. And we don't generate all possible borderline cases, so we don't think of all the considerations that might play a role in redefining the concept, but haven't yet played a role in defining it. Since people underestimate the complexity of their concepts, they underestimate the difficulty of inducing the concept from training data. (And also the difficulty of describing the concept directly - see The Hidden Complexity of Wishes.)

The second fallacy is anthropomorphic optimism: Since Bill Hibbard uses his own intelligence to generate options and plans ranking high in his preference ordering, he is incredulous at the idea that a superintelligence could classify never-before-seen tiny molecular smileyfaces as a positive instance of "smile". As Hibbard uses the "smile" concept (to describe desired behavior of superintelligences), extending "smile" to cover tiny molecular smileyfaces would rank very low in his preference ordering; it would be a stupid thing to do - inherently so, as a property of the concept itself - so surely a superintelligence would not do it; this is just obviously the wrong classification. Certainly a superintelligence can see which heaps of pebbles are correct or incorrect.

Why, Friendly AI isn't hard at all! All you need is an AI that does what's good! Oh, sure, not every possible mind does what's good - but in this case, we just program the superintelligence to do what's good. All you need is a neural network that sees a few instances of good things and not-good things, and you've got a classifier. Hook that up to an expected utility maximizer and you're done!

I shall call this the fallacy of magical categories - simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication - transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood. Relative to the full space of possibilities the Future encompasses, we ourselves haven't imagined most of the borderline cases, and would have to engage in full-fledged moral arguments to figure them out. To solve the FAI problem you have to step outside the paradigm of induction on human-labeled training data and the paradigm of human-generated intensional definitions.

Of course, even if Hibbard did succeed in conveying to an AI a concept that covers exactly every human facial expression that Hibbard would label a "smile", and excludes every facial expression that Hibbard wouldn't label a "smile"...

Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers.

When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face, wire it into a permanent smile, and start xeroxing.

The deep answers to such problems are beyond the scope of this post, but it is a general principle of Friendly AI that there are no bandaids. In 2004, Hibbard modified his proposal to assert that expressions of human agreement should reinforce the definition of happiness, and then happiness should reinforce other behaviors. Which, even if it worked, just leads to the AI xeroxing a horde of things similar-in-its-conceptspace to programmers saying "Yes, that's happiness!" about hydrogen atoms - hydrogen atoms are easy to make.

Link to my discussion with Hibbard here. You already got the important parts.

Squiggle Maximizer (formerly "Paperclip maximizer")Machine Learning (ML)AI

Personal Blog

80

Morality as Fixed Computation

50 comments79 karma

The True Prisoner's Dilemma

118 comments267 karma

-5Hopefully_Anonymous

-1Hopefully_Anonymous

22Kragen_Javier_Sitaker2

More from Eliezer Yudkowsky

Curated and popular this week

143Comments

143

New Comment

143 comments, sorted by

oldest

Click to highlight new comments since: Today at 9:08 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]peco17y-10

Why can't the AI just be exactly the same as Hibbard? If Hibbard is flawed in a major way, you could make an AI for every person on Earth (this obviously wouldn't be practical, but if a few million AI's are bad the other few billion can deal with them).

[-]DanielLC14y180

We already have an entity exactly the same as Hibbard. Namely: Hibbard. Why do we need another one?

What we want is an AI that's far more intelligent than a human, yet shares their values. Increasing intelligence while preserving values is nontrivial. You could try giving Hibbard the ability to self-modify, but then he'd most likely just go insane in some way or another.

1BeanSprugget6y

I don't really doubt that increasing value while preserving values is nontrivial, but I wonder just how nontrivial it is: are the regions of the brain for intelligence and values separate? Actually, writing that out, I realize that (at least for me) values are a "subset" of intelligence: the "facts" we believe about science/math/logic/religion are generated in basically the same way as our moral values; the difference to us humans seems obvious, but it really is, well, nontrivial. The paper clip maximizing AI is a good example: even if it wasn't about "moral values"--even if you wanted to maximize something like paper clips--you'd still run into trouble

[-]Carl_Shulman17y111

"Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers."

You use examples of this type fairly often, but for a utility function linear in smiles wouldn't the number of smiles generated by pleasing the programmers be trivial relative to the output of even a little while with access to face-xeroxing? This could be partly offset by anthropic/simulation issues, but still I would expect the overwhelming motive for appearing to work correctly during childhood (after it could recognize this point) would be tricking the programmers, not the tiny gains from their smiles.

[-]Carl_Shulman17y71

For instance, a weak AI might refrain from visibly trying to produce smiles in disturbing ways as part of an effort (including verbal claims) to convince the programmers that it had apprehended the objective morality behind their attempts to inculcate smiles as a reinforcer.

[-]Tim_Tyler17y-30

Early AIs are far more likely to be built to maximise the worth of the company that made them than anything to do with human hapiness. E.g. see: Artificial intelligence applied heavily to picking stocks

A utility function measured in dollars seems fairly unambiguous.

[-]DilGreen15y210

A utility function measured in dollars seems fairly unambiguously to lead to decisions that are non-optimal for humans, without a sophisticated understanding of what dollars are.

Dollars mean something for humans because they are tokens in a vast, partly consensual and partially reified game. Economics, which is our approach to developing dollar maximising strategies, is non-trivial.

Training an AI to understand dollars as something more than data points would be similarly non-trivial to training an AI to faultlessly assess human happiness.

2PhilGoetz14y

But that's not what this post is about. Eliezer is examining a different branch of the tree of possible futures.

[-]JessRiedel17y286

Eliezer, I believe that your belittling tone is conducive to neither a healthy debate nor a readable blog post. I suspect that your attitude is borne out of just frustration, not contempt, but I would still strongly encourage you to write more civilly. It's not just a matter of being nice; rudeness prevents both the speaker and the listener from thinking clearly and objectively, and it doesn't contribute to anything.

-1CynicalOptimist9y

Can't agree with this enough.

[-]Anon1717y1-1

It has always struck me that the tiling the universe with smiley faces example is one of the stupidest possible examples Eliezer could have come up with. It is extremely implausible, MUCH, MUCH more so than the camouflage tank scenario, and I understand Hibbard's indignation even if I agree with Eliezer on the general point he is making.

I have no idea why Eliezer wouldn't choose a better example that illustrates the same point, like the AGI spiking the water supply with a Soma-like drug that actually does make us all profoundly content in a highly undesirable way.

[+]retired_urologist17y-60

[-]Shane_Legg17y60

It is just me, or are things getting a bit unfriendly around here?

Anyway...

Wiring up the AI to maximise happy faces etc. is not a very good idea, the goal is clearly too shallow to reflect the underlying intent. I'd have to read more of Hibbard's stuff to properly understand his position, however.

That said, I do agree with a more basic underlying theme that he seems to be putting forward. In my opinion, a key, perhaps even THE key to intelligence is the ability to form reliable deep abstractions. In Solomonoff induction and AIXI you see this being drivi... (read more)

3Kenny13y

"It's not that stupid." What if it doesn't care about happiness or smiles or any other abstractions that we value? A super-intelligence isn't an unlimited intelligence, i.e. it would still have to choose what to think about.

2bouilhet12y

I think the point is that if you accept this definition of intelligence, i.e. that it requires the ability to form deep and reliable abstractions about the world, then it doesn't make sense to talk about any intelligence (let alone a super one) being unable to differentiate between smiley-faces and happy people. It isn't a matter, at least in this instance, of whether it cares to make that differentiation or not. If it is intelligent, it will make the distinction. It may have values that would be unrecognizable or abhorrent to humans, and I suppose that (as Shane_Legg noted) it can't be ruled out that such values might lead it to tile the universe with smiley-faces, but such an outcome would have to be the result of something other than a mistake. In other words, if it really is "that stupid," it fails in a number of other ways long before it has a chance to make this particular error.

2Rob Bensinger12y

I wrote a post about this! See The genie knows, but doesn't care. It may not make sense to talk about a superintelligence that's too dumb to understand human values, but it does make sense to talk about an AI smart enough to program superior general intelligences that's too dumb to understand human values. If the first such AIs ('seed AIs') are built before we've solved this family of problems, then the intelligence explosion thesis suggests that it will probably be too late. You could ask an AI to solve the problem of FAI for us, but it would need to be an AI smart enough to complete that task reliably yet too dumb (or too well-boxed) to be dangerous.

0TheAncientGeek12y

Superior to what? If they are only as smart as the average person, then all things being equal, they will be as good as the average peson as figuring out morality. If they are smarter, they will be better, You seem to be tacitly assuming that the Seed AIs are designing walled-off unupdateable utility functions. But if one assumes a more natural architecture, where moral sense is allowed to evolve with eveythign else, you would expect and incremental succession of AIs to gradually get better at moral reasoning. And if it fooms, it's moral reasoning will fomm along with eveything else, because you haven't created an artificial problem by firewalling it off.

0Rob Bensinger12y

Superior to itself. That's not generally true of human-level intelligences. We wouldn't expect a random alien species that happens to be as smart as humans to be very successful at figuring out human morality. It maybe true if the human-level AGI is an unmodified emulation of a human brain. But humans aren't very good at figuring out morality; they can make serious mistakes, though admittedly not the same mistakes Eliezer gives as examples above. (He deliberately picked ones that sound 'stupid' to a human mind, to make the point that human concepts have a huge amount of implicit complexity built in.) Not necessarily. The average chimpanzee is better than the average human at predicting chimpanzee behavior, simulating chimpanzee values, etc. (See Sympathetic Minds.) Utility functions that change over time are more dangerous than stable ones, because it's harder to predict how a descendant of a seed AI with a heavily modified utility function will behave than it is to predict how a descendant with the same utility function will behave. If we don't solve the problem of Friendly AI ourselves, we won't know what trajectory of self-modification to set the AI on in order for it to increasingly approximate Friendliness. We can't tell it to increasingly approximate something that we ourselves cannot formalize and cannot point to clear empirical evidence of. We already understand arithmetic, so we know how to reward a system for gradually doing better and better at arithmetic problems. We don't understand human morality or desire, so we can't design a Morality Test or Wish Test that we know for sure will reward all and only the good or desirable actions. We can make the AI increasingly approximate something, sure, but how do we know in advance that that something is something we'd like?

0TheAncientGeek12y

Assuming morality is lots of highly localised, different things...which I don't , particularly. if it is not, then you can figure it out anywhere, If it is,then the problem the aliens have is not that morality is imponderable, but that they are don't have access to the right data. They don't know how things on earth. However, an AI built on Earth would. So the situation is not analogous. The only disadvantage an AI would have is not having biological drives itself, but it is not clear that an entity needs to have drives in order to understand them. We could expect a SIAI to get incrementally betyter at maths than us until it surpasses us; we wouldn't worry that i would hit on the wrong maths, because maths is not a set of arbitrary, disconnected facts. An averagely intelligent AI with an average grasp of morality would not be more of a threat than an average human. A smart AI, would, all other things being equal, be better at figuring out morality. But all other things are not equal, because you want to create problems by walling off the UF. I'm sure they do. That seems to be why progress in AGI , specifically use of natural language,has been achingly slow. But why should moral concepts be so much more difficult than others? An AI smart enough to talk its way out of a box would be able to understand the implicit complexity: an AI too dumb to understand implicit complexity would be boxable. Where is the problem? Things are not inherently dangerous just because they are unpredictable. If you have some independent reason fo thinking something might turn dangerous, then it becomes desirable to predict it. But Superintelligent artificial general intelligences are generally assumed to be good at everything: they are not assumed to develop mysterious blind spots about falconry or mining engineering, Why assume they will develop a blind spot about morality? Oh yes...because you have assumed from the outset that the UF must be walled off from self improvement...in order

1Rob Bensinger12y

The problem of FAI is the problem of figuring out all of humanity's deepest concerns and preferences, not just the problem of figuring out the 'moral' ones (whichever those are). E.g., we want a superintelligence to not make life boring for everyone forever, even if 'don't bore people' isn't a moral imperative. Regardless, I don't see how the moral subset of human concerns could be simplified without sacrificing most human intuitions about what's right and wrong. Human intuitions as they stand aren't even consistent, so I don't understand how you can think the problem of making them consistent and actionable is going to be a simple one. Someday, perhaps. With enough time and effort invested. Still, again, we would expect a lot more human-intelligence-level aliens (even if those aliens knew a lot about human behavior) to be good at building better AIs than to be good at formalizing human value. For the same reason, we should expect a lot more possible AIs we could build to be good at building better AIs than to be good at formalizing human value. I don't know what you mean by 'imponderable'. Morality isn't ineffable; it's just way too complicated for us to figure out. We know how things are on Earth; we've been gathering data and theorizing about morality for centuries. And our progress in formalizing morality has been minimal. An AI that's just a copy of a human running on transistors is much more powerful than a human, because it can think and act much faster. It would also be better at figuring out how many atoms are in my fingernail, but that doesn't mean it will ever get an exact count. The question is how rough an approximation of human value can we allow before all value is lost; this is the 'fragility of values' problem. It's not enough for an AGI to do better than us at FAI; it has to be smart enough to solve the problem to a high level of confidence and precision. First, because they're anthropocentric; 'iron' can be defined simply because it's a comm

-8TheAncientGeek12y

-6TheAncientGeek12y

0[anonymous]9y

Those two things turn out to be identical (deepest concerns and preferences=the 'moral' ones). Because nothing else can be of greater importance to a decision maker.

0CynicalOptimist9y

I think that RobbBB has already done a great job of responding to this, but I'd like to have a try at it too. I'd like to explore the math/morality analogy a bit more. I think I can make a better comparison. Math is an enormous field of study. Even if we limited our concept of "math" to drawing graphs of mathematical functions, we would still have an enormous range of different kinds of functions: Hyperbolic, exponential, polynomial, all the trigonometric functions, etc. etc. Instead of comparing math to morality, I think it's more illustrative to compare math to the wider topic of "value-driven-behaviour". An intelligent creature could have all sorts of different values. Even within the realm of modern, western, democratic morality we still disagree about whether it is just and propper to execute murderers. We disagree about the extent to which a state is obligated to protect its citizens and provide a safety net. We disagree about the importance of honesty, of freedom vs. safety, freedom of speech vs. protection from hate speech. If you look at the wider world, and at cultures through history, you'll find a much wider range of moralities. People who thought it was not just permitted, but morally required that they enslave people, restrict the freedoms of their own families, and execute people for religious transgressions. You might think that these are all better or worse approximations of the "one true morality", and that a superintelligence could work out what that true morality is. But we don't think so. We believe that these are different moralities. Fundamentally, these people have different values. Then we can step further out, and look at the "insane" value systems that a person could hold. Perhaps we could believe that all people are so flawed that they must be killed. Or we could believe that no one should ever be allowed to die, and so we extend life indefinitely, even for people in agony. Or we might believe everyone should be lobotomised for our

0TheAncientGeek9y

The range of possible values is only a problem if you hold to the theory that morality "is" values, without any further qualifications, then an AI is going to have trouble figuring out morality apriori. If you take the view that morality is a fairly uniform way of handling values, or a subset of values, then so long as then the AI can figure it out by taking prevailing values as input, as data. (We will be arguing that:- * Ethics fulfils a role in society, and originated as a mutually beneficial way of regulating individual actions to minimise conflict, and solve coordination problems. ("Social Realism"). * No spooky or supernatural entities or properties are required to explain ethics (naturalism is true) * There is no universally correct system of ethics. (Strong moral realism is false) * Multiple ethical constructions are possible... Our version of ethical objectivism needs to be distinguished from universalism as well as realism, Ethical universalism is unikely...it is unlikely that different societies would have identical ethics under different circumstances. Reproductive technology must affect sexual ethics. The availability of different food sources in the environment must affect vegetarianism versus meat eating. However, a compromise position can allow object-level ethics to vary non-arbitrarily. In other words, there is not an objective answer to questions of the form "should I do X", but there is an answer to the question "As a member of a society with such-and-such prevailing conditions, should I do X". In other words still, there is no universal (object level) ethics, but there there is an objective-enough ethics, which is relativised to societies and situations, by objective features of societies and situations...our meta ethics is a function from situations to object level ethics, and since both the functions and its parameters are objective, the output is objective. By objectivism-without-realism, we mean that mutually isolated groups of

2CCC9y

So... what you're suggesting, in short, is that a sufficiently intelligent AI can work out the set of morals which are most optimal in a given human society. (There's the question of whether it would converge on the most optimal set of morals for the long-term benefit of the society as a whole, or the most optimal set of morals for the long-term benefit of the individual). But let's say the AI works out an optimal set of morals for its current society. What's to stop the AI from metaphorically shrugging and ignoring those morals in order to rather build more paperclips? Especially given that it does not share those values.

0TheAncientGeek9y

Which individual? The might be some decision theory which promotes the interests of Joe Soap, against the interests of society, but there is no way i would call it morality. Its motivational system. We're already assuming it's motivated to make the deduction, we need to assume it's motivated to implement. I am not bypassing the need for a goal driven AI to have appropriate goals, I am by passing the need for a detailed and accurate account of human ethics to be preprogrammed. I am not sayngn it necessarily does not. I am saying it does not necessarily.

1CCC9y

Ah, I may have been unclear there. To go into more detail, then; you appear to be suggesting that optimal morality can be approached as a society-wide optimisation problem; in the current situations, these moral strictures produce a more optimal society than those, and this optimisation problem can be solved with sufficient computational resources and information. But now, let us consider an individual example. Let us say that I find a wallet full of money on the ground. There is no owner in sight. The optimal choice for the society as a whole is that I return the money to the original owner; the optimal choice for the individual making the decision is to keep the money and use it towards my aims, whatever those are. (I can be pretty sure that the man to whom I return the money will be putting it towards his aims, not mine, and if I'm sufficiently convinced that my aims are better for society than his then I can even rationalise this action). By my current moral structures, I would have to return the money to its original owner. But I can easily see a superintelligent AI giving serious consideration to the possibility that it can do more good for the original owner with the money than the original owner could. This, right here, is the hard problem of Friendly AI. How do we make it motivated to implement? And, more importantly, how do we know that it is motivated to implement what we think it's motivated to implement? You're suggesting that it can figure out the complicated day-to-day minutae and the difficult edge cases on its own, given a suitable algorithm for optimising morality. My experience in software design suggests that that algorithm needs to be really, really good. And extremely thoroughly checked, from every possible angle, by a lot of people. I'm not denying that such an algorithm potentially exists. I can just think of far, far too many ways for it to go very badly wrong. ...point taken. It may or may not share those values. But then we must a

0rkyeun9y

I believe that iff naturalism is true then strong moral realism is as well. If naturalism is true then there are no additional facts needed to determine what is moral than the positions of particles and the outcomes of arranging those particles differently. Any meaningful question that can be asked of how to arrange those particles or rank certain arrangements compared to others must have an objective answer because under naturalism there are no other kinds and no incomplete information. For the question to remain unanswerable at that point would require supernatural intervention and divine command theory to be true. If you there can't be an objective answer to morality, then FAI is literally impossible. Do remember that your thoughts and preference on ethics are themselves an arrangement of particles to be solved. Instead I posit that the real morality is orders of magnitude more complicated, and finding it more difficult, than for real physics, real neurology, real social science, real economics, and can only be solved once these other fields are unified. If we were uncertain about the morality of stabbing someone, we could hypothetically stab someone to see what happens. When the particles of the knife rearranges the particles of their heart into a form that harms them, we'll know it isn't moral. When a particular subset of people with extensive training use their knife to very carefully and precisely rearrange the particles of the heart to help people, we call those people doctors and pay them lots of money because they're doing good. But without a shitload of facts about how to exactly stab someone in the heart to save their life, that moral option would be lost to you. And the real morality is a superset that includes that action along with all others.

1TheAncientGeek9y

You need to refute non-cognitivism, as well as asserting naturalism. Naturalism says that all questions that have answer have naturalistic answers, which means that if there are answers to ethical questions, they are naturalistic answers. But there is no guarantee that ethical questions mean anything, that they have answers. No, only non-cogntivism, the idea that ethical questions just don't make sense, like "how many beans make yellow?". Not unless the "F" is standing for something weird. Absent objective morality, you can possibly solve the control problem, ie achieving safety by just making the AI do what you want; and absent objective morality, you can possibly achieve AI safety by instilling a suitable set of arbitrary values. Neither is easy, but you said "impossible". That's not an argument for cognitivism. When I entertain the thought "how many beans make yellow?", that's an arrangement of particles. Do you have an argument for that proposal? Because I am arguing for something much simpler, that morality only needs to be grounded at the human level, so reductionism is neither denied nor employed. It's hard to see what point you are making there. The social and evaluative aspects do make a difference to the raw physics, and so much that the raw physics counts for very little. yet previously you were insisting that a reduction to fundamental particles was what underpinned the objectivity of morality.

1g_pepper9y

Even if it were true that under naturalism we could determine the outcome of various arrangements of particles, wouldn't we still be left with the question of which final outcome was the most morally preferable? But, you and I might have different moral preferences. How (under naturalism) do we objectively decide between your preferences and mine? And, Isn't it also possible that neither your preferences nor my preferences are objectively moral?

-1BrianPansky9y

Yup. But that's sort-of contained within "the positions of particles" (so long as all their other properties are included, such as temperature and chemical connections and so on...might need to include rays of light and non-particle stuff too!). The two are just different ways of describing the same thing. Just like every object around you could be described either with their usual names, ("keyboard:, "desk", etc) or with an elaborate molecule by molecule description. Plenty of other descriptions are possible too (like "rectangular black colored thing with a bunch of buttons with letters on it" describes my keyboard kinda). You don't. True preferences (as opposed to mistaken preferences) aren't something you get to decide. They are facts.

1TheAncientGeek9y

That's an expression of ethical naturalism not a defence of ethcial naturalism. Missing the point. Ethics needs to sort good actors from bad--decisions about punishments and rewards depend on it. PS are you the same person as rkyeun? If not, to what extent are you on the same page?

0BrianPansky9y

(I'd say need to sort good choices from bad. Which includes the choice to punish or reward.) Discovering which choices are good and which are bad is a fact finding mission. Because: * 1) it's a fact whether a certain choice will successfully fulfill a certain desire or not * And 2) that's what "good" literally means: desirable. So that's what any question of goodness will be about: what will satisfy desires. No I'm not rkyeun. As for being on the same page...well I'm definitely a moral realist. I don't know about their first iff-then statement though. Seems to me that strong moral realism could still exist if supernaturalism were true. Also, talking in terms of molecules is ridiculously impractical and unnecessary. I only talked in those terms because I was replying to a reply to those terms :P

0g_pepper9y

But, what if two different people have two conflicting desires? How do we objectively find the ethical resolution to the conflict?

0BrianPansky9y

Basically: game theory. In reality, I'm not sure there ever are precise conflicts of true foundational desires. Maybe it would help if you had some real example or something. But the best choice for each party will always be the one that maximizes their chances of satisfying their true desire.

0g_pepper9y

I was surprised to hear that you doubt that there are ever conflicts in desires. But, since you asked, here is an example: A is a sadist. A enjoys inflicting pain in others. A really wants to hurt B. B wishes not to be hurt by A. (For the sake of argument, lets suppose that no simulation technology is available that would allow A to hurt a virtual B, and that A can be reasonably confident that A will not be arrested and brought to trial for hurting B.) In this scenario, since A and B have conflicting desires, how does a system that defines objective goodness as that which will satisfy desires resolve the conflict?

0BrianPansky9y

Re-read what I said. That's not what I said. First get straight: good literally objectively does mean desirable. You can't avoid that. Your question about conflict can't change that (thus it's a red herring). As for your question: I already generally answered it in my previous post. Use Game theory. Find the actions that will actually be best for each agent. The best choice for each party will always be the one that maximizes their chances of satisfying their true desires. I might finish a longer response to your specific example, but that takes time. For now, Richard Carrier's Goal Theory Update probably covers a lot of that ground. http://richardcarrier.blogspot.ca/2011/10/goal-theory-update.html

1CCC9y

It does not. Wiktionary states that it means "Acting in the interest of good; ethical." (There are a few other definitions, but I'm pretty sure this is the right one here). Looking through the definitions of 'ethical', I find "Morally approvable, when referring to an action that affects others; good. " 'Morally' is defined as "In keeping of requirements of morality.", and 'morality' is "Recognition of the distinction between good and evil or between right and wrong; respect for and obedience to the rules of right conduct; the mental disposition or characteristic of behaving in a manner intended to produce morally good results. " Nowhere in there do I see anything about "desirable" - it seems to simplify down to "following a moral code". I therefore suspect that you're implicitly assuming a moral code which equates "desirable" with "good" - I don't think that this is the best choice of a moral code, but it is a moral code that I've seen arguments in favour of before. But, importantly, it's not the only moral code. Someone who follows a different moral code can easily find something that is good but not desirable; or desirable but not good.

0g_pepper9y

Right. You said: Do you have an objective set of criteria for differentiating between true foundational desires and other types of desires? If not, I wonder if it is really useful to respond to an objection arising from the rather obvious fact that people often have conflicting desires by stating that you doubt that true foundational desires are ever in precise conflict. As CCC has already pointed out, no, it is not apparent that (morally) good and desirable are the same thing. I won’t spend more time on this point since CCC addressed it well. The issue that we are discussing is objective morals. Your equating goodness and desirability leads (in my example of the sadist) A to believe that hurting B is good, and B to believe that hurting B is not good. But moral realism holds that moral valuations are statements that are objectively true or false. So, conflicting desires is not a red herring, since conflicting desires leads (using your criterion) to subjective moral evaluations regarding the goodness of hurting B. Game theory on the other hand does appear to be a red herring – no application of game theory can change the fact that A and B differ regarding the desirability of hurting B. One additional problem with equating moral goodness with desirability is that it leads to moral outcomes that are in conflict with most people’s moral intuitions. For example, in my example of the sadist A desires to hurt B, but most people’s moral intuition would say that A hurting B just because A wants to hurt B would be immoral. Similarly, rape, murder, theft, etc., could be considered morally good by your criterion if any of those things satisfied a desire. While conflicting with moral intuition does not prove that your definition is wrong, it seems to me that it should at a minimum raise a red flag. And, I think that the burden is on you to explain why anyone should reject his/her moral intuition in favor of a moral criterion that would adjudge theft, rape and murder to be mo

0TheAncientGeek9y

It's not at all clear that morally good means desirable. The idea that the good is the desirable gets what force it has from the fact that "good" has a lot of nonmoral meanings. Good ice cream is desirable ice cream, but what's that got to do with ethics?

0entirelyuseless9y

Morally good means what it is good to do. So there is something added to "good" to get morally good -- namely it is what it is good all things considered, and good to do, as opposed to good in other ways that have nothing to do with doing. It if it would be good to eat ice cream at the moment, eating ice cream is morally good. And if it would be bad to eat ice cream at the moment, eating ice cream is morally bad. But when you say "good ice cream," you aren't talking about what it is good to do, so you aren't talking about morality. Sometimes it is good to eat bad ice cream (e.g. you have been offered it in a situation where it would be rude to refuse), and then it is morally good to eat the bad ice cream, and sometimes it is bad to eat good ice cream (e.g. you have already eaten too much), and then it is morally bad to eat the good ice cream.

0TheAncientGeek9y

That's a theory of what "morally" is adding to "good". You need to defend it against alternatives, rather than stating it as if it were obvious. Are you sure? How many people agree with that? Do you have independent evidence , or are you just following through the consequences of your assumptions (ie arguing in circles)?

0entirelyuseless9y

I think most people would say that it doesn't matter if you eat ice cream or not, and in that sense they might say it is morally indifferent. However, while I agree that it mainly doesn't matter, I think they are either identifying "non-morally obligatory" with indifferent here, or else taking something that doesn't matter much, and speaking as though it doesn't matter at all. But I think that most people would agree that gluttony is a vice, and that implies that there is an opposite virtue, which would mean eating the right amount and at the right time and so on. And eating ice cream when it is good to eat ice cream would be an act of that virtue. Would you agree that discussion about "morally good" is discussion about what we ought to do? It seems to me this is obviously what we are talking about. And we should do things that are good to do, and avoid doing things that are bad to do. So if "morally good" is about what we should do, then "morally good" means something it is good to do.

1TheAncientGeek9y

What is wrong with saying it doesn't matter at all? That's pretty much changing the subject. I think it is about what we morally ought to do. If you are playing chess, you ought to move the bishop diagonally, but that is again non-moral. We morally-should do what is morally good, and hedonistically-should do what is hedonotsitcally-good, and so on. These can conflict, so they are not the same.

0entirelyuseless9y

Talking about gluttony and temperance was not changing the subject. Most people think that morally good behavior is virtuous behavior, and morally bad behavior vicious behavior. So that implies that gluttony is morally bad, and temperance morally good. And if eating too much ice cream can be gluttony, then eating the right amount can be temperance, and so morally good. There is a lot wrong with saying "it doesn't matter at all", but basically you would not bother with eating ice cream unless you had some reason for it, and any reason would contribute to making it a good thing to do. I disagree completely with your statements about should, which do not correspond with any normal usage. No one talks about "hedonistically should." To reduce this to its fundamentals: "I should do something" means the same thing as "I ought to do something", which means the same thing as "I need to do something, in order to accomplish something else." Now if we can put whatever we want for "something else" at the end there, then you can have your "hedonistically should" or "chess playing should" or whatever. But when we are talking about morality, that "something else" is "doing what is good to do." So "what should I do?" has the answer "whatever you need to do, in order to be doing something good to do, rather than something bad to do."

0TheAncientGeek9y

It's changing the subject because you are switching from an isolated act to a pattern of behaviour. Such as? You are using good to mean morally good again. You can't infer the non-existence of a distinction from the fact that it is not regularly marked in ordinary language. "Jade is an ornamental rock. The term jade is applied to two different metamorphic rocks that are composed of different silicate minerals: Nephrite consists of a microcrystalline interlocking fibrous matrix of the calcium, magnesium-iron rich amphibole mineral series tremolite (calcium-magnesium)-ferroactinolite (calcium-magnesium-iron). The middle member of this series with an intermediate composition is called actinolite (the silky fibrous mineral form is one form of asbestos). The higher the iron content, the greener the colour. Jadeite is a sodium- and aluminium-rich pyroxene. The precious form of jadeite jade is a microcrystalline interlocking growth of jadeite crystals."" So you say. Actually, the idea that ethical claims can be cashed out as hypotheticals is quite contentious. Back to the usual problem. What you morally-should do is whatever you need to do, in order to be doing something morally good, is true but vacuous. . What you morally-should do is whatever you need to do, in order to be doing something good is debatable.

0entirelyuseless9y

The point about the words is that it is easy to see from their origins that they are about hypothetical necessity. You NEED to do something. You MUST do it. You OUGHT to do it, that is you OWE it and you MUST pay your debt. All of that says that something has to happen, that is, that it is somehow necessary. Now suppose you tell a murderer, "It is necessary for you to stop killing people." He can simply say, "Necessary, is it?" and then kill you. Obviously it is not necessary, since he can do otherwise. So what did you mean by calling it necessary? You meant it was necessary for some hypothesis. I agree that some people disagree with this. They are not listening to themselves talk. The reason that moral good means doing something good, is that the hypothesis that we always care about, is whether it would be good to do something. That gives you a reason to say "it is necessary" without saying for what, because everyone wants to do something that would be good to do. Suppose you define moral goodness to be something else. Then it might turn out that it would be morally bad to do something that would be good to do, and morally good to do something that would be bad to do. But in that case, who would say that we ought to do the thing which is morally good, instead of the thing that would be good to do? They would say we should do the thing that would be good to do, again precisely because it is necessary, and therefore we MUST do the supposedly morally bad thing, in order to be doing something good to do.

1TheAncientGeek9y

You are assuming that the only thing that counts as necessity per se is physical necessity, ie there is no physical possiibity of doing otherwise. But moral necessity is more naturally cashed out as the claim that there is no permissable state of affairs in which the murdered can murder. http://www.hsu.edu/academicforum/2000-2001/2000-1AFThe%20Logic%20of%20Morality.pdf In less abstract terms, what we are saying is that morality does not work like a common-or-garden in-order-to-achieve-X-do-Y. because you cannot excuse yourself , or obtain permissibility, simply by stating that you have some end in mind other than being moral. Even without logical necessity, morality has social obligatoriness, and that needs to be explained, and a vanilla account in terms of hypotetical necessities in order to achieve arbtrary ends cannot do that. If the moral good were just a rubber-stamp of approval for whatever we have in our utility functions, there would be no need for morality as a behaviour-shaping factor in human society. Morality is not "do what thou wilt". In some sense of "good", but, as usual, an unqualified "good" does not give you plausible morality. It's tautologous that we morally-should do what is morally-good.

0entirelyuseless9y

The "no permissible state of affairs" idea is also hypothetical necessity: "you must do this, if we want a situation which we call permissible." As I think I have stated previously, the root of this disagreement is that you believe, like Eliezer, that reality is indifferent in itself. I do not believe that. In particular, I said that good things tend to make us desire them. You said I had causality reversed there. But I did not: I had it exactly right. Consider survival, which is an obvious case of something good. Does the fact that we desire something, e.g. eating food instead of rocks, make it into something that makes us survive? Or rather, is the fact that it makes us survive the cause of the fact that we desire it? It is obvious from how evolution works that the latter is the case and not the former. So the fact that eating food is good is the cause of the fact that we desire it. I said the basic moral question is whether it would be good to do something. You say that this is putting a "rubber-stamp of approval for whatever we have in our utility functions." This is only the case, according to your misunderstanding of the relationship between desire and good. Good things tend to make us desire them. But just because there is a tendency, does not mean it always works out. Things tend to fall, but they don't fall if someone catches them. And similarly good things tend to make us desire them, but once in a while that fails to work out and someone desires something bad instead. So saying "do whatever is good to do," is indeed morality, but it definitely does not mean "do whatever thou wilt." I don't care about "morally-should" as opposed to what I should do. I think I should do whatever would be good to do; and if that's different from what you call moral, that's too bad for you.

0TheAncientGeek9y

I still don't think you have made a good case for morality being hypothetical, since you haven't made a case against the case against. And I still think you need to explain obligatoriness. Survival is good, you say. If I am in a position to ensure my survival by sacrificing Smith, is it morally good to do so? After all Smith's survival is just as Good as mine. Doens't-care is made to care. If you don't behave as though you care about morality, society will punish you. However. it won't punish you for failing to fulfil other shoulds.

0entirelyuseless9y

I didn't see any good case against morality being hypothetical, not even in that article. I did explain obligatoriness. It is obligatory to do something morally good because we don't have a choice about wanting to do something good. Everyone wants to do that, and the only way you can do that is by doing something morally good. I did said I do not care about morally-should "as opposed" to what I should do. It could sometimes happen that I should not do something because people will punish me if I do it. In other words, I do care about what I should do, and that is determined by what would be good to do.

0TheAncientGeek9y

From which it follows that nobody ever fails to do what is morally good, and that their inevitable moral goodness is th result of inner psychological compulsion, not outer systems of reward and punishment, and that no systems of reward and punishment systems were ever necessary. All of that is clearly false. Unless there are non-moral gods, which there clearly are,since there are immoral and amoral acts committed to obtain them.

0entirelyuseless9y

"From which it follows that nobody ever fails to do what is morally good" No, it does not, unless you assume that people are never mistaken about what would be good to do. I already said that people are sometimes mistaken about this, and think that it would be good to do something, when it would be bad to do it. In those cases they fail to do what is morally good. I agree there are non-moral goods, e.g. things like pleasure and money and so on. That is because a moral good is "doing something good", and pleasure and money are not doing anything. But people who commit immoral acts in order to obtain those goods, also believe that they are doing something good, but they are mistaken.

0rkyeun8y

I would be very surprised to find that a universe whose particles are arranged to maximize objective good would also contain unpaired sadists and masochists. You seem to be asking a question of the form, "But if we take all the evil out of the universe, what about evil?" And the answer is "Good riddance." Pun intentional.

0g_pepper8y

The problem is that neither you nor BrianPansky has proposed a viable objective standard for goodness. BrianPansky said that good is that which satisfies desires, but proposed no objective method for mediating conflicting desires. And here you said “Do remember that your thoughts and preference on ethics are themselves an arrangement of particles to be solved” but proposed no way for resolving conflicts between different people’s ethical preferences. Even if satisfying desires were an otherwise reasonable standard for goodness, it is not an objective standard, since different people may have different desires. Similarly, different people may have different ethical preferences, so an individual’s ethical preference would not be an objective standard either, even if it were otherwise a reasonable standard. No, I am not asking that. I am pointing out that neither your standard nor BrianPansky’s standard is objective. Therefore neither can be used to determine what would constitute an objectively maximally good universe nor could either be used to take all evil out of the universe, nor even to objectively identify evil.

2TheAncientGeek9y

Whose desires? The murderer wants to murder the victim, the victim doesn't want to be murdered. You have realism without objectivism. There is a realistic fact about people's preferences, but since the same act can increase one person's utility and reduce anothers, there is no unambiguous way to label an arbitrry outcome.

0BrianPansky9y

Murder isn't a foundational desire. It's only a means to some other end. And usually isn't even a good way to accomplish its ultimate end! It's risky, for one thing. So usually it's a false desire: if they knew the consequences of this murder compared to all other choices available, and they were correctly thinking about how to most certainly get what they really ultimately want, they'd almost always see a better choice. (But even if it were foundational, not a means to some other end, you could imagine some simulation of murder satisfying both the "murderer"'s need to do such a thing and everyone else's need for safety. Even the "murderer" would have a better chance of satisfaction, because they would be far less likely to be killed or imprisoned prior to satisfaction.) Well first, in the most trivial way, you can unambiguously label an outcome as "good for X". If it really is (it might not be, after all, the consequences of achieving or attempting murder might be more terrible for the would-be murderer than choosing not to attempt murder). It works the same with (some? all?) other adjectives too. For example: soluble. Is sugar objectively soluble? Depends what you try to dissolve it in, and under what circumstances. It is objectively soluble in pure water at room temperature. It won't dissolve in gasoline. Second, in game theory you'll find sometimes there are options that are best for everyone. But even when there isn't, you can still determine which choices for the individuals maximize their chance of satisfaction and such. Objectively speaking, those will be the best choices they can make (again, that's what it means for something to be a good choice). And morality is about making the best choices.

1TheAncientGeek9y

It can be instrumental or terminal, as can most other criminal impulses. You can't solve all ethical problems by keeping everyone in permanent simulation. That's no good. You can't arrive at workable ethics by putting different weightings on the same actions from different perspectives. X stealing money form Y is good for X and bad for Y, so why disregard Y's view? An act is either permitted or forbidden, punished or praised. You can't say it is permissible-for-X but forbidden-for-Y if it involves both of them. No, there's no uniform treatment of all predicates. Some are one-place, some are two-place. For instance, aesthetic choices can usually be fulfilled on a person-by-person basis. To be precise, you sometimes find solutions that leave everyone better off, and more often find solutions that leave the average person better off. Too vague. For someone who likes killing ot kill a lot of people is the best choice for them, but not the best ethical choice.

0gattsuru12y

It's quite possible that I'm below average, but I'm not terribly impressed by my own ability to extrapolate how other average people's morality works -- and that's with the advantage of being built on hardware that's designed toward empathy and shared values. I'm pretty confident I'm smarter than my cat, but it's not evident that I'm correct when I guess at the cat's moral system. I can be right, at times, but I can be wrong, too. Worse, that seems a fairly common matter. There are several major political discussions involving moral matters, where it's conceivable that at least 30% of the population has made an incorrect extrapolation, and probable that in excess of 60% has. And this only gets worse if you consider a time variant : someone who was as smart as the average individual in 1950 would have little problem doing some very unpleasant things to Alan Turing. Society (luckily!) developed since then, but it has mechanisms for development and disposal of concepts that AI do not necessarily have or we may not want them to have. ((This is in addition to general concerns about the universality of intelligence : it's not clear that the sort of intelligence used for scientific research necessarily overlaps with the sort of intelligence used for philosophy, even if it's common in humans.)) Well, the obvious problem with not walling off and making unupdateaable the utility function is that the simplest way to maximize the value of a malleable utility function is to update it to something very easy. If you tell an AI that you want it to make you happy, and let it update that utility function, it takes a good deal less bit-twiddling to define "happy" as a steadily increasing counter. If you're /lucky/, that means your AI breaks down. If not, it's (weakly) unfriendly. You can have a higher-level utility function of "do what I mean", but not only is that harder to define, it has to be walled off, or you have "what I mean" redirected to a steadily increasing counter. And

-8TheAncientGeek12y

0bouilhet12y

Thanks for the reply, Robb. I've read your post and a good deal of the discussion surrounding it. I think I understand the general concern, that an AI that either doesn't understand or care about our values could pose a grave threat to humanity. This is true on its face, in the broad sense that any significant technological advance carries with it unforeseen (and therefore potentially negative) consequences. If, however, the intelligence explosion thesis is correct, then we may be too late anyway. I'll elaborate on that in a moment. First, though, I'm not sure I see how an AI "too dumb to understand human values" could program a superior general intelligence (i.e. an AI that is smart enough to understand human values). Even so, assuming it is possible, and assuming it could happen on a timescale and in such a way as to preclude or make irrelevant any human intervention, why would that change the nature of the superior intelligence from being, say, friendly to human interests, to being hostile to them? Why, for that matter, would any superintelligence (that understands human values, and that is "able to form deep and reliable abstractions about the world") be predisposed to any particular position vis-a-vis humans? And even if it were predisposed toward friendliness, how could we possibly guarantee it would always remain so? How, that is, having once made a friend, can we foolproof ourselves against betrayal? My intuition is that we can’t. No step can be taken without some measure of risk, however small, and if the step has potentially infinitely negative consequences, then even the very slightest of risks begins to look like a bad bet. I don’t know a way around that math. The genie, as you say, doesn't care. But also, often enough, the human doesn't care. He is constrained, of course, by his fellow humans, and by his environment, but he sometimes still manages (sometimes alone, sometimes in groups) to sow massive horror among his fellows, sometimes even in the na

1Kenny5y

Yes; good points! Do note that my original comment was made eight years ago! (At least – it was probably migrated from Overcoming Bias if this post is as early as it seems to be.) So I have had some time to think along these lines a little more :) But I don't think intelligence itself can lead one to conclude as you have: It's not obvious to me now that any particular distinction will be made by any particular intelligence. There's maybe not literally infinite, but still a VAST number of possible ontologies with which to make distinctions. The general class of 'intelligent systems' is almost certainly WAY more alien than we can reasonably imagine. I don't assume that even a 'super-intelligence' would definitely ever "differentiate between smiley-faces and happy people". But I don't remember this post that well, and I was going to re-read before I remembered that I didn't even know what I was originally replying to (as it didn't seem to be the post itself), and re-constructing the entire context to write a better reply which my temporal margin "is too narrow to contain" at the moment. But I think I still disagree with whatever Shane wrote!

[-]Carl_Shulman17y60

Tim,

"A utility function measured in dollars seems fairly unambiguous."

Oy vey.

http://en.wikipedia.org/wiki/Hyperinflation

[+]Hopefully_Anonymous17y-52

[-]Eliezer Yudkowsky17y140

Shane, again, the issue is not differentiation. The issue is classification. Obviously, tiny smiley faces are different from human smiling faces, but so is the smile of someone who had half their face burned off. Obviously a superintelligence knows that this is an unusual case, but that doesn't say if it's a positive or negative case.

Deep abstractions are important, yes, but there is no unique deep abstraction that classifies any given example. An apple is a red thing, a biological artifact shaped by evolution, and an economic resource in the human market.

Also, Hibbard spoke of using smiling faces to reinforce behaviors, so if a superintelligence would not confuse smiling faces and happiness, that works against that proposal - because it means that the superintelligence will go on focusing on smiling faces, not happiness.

Retired Urologist, one of the most important lessons that a rationalist learns is not to try to be clever. I don't play nitwit games with my audience. If I say it, I mean it. If I have words to emit that I don't necessarily mean, for the sake of provoking reactions, I put them into a dialogue, short story, or parable - I don't say them in my own voice.

1azergante9mo

Since the ASI knows this is an unusual case, can it do some exception handling (like asking a human) instead of executing the normal path? Why only positive or negative? some classifiers have an "out-of-distribution" category, for example One-Class SVM, using several of them should handle multiple classes. Perhaps this is also doable with any other latent feature spaces (transformers?) using a threshold distance to limit categories and labeling the remaining space as the "out-of-distribution" category. ---------------------------------------- The main AI-categorization issue I see is that humans might care about a dimension of the data that is entirely missing from the latent space of the AI. In that case the AI is literally unable to tell the difference between two inputs that are not the same in a way that matters to us (like a color-blind AI). If that issue occurs with a sample from the "human smiling faces" category and a sample from the "tiny smiley faces" category, it means the AI classifies both as "human smiling faces" because in its latent space it lacks the feature dimensions necessary to tell the difference. So the AI keeps optimizing for what it thinks are "human smiling faces", but from our point of view it optimizes for both "human smiling faces" and "tiny smiley faces". Crucially, I do not think the AI starts optimizing only for "tiny smiley faces", remember, it cannot tell the difference between the two categories! So it has no way to optimize for only one. It also does not yet know whether one category is easier to optimize for than the other, because as soon as it knows, that is an additional dimension in feature space that separates the two in distinct categories. Diagram to Clarify my Mental Model in a Hypothetical Scenario During training the AI only encounters small black points, so it learns to classify based on 2 dimensions (x and y coordinates) into 3 categories (positive, negative, and out-of-distribution based on distance). Then

[-]steven17y00

There's a Hibbard piece from January 2008 in JET, but I'm not sure if it's new or if Eliezer has seen it: http://jetpress.org/v17/hibbard.htm

[-]retired_urologist17y10

@EY: If I have words to emit that I don't necessarily mean, for the sake of provoking reactions, I put them into a dialogue, short story, or parable - I don't say them in my own voice.

That's what I meant when I wrote: "By making his posts quirky and difficult to understand". Sorry. Should have been more precise.

@HA: perhaps you know the parties far better than I. I'm still looking.

[-]Shane_Legg17y61

I mean differentiation in the sense of differentiating between the abstract categories. Is a half a face that appears to be smiling while the other half is burn off still a "smiley face"? Even I'm not sure.

I'm certainly not arguing that training an AGI to maximise smiling faces is a good idea. It's simply a case of giving the AGI the wrong goal.

My point is that a super intelligence will form very good abstractions, and based on these it will learn to classify very well. The problem with the famous tank example you cite is that they were train... (read more)

0DilGreen15y

Surely the discussion is not about the issue of whether an AI will be able to be sophisticated in forming abstractions - if it is of interest, then presumably it will be. But the concern discussed here is how to determine beforehand that those abstractions will be formed in a context characterised here as Friendly AI. The concern is to pre-ordain that context before the AI achieves superintelligence. Thus the limitations of communicating desirable concepts apply.

0timtyler15y

Hopefully. Assuming server-side intelligence, the machine may initially know a lot about text, a reasonable amount about images, and a bit about audio and video. Its view of things is likely to be pretty strange - compared to a human. It will live in cyberspace, and for a while may see the rest of the world through a glass, darkly.

3Martin Randall1y

There are some excellent predictions in this thread. We have here some "natural abstraction hypothesis" and some "mechanistic interpretability".

[-]Chris_Hibbert17y70

I read most of the interchange between EY and BH. It appears to me that BH still doesn't get a couple of points. The first is that smiley faces are an example of misclassification and it's merely fortuitous to EY's ends that BH actually spoke about designing an SI to use human happiness (and observed smiles) as its metric. He continues to speak in terms of "a system that is adequate for intelligence in its ability to rule the world, but absurdly inadequate for intelligence in its inability to distinguish a smiley face from a human." EY's poin... (read more)

[-]JulianMorrison17y10

Even if by impossible luck he gets an AI that actually is a valid-happiness maximizer, he would still screw up. The AI would rampage out turning the galaxy into a paradise garden with just enough tamed-down monsters to keep us on our toes... but it would obliterate those sorts of utility that extend outside happiness, and probably stuff a cork in apotheosis. An Eden trap - a sort of existential whimper.

[-]Eliezer Yudkowsky17y90

Shane: I mean differentiation in the sense of differentiating between the abstract categories.

The abstract categories? This sounds like a unique categorization that the AI just has to find-in-the-world. You keep speaking of "good" abstractions as if this were a property of the categories themselves, rather than a ranking in your preference ordering relative to some decision task that makes use of the categories.

[-]DanB17y11

@Eliezer - I think Shane is right. "Good" abstractions do exist, and are independent of the observer. The value of an abstraction relates to its ability to allow you to predict the future. For example, "mass" is a good abstraction, because when coupled with a physical law it allows you to make good predictions.

If we assume a superintelligent AI, we have to assume that the AI has the ability to discover abstractions. Human happiness is one such abstraction. Understanding the abstraction "happiness" allows one to predict certain... (read more)

1DilGreen15y

Whether or not the AI finds the abstraction of human happiness to be pertinent, and whether it considers increasing it to be worthwhile sacrificing other possible benefits for, are unpredictable, unless we have succeeded in achieving EY's goal of pre-destining the AI to be Friendly.

[-]Allan_Crossman17y10

Plato had a concept of "forms". Forms are ideal shapes or abstractions: every dog is an imperfect instantiation of the "dog" form that exists only in our brains.

Mmm. I believe Plato saw the forms as being real things existing "in heaven" rather than merely in our brains. It wasn't a stupid theory for its day; in particular, a living thing growing into the right shape or form must have seemed utterly mysterious, and so the idea that some sort of blueprint was laid out in heaven must have had a lot of appeal.

But anyway, forms as... (read more)

[-]Manuel_Moertelmaier17y00

In contrast to Eliezer I think it's (remotely) possible to train an AI to reliably recognize human mind states underlying expressions of happiness. But this would still not imply that the machine's primary, innate emotion is unconditional love for all humans. The machines would merely be addicted to watching happy humans.

Personally, I'd rather not be an object of some quirky fetishism.

Monthy Python has, of course, realized it long ago:

http://www.youtube.com/watch?v=HoRY3ZjiNLU http://www.youtube.com/watch?v=JTMXtJvFV6E

[-]DanB17y00

@AC

I mean that a superintelligent AI should be able to induce the Form of the Good from extensive study of humans, human culture, and human history. The problem is not much different in principle from inducing the concept of "dog" from many natural images, or the concept of "mass" from extensive experience with physical systems.

[+]Tim_Tyler17y-50

[-]Carl_Shulman17y40

"Wealth then. Wealth measures access to resources - so convert to gold, silver, barrels of oil, etc to measure it - if you don't trust your country's currency."

I may not have gotten the point across. An AI aiming to maximize its wealth in U.S. dollars can do astronomically better by taking control of the Federal Reserve (if dollars are defined in its utility function as being issued by the Reserve, with only the bare minimum required to meet that definition being allowed to persist) and having it start issuing $3^^^3 bills than any commercial act... (read more)

[-]Tim_Tyler17y-40

Re: Creating an oil bank that issues oil vouchers in numbers astronomically exceeding its reserves could let an AI possess 3^^^3 account units each convertible to a barrel of oil.

No: such vouchers would not be redeemable in the marketplace: they would be worthless. Everyone would realise that - including the AI.

This is an example of the wirehead fallacy framed in economic terms. As Omohundro puts it, "AIs will try to prevent counterfeit utility".

[-]Carl_Shulman17y20

"No: such vouchers would not be redeemable in the marketplace: they would be worthless. Everyone would realise that - including the AI."

The oil bank stands ready to exchange any particular voucher for a barrel of oil, so if the utility function refers to the values of particular items, they can all have that market price. Compare with the price of gold or some other metal traded on international commodity markets. The gold in Fort Knox is often valued at the market price per ounce of gold multiplied by the number of ounces present, but in fact yo... (read more)

[+]Tim_Tyler17y-50

[-]Phil_Goetz517y-10

There are several famous science fiction stories about humans who program AIs to make humans happy, which then follow the letter of the law and do horrible things. The earliest is probably "With folded hands", by Jack Williamson (1947), in which AIs are programmed to protect humans, and they do this by preventing humans from doing anything or going anywhere. The most recent may be the movie "I, Robot."

I agree with E's general point - that AI work often presupposes that the AI magically has the same concepts as its inventor, even outsi... (read more)

[-]RobinHanson17y50

I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider.

[-]prase17y10

The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the classifier.

The point is, probably, that humans tend to underestimate the complexity of classifiers they use. The categories like "good" are not only difficult to precisely define, they are difficult to define at all, because they are too complicated to be formulated in words. To point out that in classification we use structures based on the architectu... (read more)

[-]Caledonian217y10

Look: humans can learn what a 'tank' is, and can direct their detection activities to specifically seek them - not whether the scene is light or dark, or any other weird regularity that might be present in the test materials. We can identify the regularities, compare them with the properties of tanks, and determine that they're not what we're looking for.

If we can do it, the computers can do it as well. We merely need to figure out how to bring it about - it's an engineering challenge only. That doesn't dismiss or minimize the difficulty of achieving i... (read more)

[-]Sean_C.17y90

Animal trainers have this problem all the time. Animal performs behavior 'x' gets a reward. But the animal might have been doing other subtle behaviors at the same time, and map the reward to 'y'. So instead of reinforcing 'x', you might be reinforcing 'y'. And if 'x' and 'y' are too close for you to tell apart, then you'll be in for a surprise when your perspective and context changes, and the difference becomes more apparent to you. And you find out that the bird was trained to peck anything that moves, instead of just the bouncy red ball or somethi... (read more)

[-]Shane_Legg17y31

"You keep speaking of "good" abstractions as if this were a property of the categories themselves, rather than a ranking in your preference ordering relative to some decision task that makes use of the categories."

Yes, I believe categories of things do exist in the world in some sense, due to structure that exists in the world. I've seen thousands of things where were referred to as "smiley faces" and so there is an abstraction for this category of things in my brain. You have done likewise. While we can agree about many th... (read more)

[-]Yvain217y100

IMHO, the idea that wealth can't usefully be measured is one which is not sufficiently worthwhile to merit further discussion.

The "wealth" idea sounds vulnerable to hidden complexity of wishes. Measure it in dollars and you get hyperinflation. Measure it in resources, and the AI cuts down all the trees and converts them to lumber, then kills all the animals and converts them to oil, even if technology had advanced beyond the point of needing either. Find some clever way to specify the value of all resources, convert them to products and allocate them to humans in the level humans want, and one of the products will be highly carcinogenic because the AI didn't know humans don't like that. The only way to get wealth in the way that's meaningful to humans without humans losing other things they want more than wealth is for the AI to know exactly what we want as well or better than we do. And if it knows that, we can ignore wealth and just ask it to do what it knows we want.

"The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the cl... (read more)

[-]Grant17y00

I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider.

I find the idea of an AI God rather scary. However, unless private AIs are made illegal or heavily regulated, is there much danger of one AI ruling all the lesser intelligences?

[-]Hopefully_Anonymous17y-10

"I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider."

I agree with Robin. Although I'm disappointed that he thinks he lacks an adequate forum to pound the podium on this more forcefully.

[-]Eliezer Yudkowsky17y30

Robin and I have discussed this subject in-person and got as far as narrowing down considerably the focus of the disagreement. Robin probably doesn't disagree with me at the point you would expect. Godlike powers, sure, nanotech etc., but Robin expects them to be rooted in a whole economy, not concentrated in a single brain like I expect. No comfort there for those attached to Life As We Know It.

However, I've requested that Robin hold off on discussing his disagreement with me in particular (although of course he continues to write general papers on the cosmic commons and exponential growth modes) until I can get more material out of the way on Overcoming Bias. This is what Robin means by "proper timing".

[-]Eliezer Yudkowsky17y40

Shane, I think we agree on essential Bayesian principles - there's structure that's useful for generic prediction, which is sensitive only to the granularity of your sensory information; and then there's structure that's useful for decision-making. In principle, all structure worth thinking about is decision-making structure, but in practice we can usually factor out the predictive structure just as we factor out probabilities in decision-making.

But I would further say that decision-making structure can be highly sensitive to terminal values in a way that... (read more)

[-]Tom_Breton_(Tehom)17y00

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication - transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood.

Or more generally, not just a binary classification problem but a measurement issue: How to measure benefit to humans or human satisfaction.

It has sometimes struck me that this FAI requirement has a lot i... (read more)

[-]Jadagul17y30

Shane, the problem is that there are (for all practical purposes) infinitely many categories the Bayesian superintelligence could consider. They all "identify significant regularities in the environment" that "could potentially become useful." The problem is that we as the programmers don't know whether the category we're conditioning the superintelligence to care about is the category we want it to care about; this is especially true with messily-defined categories like "good" or "happy." What if we train it to d... (read more)

[-]Lightwave217y00

I wonder if you'd consider a superintelligent human have the same flaws as a superintelligent AI (and will eventually destroy the world). What about a group of superintelligent humans (assuming they have to cooperate in order to act)?

[-]Aaron617y10

Eliezer: Have you read Scott Aaronson's work on the learnability of quantum states. There, the full space is doubly exponential in system size, but if we just want to predict the results of some set of possible questions (to some fixed accuracy), we don't need to train with nearly as many questions as one might think.

[-]Ben_Jones17y00

But it illustrates the general idea: the potential poison, in interacting with the complicated human machine, takes on a complicated boundary that doesn't match the grain of any local boundaries you would draw around substances.

Compared to 'actions that are right', even 'poisons' seems like a pretty obvious boundary to draw. Where's the grain around 'right'? Unlucky for Eliezer, we seem to find some pretty bizarre boundaries 'useful'.

[-]Tim_Tyler17y-10

Re: One god to rule us all

It does look as though there is going to be one big thing out there. It looks as though it will be a more integrated and unified entity than any living system up to now - and it is unlikely to be descended from today's United Nations - e.g. see:

Kevin Kelly: Predicting the next 5,000 days of the web

It seems rather unlikely that the Monopolies and Mergers Commission will be there to stop this particular global unification.

[-]Shane_Legg17y10

Eli, to my mind you seem to be underestimating the potential of a super intelligent machine.

How do I know that hemlock is poisonous? Well, I've heard the story that Socrates died by hemlock poisoning. This is not a conclusion that I've arrived at due to the physical properties of hemlock that I have observed and how this would affect the human body, indeed, as far as I know, I've never even seen hemlock before. The idea that hemlock is a poison is a pattern in my environment: every time I hear about the trial of Socrates I hear about it being the poison... (read more)

[-]Eliezer Yudkowsky17y70

Shane, I think you're underestimating the idiosyncrasy of morality. Suppose that I show you the sentence "This sentence is false." Do you convert it to ASCII, add up the numbers, factorize the result, and check if there are two square factors? No; it would be easy enough for you to do so, but why bother? The concept "sentences whose ASCII conversion of their English serialization sums to a number with two square factors" is not, to you, an interesting way to carve up reality.

Suppose that, driving along the highway, I see someone rid... (read more)

[-]Shane_Legg17y20

Eli, I've been busy fighting with models of cognitive bias in finance and only just now found time to reply:

Suppose that I show you the sentence "This sentence is false." Do you convert it to ASCII, add up the numbers, factorize the result, and check if there are two square factors? No; it would be easy enough for you to do so, but why bother? The concept "sentences whose ASCII conversion of their English serialization sums to a number with two square factors" is not, to you, an interesting way to carve up reality.

Sure, this property of... (read more)

[-]Eliezer Yudkowsky17y70

Shane, religious fundamentalists routinely act based on their beliefs about God. Do you think that makes "God" a natural category that any superintelligence would ponder? I see "human thoughts about God" and "things that humans justify by referring to God" and "things you can get people to do by invoking God" as natural categories for any AI operating on modern Earth, though an unfriendly AI wouldn't give it a second thought after wiping out humanity. But to go from here to reasoning about what God would actually ... (read more)

[-]Kragen_Javier_Sitaker217y220

It's worth pointing out that we have wired-in preferences analogous to those Hibbard proposes to build into his intelligences: we like seeing babies smile; we like seeing people smile; we like the sweet taste of fresh fruit; we like orgasms; many of us (especially men) like the sight of naked women, especially if they're young, and they sexually arouse us to boot; we like socializing with people we're familiar with; we like having our pleasure centers stimulated; we don't like killing people; and so on.

It's worth pointing out that we engage in a lot of face-xeroxing-like behavior in pursuit of these ends. We keep photos of our family in our wallets, we look at our friends' baby photos on their cellphones, we put up posters of smiling people; we eat candy and NutraSweet; we masturbate; we download pornography; we watch Friends on television; we snort cocaine and smoke crack; we put bags over people's heads before we shoot them. In fact, in many cases, we form elaborate, intelligent plans to these ends.

It doesn't matter that you know, rationally, that you aren't impregnating Jenna Jameson, or that the LCD pixels on the cellphone display aren't a real baby, that Caffeine Free Diet C... (read more)

-1EniScien4y

An expression of absolute horror. Saved this comment to my favourites. It's just so mundane that we don't even think about it. But if such "simple ways" worked, then there would be no problem of obesity from eating delicious sweets.

[-]Tim_Tyler17y-30

Re: One of the more obvious roadmaps to creating AI involves the stock market waking up.

I've fleshed this comment out into an essay on the topic: http://alife.co.uk/essays/the_awakening_marketplace/

[-]Shane_Legg17y-20

Eli,

Do you think that makes "God" a natural category that any superintelligence would ponder?

Yes. If you're a super intelligent machine on a mission there is very little that can stop you. You know that. About the only thing that could stop you would be some other kind of super intelligent entity, maybe an entity that created the universe. A "God" of some description. Getting the God question wrong could be a big mistake, and that's reason enough for you to examine the possibility.

[-]Eliezer Yudkowsky17y30

I don't consider such as Gods, as they are not supernatural and not ontologically distinct from creatures; they are simply powerful aliens or Matrix Lords. So I'll phrase it more precisely. Lots of humans talk about Jehovah. Does that make Jehovah a natural category? Or is only "human talk about Jehovah" a natural category? Do you ponder what Jehovah would do, or only what humans might think Jehovah would do?

[-]DilGreen15y00

So many of the comments here seem designed to illustrate the extreme difficulty, even for intelligent humans interested in rationality, and trying hard to participate usefully in a conversation about hard-edged situations of perceived non-trivial import, to avoid fairly simplistic anthropomorphisms of one kind or another.

Saying, of a supposed super-intelligent AI - one that works by being able to parallel, somehow, the 'might as well be magic' bits of intelligence that we currently have at best a crude assembly of speculative guesses for - any version of "of course, it would do X", seems - well - foolish.

[-]taryneast15y-10

Ok, so, trying on my understanding of this post: I guess that a smiling face should only reinforce something if it also leads to the "human happiness" goal... (which would be harder to train for).

I think I can see what Hibbard may have been trying for - in feeling that a smiley face might be worth training for as a first-step towards training for the actual, real goal... depending on how training a "real" AI would proceed.

As background, I can compare against training lab rats to perform complicated processes before getting their "r... (read more)

0TheOtherDave15y

Right. Unless it turns out that happiness isn't what we would have chosen, either. In which case perhaps discarding the "human happiness" goal and teaching it to adopt a "what humans would have chosen" goal works better? Unless it turns out that what humans would have chosen involves being fused into glass at the bottoms of smoking craters. In which case perhaps a "what humans ought to have chosen" goal works better? Except now we've gone full circle and are expecting the AI to apply a nonhuman valuation, which is what we rejected in the first place. I haven't completely followed the local thinking on this subject yet, but my current approximation of the local best answer goes "Let's assume that there is a way W for the world to be, such that all humans would prefer W if they were right-thinking enough, including hypothetical future humans living in the world according to W. Further, let's assume the specifications of W can be determined from a detailed study of humans by a sufficiently intelligent observer. Given those assumptions, we should build a sufficiently intelligent observer whose only goal is to determine W, and then an optimizing system to implement W."

1taryneast15y

Hmmm, I can forsee many problems with guessing what humans "ought" to prefer. Even humans have got that one wrong pretty much every time they've tried. I'd say a "better" goal might be cased as "increasing the options available to most humans (not at the expense of the options of other humans)" This goal seems compatible with allowing humans to choose happier lifestyles - but without forcing them into any particular lifestyle that they may not consider to be "better". It would "work" by concentrating on things like extending human lifespans and finding better medical treatments for things that limit human endeavour. However, this is just a guess... and I am still only a novice here... which means I am in no way capable of figuring out how I'd actually go about training an AI to accept the above goal. All I know is that I agree with Eliezer's post that the lab-rat method would be sub-optimal as it has a high propensity to fall into pathological configurations.

[-]David Althaus15y90

Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand.

I have to confess that at first glance this statement seems arrogant. But, then I actually read some stuff in this AGI-mailing-list and well, I was filled with horror after I've read threads like this one:

Here is one of the most ridiculous passages:

Note that we may not have perfected this process, and further, that this process need not be perf

... (read more)

[-]elspood15y20

Can anyone please explain the reference to the horror seen firsthand at http://www.mail-archive.com/agi@v2.listbox.com/? I tried going back in the archives to see if something happened in August 2008 or earlier (the date of Eliezer's post), but the list archive site doesn't have anything older than October 2008 currently. My curiosity is piqued and I need closure on the anecdote. If nothing else, others might benefit from knowing what horrors might be avoided during AGI research.

0saturn15y

I think Eliezer is referring to the high ratio of posts by M-ntif-x and similar kooks.

[-]thomblake14y40

Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

Probably apocryphal. I haven't been able to track this down, despite having heard the story both in computer ethics class and at academic conferences.

3gwern14y

I poked around in Google Books; the earliest clear reference I found was the 2000 Cartwright book Intelligent data analysis in science, which seems to attribute it to the TV show Horizon. (No further info - just snippet view.)

5thomblake14y

Here is one supposedly from 1998, though it's hardly academic.

[-]gwern11y100

A Redditor provides not one but two versions from "Embarrassing mistakes in perceptron research", Marvin Minsky, recorded 29-31 Jan 2011:

Like I had a friend in Italy who had a perceptron that looked at a visual... it had visual inputs. So, he... he had scores of music written by Bach of chorales and he had scores of chorales written by music students at the local conservatory. And he had a perceptron - a big machine - that looked at these and those and tried to distinguish between them. And he was able to train it to distinguish between the masterpieces by Bach and the pretty good chorales by the conservatory students. Well, so, he showed us this data and I was looking through it and what I discovered was that in the lower left hand corner of each page, one of the sets of data had single whole notes. And I think the ones by the students usually had four quarter notes. So that, in fact, it was possible to distinguish between these two classes of... of pieces of music just by looking at the lower left... lower right hand corner of the page. So, I told this to the... to our scientist friend and he went through the data and he said: 'You guessed right. That's... that's how

... (read more)

2gwern8y

Another version is provided by Ed Fredkin via Eliezer Yudkowsky in http://lesswrong.com/lw/7qz/machine_learning_and_unintended_consequences/ This is still not a source because it's a recollection 50 years later and so highly unreliable, and even at face value, all Fredkin did was suggest that the NN might have picked up on a lighting difference; this is not proof that it did, much less all the extraneous details of how they had 50 photos in this set and 50 in that and then the Pentagon deployed it and it failed in the field (and what happened to it being set in the 1980s?). Classic urban legend/myth behavior: accreting plausible entertaining details in the retelling.

6gwern8y

I've compiled and expanded all the examples at https://www.gwern.net/Tanks

[-]PhilGoetz14y-10

I was surprised that the post focused on the difficulty of learning to classify things, rather than on the problems that would arise assuming the AI learned to classify smiling humans correctly. I'm not worried that the AI will tile the universe with smiley-faces. I'm worried the AI will tile the universe with smiling humans. Even with genuinely happy humans.

Humans can classify humans into happy and unhappy pretty well; superintelligent AI will be able to also. The hard problem is not identifying happiness; the hard problem is deciding what to maximize.

[-]timtyler14y30

Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

This document has a citation for the story: (Skapura, David M. and Peter S. Gordon, Building Neural Networks, Addison-Wesley, 1996.) I don't know for sure if that is the end of the trail or not.

3gwern14y

No page number, unfortunately. Not in library.nu; closest copy to me was in the New York Public Library. I then looked in Google Books http://books.google.com/books?id=RaRbNBqGR1oC The 2 hits for 'tanks' neither seemed to be relevant; ditto for 'clear'. No hits for 'cloudy' or 'skies' or 'enemy'; there's one hit for 'sky', pg 206, where it talks about a plane recognition system that worked well until the plane moved close to the ground and then became confused because it had only learned to find 'the darkest section in the image'. EDIT: https://www.gwern.net/Tanks

0timtyler14y

The bottom of page 199 seems to be about "classifying military tanks in SAR imagery". It goes on to say it is only interested in "tank" / "non-tank" categories.

4pedanterrific14y

Discussed here, there's a few bits that might be useful.

[-]MugaSofer13y7-1

When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face, wire it into a permanent smile, and start xeroxing.

That's a much more convincing and vivid image than "molecular smiley faces". Makes a more general point, too. Shame you didn't use that the first time, really.

[-]Martin Randall1y*112

I shall call this the fallacy of magical categories - simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

And then in the 2020s it turned out to be the case again! Eg ChessGPT. Today I learned that Stockfish is now a neural network (trained on board positions, not move sequences).

~~This in no way cuts against the point of this post, but it stood out when I read this 16 years after it was posted.~~

6TurnTrout1y

It does cut against the point of the post. He was wrong in a way that pertains to the key point. He makes fun of "magical categories" as "simple little words that turn out to carry all the desired functionality of the AI", but turns out those "simple little words" actually work. Lol. In this post, you can also see the implicit reliance on counting arguments against good generalization (e.g. "superexponential conceptspace"). Those arguments are, evidently, wrong - or at least irrelevant. He fell into the standard statistical learning theoretic trap of caring about e.g. VC dimension since he was too pessimistic about inductive biases. I'll wager that an LLM won't get this one wrong. goes to check - yup, it didn't:

[-]Paul Crowley1y117

In this instance the problem the AI is optimizing for isn't "maximize smiley faces", it's "produce outputs that human raters give high scores to". And it's done well on that metric, given that the LLM isn't powerful enough to subvert the reward channel.

4Zack_M_Davis1y

This isn't a productive response to TurnTrout in particular, who has written extensively about his reasons for being skeptical that contemporary AI training setups produce reward optimizers (which doesn't mean he's necessarily right, but the parent comment isn't moving the debate forward).

[-]Paul Crowley1y114

I'm not quite seeing how this negates my point, help me out?

Eliezer sometimes spoke of AIs as if they had "reward channel"
But they don't, instead they are something a bit like "adaption executors, not fitness maximizers"
This is potentially an interesting misprediction!
Eliezer also said that if you give the AI the goal of maximizing smiley faces, it will make tiny molecular ones
TurnTrout points out that if you ask an LLM if that would be a good thing to do, it says no
My point is that this is exactly what Eliezer would have predicted for an LLM whose reward channel was "maximize reader scores"
Our LLMs tend to produce high reader scores for a reason that's not exactly "they're trying to maximize their reward channel"
I don't at all see how this difference makes a difference! Eliezer would always have predicted that an AI aimed at maximizing reader scores would have produced a response to TurnTrout's question that maximized reader scores, so it's silly to present them doing so as a gotcha!

7Martin Randall1y

This article does not predict that LLM behavior. Here's another quote from it: Here, the category boundary you are describing is "outputs that human raters give high scores to". That is a complex category of human values. This is squarely in both "formal fallacies" described by the article, the fallacy of "underestimating the complexity of a concept we develop for the sake of its value" and the fallacy of "anthropomorphic optimism". My reading is that, if this article is correct, then an AI trained to "produce outputs that human raters give high scores to" will instead produce out-of-distribution text that fits the category the AI learned, and not the category we wanted the AI to learn, especially when placed in novel situations. Less like Claude, more like Sydney and Bing. You apparently have the opposite reading to me. I don't see it, at all. ---------------------------------------- I think TurnTrout's point is that in order for the AI to succeed at the "magical category" pointed at by the words "outputs that human raters give high scores to", it has to also have learned the strictly easier "unnatural category" pointed at by the words "making people smile". And the results show that it has learned that.

7Paul Crowley1y

Not being able to figure out what sort of thing humans would rate highly isn't an alignment failure, it's a capabilities failure, and Eliezer_2008 would never have assumed a capabilities failure in the way you're saying he would. He is right to say that attempting to directly encode the category boundaries won't work. It isn't covered in this blog post, but his main proposal for alignment was always that as far as possible, you want the AI to do the work of using its capabilities to figure out what it means to optimize for human values rather than trying to directly encode those values, precisely so that capabilities can help with alignment. The trouble is that even pointing at this category is difficult - more difficult than pointing at "gets high ratings".

6Zack_M_Davis1y

I think we probably don't disagree much; I regret any miscommunication. If the intent of the great-grandparent was just to make the narrow point that an AI that wanted the user to reward it could choose to say things that would lead to it being rewarded, which is compatible with (indeed, predicts) answering the molecular smiley-face question correctly, then I agree. Treating the screenshot as evidence in the way that TurnTrout is doing requires more assumptions about the properties of LLMs in particular. I read your claims regarding "the problem the AI is optimizing for [...] given that the LLM isn't powerful enough to subvert the reward channel" as taking as given different assumptions about the properties of LLMs in particular (viz., that they're reward-optimizers) without taking into account that the person you were responding to is known to disagree.

2Noosphere891y

I'll also say to the extent they are optimizing in a utility-maximizing sense, it's about predicting correctly about the whole world, not a reward function in the traditional sense (though they probably do have more learned utility functions/values as a part of that), so Paul Crowley is still wrong here.

Moderation Log