Why can't the AI just be exactly the same as Hibbard? If Hibbard is flawed in a major way, you could make an AI for every person on Earth (this obviously wouldn't be practical, but if a few million AI's are bad the other few billion can deal with them).
We already have an entity exactly the same as Hibbard. Namely: Hibbard. Why do we need another one?
What we want is an AI that's far more intelligent than a human, yet shares their values. Increasing intelligence while preserving values is nontrivial. You could try giving Hibbard the ability to self-modify, but then he'd most likely just go insane in some way or another.
"Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers."
You use examples of this type fairly often, but for a utility function linear in smiles wouldn't the number of smiles generated by pleasing the programmers be trivial relative to the output of even a little while with access to face-xeroxing? This could be partly offset by anthropic/simulation issues, but still I would expect the overwhelming motive for appearing to work correctly during childhood (after it could recognize this point) would be tricking the programmers, not the tiny gains from their smiles.
For instance, a weak AI might refrain from visibly trying to produce smiles in disturbing ways as part of an effort (including verbal claims) to convince the programmers that it had apprehended the objective morality behind their attempts to inculcate smiles as a reinforcer.
Early AIs are far more likely to be built to maximise the worth of the company that made them than anything to do with human hapiness. E.g. see: Artificial intelligence applied heavily to picking stocks
A utility function measured in dollars seems fairly unambiguous.
A utility function measured in dollars seems fairly unambiguously to lead to decisions that are non-optimal for humans, without a sophisticated understanding of what dollars are.
Dollars mean something for humans because they are tokens in a vast, partly consensual and partially reified game. Economics, which is our approach to developing dollar maximising strategies, is non-trivial.
Training an AI to understand dollars as something more than data points would be similarly non-trivial to training an AI to faultlessly assess human happiness.
Eliezer, I believe that your belittling tone is conducive to neither a healthy debate nor a readable blog post. I suspect that your attitude is borne out of just frustration, not contempt, but I would still strongly encourage you to write more civilly. It's not just a matter of being nice; rudeness prevents both the speaker and the listener from thinking clearly and objectively, and it doesn't contribute to anything.
It has always struck me that the tiling the universe with smiley faces example is one of the stupidest possible examples Eliezer could have come up with. It is extremely implausible, MUCH, MUCH more so than the camouflage tank scenario, and I understand Hibbard's indignation even if I agree with Eliezer on the general point he is making.
I have no idea why Eliezer wouldn't choose a better example that illustrates the same point, like the AGI spiking the water supply with a Soma-like drug that actually does make us all profoundly content in a highly undesirable way.
It is just me, or are things getting a bit unfriendly around here?
Anyway...
Wiring up the AI to maximise happy faces etc. is not a very good idea, the goal is clearly too shallow to reflect the underlying intent. I'd have to read more of Hibbard's stuff to properly understand his position, however.
That said, I do agree with a more basic underlying theme that he seems to be putting forward. In my opinion, a key, perhaps even THE key to intelligence is the ability to form reliable deep abstractions. In Solomonoff induction and AIXI you see this being drivi...
Tim,
"A utility function measured in dollars seems fairly unambiguous."
Oy vey.
There's this weird hero-worship codependency that emerges between Eliezer and some of his readers that I don't get, but I have to admit, it diminishes (in my eyes) the stature of all parties involved.
Shane, again, the issue is not differentiation. The issue is classification. Obviously, tiny smiley faces are different from human smiling faces, but so is the smile of someone who had half their face burned off. Obviously a superintelligence knows that this is an unusual case, but that doesn't say if it's a positive or negative case.
Deep abstractions are important, yes, but there is no unique deep abstraction that classifies any given example. An apple is a red thing, a biological artifact shaped by evolution, and an economic resource in the human market.
Also, Hibbard spoke of using smiling faces to reinforce behaviors, so if a superintelligence would not confuse smiling faces and happiness, that works against that proposal - because it means that the superintelligence will go on focusing on smiling faces, not happiness.
Retired Urologist, one of the most important lessons that a rationalist learns is not to try to be clever. I don't play nitwit games with my audience. If I say it, I mean it. If I have words to emit that I don't necessarily mean, for the sake of provoking reactions, I put them into a dialogue, short story, or parable - I don't say them in my own voice.
There's a Hibbard piece from January 2008 in JET, but I'm not sure if it's new or if Eliezer has seen it: http://jetpress.org/v17/hibbard.htm
@EY: If I have words to emit that I don't necessarily mean, for the sake of provoking reactions, I put them into a dialogue, short story, or parable - I don't say them in my own voice.
That's what I meant when I wrote: "By making his posts quirky and difficult to understand". Sorry. Should have been more precise.
@HA: perhaps you know the parties far better than I. I'm still looking.
I mean differentiation in the sense of differentiating between the abstract categories. Is a half a face that appears to be smiling while the other half is burn off still a "smiley face"? Even I'm not sure.
I'm certainly not arguing that training an AGI to maximise smiling faces is a good idea. It's simply a case of giving the AGI the wrong goal.
My point is that a super intelligence will form very good abstractions, and based on these it will learn to classify very well. The problem with the famous tank example you cite is that they were train...
I read most of the interchange between EY and BH. It appears to me that BH still doesn't get a couple of points. The first is that smiley faces are an example of misclassification and it's merely fortuitous to EY's ends that BH actually spoke about designing an SI to use human happiness (and observed smiles) as its metric. He continues to speak in terms of "a system that is adequate for intelligence in its ability to rule the world, but absurdly inadequate for intelligence in its inability to distinguish a smiley face from a human." EY's poin...
Even if by impossible luck he gets an AI that actually is a valid-happiness maximizer, he would still screw up. The AI would rampage out turning the galaxy into a paradise garden with just enough tamed-down monsters to keep us on our toes... but it would obliterate those sorts of utility that extend outside happiness, and probably stuff a cork in apotheosis. An Eden trap - a sort of existential whimper.
Shane: I mean differentiation in the sense of differentiating between the abstract categories.
The abstract categories? This sounds like a unique categorization that the AI just has to find-in-the-world. You keep speaking of "good" abstractions as if this were a property of the categories themselves, rather than a ranking in your preference ordering relative to some decision task that makes use of the categories.
@Eliezer - I think Shane is right. "Good" abstractions do exist, and are independent of the observer. The value of an abstraction relates to its ability to allow you to predict the future. For example, "mass" is a good abstraction, because when coupled with a physical law it allows you to make good predictions.
If we assume a superintelligent AI, we have to assume that the AI has the ability to discover abstractions. Human happiness is one such abstraction. Understanding the abstraction "happiness" allows one to predict certain...
Plato had a concept of "forms". Forms are ideal shapes or abstractions: every dog is an imperfect instantiation of the "dog" form that exists only in our brains.
Mmm. I believe Plato saw the forms as being real things existing "in heaven" rather than merely in our brains. It wasn't a stupid theory for its day; in particular, a living thing growing into the right shape or form must have seemed utterly mysterious, and so the idea that some sort of blueprint was laid out in heaven must have had a lot of appeal.
But anyway, forms as...
In contrast to Eliezer I think it's (remotely) possible to train an AI to reliably recognize human mind states underlying expressions of happiness. But this would still not imply that the machine's primary, innate emotion is unconditional love for all humans. The machines would merely be addicted to watching happy humans.
Personally, I'd rather not be an object of some quirky fetishism.
Monthy Python has, of course, realized it long ago:
http://www.youtube.com/watch?v=HoRY3ZjiNLU http://www.youtube.com/watch?v=JTMXtJvFV6E
@AC
I mean that a superintelligent AI should be able to induce the Form of the Good from extensive study of humans, human culture, and human history. The problem is not much different in principle from inducing the concept of "dog" from many natural images, or the concept of "mass" from extensive experience with physical systems.
"Wealth then. Wealth measures access to resources - so convert to gold, silver, barrels of oil, etc to measure it - if you don't trust your country's currency."
I may not have gotten the point across. An AI aiming to maximize its wealth in U.S. dollars can do astronomically better by taking control of the Federal Reserve (if dollars are defined in its utility function as being issued by the Reserve, with only the bare minimum required to meet that definition being allowed to persist) and having it start issuing $3^^^3 bills than any commercial act...
Re: Creating an oil bank that issues oil vouchers in numbers astronomically exceeding its reserves could let an AI possess 3^^^3 account units each convertible to a barrel of oil.
No: such vouchers would not be redeemable in the marketplace: they would be worthless. Everyone would realise that - including the AI.
This is an example of the wirehead fallacy framed in economic terms. As Omohundro puts it, "AIs will try to prevent counterfeit utility".
"No: such vouchers would not be redeemable in the marketplace: they would be worthless. Everyone would realise that - including the AI."
The oil bank stands ready to exchange any particular voucher for a barrel of oil, so if the utility function refers to the values of particular items, they can all have that market price. Compare with the price of gold or some other metal traded on international commodity markets. The gold in Fort Knox is often valued at the market price per ounce of gold multiplied by the number of ounces present, but in fact yo...
There are several famous science fiction stories about humans who program AIs to make humans happy, which then follow the letter of the law and do horrible things. The earliest is probably "With folded hands", by Jack Williamson (1947), in which AIs are programmed to protect humans, and they do this by preventing humans from doing anything or going anywhere. The most recent may be the movie "I, Robot."
I agree with E's general point - that AI work often presupposes that the AI magically has the same concepts as its inventor, even outsi...
I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider.
The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the classifier.
The point is, probably, that humans tend to underestimate the complexity of classifiers they use. The categories like "good" are not only difficult to precisely define, they are difficult to define at all, because they are too complicated to be formulated in words. To point out that in classification we use structures based on the architectu...
Look: humans can learn what a 'tank' is, and can direct their detection activities to specifically seek them - not whether the scene is light or dark, or any other weird regularity that might be present in the test materials. We can identify the regularities, compare them with the properties of tanks, and determine that they're not what we're looking for.
If we can do it, the computers can do it as well. We merely need to figure out how to bring it about - it's an engineering challenge only. That doesn't dismiss or minimize the difficulty of achieving i...
Animal trainers have this problem all the time. Animal performs behavior 'x' gets a reward. But the animal might have been doing other subtle behaviors at the same time, and map the reward to 'y'. So instead of reinforcing 'x', you might be reinforcing 'y'. And if 'x' and 'y' are too close for you to tell apart, then you'll be in for a surprise when your perspective and context changes, and the difference becomes more apparent to you. And you find out that the bird was trained to peck anything that moves, instead of just the bouncy red ball or somethi...
"You keep speaking of "good" abstractions as if this were a property of the categories themselves, rather than a ranking in your preference ordering relative to some decision task that makes use of the categories."
Yes, I believe categories of things do exist in the world in some sense, due to structure that exists in the world. I've seen thousands of things where were referred to as "smiley faces" and so there is an abstraction for this category of things in my brain. You have done likewise. While we can agree about many th...
IMHO, the idea that wealth can't usefully be measured is one which is not sufficiently worthwhile to merit further discussion.
The "wealth" idea sounds vulnerable to hidden complexity of wishes. Measure it in dollars and you get hyperinflation. Measure it in resources, and the AI cuts down all the trees and converts them to lumber, then kills all the animals and converts them to oil, even if technology had advanced beyond the point of needing either. Find some clever way to specify the value of all resources, convert them to products and allocate them to humans in the level humans want, and one of the products will be highly carcinogenic because the AI didn't know humans don't like that. The only way to get wealth in the way that's meaningful to humans without humans losing other things they want more than wealth is for the AI to know exactly what we want as well or better than we do. And if it knows that, we can ignore wealth and just ask it to do what it knows we want.
"The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the cl...
I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider.I find the idea of an AI God rather scary. However, unless private AIs are made illegal or heavily regulated, is there much danger of one AI ruling all the lesser intelligences?
"I await the proper timing and forum in which to elaborate my skepticism that we should focus on trying to design a God to rule us all. Sure, have a contingency plan in case we actually face that problem, but it seems not the most likely or important case to consider."
I agree with Robin. Although I'm disappointed that he thinks he lacks an adequate forum to pound the podium on this more forcefully.
Robin and I have discussed this subject in-person and got as far as narrowing down considerably the focus of the disagreement. Robin probably doesn't disagree with me at the point you would expect. Godlike powers, sure, nanotech etc., but Robin expects them to be rooted in a whole economy, not concentrated in a single brain like I expect. No comfort there for those attached to Life As We Know It.
However, I've requested that Robin hold off on discussing his disagreement with me in particular (although of course he continues to write general papers on the cosmic commons and exponential growth modes) until I can get more material out of the way on Overcoming Bias. This is what Robin means by "proper timing".
Shane, I think we agree on essential Bayesian principles - there's structure that's useful for generic prediction, which is sensitive only to the granularity of your sensory information; and then there's structure that's useful for decision-making. In principle, all structure worth thinking about is decision-making structure, but in practice we can usually factor out the predictive structure just as we factor out probabilities in decision-making.
But I would further say that decision-making structure can be highly sensitive to terminal values in a way that...
The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication - transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood.
Or more generally, not just a binary classification problem but a measurement issue: How to measure benefit to humans or human satisfaction.
It has sometimes struck me that this FAI requirement has a lot i...
Shane, the problem is that there are (for all practical purposes) infinitely many categories the Bayesian superintelligence could consider. They all "identify significant regularities in the environment" that "could potentially become useful." The problem is that we as the programmers don't know whether the category we're conditioning the superintelligence to care about is the category we want it to care about; this is especially true with messily-defined categories like "good" or "happy." What if we train it to d...
I wonder if you'd consider a superintelligent human have the same flaws as a superintelligent AI (and will eventually destroy the world). What about a group of superintelligent humans (assuming they have to cooperate in order to act)?
Eliezer: Have you read Scott Aaronson's work on the learnability of quantum states. There, the full space is doubly exponential in system size, but if we just want to predict the results of some set of possible questions (to some fixed accuracy), we don't need to train with nearly as many questions as one might think.
But it illustrates the general idea: the potential poison, in interacting with the complicated human machine, takes on a complicated boundary that doesn't match the grain of any local boundaries you would draw around substances.
Compared to 'actions that are right', even 'poisons' seems like a pretty obvious boundary to draw. Where's the grain around 'right'? Unlucky for Eliezer, we seem to find some pretty bizarre boundaries 'useful'.
Re: One god to rule us all
It does look as though there is going to be one big thing out there. It looks as though it will be a more integrated and unified entity than any living system up to now - and it is unlikely to be descended from today's United Nations - e.g. see:
Kevin Kelly: Predicting the next 5,000 days of the web
It seems rather unlikely that the Monopolies and Mergers Commission will be there to stop this particular global unification.
Eli, to my mind you seem to be underestimating the potential of a super intelligent machine.
How do I know that hemlock is poisonous? Well, I've heard the story that Socrates died by hemlock poisoning. This is not a conclusion that I've arrived at due to the physical properties of hemlock that I have observed and how this would affect the human body, indeed, as far as I know, I've never even seen hemlock before. The idea that hemlock is a poison is a pattern in my environment: every time I hear about the trial of Socrates I hear about it being the poison...
Shane, I think you're underestimating the idiosyncrasy of morality. Suppose that I show you the sentence "This sentence is false." Do you convert it to ASCII, add up the numbers, factorize the result, and check if there are two square factors? No; it would be easy enough for you to do so, but why bother? The concept "sentences whose ASCII conversion of their English serialization sums to a number with two square factors" is not, to you, an interesting way to carve up reality.
Suppose that, driving along the highway, I see someone rid...
Eli, I've been busy fighting with models of cognitive bias in finance and only just now found time to reply:
Suppose that I show you the sentence "This sentence is false." Do you convert it to ASCII, add up the numbers, factorize the result, and check if there are two square factors? No; it would be easy enough for you to do so, but why bother? The concept "sentences whose ASCII conversion of their English serialization sums to a number with two square factors" is not, to you, an interesting way to carve up reality.
Sure, this property of...
Shane, religious fundamentalists routinely act based on their beliefs about God. Do you think that makes "God" a natural category that any superintelligence would ponder? I see "human thoughts about God" and "things that humans justify by referring to God" and "things you can get people to do by invoking God" as natural categories for any AI operating on modern Earth, though an unfriendly AI wouldn't give it a second thought after wiping out humanity. But to go from here to reasoning about what God would actually ...
It's worth pointing out that we have wired-in preferences analogous to those Hibbard proposes to build into his intelligences: we like seeing babies smile; we like seeing people smile; we like the sweet taste of fresh fruit; we like orgasms; many of us (especially men) like the sight of naked women, especially if they're young, and they sexually arouse us to boot; we like socializing with people we're familiar with; we like having our pleasure centers stimulated; we don't like killing people; and so on.
It's worth pointing out that we engage in a lot of face-xeroxing-like behavior in pursuit of these ends. We keep photos of our family in our wallets, we look at our friends' baby photos on their cellphones, we put up posters of smiling people; we eat candy and NutraSweet; we masturbate; we download pornography; we watch Friends on television; we snort cocaine and smoke crack; we put bags over people's heads before we shoot them. In fact, in many cases, we form elaborate, intelligent plans to these ends.
It doesn't matter that you know, rationally, that you aren't impregnating Jenna Jameson, or that the LCD pixels on the cellphone display aren't a real baby, that Caffeine Free Diet C...
Re: One of the more obvious roadmaps to creating AI involves the stock market waking up.
I've fleshed this comment out into an essay on the topic: http://alife.co.uk/essays/the_awakening_marketplace/
Eli,
Do you think that makes "God" a natural category that any superintelligence would ponder?
Yes. If you're a super intelligent machine on a mission there is very little that can stop you. You know that. About the only thing that could stop you would be some other kind of super intelligent entity, maybe an entity that created the universe. A "God" of some description. Getting the God question wrong could be a big mistake, and that's reason enough for you to examine the possibility.
I don't consider such as Gods, as they are not supernatural and not ontologically distinct from creatures; they are simply powerful aliens or Matrix Lords. So I'll phrase it more precisely. Lots of humans talk about Jehovah. Does that make Jehovah a natural category? Or is only "human talk about Jehovah" a natural category? Do you ponder what Jehovah would do, or only what humans might think Jehovah would do?
So many of the comments here seem designed to illustrate the extreme difficulty, even for intelligent humans interested in rationality, and trying hard to participate usefully in a conversation about hard-edged situations of perceived non-trivial import, to avoid fairly simplistic anthropomorphisms of one kind or another.
Saying, of a supposed super-intelligent AI - one that works by being able to parallel, somehow, the 'might as well be magic' bits of intelligence that we currently have at best a crude assembly of speculative guesses for - any version of "of course, it would do X", seems - well - foolish.
Ok, so, trying on my understanding of this post: I guess that a smiling face should only reinforce something if it also leads to the "human happiness" goal... (which would be harder to train for).
I think I can see what Hibbard may have been trying for - in feeling that a smiley face might be worth training for as a first-step towards training for the actual, real goal... depending on how training a "real" AI would proceed.
As background, I can compare against training lab rats to perform complicated processes before getting their "r...
Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand.
I have to confess that at first glance this statement seems arrogant. But, then I actually read some stuff in this AGI-mailing-list and well, I was filled with horror after I've read threads like this one:
Here is one of the most ridiculous passages:
...Note that we may not have perfected this process, and further, that this process need not be perf
Can anyone please explain the reference to the horror seen firsthand at http://www.mail-archive.com/agi@v2.listbox.com/? I tried going back in the archives to see if something happened in August 2008 or earlier (the date of Eliezer's post), but the list archive site doesn't have anything older than October 2008 currently. My curiosity is piqued and I need closure on the anecdote. If nothing else, others might benefit from knowing what horrors might be avoided during AGI research.
Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.
Probably apocryphal. I haven't been able to track this down, despite having heard the story both in computer ethics class and at academic conferences.
A Redditor provides not one but two versions from "Embarrassing mistakes in perceptron research", Marvin Minsky, recorded 29-31 Jan 2011:
...Like I had a friend in Italy who had a perceptron that looked at a visual... it had visual inputs. So, he... he had scores of music written by Bach of chorales and he had scores of chorales written by music students at the local conservatory. And he had a perceptron - a big machine - that looked at these and those and tried to distinguish between them. And he was able to train it to distinguish between the masterpieces by Bach and the pretty good chorales by the conservatory students. Well, so, he showed us this data and I was looking through it and what I discovered was that in the lower left hand corner of each page, one of the sets of data had single whole notes. And I think the ones by the students usually had four quarter notes. So that, in fact, it was possible to distinguish between these two classes of... of pieces of music just by looking at the lower left... lower right hand corner of the page. So, I told this to the... to our scientist friend and he went through the data and he said: 'You guessed right. That's... that's how
I was surprised that the post focused on the difficulty of learning to classify things, rather than on the problems that would arise assuming the AI learned to classify smiling humans correctly. I'm not worried that the AI will tile the universe with smiley-faces. I'm worried the AI will tile the universe with smiling humans. Even with genuinely happy humans.
Humans can classify humans into happy and unhappy pretty well; superintelligent AI will be able to also. The hard problem is not identifying happiness; the hard problem is deciding what to maximize.
Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.
This document has a citation for the story: (Skapura, David M. and Peter S. Gordon, Building Neural Networks, Addison-Wesley, 1996.) I don't know for sure if that is the end of the trail or not.
When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face, wire it into a permanent smile, and start xeroxing.
That's a much more convincing and vivid image than "molecular smiley faces". Makes a more general point, too. Shame you didn't use that the first time, really.
I shall call this the fallacy of magical categories - simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.
And then in the 2020s it turned out to be the case again! Eg ChessGPT. Today I learned that Stockfish is now a neural network (trained on board positions, not move sequences).
This in no way cuts against the point of this post, but it stood out when I read this 16 years after it was posted.
In this instance the problem the AI is optimizing for isn't "maximize smiley faces", it's "produce outputs that human raters give high scores to". And it's done well on that metric, given that the LLM isn't powerful enough to subvert the reward channel.
I'm not quite seeing how this negates my point, help me out?
That was published in a peer-reviewed journal, and the author later wrote a whole book about it, so this is not a strawman position I'm discussing here.
So... um... what could possibly go wrong...
When I mentioned (sec. 6) that Hibbard's AI ends up tiling the galaxy with tiny molecular smiley-faces, Hibbard wrote an indignant reply saying:
As Hibbard also wrote "Such obvious contradictory assumptions show Yudkowsky's preference for drama over reason," I'll go ahead and mention that Hibbard illustrates a key point: There is no professional certification test you have to take before you are allowed to talk about AI morality. But that is not my primary topic today. Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand. Even Michael Vassar was probably surprised his first time through.
No, today I am here to dissect "You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans."
Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.
The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set - output "yes" for the 50 photos of camouflaged tanks, and output "no" for the 50 photos of forest.
Now this did not prove, or even imply, that new examples would be classified correctly. The neural network might have "learned" 100 special cases that wouldn't generalize to new problems. Not, "camouflaged tanks versus forest", but just, "photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive..."
But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed!
The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.
It turned out that in the researchers' data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.
This parable - which might or might not be fact - illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence: If the training problems and the real problems have the slightest difference in context - if they are not drawn from the same independently identically distributed process - there is no statistical guarantee from past success to future success. It doesn't matter if the AI seems to be working great under the training conditions. (This is not an unsolvable problem but it is an unpatchable problem. There are deep ways to address it - a topic beyond the scope of this post - but no bandaids.)
As described in Superexponential Conceptspace, there are exponentially more possible concepts than possible objects, just as the number of possible objects is exponential in the number of attributes. If a black-and-white image is 256 pixels on a side, then the total image is 65536 pixels. The number of possible images is 265536. And the number of possible concepts that classify images into positive and negative instances - the number of possible boundaries you could draw in the space of images - is 2^(265536). From this, we see that even supervised learning is almost entirely a matter of inductive bias, without which it would take a minimum of 265536 classified examples to discriminate among 2^(265536) possible concepts - even if classifications are constant over time.
If this seems at all counterintuitive or non-obvious, see Superexponential Conceptspace.
So let us now turn again to:
and
It's trivial to discriminate a photo of a picture with a camouflaged tank, and a photo of an empty forest, in the sense of determining that the two photos are not identical. They're different pixel arrays with different 1s and 0s in them. Discriminating between them is as simple as testing the arrays for equality.
Classifying new photos into positive and negative instances of "smile", by reasoning from a set of training photos classified positive or negative, is a different order of problem.
When you've got a 256x256 image from a real-world camera, and the image turns out to depict a camouflaged tank, there is no additional 65537th bit denoting the positiveness - no tiny little XML tag that says "This image is inherently positive". It's only a positive example relative to some particular concept.
But for any non-Vast amount of training data - any training data that does not include the exact bitwise image now seen - there are superexponentially many possible concepts compatible with previous classifications.
For the AI, choosing or weighting from among superexponential possibilities is a matter of inductive bias. Which may not match what the user has in mind. The gap between these two example-classifying processes - induction on the one hand, and the user's actual goals on the other - is not trivial to cross.
Let's say the AI's training data is:
Dataset 1:
Now the AI grows up into a superintelligence, and encounters this data:
Dataset 2:
It is not a property of these datasets that the inferred classification you would prefer is:
rather than
Both of these classifications are compatible with the training data. The number of concepts compatible with the training data will be much larger, since more than one concept can project the same shadow onto the combined dataset. If the space of possible concepts includes the space of possible computations that classify instances, the space is infinite.
Which classification will the AI choose? This is not an inherent property of the training data; it is a property of how the AI performs induction.
Which is the correct classification? This is not a property of the training data; it is a property of your preferences (or, if you prefer, a property of the idealized abstract dynamic you name "right").
The concept that you wanted, cast its shadow onto the training data as you yourself labeled each instance + or -, drawing on your own intelligence and preferences to do so. That's what supervised learning is all about - providing the AI with labeled training examples that project a shadow of the causal process that generated the labels.
But unless the training data is drawn from exactly the same context as the real-life, the training data will be "shallow" in some sense, a projection from a much higher-dimensional space of possibilities.
The AI never saw a tiny molecular smileyface during its dumber-than-human training phase, or it never saw a tiny little agent with a happiness counter set to a googolplex. Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile. But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values. It is your own plans and desires that are at work when you say "No!"
Hibbard knows instinctively that a tiny molecular smileyface isn't a "smile", because he knows that's not what he wants his putative AI to do. If someone else were presented with a different task, like classifying artworks, they might feel that the Mona Lisa was obviously smiling - as opposed to frowning, say - even though it's only paint.
As the case of Terry Schiavo illustrates, technology enables new borderline cases that throw us into new, essentially moral dilemmas. Showing an AI pictures of living and dead humans as they existed during the age of Ancient Greece, will not enable the AI to make a moral decision as to whether switching off Terry's life support is murder. That information isn't present in the dataset even inductively! Terry Schiavo raises new moral questions, appealing to new moral considerations, that you wouldn't need to think about while classifying photos of living and dead humans from the time of Ancient Greece. No one was on life support then, still breathing with a brain half fluid. So such considerations play no role in the causal process that you use to classify the ancient-Greece training data, and hence cast no shadow on the training data, and hence are not accessible by induction on the training data.
As a matter of formal fallacy, I see two anthropomorphic errors on display.
The first fallacy is underestimating the complexity of a concept we develop for the sake of its value. The borders of the concept will depend on many values and probably on-the-fly moral reasoning, if the borderline case is of a kind we haven't seen before. But all that takes place invisibly, in the background; to Hibbard it just seems that a tiny molecular smileyface is just obviously not a smile. And we don't generate all possible borderline cases, so we don't think of all the considerations that might play a role in redefining the concept, but haven't yet played a role in defining it. Since people underestimate the complexity of their concepts, they underestimate the difficulty of inducing the concept from training data. (And also the difficulty of describing the concept directly - see The Hidden Complexity of Wishes.)
The second fallacy is anthropomorphic optimism: Since Bill Hibbard uses his own intelligence to generate options and plans ranking high in his preference ordering, he is incredulous at the idea that a superintelligence could classify never-before-seen tiny molecular smileyfaces as a positive instance of "smile". As Hibbard uses the "smile" concept (to describe desired behavior of superintelligences), extending "smile" to cover tiny molecular smileyfaces would rank very low in his preference ordering; it would be a stupid thing to do - inherently so, as a property of the concept itself - so surely a superintelligence would not do it; this is just obviously the wrong classification. Certainly a superintelligence can see which heaps of pebbles are correct or incorrect.
Why, Friendly AI isn't hard at all! All you need is an AI that does what's good! Oh, sure, not every possible mind does what's good - but in this case, we just program the superintelligence to do what's good. All you need is a neural network that sees a few instances of good things and not-good things, and you've got a classifier. Hook that up to an expected utility maximizer and you're done!
I shall call this the fallacy of magical categories - simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.
The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication - transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood. Relative to the full space of possibilities the Future encompasses, we ourselves haven't imagined most of the borderline cases, and would have to engage in full-fledged moral arguments to figure them out. To solve the FAI problem you have to step outside the paradigm of induction on human-labeled training data and the paradigm of human-generated intensional definitions.
Of course, even if Hibbard did succeed in conveying to an AI a concept that covers exactly every human facial expression that Hibbard would label a "smile", and excludes every facial expression that Hibbard wouldn't label a "smile"...
Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers.
When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face, wire it into a permanent smile, and start xeroxing.
The deep answers to such problems are beyond the scope of this post, but it is a general principle of Friendly AI that there are no bandaids. In 2004, Hibbard modified his proposal to assert that expressions of human agreement should reinforce the definition of happiness, and then happiness should reinforce other behaviors. Which, even if it worked, just leads to the AI xeroxing a horde of things similar-in-its-conceptspace to programmers saying "Yes, that's happiness!" about hydrogen atoms - hydrogen atoms are easy to make.
Link to my discussion with Hibbard here. You already got the important parts.