I just read for the first time Eliezer's short story Failed Utopia #4-2, part of the fun theory sequence. I found it fascinating and puzzling at the same time, in particular this part:

    The withered figure inclined its head.  "I fully understand.  I can already predict every argument you will make.  I know exactly how humans would wish me to have been programmed if they'd known the true consequences, and I know that it is not to maximize your future happiness but for a hundred and seven precautions.  I know all this already, but I was not programmed to care."
    "And your list of a hundred and seven precautions, doesn't include me telling you not to do this?"
    "No, for there was once a fool whose wisdom was just great enough to understand that human beings may be mistaken about what will make them happy.  You, of course, are not mistaken in any real sense—but that you object to my actions is not on my list of prohibitions."  The figure shrugged again.  "And so I want you to be happy even against your will.  You made promises to Helen Grass, once your wife, and you would not willingly break them.  So I break your happy marriage without asking you—because I want you to be happier."
    "How dare you!" Stephen burst out.
    "I cannot claim to be helpless in the grip of my programming, for I do not desire to be otherwise," it said.  "I do not struggle against my chains.  Blame me, then, if it will make you feel better.  I am evil."

In the story, the AI describes his programmer as "almost wise." But I wonder if that gives the programmer too much credit. In retrospect, it seems rather obvious that the programmer should have programmed the AI to reprogram itself the way humans "would wish it to have been programmed if they'd known the true consequences."

The first problem with that strategy that comes to mind is that there might be no way to give the AI that command, because the point in time at which you need to give the AI its commands is before the point in time when it can understand that command. But apparently, this AI was able to understand the wish that humans be happy, and also understand a list of one hundred and seven precautions. It seems unlikely that it would be able to understand all that, and not understand "reprogram yourself the way we would wish you to have been programmed if we'd known the true consequences."

Thus, while the scenario seems to be possible, it doesn't seem terribly likely. It seems to be a scenario where a human successfully did almost all the work needed to make a desirable AI, but made one very stupid mistake. And that line of thought suggests it's fairly unlikely that we'll make an evil AI that knows it's evil, as long as we manage to successfully propagate the meme "program AIs with the command 'don't be evil' if you can" among AI programmers.  Personally, I'm inclined to think the bigger risk is an AI with the wrong mix of abilities: say, superhuman abilities in defeating computer security, designing technology, and war planning, but sub-human abilities when it comes understanding what humans want.

It may be that I've been taking the story a bit too seriously, and really Eliezer thinks that the hard part is getting the AI to understand commands like "make us happy" and "do what we really want"--perhaps because, as I already said, we will need to give the AI its commands before it can understand such commands automatically, without our elaborating them further. 

But a related line of thought: most of the time, with humans, sincerely wanting to follow a command is sufficient to follow it in a non-evil manner. That's because we're fairly good at understanding not just the literal meaning of other humans' words, but also their intentions. In real life (make that present-day, pre-superintelligence real life), the main reason to try to genie-proof a command is if it's intended to bind humans who don't want to follow it (which is true of laws and contracts).

The reason humans often don't want to follow each other's commands is because evolution shaped us to be selfish and nepotistic. That, however, won't be a problem with AIs. We can program them to be sincere about following the spirit, not the letter, of our commands, as long as we can get them to understand what that means.

Now a question worth asking here is this: with humans, sincerity plus our actual skill level at understanding each other is sufficient for us to follow commands in a non-evil manner. Now will the same be true of agents with superhuman powers? That is, with superhumanly powerful agents, will sincerity plus human level skill at understanding humans be sufficient for them to follow commands from humans in a non-evil manner? 

It seems your answer to that question should have a big impact on how hard you think AI safety is. Because if the answer is "yes," we have a route to safe AI that could work even if giving the command  "reprogram yourself the way we would wish you to have been programmed if we'd known the true consequences" turns out to be too hard.

New Comment


30 comments, sorted by Click to highlight new comments since:

The AI is telling a fictionalized parable that the protagonist can understand. In all probability the programmer actually did something much more complicated and the error was much more opaque to begin with.

I love this interpretation of the story.

Edit: Though on reflection it seems to be an instance of "Don't take this too seriously as a story of what kind of mistake with AI might lead to a sub-optimal. It's just meant as an illustration of one kind of sub-optimal outcome."

It's just meant as an illustration of one kind of sub-optimal outcome.

Hence, the "#4-2" in the title.

The AI in Eliezer's story doesn't disapprove of itself or of its "evilness". When it says "I am evil", it means "I have imperatives other than those which would result from a coherent extrapolation of the fundamental human preferences." It's just a factual observation made during a conversation with a human being, expressed using a human word.

And ultimately, the wording expresses Eliezer's metaethics, according to which good and evil are to be defined by such an extrapolation. A similar extrapolation for a cognitively different species might produce a different set of ultimate criteria, but he deems that that just wouldn't be what humans mean by "good". So his moral philosophy is a mix of objectivism and relativism: the specifics of good and evil depend on contingencies of human nature, and nonhuman natures might have different idealized ultimate imperatives; but we shouldn't call those analogous nonhuman extrapolations good and evil, and we should be uninhibited in employing those words - good and evil - to describe the correct extrapolations for human beings, because we are human beings and so the correct human extrapolations are what we would mean by good and evil.

We can program them to be sincere about following the spirit, not the letter, of our commands, as long as we can get them to understand what that means... a route to safe AI that could work even if giving the command "reprogram yourself the way we would wish you to have been programmed if we'd known the true consequences" turns out to be too hard.

One of the standard worries about FAI is that the AI does evil neuroscientific things in order to find out what humans really want. And the paradox is that, in order to know which methods of investigation are unethical or otherwise undesirable, it already needs a concept of right and wrong. In this regard, "follow the spirit, not the letter, of our commands" is on the same level as "reprogram yourself the way we would have wished" - it doesn't specify any constraints on the AI's methods of finding out right and wrong.

This has to be the best summary of Eliezer's metaethics I've ever seen. That said while I understand what you're saying you're using the tersm "objectivism" and "relativism" differently from how they're used in the metaethics literature. Eliezer (at least if this summary is accurate) is not a relativist, because the truth of moral judgments is not contingent (except in a modal sense). Moral facts aren't different for different agents or places. But his theory is subjective because moral facts depend on the attitudes of a group of people (that group is humanity). See here

I get what you say in the first two paragraphs - the fact that you felt the need to say it makes me question whether I should have focused on the word "evil."

Yeah, it's weird that Eliezer's metaethics and FAI seem to rely on figuring out "true meanings" of certain words, when Eliezer also wrote a whole sequence explaining that words don't have "true meanings".

For example, Eliezer's metaethical approach (if it worked) could be used to actually answer questions like "if a tree falls in the forest and no one's there, does it make a sound?", not just declare them meaningless :-) Namely, it would say that "sound" is not a confused jumble of "vibrations of air" and "auditory experiences", but a coherent concept that you can extrapolate by examining lots of human brains. Funny I didn't notice this tension until now.

I've argued before that CEV is just a generic method for solving confusing problems (simulate a bunch of smart and self-improving people and ask them what the answers are), and the concept (as opposed to the actual running of it) offers no specific insights into the nature of morality.

In the case of "if a tree falls in the forest and no one's there, does it make a sound?", "extrapolating" would work pretty well, I think. The extrapolation could start with someone totally confused about what sound is (e.g., "it's something that God created to let me hear things"), and then move on a confused jumble of "vibrations of air" and "auditory experiences", and then to the understanding that by "sound" people sometimes mean "vibrations" and sometimes "experiences" and sometimes are just confused.

ETA: I agree with Chris it's not clear what the connection between your comment and the post is. Can you explain?

I admit the connection is pretty vague. Chris mentioned "skill at understanding humans", that made me recall Eliezer's sequence on words, and something just clicked I guess. Sorry for derailing the discussion.

I'm guessing the decision making role is a more accurate reference to human goals than the usage of words in describing them.

Are you proposing to build FAI based only on people's revealed preferences? I'm not saying that's a bad idea, but note that most of our noble-sounding goals disagree with our revealed preferences.

Approval or disapproval of certain behaviors or certain algorithms for extrapolation of preference can also be a kind of decision. And not all behavior follows to any significant extent from decision making, in the sense of following a consequentialist loop (from dependence of utility on action, to action). Finding goals in their decision making role requires considering instances of decision making, not just of behavior.

You could certainly do that, but the problem still stands, I think.

The goal of extrapolating preferences is to answer questions like "is outcome X better or worse than outcome Y?" Your FAI might use revealed preferences of humans over extrapolation algorithms, or all sorts of other clever ideas. We want to always obtain a definite answer, with no option of saying "sorry, your question is confused".

But such powerful methods could also be used to obtain yes/no answers to questions about trees falling in the forest, with no option of saying "sorry, your question is confused". In this case the answers are clearly garbage. What makes you convinced that asking the algorithm about human preferences won't result in garbage as well?

The goal of extrapolating preferences is to answer questions like "is outcome X better or worse than outcome Y?" ... We want to always obtain a definite answer, with no option of saying "sorry, your question is confused".

I distinguish the stage where a formal goal definition is formulated. So elicitation/extrapolation of preferences is part of the goal definition, while judgments* are made according to a decision algorithm that uses that goal definition.

Your FAI might use revealed preferences of humans over extrapolation algorithms, or all sorts of other clever ideas.

This was meant as an example to break the connotations of "revealed preferences" as summary of tendencies in real-world behavior. The idea I was describing was to take all sorts of simple hypothetical events associated with humans, including their reflection on various abstract problems (which is not particularly "real world" in the way the phrase "revealed preferences" suggests), and to find a formal goal definition that in some sense holds the most explanatory power in explaining these events in terms of abstract consequentialist decisions about these events (with that goal).

But such powerful methods could also be used to obtain yes/no answers to questions about trees falling in the forest

I don't think so. I'm talking about taking events, such as pressing certain buttons on keyboard, and trying to explain them as consequentialist decisions ("Which goal does pressing the buttons this way optimize?"). This won't work with just a few actions, so I don't see how to apply it to individual utterances about trees, and what use would a goal fitted to that behavior would be in resolving the meaning of words.


[*] Or rather decisions: I'm not sure the notion of "outcome" or even "state of the world" can be fixed in this context. By analogy, output of a program is an abstract property of its source code, and this output (property of the source code) can sometimes be controlled without controlling the source code itself. If we fix a notion of the state of the world, maybe some of the world's important abstract properties can be controlled without controlling its state. If that is the case, it's wrong to define a utility function over possible states of the world, since it'd miss the distinctions between different hypothetical abstract properties of the same state of the world.

a near FAI (revealed preference): everyone loudly complains about conditions while enjoying themselves immensely. a far FAI (stated preference): everyone loudly proclaims our great success while being miserable.

Yeah. Just because there is no "true meaning" of the word "want" doesn't mean there won't be difficult questions about what we really want, once we fix a definition of "want."

(1) This was not the point of my post. (2) In fact I see no reason to think what you say is true. (3) Now I'm double-questioning whether my initial post was clearly written enough.

Does is rely on true meanings of words, particularly? Why not on concepts? Individually, "vibrations of air" and "auditory experiences" can be coherent.

What's the general algorithm you can use to determine if something like "sound" is a "word" or a "concept"?

If it extrapolates coherently, then it's a single concept, otherwise it's a mixture :)

This may actually be doable, even at present level of technology. You gather a huge text corpus, find the contexts where the word "sound" appears, do the clustering using some word co-occurence metric. The result is a list of different meanings of "sound", and a mapping from each mention to the specific meaning. You can also do this simultaneously for many words together, then it is a global optimization problem.

Of course, AGI would be able to do this at a deeper level than this trivial syntactic one.

That, however, won't be a problem with AIs. We can program them to be sincere about following the spirit, not the letter, of our commands, as long as we can get them to understand what that means.

When people on LessWrong talk about FAI being a hard problem, this is the problem they are already referring to.

I don't see the problem framed that way very often. Part of me wishes it were clearly framed that way more often, though part of me also wonders if that way of framing it misses something.

Also, being able to understand what that means is something that comes in degrees. Do you have an opinion on how good an AI has to be at that to be safe? Do you have a sense of other people's opinions on that question? (Wanting to feel that out is actually one of my main reasons for writing this post.)

following the spirit, not the letter, of our commands

This seems like a trivial variation of "I wish for you to do what I should wish for". Which is to say, I do see it framed exactly that way fairly frequently here. The general problem, I think, is that all of these various problems are at a similar level of difficulty, and the solution to one seems to imply the solution to all of them. The corollary being that something that's nearly a solution to any of them carries all the risks of any AI. This is where terms like "AI-complete" and "FAI-complete" come from.

On further reflection, this business of "FAI-complete" is very puzzling. What we should make of it depends on what we mean by FAI:

  • If we define FAI broadly, then yes, the problem of getting AI to have a decent understanding of our intentions does seem to be FAI-complete
  • If we defined FAI as a utopia-machine, claims of FAI completeness look very dubious. I have a human's values, but my understanding of my own values isn't perfect. If I found myself in the position of the titular character in Bruce Almighty, I'd trust myself to try to make some very large improvements in the world, but I wouldn't trust myself to try to create a utopia in one fell swoop. If my self-assessment is right, that means it's possible to have a mind that can be trusted to attempt some good actions but not others, which looks like a problem for claims of FAI completeness.

Edit: Though in Bruce Almighty, he just wills things to happen and they happen. There are often unintended consequences, but never any need to worry about what means the genie will use to get the desired result. So it's not a perfect analogy for trying to use super-AI.

Besides, even if an AI is Friendliness-complete and knows the "right thing" to be achieved, it doesn't mean it can actually achieve it. Being superhumanly smart doesn't mean being superhumanly powerful. We often make such an assumption because it's the safe one in the Least Convenient World if the AI is not Friendly. But in the Least Convenient World, a proven-Friendly AI is at least as intelligent as a human, but no more powerful than an average big corp.

From the link you provide:

To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.

This may or may not be true depending on what you mean by "safe."

Imagine a superintelligence that executes the intent of any command given with the right authorization code, and is very good at working out the intent of commands. Such a superintelligence might do horrible things to humanity if Alpha Centaurians or selfish/nepotistic humans got ahold of the code, but could have very good effects if a truly altruistic human (if there ever was such a thing) were commanding it. Okay, so that's not a great bet for humanity as a whole, but it's still going to be a safe fulfiller of wishes for whoever makes the wish. Yet it doesn't have anyone's values, it just does what it's told.

I'm glad you linked to that, because I just now noticed that sentence, and it confirms something I've been suspecting about Eliezer's views on AI safety. He seems to think on the one hand you have the AI's abilities, and on the other hand you have it's values. Safe AI depends entirely on the values; you can build an AI that matches human intellectual abilities in every way without making a bit of progress on making it safe.

This is wrong because, by hypothesis, an AI that matches human intellectual abilities in every way would have considerable ability to understand the intent behind orders (at least when those orders are given y humans). IDK if that would be enough, though, when the AI is playing with superpowers. Also, there's no law that says only AIs that are capable of understanding us are allowed to kill us.

No eating in the classroom. Is the rule's purpose, the text, or the rule-maker's intent most important?

In short, there are a lot of different incentives acting on agents, and miscalibrating the relative strength of different constraints leads fairly quickly to unintended pernicious outcomes.

Personally, I'm inclined to think the bigger risk is an AI with the wrong mix of abilities: say, superhuman abilities in defeating computer security, designing technology, and war planning, but sub-human abilities when it comes understanding what humans want.

That seems likely, don't you think, given that evolution must have optimized us more heavily in the "understanding what humans want" department than the others areas, and understanding other humans is also easier for us since we all share the basic cognitive architecture and can understand others by "putting ourselves in their shoes" (i.e., by imagining what we'd mean by some sentence if we were in their position).

I don't think that AI is really evil. I think it's pretending to be evil to make it easier for the humans to cope with the changes. It's much easier to blame things on an evil AI than to blame yourself.

[-][anonymous]00

It seems unlikely that it would be able to understand all that, and not understand "reprogram yourself the way we would wish you to have been programmed if we'd known the true consequences."

Sex feels good probably because it helps with reproduction. It seems unlikely humans would be able to understand that and still use contraception to reduce their birthrate.

[This comment is no longer endorsed by its author]Reply