This is is an outgrowth of a comment I left on Luke's dialog with Pei Wang, and I'll start by quoting that comment in full:

Luke, what do you mean here when you say, "Friendly AI may be incoherent and impossible"? 

The Singularity Institute's page "What is Friendly AI?" defines "Friendly AI" as "A "Friendly AI" is an AI that takes actions that are, on the whole, beneficial to humans and humanity." Surely you don't mean to say, "The idea of an AI that takes actions that are, on the whole, beneficial to humans and humanity may be incoherent or impossible"?

Eliezer's paper "Artificial Intelligence as a Positive and Negative Factor in Global Risk" talks about "an AI created with specified motivations." But it's pretty clear that that's not the only thing you and he have in mind, because part of the problem is making sure the motivations we give an AI are the ones we really want to give it.

If you meant neither of those things, what did you mean? "Provably friendly"? "One whose motivations express an ideal extrapolation of our values"? (It seems a flawed extrapolation could still give results that are on the whole beneficial, so this is different than the first definition suggested above.) Or something else?

Since writing that comment, I've managed to find two other definitions of "Friendly AI." One is from Armstrong, Sandberg, and Bostrom's paper on Oracle AI, which describes Friendly AI as: "AI systems designed to be of low risk." This definition is very similar to the definition from the Singularity Institute's "What is Friendly AI?" page, except that it incorporates the concept of risk. The second definition is from Luke's paper with Anna Salamon, which describes Friendly AI as "an AI with a stable, desirable utility function." This definition has the important feature of restricting "Friendly AI" to designs that have a utility function. Luke's comments about "rationally shaped" AI in this essay seem relevant here.

Neither of those papers seems to use the initial definition they give of "Friendly AI" consistently. Armstrong, Sandberg, and Bostrom's paper has a section on creating Oracle AI by giving it a "friendly utility function," which states, "if a friendly OAI could be designed, then it is most likely that a friendly AI could also be designed, obviating the need to restrict to an Oracle design in the first place."

This is a non-sequitur if "friendly" merely means "low risk," but it makes sense if they are actually defining Friendly AI in terms of a safe utility function: what they're saying then is if we can create an AI that stays boxed because of its utility function, we can probably create an AI that doesn't need to be boxed to be safe.

In the case of Luke's paper with Anna Salamon, the discussion on page 17 seems to imply that "Nanny AI" and "Oracle AI" are not types of Friendly AI. This is strange under their official definition of "Friendly AI." Why couldn't Nanny AI or Oracle AI have a stable, desirable utility function? I'm inclined to think the best way to make sense of that part of the paper is if "Friendly AI" is interpreted to mean "an AI whose utility function an ideal extrapolation of our values (or at least comes close.)"

I'm being very nitpicky here, but I think the issue of how to define "Friendly AI" is important for a couple of reasons. First, it's obviously important for clear communication. If we aren't clear on what we mean by "Friendly AI," we won't understand each other when we try to talk about it." But another very important worry that confusion about the meaning of "Friendly AI" may be spawning sloppy thinking about it. Equivocating between narrower and broader definitions of "Friendly AI" may end up taking the place of an argument that the approach specified by the more narrow definition is the way to go. This seems like an excellent example of the benefits of tabooing your words.

I see on Luke's website that he has a forthcoming peer-reviewed article with Nick Bostrom titled "Why We Need Friendly AI." On the whole, I've been impressed with the drafts of the two peer-reviewed articles Luke has posted so far, so I'm moderately optimistic that that article will resolve these issues.

New Comment
52 comments, sorted by Click to highlight new comments since:

My article forthcoming with Bostrom is too short to resolve the confusions you're discussing.

What we actually said about Nanny AI is that it may be FAI-complete, and that it is thus really full-blown Friendly AI even though when Ben Goertzel talks about it in English it might sound like not-FAI.

Here's an example of why "Friendly AI may be incoherent and impossible." Suppose that the only way to have a superintelligent AI beneficial to humanity is something like CEV, but nobody is ever able to make sense of the idea of combining and extrapolating human values. "Can we extrapolate the coherent convergence of human values?" sounds suspiciously like a Wrong Question. Maybe there's a Right Question somewhere near that space, and we'll be able to find the answer, but right now we are fundamentally philosophically confused about what these English words could usefully mean.

What we actually said about Nanny AI is that it may be FAI-complete, and that it is thus really full-blown Friendly AI even though when Ben Goertzel talks about it in English it might sound like not-FAI.

It's worth distinguishing between two claims: (1) If you can build Nanny AI, you can build FAI and (2) If you've built Nanny AI, you've built FAI.

(2) is compatible with and in fact entails (1). (1) does not, however, entail (2). In fact, (1) seems pointless to say if you also believe (2) because the entailment is so obvious. Because your paper explicitly asserts (1), I inferred you did not believe (2). Your comment seems to explicitly assert both (1) and (2), making me somewhat confused about what your view is.

EDIT: Part of what is confusing about your comment is that it seems to say "(1), thus (2)" which does not follow. Also, to save people the trouble of looking up the relevant section of the paper, the term "FAI complete" is explained in this way: "That is, in order to build Nanny AI, you may need to solve all the problems required to build full-blown Friendly AI."

Here's an example of why "Friendly AI may be incoherent and impossible." Suppose that the only way to have a superintelligent AI beneficial to humanity is something like CEV, but nobody is ever able to make sense of the idea of combining and extrapolating human values. "Can we extrapolate the coherent convergence of human values?" sounds suspiciously like a Wrong Question. Maybe there's a Right Question somewhere near that space, and we'll be able to find the answer, but right now we are fundamentally philosophically confused about what these English words could usefully mean.

I'm not sure I understand what you mean by this either. Maybe, going off the "beneficial to humanity" definition of FAI, you mean to say that it's possible that right now, we are fundamentally philosophically confused about what "beneficial to humanity" might mean?

"Can we extrapolate the coherent convergence of human values?" sounds suspiciously like a Wrong Question. Maybe there's a Right Question somewhere near that space, and we'll be able to find the answer, but right now we are fundamentally philosophically confused about what these English words could usefully mean.

(Dances the Dance of Endorsement )

I don't think the confusions are that hard to resolve, although related confusions might be. Here are some distinct questions:

  • Will a given AI's creation lead to good consequences?
  • To what extent can a given AI be said to have a utility function?
  • How can we define humanity's utility function?
  • How closely does a given AI's utility function approximate our definition?
  • Is a given AI's utility function stable?

The standard SI position would be something like an AI will only lead to good consequences if we are careful to define humanity's utility function, get the AI to approximate it extremely closely, and ensure the AI's utility function is stable, or only moves towards being a better approximation of humanity's utility function. (I don't see how that last one could reliably be expected to happen.)

In the case of Luke's paper with Anna Salamon, the discussion on page 17 seems to imply that "Nanny AI" and "Oracle AI" are not types of Friendly AI. This is strange under their official definition of "Friendly AI." Why couldn't Nanny AI or Oracle AI have a stable, desirable utility function? I'm inclined to think the best way to make sense of that part...

I read through that section of the paper. It seems to me they mean to imply that "Nanny" or "Oracle" properties do not necessarily entail the "Friendly" property. Given that we care immensely about avoiding all AIs without the "Friendly" property, simply making an AI with "Nanny" or "Oracle" is not desirable.

In fewer words: Nanny AI and Oracle AI are not always Friendly AI, and that's what really matters.

Yeah, the terminology doesn't seem to be consistently used. On one hand, Eliezer seems to use it as a general term for "safe" AI:

Creating Friendly AI, 2001: The term "Friendly AI" refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.

Artificial Intelligence as a Positive and Negative Factor in Global Risk, 2006/2008 It would be a very good thing if humanity knew how to choose into existence a powerful optimization process with a particular target. Or in more colloquial terms, it would be nice if we knew how to build a nice AI.

To describe the field of knowledge needed to address that challenge, I have proposed the term "Friendly AI". In addition to referring to a body of technique, "Friendly AI" might also refer to the product of technique - an AI created with specified motivations. When I use the term Friendly in either sense, I capitalize it to avoid confusion with the intuitive sense of "friendly".

Complex Value Systems are Required to Realize Valuable Futures, 2011: A common reaction to first encountering the problem statement of Friendly AI ("Ensure that the creation of a generally intelligent, self-improving, eventually superintelligent system realizes a positive outcome")...

On the other hand, some authors seem to use "Friendly AI" as a more specific term to refer to a particular kind of AI design proposed by Eliezer. For instance,

Ben Goertzel, Thoughts on AI Morality, 2002: Eliezer Yudkowsky has recently put forth a fairly detailed theory of what he calls “Friendly AI,” which is one particular approach to instilling AGI’s with morality (Yudkowsky, 2001a). The ideas presented here, in this (much briefer) essay, are rather different from Yudkowsky’s, but they are aiming at roughly the same goal.

Ben Goertzel, Apparent Limitations on the “AI Friendliness” and Related Concepts Imposed By the Complexity of the World, 2006: Eliezer Yudkowsky, in his various online writings (see links at www.singinst.org), has introduced the term “Friendly AI” to refer to powerful AI’s that are beneficent rather than malevolent or indifferent to humans.1 On the other hand, in my prior writings (see the book The Path to Posthumanity that I coauthored with Stephan Vladimir Bugaj; and my earlier online essay “Encouraging a Positive Transcension”), I have suggested an alternate approach in which much more abstract properties like “compassion”, “growth” and “choice” are used as objectives to guide the long-term evolution and behavior of AI systems. [...]

My general feeling, related here in the context of some specific arguments, is not that Friendly AI is a bad thing to pursue in any moral sense, but rather that it is very likely to be unachievable for basic conceptual reasons.

Mark Waser, Rational Universal Benevolence: Simpler, Safer, and Wiser than “Friendly AI”, 2011: Insanity is doing the same thing over and over and expecting a different result. “Friendly AI” (FAI) meets these criteria on four separate counts by expecting a good result after: 1) it not only puts all of humanity’s eggs into one basket but relies upon a totally new and untested basket, 2) it allows fear to dictate our lives, 3) it divides the universe into us vs. them, and finally 4) it rejects the value of diversity. In addition, FAI goal initialization relies on being able to correctly calculate a “Coherent Extrapolated Volition of Humanity” (CEV) via some as-yet-undiscovered algorithm. Rational Universal Benevolence (RUB) is based upon established game theory and evolutionary ethics and is simple, safe, stable, self-correcting, and sensitive to current human thinking, intuitions, and feelings. Which strategy would you prefer to rest the fate of humanity upon?

This definition has the important feature of restricting "Friendly AI" to designs that have a utility function.

That doesn't seem important - for the reason described here - where it says:

Utility maximisation is a general framework which is powerful enough to model the actions of any computable agent. The actions of any computable agent - including humans - can be expressed using a utility function.

The actions of any computable agent - including humans - can be expressed using a utility function.

This is a highly questionable statement concerning humans, and the paper linked from that page doesn't appear to prove it.

Edit: ah, this includes "functions" that anyone else would call a "stupidly complicated state machine" and which may not actually be feasible to calculate.

The term "function" - as used on the page - is a technical term with a clearly-established meaning.

Yes indeed, and the only way to fit that function to the human state machine is to include a "t" term, over the life of the human in question. Which is pretty much infeasible to calculate unless you invoke "and then a miracle occurs".

The trouble, as usual, being that most of these descriptive utility functions are very complicated relative to the storage space we have available - they start out in the format of "one number for every possible history of the universe," and don't get compressed much from there.

That is not a problem. A compact utility-based description of an agent's behaviour is only ever slightly longer than the shortest description of it available. It's easy to show that by considering a utility-based "wrapper" around the shortest description.

That's a good way to get effective expected utilities. But expected utilities aren't utility functions. Hm, there may be a way to fix that that I haven't noticed, though. But maybe not.

Your comment doesn't seem very clear to me. Are you thinking that a "utility function" needs to have a specific domain which is not simply sensory contents and internal state? If so, do you have a reference for that notion?

I am at least claiming that in the context of designing a good AI, "utility function" should be taken to be a function of some external world, yes.

Otherwise you may run into problems. For example, you could offer to change a robot's sensory contents and internal state to something with higher utility than its current state - and if the agent refuses, you will reset it. If we were using a "utility wrapper" model, all modeled agents would say yes. But the trivial example of an agent that always says "I would prefer not to" (BartlebeyBot) demonstrates that not all agents make choices that maximize some function of their internal state.

So: the only information available to any agent is in the form of its internal state and its sensory channels. Any function it computes must have that domain (or some subset of it). Confining the agent to that domain isn't any kind of restriction. All utility functions calulated over the state of the world necessarily correspond to other utility functions calulated over the domain of internal state and sensory input.

Your example seems wrong to me. The problem is with:

For example, you could offer to change a robot's sensory contents and internal state to something with higher utility than its current state - and if the agent refuses, you will reset it. If we were using a "utility wrapper" model, all modeled agents would say yes.

That's not correct. For one thing, the agent may not believe what you say.

the only information available to any agent is in the form of its internal state and its sensory channels. Any function it computes must have that domain (or some subset of it).

Good point. So any function it computes has to be some function of its internal state. However, not all choices correspond to maximizing such a function - any time choices go in a circle, for instance, you're not maximizing a function. We could imagine a very simple machine with a 3-state memory. It wants to go from A to B, and from B to C, and from C to A. Its choices are always a function if its internal state. But its choices don't maximize a function of its internal state.

That's not correct. For one thing, the agent may not believe what you say.

Okay. Replace "offer it a choice" with "offer it a choice, and provide sufficient Bayesian evidence that this is this choice faced." This doesn't lead anywhere anyhow.

not all choices correspond to maximizing such a function - any time choices go in a circle, for instance, you're not maximizing a function. We could imagine a very simple machine with a 3-state memory. It wants to go from A to B, and from B to C, and from C to A. Its choices are always a function if its internal state. But its choices don't maximize a function of its internal state.

Here's the corresponding utility function - assuming that state transitions are tied to actions.

  • If IAM(A) { U(A) = 0, U(B) = 1 U(C) = 0; }
  • If IAM(B) { U(A) = 0, U(B) = 0 U(C) = 1; }
  • If IAM(C) { U(A) = 1, U(B) = 0 U(C) = 0; }

Using simple maximisation algorithms (e.g. gradient descent) on that utility landscape will produce the behaviour in question. More sophisticted algorithms will do no better.

For one thing, the agent may not believe what you say.

Okay. Replace "offer it a choice" with "offer it a choice, and provide sufficient Bayesian evidence that this is this choice faced." This doesn't lead anywhere anyhow.

Your "BartlebeyBot" agent totally ignored Bayesian evidence. By what rule does "my" example agent have to listen and respond to such evidence, while "yours" does not? Again, I don't think your proposed counter example is remotely convincing.

Why do you think there's a counter-example? Did you read the referenced Dewey paper about O-Maximisers?

Here's the corresponding utility function.

Any function of the internal state can be expressed with a number of entries equal to the number of possible internal states.

You've given me something that's still interesting, which is all the expected utilities.

By what rule does "my" example agent have to listen and respond to such evidence, while "yours" does not? Again, I don't think your proposed counter example is remotely convincing.

Because one maximizes a utility function, and the other just says "no" all the time.

Why do you think there's a counter-example? Did you read the referenced Dewey paper about O-Maximisers?

Thank you for linking that again. Hm, I guess I did assume that agents could have different utilities at different timesteps. Just putting "1" for everything resolves how an O-maximizer can refuse the offer to raise its utility. But then, they assume that the tape of a turing machine is infinite, so the cycle above still is a problem.

Following the links, at first glance it looks like there's an argument there that anything with computable behavior will have behavior expressible as a utility function. Is that correct?

I think this is a great post, but naively optimistic. You're missing the rhetorical point. The purpose of using the term "Friendly AI" is to prevent people from thinking about what it means, to get them to agree that it's a good thing before they know what it means.

The thing about algorithms is, knowing what "quicksort" means is equivalent to knowing how to quicksort. The source code to a quicksort function is an unambiguous description of what "quicksort" means.

If you knew what "Friendly AI" means — in sufficient detail — then you already possess the source code to Friendly AI.

So what you're calling a "rhetorical point" is merely an inescapable fact about thinking about algorithms that we do not yet possess. If you don't know how to quicksort, but you do know that you'd like an efficient in-place sorting function, then you don't know what "quicksort" refers to, but you do know something about what you would like from it.

Eliezer has said things along the lines of, "I want to figure out how to build Friendly AI". How is such a statement meaningful under your interpretation? According to you, either he already knows how to build it, or he doesn't know what it is.

The term "Friendly AI" does not correspond in your example to "quicksort", but to what you would like from it.

Knowing (some of) what you want an algorithm to do is not the same as knowing what the algorithm is. It seems likely to me that neither Eliezer nor anyone else knows what a Friendly AI algorithm is. They have a partial understanding of it and would like to have more of one.

It is conceivable that some future research might (for instance) prove that Friendly AI is impossible, in the same regard that a general solution to the halting problem is impossible. In such a case, a person who subsequently said "I want to figure out how to build Friendly AI" would be engaging in wishful thinking.

Once upon a time, someone could have said, "I want to figure out how to build a general solution to the halting problem." That person might have known what the inputs and outputs of a general solution to the halting problem would look like, but they could not have known what a general solution to it was, since there's no such thing.

So P=NP? If I can verify and understand a solution, then producing the solution must be equally easy...

I didn't claim an algorithm was specified by its postcondition; just that saying "agree that X is a good thing without knowing what X means" is, for algorithms X, equivalent to "agree that X is a good thing before we possess X".

We define friendly AI as being utility function based (apologies if this wasn't clear - we did list it under the "utility" based agents). The use of "low risk" derives from my view that getting a survivable super-AI is the challenge - all the nice stuff is easy to add on after that.

I've been interpreting "Friendly AI" to mean something like:

  • A system that acts as a really powerful expected utility maximizer
  • whose utility function is specified
  • and whose utility function is desirable.

I intend this to be consistent with Eliezer's definition but I can't be certain.

As the OP notes, this is a strict subset of "AI systems designed to be of low risk." and Armstrong, Sandberg & Bostrom appear confused here. They're citing an old Yudkowsky paper from 2001 (which I understand is no longer canonical?) so I'm hoping this is a simple slip rather than a confusion of ideas.

As the OP also notes, there's also some potential confusion about the meaning of a "desirable" utility function here. Does it have to be an "ideal extrapolation of our values" (this may be one of the concepts Luke worries is "incoherent"). Or does it just have to be "good enough"?

Is "good enough" only good enough if it allows itself to be upgraded, like a Nanny AI? (Most of the time we expect utility maximizers to squelch competitors with a different utility function, so this provision would need to be encoded explicitly).

While I ordinarily try to stay out of the exegesis business, I will observe that EY's failed utopia story seems to suggest that "good enough" is not compatible with his goal.

IIRC EY even agrees in the comments that the state-change implemented by that optimizer is a net improvement, but nevertheless implies that leaving a value out of the list of things to preserve/maximize (whether that value is the thing that scores existing relationships higher than new ones that are otherwise better, or that scores relationships to other humans higher than relationships to nonhuman entities created for the purpose of having such a relationship, or something else, is importantly left unclear, but the story definitely suggests that there's some value that was left out of the mix) means we ought to prefer that such an optimizer not be run.

EDIT: Someone actually bothered to do the research below, and it seems IDRC. It's not that we ought to prefer that such an optimizer not be run, it's that we ought to prefer that the (fictional) process that led to that optimizer not be implemented, since in most worlds where it is the result, unlike in the world depicted, is worse than the status quo. (This is why I try to stay out of exegesis.)

This comment seems to imply that EY would prefer that such an optimizer be run, if the only other option was business-as-usual.

Okay, just to disclaim this clearly, I probably would press the button that instantly swaps us to this world - but that's because right now people are dying, and this world implies a longer time to work on FAI 2.0.

But the Wrinkled Genie scenario is not supposed to be probable or attainable - most programmers this stupid just kill you, I think.

EDIT: that doesn't imply it earns the label "Friendly" though.

I thought Friendly AIs were those that minimized the regret of switching them on.

Unfortunately, depending on the range of human beliefs, one can choose a wide variety of minimax, maximin criteria here also.

[-][anonymous]00

Arguably,

[This comment is no longer endorsed by its author]Reply

I've seen two definitions.

1: AI that is carefully designed to not do bad things. This is exemplified in sentences like "don't build this AI, it's not Friendly."

2: AI that does good things and not bad things because it wants the same things we want. "Friendly AI will have to learn human values."

I strongly suspect that a lot of this confusion comes from looking at a very deep rabbit hole from five yards away and trying to guess what's in it.

There is no doubt that Eliezer's team has made some progress in understanding the FAI problem, which is the second of the Four Steps:

1. Identify the problem.
2. Understand the problem.
3. Solve the problem.
4. Waterslide down rainbows for eternity

I sympathize entirely with the wish to summarize all the progress made so far into a sentence-long definition. But I don't think it's completely reasonable. As timtyler suggests, I think the thing to do at this point is use the pornography rule: "I know it when I see it."

It's entirely unclear that we know it when we see it either. Humans don't qualify as a human-friendly intelligence, for example (or us creating an uFAI wouldn't be a danger). We might know something's not it when we see it, but that is not the same thing as knowing something is it when we see it.

I agree. But it's the best we've got when we're not domain experts, or so it seems.

Maybe "Friendly AI" should be understood in a defined-by-pointing-at-it way, as "that thing Eliezer keeps talking about?"