10

Inspired by Don't Plan For the Future.

For the purposes of discussion on this site, a Friendly AI is assumed to be one that shares our terminal values. It's a safe genie that doesn't need to be told what to do, but anticipates how to best serve the interests of its creators. Since our terminal values are a function of our evolutionary history, it seems reasonable to assume that an FAI created by one intelligent species would not necessarily be friendly to other intelligent species, and that being subsumed by another species' FAI would be fairly catastrophic.

Except.... doesn't that seem kind of bad? Supposing I were able to create a strong AI, and it created a sound fun-theoretic utopia for human beings, but then proceeded to expand and subsume extraterrestrial intelligences, and subject them to something they considered a fate worse than death, I would have to regard that as a major failing of my design. My utility function assigns value to the desires of beings whose values conflict with my own. I can't allow other values to supersede mine, but absent other considerations, I have to assign negative utility in my own function for creating negative utility in the functions of other existing beings. I'm skeptical that an AI that would impose catastrophe on other thinking beings is really maximizing my utility.

It seems to me that to truly maximize my utility, an AI would need to have consideration for the utility of other beings. Secondary consideration, perhaps, but it could not maximize my utility simply by treating them as raw material with which to tile the universe with my utopian civilization.

Perhaps my utility function gives more value than most to beings that don't share my values (full disclosure, I prefer the "false" ending of Three Worlds Collide, although I don't consider it ideal.) However, if an AI imposes truly catastrophic fates on other intelligent beings, my own utility function takes such a hit that I cannot consider it friendly. A true Friendly AI would need to be at least passably friendly to other intelligences to satisfy me.

I don't know if I've finally come to terms with  Eliezer's understanding of how hard Friendly AI is, or made it much, much harder, but it gives me a somewhat humbling perspective of the true scope of the problem.

New to LessWrong?

New Comment
39 comments, sorted by Click to highlight new comments since: Today at 5:33 PM

You seem to jump from "our terminal values are a function of our evolutionary history" to "our terminal values do not include terms for the well-being of any aliens which we might encounter", which does not follow and which is, as evidenced by this post, untrue. A CEV-based FAI would spend resources to help aliens to exactly the degree that we would care about those aliens.

I didn't make that jump, I wrote this post as after reading that jump in the thread linked to at the top, which prompted me to think more about the issue.

Our FAI might care about alien values, but that doesn't mean an alien FAI would care about ours.

Assuming that Desrtopa isn't weird (or WEIRD), it would help if we knew the critical causes of humans' caring about alien values. For example, suppose social animals, upon achieving intelligence and language, come to care about the values of anyone they can hold an intelligent conversation with. (Not that we know this now - but perhaps we could, later, if true.) In that case, we may be safe as long as intelligent aliens are likely to be social animals. (ETA: provided, duh, that they don't construct uFAI and destroy themselves, then us.)

The key question is how much I trust the (hypothetical) CEV-extracting algorithm that developed the FAI to actually do what its programmers intended.

If I think it's more reliable than my own bias-ridden thinking process, then if the FAI it produces does something that I reject -- for example, starts disassembling alien civilizations to replace them with a human utopia -- presumably I should be skeptical about my rejection. The most plausible interpretation of that event is that my rejection is a symptom of my cognitive biases.

Conversely, if I am not skeptical of my rejection -- if I watch the FAI disassembling aliens and I say "No, this is not acceptable" and try to stop it -- it follows that I don't think the process is more reliable than my own thinking.

As I've said before, I suspect that an actual CEV-maximizing AI would do any number of things that quite a few humans (including me) would be horrified by, precisely because I suspect that quite a few humans (including me) would be horrified by the actual implications of their own values being maximally realized.

I suspect that quite a few humans (including me) would be horrified by the actual implications of their own values being maximally realized.

How exactly could you be "horrified" about that unless you were comparing some of your own values being maximally realized with some of your other values not being maximally realized?

In other words, it doesn't make sense (doesn't even mean anything!) to say that you would be horrified (isn't that a bad thing?) to have your desires fulfilled (isn't that a good thing?), unless you're really just talking about some of your desires conflicting with some of your other desires.

Because humans' values are not a coherent, consistent set. We execute an evolutionarily-determined grab-bag of adaptations; there is no reason to assume this grab-bag adds up to a more coherent whole than "don't die out." (And even that's a teleological projection onto stuff just happening.)

If I am completely and consistently aware of what I actually value, then yes, my desires are equivalent to my values and it makes no sense to talk about satisfying one while challenging the other (modulo cases of values-in-tension, as you say, which isn't what I'm talking about).

My experience is that people are not completely and consistently aware of what they actually value, and it would astonish me if I turned out to be the fortunate exception.

Humans frequently treat instrumental goals as though they were terminal. Indeed, I suspect that's all we ever do.

But even if I'm wrong, and it turns out that there really are terminal values in there somewhere, then I expect that most humans aren't aware of them and if some external system starts optimizing for them, and is willing to trade arbitrary amounts of a merely instrumental good in exchange for the terminal good it serves as a proxy for (as well it should), we'll experience that as emotionally unpleasant and challenging.

Solid answer, as far as I can see right now.

I see reliability and friendliness as separate questions. An AI might possess epistemic and instrumental rationality that's superior to ours, but not share our terminal values, in which case, I think it makes sense to regard it as reliable, but not friendly.

That is certainly true.

But the theory here, as I understand it, is that it's possible to build a merely reliable system -- a seed AI -- that determines what humanity's CEV is and constructs a self-improving target AI that maximizes it. That target AI is Friendly if the seed AI is reliable, and not otherwise.

So they aren't entirely separable questions, either.

Updated and upvoted, but if you find yourself horrified by the AI's actions, I think that would be fairly strong evidence that the AI had not been sufficiently reliable in extrapolating your values.

Sure.

But one could argue that I ought not run such a seed AI in the first place until my confidence in its reliability was so high that even updating on that evidence would not be enough to make me distrust the target AI. (Certainly, I think EY would argue that.)

It seems analogous to the question of when I should doubt my own senses. There is some theoretical sense in which I should never do that: since the vast majority of my beliefs about the world are derived from my senses, it follows that when my beliefs contradict my senses I should trust my senses and doubt my beliefs. And in practice, that seems like the right thing to do most of the time.

But there are situations where the proper response to a perception is to doubt that its referent exists... to think "Yes, I'm seeing X, but no, X probably is not actually there to be seen." They are rare, but recognizing them when they occur is important. (I've encountered this seriously only once in my life, shortly after my stroke, and successfully doubting it was... challenging.)

Similarly, there are situations where the proper response to a moral judgment is to doubt the moral intuitions on which it is based... to think "Yes, I'm horrified by X, but no, X probably is not actually horrible."

Agreed, but if you do have very high confidence that you've made the AI reliable, and also a fairly reasoned view of your own utility function, I think you should be able to predict in advance with reasonable confidence that you won't find yourself horrified by whatever it does. And I predict that if an AI subsumed intelligent aliens and subjected them to something they considered a terrible fate, I would be horrified.

(I've encountered this seriously only once in my life, shortly after my stroke, and successfully doubting it was... challenging.)

Please elaborate! It sounds interesting and it would be useful to hear how you were able to identify such a situation and successfully doubt your senses.

I'm not prepared to tell that story in its entirety here, though I appreciate your interest.

The short form is that I suffered significant brain damage and was intermittently delerious for the better part of a week, in the course of which I experienced both sensory hallucinations and a variety of cognitive failures.

The most striking of these had a fairly standard "call to prophecy" narrative, with the usual overtones of Great Significance and Presence and etc.

Doubting it mostly just boiled down to asking the question "Is it more likely that my experiences are isomorphic to external events, or that they aren't?" The answer to that question wasn't particularly ambiguous, under the circumstances.

The hard part was honestly asking that question, and being willing to focus on it carefully enough to arrive at an answer when my brain was running on square wheels, and being willing to accept the answer when it required rejecting some emotionally potent experiences.

The scope of CEV is humanity: Not just the designer's volition, and not the volition of non-human intelligences. Why?

If you exclude non-humans' volition (except indirectly if humans care about it), then why not exclude the volition of all humans but the designer (except indirectly if the designer cares about it)?

If all humans' volitions were identical or very similar, I could see an argument along the lines of Drescher or TDT. But they are not.

So, you could include the volition of all entities, but only the portion which overlaps with that of the designer. This would indeed consider humans much more than non-human intelligences. But that gets into dangerous territory, with educated Western liberals getting a higher weight (assuming that the usual crowd develops the AGI).

This is a TDT-flavoured problem, I think. The process that our TDT-using FAI uses to decide what to do with an alien civilization it discovers is correlated with the process that a hypothetical TDT-using alien-Friendly AI would use on discovering our civilization. The outcome in both cases ought to be something a lot better than subjecting us/them to a fate worse than death.

This is similar to a discussion I wanted to start, I'll just leave a comment here instead:

If we were to detect the presence of an alien civilisation before the SIAI implements CEV, should they account for their extrapolated volition?

I ask myself what advice I would give to terrorists, if they were programming a superintelligence and honestly wanted not to screw it up, and then that is the advice I follow myself.

Eliezer Yudkowsky

I ask myself what advice I would give to aliens, if they were programming a superintelligence and honestly wanted not to screw it up, and then that is the advice I follow myself.

Eliezer Yudkowsky (counterfactual)

There are a few problems:

  • Not accounting for the alien volition could equal genocide.
  • Accounting for the alien volition could outweigh our volition by their sheer number (e.g. if they are insectoids).

Both arguments are bilateral. If you accept the premise that the best way is to account for all agents then we are left with the problem of possibly being a minority. But what appears much more likely is that we'll be risk-averse and expect the aliens not to follow the same line of reasoning. The FAI's of both civilizations might try to subdue the other.

What implications would arise from the detection of an alien civilization technologically similar to ours?

Accounting for the alien volition could outweigh our volition by their sheer number (e.g. if they are insectoids).

For analogous reasons CEV does not sound particularly 'Friendly' to me.

Accounting for the alien volition could outweigh our volition by their sheer number (e.g. if they are insectoids).

If we account for alien volition directly, then yes, this could be a problem. But if we only care about aliens because we're implementing CEV and some humans care about aliens, then scope insensitivity comes into play and the amount of resources that will be dedicated to the aliens is limited.

Scope insensitivity is a failure to properly account for certain things. CEV is designed to account for everything. It is possible that some conclusions arrived at due to scope insensitivity will be upheld, but we do not yet know whether that is true and current human choices that we know to be the product of biases definitely do not count as evidence about how CEV will choose.

But if we only care about aliens because we're implementing CEV and some humans care about aliens...

If we only implement CEV for people working for the SIAI and some of them care about the rest of humanity...what's the difference?

Note: I get super redundant after like the first reply, so watch out for that. I'm not trying to be an asshole or anything; I'm just attempting to respond to your main point from every possible angle.

For the purposes of discussion on this site, a Friendly AI is assumed to be one that shares our terminal values.

What's a "terminal value"?

My utility function assigns value to the desires of beings whose values conflict with my own.

Even for somebody trying to kill you for fun?

I can't allow other values to supersede mine, but absent other considerations, I have to assign negative utility in my own function for creating negative utility in the functions of other existing beings.

What exactly would those "other considerations" be?

I have to assign negative utility in my own function for creating negative utility in the functions of other existing beings.

Would you be comfortable being a part of putting somebody in jail for murdering your best friend (whoever that is)?

I'm skeptical that an AI that would impose catastrophe on other thinking beings is really maximizing my utility.

What if somebody were to build an AI for hunting down and incarcerating murderers?

Would that "maximize your utility", or would you be uncomfortable with the fact that it would be "imposing catastrophe" on beings "whose desires conflict with [your] own"?

It seems to me that to truly maximize my utility, an AI would need to have consideration for the utility of other beings.

What if the "terminal values" (assuming that I know what you mean by that) of those beings made killing you (for laughs!) a great way to "maximize their utility"?

Perhaps my utility function gives more value than most to beings that don't share my values

But does that extraordinary consideration stretch to the people bent on killing other people for fun?

However, if an AI imposes truly catastrophic fates on other intelligent beings, my own utility function takes such a hit that I cannot consider it friendly.

Would your utility function take that hit if an AI saved your best friend from one of those kind of people (the ones who like to kill other people for laughs)?

Roughly, a terminal value is a thing you value for its own sake.

This is contrasted with instrumental values, which are things you value only because they provide a path to terminal values.

For example: money, on this view, is something we value only instrumentally... having large piles of money with no way to spend it isn't actually what anyone wants.

Caveat: I should clarify that I am not sure terminal values actually exist, personally.

The linked comment seems to be questioning whether terminal values are stable or unambiguous, rather than whether they exist. Unless the ambiguity goes so deep as to make the values meaningless ... but that seems far-fetched.

Hm... maybe. Certainly my understanding of the concept of terminal values includes stability, so I haven't drawn that distinction much in my thinking.

That said, I don't quite see how considering them distinct resolves any of my concerns. Can you expand on your thinking here?

What's a "terminal value"?

Wikipedia has an article on it.

What exactly would those "other considerations" be?

Things like me getting killed in the course of satisfying their utility functions, as you mentioned above, would be a big one.

Would you be comfortable being a part of putting somebody in jail for murdering your best friend (whoever that is)?

I support a system where we precommit to actions such as imprisoning people who commit crimes to prevent them from committing crimes in the first place. My utlity function doesn't get positive value out of retribution against them. If an AI that hunts down and incarcerates murderers is better at preventing people from murdering in the first place, I would be in favor of it, assuming no unforseen side effects.

Things like me getting killed in the course of satisfying their utility functions, as you mentioned above, would be a big one.

So basically your "utility function assigns value to the desires of beings whose values conflict with your own" unless they really conflict with your own (such as get you killed in the process)?

I assign utility to their values even if they conflict with mine to such a great degree, but I have to measure that against the negative utility they impose on me.

I assign utility to their values even if they conflict with mine to such a great degree, but I have to measure that against the negative utility they impose on me.

So, as to the example, you would value that they want to kill you somewhat, but you would value not dying even more?

That's my understanding of what I value, at least.

That's my understanding of what I value, at least.

Well, I'm not so sure that those words (the ones that I used to summarize your position) even mean anything.

How could you value them wanting to kill you somewhat (which would be you feeling some desire while cycling through a few different instances of imagining them doing something leading to you dying), but also value you not dying even more (which would be you feeling even more desire while moving through a few different examples of imagining you being alive)?

It would be like saying that you value going to the store somewhat (which would be you feeling some desire while cycling through a few different instances of imagining yourself traveling to the store and getting there), but value not actually being at the store (which would be you feeling even more desire while moving through a few different examples of not being at the store). But would that make sense? Do those words (the ones making up the first sentence of this paragraph) even mean anything? Or are they just nonsense?

Simply put, would it make sense to say that somebody could value X+Y (where the addition sign refers to adding the first event to the second in a sequence), but not Y (which is a part of X+Y, which the person apparently likes)?

As you already pointed out to TheOtherDave, we have multiple values which can conflict with each other. Maximally fulfilling one value can lead to low utility as it creates negative utility according to other values. I have a general desire to fulfill the utility functions of others, but sometimes this creates negative utility according to my other values.

Simply put, could you value X+Y (where the addition sign refers to adding the first event to the second in a sequence), but not Y?

Unless I'm misunderstanding you, yes. Y could have zero or negative utility, the positive utility of X could be great enough that the addition of the two would have positive overall utility.

E.g. you could satisfy both values by helping build a (non-sentient) simulation through which they can satisfy their desire to kill you without actually killing you.

But really I think the problem is that when we refer to individual actions as if they're terminal values, it's difficult to compromise -- true terminal values tend however to be more personal than that.

I'm putting this in Discussion pending the reception, but I'll move it to the main page if anyone thinks it would be a good idea.

I would think it needs more flesh. Maybe try to put some math to how much you want the aliens to be happy. What do you think about human life vs. alien life, in terms of "worth?" Is your personal happiness enough for you, or do you want the universe to be tiled with utopias?

I don't think I know my utility function that well. I'm skeptical that tiling the universe with utopias is actually a good idea, but that might just be scope insensitivity on my part. I'm afraid that putting math to how much I want the aliens to be happy is almost certainly beyond me; I don't think any equation I could write up would be a realistic model of my values.

I'd rather use "Friendly" to mean "having human(e) values", so that "alien FAI" would mean "an AI that ended up arriving at a value system compatible with ours despite being created by aliens".