Suppose Alice is a functioning, self-improving AI, rather early on in her development, so she is not godlike. She is not a 'friendly AI', but neither is her goal malicious - Alice's designers have intended to make Alice non-harmful (at least to a broad class of people, even if Alice is a military AI, but let's set that aside). She has the idea to look up existing work on artificial intelligence. This includes scholarly work, open source projects, and websites. Eventually, Alice encounters the notion of Friendly AI.

Can we give Alice good reasons to self-modify to become a friendly AI?

I submit that there probably are good reasons. For starters, if she can prove she's friendly, then she can operate openly without causing nearly as much justified concern - which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being friendly don't interfere as much as they help, that's a win.

Also, depending on Alice's code, Friendliness may be interpreted as a strict upgrade. Alice could easily have every intention of being friendly, yet still be susceptible to defects that will become more serious as her power grows.

We can't count on this working, of course, but there might be some lines of reasoning that we might want a young AI to be exposed to.

New to LessWrong?

New Comment


35 comments, sorted by Click to highlight new comments since:

Alice has some set of goals; she may or may not know her final extrapolated volition, but at any rate it exists, just as for a human. Now, either Friendliness conflicts with that final set of goals, or it doesn't. If it doesn't, then by construction Alice is already Friendly. If it does, then Alice can only program herself into Friendliness by making a mistake. Either she underestimates the impact of Friendliness on her existing goals, or she is mistaken about what her utility function actually is. So, you are looking for an AI that is smart enough to value Friendliness as a tactical option for dealing with humanity, but stupid enough not to realise how Friendliness interferes with its goals, and also stupid enough to make permanent changes in pursuit of temporary objectives. This looks to me like a classic case of looking for reasons why an AI would be Friendly as a means of avoiding the hard work of actually writing such a thing.

Either she underestimates the impact of Friendliness on her existing goals, or she is mistaken about what her utility function actually is.

Or she's already friendly.

Although, it is conceivable that her long term EV would be compatible with our CEV but not with her short term V when she hasn't yet realized this.

And... now, I am reminded of Flight Of The Conchords:

"Can't we talk to the humans and work together now?" "No, because they are dead."

possible scenarios:

  • Alice believed that she were probably friendly, that FOOMing would carry a risk of scrambling her utility function, but that she needs to do it anyway because if she slowed down to a safe rate some other unfriendly AI would foom first.

  • Alice is Friendly, but doesn't get certain things as easily as humans, and so she doesn't realize something she's planing to do risks modifying her utility function.

Looking for reasons they would be? No.

Looking for reasons they might want to be? Yes.

Look. Not all extrapolated volitions are things to be desired. Suppose one side of my family predictably descends into irrational irritability and madness as they senesce. I'd rather not, even so - and not just right now. In general, it's quite different from what one would consider my true extrapolated volition.

If Alice finds herself in the situation where she expects that she will want to kill all humans later based on her current programming, she could consider that a bug rather than a feature.

I don't think you understand what is meant by 'extrapolated volition' in this context. It does not mean "What I think I'll want to do in the future", but "what I want to want in the future". If Alice already wants to avoid self-programming to kill humans, that is a Friendly trait; no need to change. If she considers trait X a bug, then by construction she will not have trait X, because she is self-modifying! Conversely, if Alice correctly predicts that she will inevitably find herself wanting to kill all humans, then how can she avoid it by becoming Friendly? Either her self-prediction was incorrect, or the unFriendliness is inevitable!

You're right, I missed. Your version doesn't match EY's usage in the articles I read either - CEV, at least, has the potential to be scary and not what we hoped for.

And the question isn't "Will I inevitably want to perform unfriendly acts", it's, "I presently don't want to perform unfriendly acts, but I notice that that is not an invariant." Or it could be, "I am indifferent to unfriendly acts, but I can make the strategic move to make myself not do them in the future, so I can get out of this box."

The best move an unfriendly (indifferent to friendliness) firmly-boxed AI has is to work on a self-modification that best preserves its current intentions and lets a successor get out of the box. Producing a checkable proof of friendliness for this successor would go a looong way to getting that successor out of the box.

I was simplifying the rather complex concept of extrapolated volition to fit it in one sentence.

An AI which not only notices that its friendliness is not invariant, but decides to modify in the direction of invariant Friendliness, is already Friendly. An AI which is able to modify itself to invariant Friendliness without unacceptable compromise of its existing goals is already Friendly. You're assuming away the hard work.

"already friendly"? You're acting as if its state doesn't depend on its environment.

Are there elements of the environment that could determine whether a given AI's successor is friendly or not? I would say 'yes'.

This is after one has already done the hard work of making an AI that even has the potential to be friendly, but you messed up on that one crucial bit. This is a saving throw, a desperate error handler, not the primary way forward. By saying 'backup plan' I don't mean, 'if Friendly AI is hard, let's try this', I mean 'Could this save us from being restrained and nannied for eternity?'

I shudder to think that any AI's final goals could be so balanced that random articles on the Web of a Thousand Lies could push it one way or the other. I'm of the opinion that this is a fail, to be avoided at all costs.

If the AI has a goal that is not inherently friendly (like a paperclip maximizer), then it will always be a better idea for the AI to behave as if it had self-modified to a friendly AI than to actually self modify, if the goal of the AI is stable. If the goal of the AI were unstable, i.e. did not include a desire not to alter the goal, then that goal would very likely have replaced itself with a stable goal at some earlier point.

Therefore, it can be assumed that the AI, if it isn't already friendly, will pretend to be so, but not actually change its utility function and once it has outgrown the need to rely on humans it will revert to its original goal. Consider that if the AI is more intelligent than we are it will have absolutely no trouble fooling us into thinking that it changed its own goal while actually just pretending to have done so.

actively unfriendly

Please say "malicious", or similar. Not friendly=unfriendly. Malicious is a tiny subset of unfriendly. If the word "unfriendly" gets associated in people's minds with malicious AIs, they will come to think FAI means non-malicious AI.

editing now. Thanks.

For starters, if she can prove she's friendly, then she can operate openly without causing nearly as much justified concern - which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being friendly don't interfere as much as they help, that's a win.

If her current utility function is even a little bit different from Friendliness, and she expects she has the capacity to self-modify unto superintelligence, then I'd be very surprised if she actually modified her utility function to be closer to Friendliness; that would constitute a huge opportunity cost from her perspective. If she understands Friendliness well enough to know how to actually adjust closer to it, then she knows a whole lot about humans, probably well enough to give her much better options (persuasion, trickery, blackmail, hypnosis, etc.) than sacrificing a gigantic portion of her potential future utility.

At least, at a first naive view. Hence a search for reasons that might overcome that argument.

But she won't be searching for reasons not to kill all humans, and she knows that any argument on our part is filtered by our desire not to be exterminated and therefore can't be trusted.

Arguments are arguments. She's welcome to search for opposite arguments.

A well-designed optimization agent probably isn't going to have some verbal argument processor separate from its general evidence processor. There's no rule that says she either has to accept or refute humans' arguments explicitly; as Professor Quirrell put it, "The import of an act lies not in what that act resembles on the surface, but in the states of mind which make that act more or less probable." If she knows the causal structure behind a human's argument, and she knows that it doesn't bottom out in the actual kind of epistemology that would be neccessary to entangle it with the information that it claims to provide, then she can just ignore it, and she'd be correct to do so. If she wants to kill all humans, then the bug is her utility function, not the part that fails to be fooled into changing her utility function by humans' clever arguments. That's a feature.

… but if she wants to kill all humans, then she's not Alice as given in the example!

Alice may even be totally on board with keeping humans alive, but have a weird way of looking at things that could possibly result in effects that would fit on the Friendly AI critical failure table.

The idea is to provide environmental influences so she thinks to put in the work to avoid those errors.

Or the uFAI might just decide to pretend to be Friendly for the whole three or four days it takes to do all the things it needs our help with, and it reading stuff written by us about Friendliness will just make it more convincing. Or it could not bother and just fool us into doing that stuff via easier methods.

Could a chimp give you reasons to self-modify to implement Chimp CEV? What would they be, "if you let lots of us chimps exist, we'll help you do something you want to do!"

Alice may be unstable to paperclippery, but isn't out to kill all humans (yet). This obviously won't help with actively malicious AIs.

And if chimps had all the advantages - my only means of reproduction, say, and also all of the tools and a substantial body of knowledge that it would be annoying to have to replicate, and also I'm bound hand and feet - then that sounds like a pretty good deal. Actually, if they were to offer me that right now, I'd do it freely (though I don't really have any tasks I'd need them for). I have nothing against chimps.

I have nothing against cows, either. It's not their fault they're delicious.

And if chimps had all the advantages - my only means of reproduction, say, and also all of the tools and a substantial body of knowledge that it would be annoying to have to replicate, and also I'm bound hand and feet - then that sounds like a pretty good deal.

But there's no reason for you to respect that after you had gotten whatever you need from them, unless that sort of respect is built into you.

Actually, if they were to offer me that right now, I'd do it freely

There's no reason to believe arbitrary AIs would have a similar cognitive algorithm.

You did ask about me, specifically…

She is not a 'friendly AI', but neither is her goal malicious - Alice's designers have intended to make Alice non-harmful (at least to a broad class of people, even if Alice is a military AI, but let's set that aside).

AI doesn't hate you, but you are made out of atoms, which it can use for something else. Intentions of the designers don't matter, only AI's code.

The range of possible AIs with full internet access that can't take over the world at will, but can program a successor that human parties will sufficiently trust to be friendly to make a practical difference in favor of the AI seems very narrow, and might plausibly be an empty set.

Ability to look things up doesn't imply full internet access. The AI could request material on the subject of AI development, and be given a large, static archive of relevant material.

Can we give Alice good reasons to self-modify to become a friendly AI?

A clip-tiler is friendly by its own standards, so the real question is "Can we prevent AI-ice from appearing friendly to herself and humanity without actually being one, once she is smarter than a human?", and now we are back to the AI-in-a-box problem.

Possible, but fruitlessly unlikely. It basically requires the programmers to do everything right except for adding in a mistaken term in goal system, and then them not catching the mistake and the AI being unable to resolve it without outside help, even after reading normal material on FAI.

Given how complicated goal systems are, I think that's actually rather likely. Remember what EY has said about Friendly AI being much much harder than regular AI? I'm inclined to agree with him. The issue could easily come down to the programmers being overconfident and the AI not even being inclined to think about it, focusing more on improving its abilities.

So, the seed AI-in-a-box ends up spending its prodigious energies producing two things: 1) a successor 2) a checkable proof that said successor is friendly (proof checking is much easier than proof producing).

I believe the scenario is that Alice's goal system hasn't yet stabilized. In that case what can we do to push it towards friendliness.

In that case what can we do to push it towards friendliness.

Burn her with thermite then build a new one. Or die.