The Need for Human Friendliness

6 Post author: Elithrion 07 March 2013 04:31AM

Consider the following scenario. MIRI succeeds beyond my wildest expectations. It comes up with a friendliness theory, and then uses it to make provably friendly AGI before anyone else can make an unfriendly one. And then a year and a half later, we find that Eliezer Yudkowsky has become the designated god-emperor of the lightcone, and the rest of the major MIRI researchers are his ministers. Woops.

 

My guess for the probability of this type of scenario given a huge MIRI success along those lines is around 15%. The reasoning is straightforward. (1) We don't know what's going on inside any particular person's head. (2) Many or most humans are selfish. (3) Looking altruistic is more likely to draw support than explicitly setting out to take over the world. (5)  And human acting abilities, while limited, are likely adequate (for example, spies seem quite successful at concealing their motives). I'd say those four things are reasonably independent and sufficient for some deception to be happening, so guessing at some probabilities, it works out to something like 1×0.5×0.8×0.5 = 0.2† At least if the person is sufficiently determined to achieve their goal no matter what.

Or to put it in a more interesting and LW-referring way, it has been argued that we might not be able to distinguish friendly from unfriendly AGI during their "take off" period, since both would probably act the same way to take over the world (before optimising it in a friendly or unfriendly way). We might also expect that as a person becomes more agenty, their behaviour more closely approximates that of an AGI, since an AGI is pretty much the embodiment of agency. It certainly seems like Eliezer is quite agenty for a human, so we clearly presently find ourselves in his take off period!

 

So, supposing we don't want god-emperor!Eliezer, what do we do? First, I think it is important to recognise that this problem is almost certainly not urgent. Selfish-Eliezer!optimisations and altruistic-Eliezer!optimisations probably require nearly the same work to not end up being paperclips!optimisations, so for the near future it doesn't particularly matter if he's secretly out to take over the world or not. In addition, having Eliezer be the god-emperor of the light cone is probably not all that bad—it might not be optimal, but it's likely to be very much better than any unfriendly outcome, and probably better than the status quo. It does, however, seem like a useful issue to address in the longer term for any organisation that plans to make a friendly(-ish) AGI.

 

Off-hand, I can think of three plausible ideas:

Extreme transparency (i.e. increase acting difficulty). If we attach microphones and cameras to all the main researchers and have a team of a hundred volunteers go over every minute of recordings every day, it would be hard to conceal any deceptive plans.

Very good lie detection (i.e. find out what's in people's minds). While current lie detector technologies are likely to be inadequate, it seems possible that by the time the issue becomes relevant we might have much better technology. That said, it might also be the case that if someone keeps simulating having a particular goal set for decades, she may be able to easily generate responses based on that goal set without the responses registering as lies in any conventional sense.

Some sort of organisational solution. I'm not sure what this would look like, to be honest, but it seems like some organisation patterns should be more likely to produce friendly outcomes than others independently of transparency. For example, if the essential team is huge and organised so different members check each others' work, it's likely to be more difficult to conceal or organise any conspiracy (although there would be other problems).

What I think would not work is letting the group write all the code and then checking it over. First, there are likely to be enough talented programmers in it that they'd have a good shot at hiding whatever they wanted, and, second, it's not clear that it would be possible to stop the group from just changing some key bits at the last moment (e.g. from CEV of humans in world to CEV of humans in room) and then launching without further oversight.

 

As I said, this is by no means a priority, but I think it would be useful to start thinking about the problems sooner rather than later, so we don't end up being caught off guard. Also, while I specifically mention Eliezer in this post, he simply happens to provide the most salient example, and most points are equally (or in some cases more) relevant to the general case of anyone working on AGI.

 

† I probably picked those numbers in large part to justify my above "15%", but you get the idea.

Comments (28)

Comment author: Qiaochu_Yuan 08 March 2013 08:47:33AM 9 points [-]

Extreme transparency increases the probability of government intervention at some late stage of the game (attempting to create a friendly AI constitutes attempting to create an extremely powerful weapon).

See what I did there? I changed the genre from sci-fi to political thriller.

Comment author: [deleted] 09 March 2013 05:01:52PM 1 point [-]

So if the awareness that MIRI is working on an AGI in secret (or rather, without sharing details) happens to get to the government, and they consider it a powerful weapon as you say...then what? You know what they do to grandiose visionaries working on powerful weapons in their backyard who explicitly don't want to share, and whose goals pretty clearly compromise their position, right?

Comment author: GDC3 10 March 2013 08:53:16AM 8 points [-]

Related worry that I've been meaning to ask about for a while:

Given that is there is still plenty of controversy over which types of unusual human minds to consider "pathological" instead of just rare variants, how is MIRI planning to decide which ones are included in CEV? My skin in the game: I'm one of the Autistic Spectrum people who feel like "curing my autism" would make me into a different person who I don't care about. I'm still transhumanist; I still want intelligence enhancements, external boosts to my executive function and sensory processing on demand, and the ability to override the nastiest of my brain chemistry. But even with all of that I would still know myself as very different from neurotypicals. I naturally see the world in different categories that most, and I don't think in anything like words or a normal human language. Maybe more relevantly, I have a far higher tolerance---even a need---for sphexishness, than most people of comparable intelligence to me.

Fun theory for me would be a little different, and I think that there really are a lot of people who would consider what I did with eternity to be somewhat sad and pathetic, maybe even horrifying. I think it could be an empathic uncanny valley effect or just an actual basic drive people have, to make everybody be the same. I'm worried that this could be an actual terminal value for some people that would hold up under reflective equilibrium.

I'm not too freaked out because I think the consensus is that since Autistic people already exist and some are happy, we should have a right to continue to exist and even make more of ourselves. But I actually believe that if we didn't exist it would be right to create us, and I worry that most neurotypicals extrapolated volition would not create all the other variations on human minds that should exist but don't yet.

If it matters, up to $1000 for MIRI this year could be at stake in answering this concern. I say this in a blatant and open effort to incentivize Eliezer etc. to answering me. I hope that I'm not out of line for waving money around like this, because this really is a big part of my choice about whether FAI is good enough. I really want to give what I can to prevent existential threats, but I consider a singularity overly dominated by neurotypicals to be a shriek.

Comment author: wedrifid 10 March 2013 12:48:38PM 3 points [-]

If it matters, up to $1000 for MIRI this year could be at stake in answering this concern. I say this in a blatant and open effort to incentivize Eliezer etc. to answering me. I hope that I'm not out of line for waving money around like this, because this really is a big part of my choice about whether FAI is good enough.

Not out of line at all. You are encouraged to use economics like that.

Comment author: Philip_W 18 December 2014 04:43:55PM 0 points [-]

Did MIRI answer you? I would expect them to have answered by now, and I'm curious about the answer.

Comment author: Kawoomba 07 March 2013 11:51:56AM 3 points [-]

have a team of a hundred volunteers

Or, as they came to be known later, the hundred Grand Moffs.

Comment author: Elithrion 07 March 2013 06:42:41PM 1 point [-]

While I am amused by the idea, in practice, I'm not sure it's possible to keep a conspiracy that large from leaking in modern times. Also, if more volunteers are readily accepted, it would be impractical to bribe them all.

Comment author: gwern 07 March 2013 09:14:55PM 5 points [-]

Don't worry. If the Eliezer conspiracy fails and one of the Grand Moffs betrays them, a year later the Hanson upload clan will successfully coordinate a take-over. After all, their coordination is evolutionarily optimal.

Comment author: Eliezer_Yudkowsky 07 March 2013 11:30:05PM 6 points [-]

The other problem with checking the code is that an FAI's Friendliness content is also going to consist significantly or mostly of things the FAI has learned, in its own cognitive representation. Keeping these cognitive representations transparent is going to be an important issue, but basically you'd have to trust that the tool and possibly AI skill that somebody told you translates the cognitive content, really does so; and that the AI is answering questions honestly.

The main reason this isn't completely hopeless for external assurance (by a trusted party, i.e., they have to be trusted not to destroy the world or start a competing project using gleaned insights) is that the FAI team can be expected to spend effort on maintaining their own assurance of Friendliness, and their own ability to be assured that goal-system content is transparent. Still, we're not talking about anything nearly as easy as checking the code to see if the EVIL variable is set to 1.

Comment author: ikrase 08 March 2013 07:17:32AM 1 point [-]

How important is averting God Emperor Yudkowsky given that it's an insanely powerful AI, which would lead to a much, much more benign and pleasant utopia than the (imo highly questionable) Fnargl? Much better than Wireheads for the Wirehead God.

I've actually thought, for a while, that Obedient AI might be better than the Strict Utilitarian Optimization models preferred by LW.

Comment author: drethelin 07 March 2013 05:32:04AM 1 point [-]

I think you're assigning too little weight to provably friendly but this is pretty funny

Comment author: Decius 07 March 2013 08:09:44AM 1 point [-]

How much would you be willing to wager that you will be able to follow the proof of friendly for the specific AI which gets implemented?

Comment author: drethelin 07 March 2013 08:15:22AM 0 points [-]

Very little. I don't like my odds. If Eliezer has provable friendliness theorems but not an AI, it's in his and everyone's interest to distribute the generalized theorem to everyone possible so that anyone working on recursive AGI has a chance to make it friendly, which means the algorithms will be checked by many, publicly. If Eliezer has the theorems and an AI ready to implement, there's nothing I can do about it at all. So why worry?

Comment author: Elithrion 07 March 2013 05:37:54AM 0 points [-]

I usually assume "provably friendly" means "will provably optimise for complex human-like values correctly" and thus includes both actual humanity-wide values and one person's values (and the two options can plausibly be switched between at a late stage of the design process).

And, well, I meant for it to be a little funny, so I'll take that as a win!

Comment author: TimS 07 March 2013 02:38:58PM *  0 points [-]

Friendly means something like "will optimize for the appropriate complex human-like values correctly."

Saying "we don't have clear criteria for appropriate human values" is just another way of saying that defining Friendly is hard.

Provably Friendly means we have a mathematical proof that an AI will be Friendly before we start running the AI.

An AI that gives its designer ultimate power over humanity is almost certainly not Friendly, even if it was Provably designer-godlike-powers implementing.

Comment author: Elithrion 07 March 2013 06:54:59PM 1 point [-]

How do you define "appropriate"? It seems a little circular. Friendly AI is AI that optimises for appropriate values, and appropriate values are the ones for which we'd want a Friendly AI to optimise.

You might say that "appropriate" values are ones which "we" would like to see the future optimised towards, but I think whether these even exist humanity-wide is an open question (and I'm leaning towards "no"), in which case you should probably have a contingency definition for what to do if they, in fact, do not.

I would also be shocked if there were a "provable" definition of "appropriate" (as opposed to the friendliness of the program being provable with respect to some definition of "appropriate").

Comment author: torekp 08 March 2013 02:10:38AM 0 points [-]

That said, it might also be the case that if someone keeps simulating having a particular goal set for decades, she may be able to easily generate responses based on that goal set without the responses registering as lies in any conventional sense.

I think it's at least as probable that the "simulated" goal set will become the bulk of the real one. That's if one single person starts out with the disguised plan. A conspiracy might be more stable, but also more likely to be revealed before completion.

Comment author: Elithrion 08 March 2013 03:52:34AM 0 points [-]

I think it's at least as probable that the "simulated" goal set will become the bulk of the real one.

It's possible, but I think if you're clever about it, you can precommit yourself to reverting to your original preferences once the objective is within your grasp. Every time you accomplish something good or endure something bad, you can tell yourself "All this is for my ultimate goal!" Then when you can actually get either your ultimate goal or your pretend ultimate goal, you'll be able to think to yourself "Remember all those things I did for this? Can't betray them now!" Or some other analogous plan. But I would agree that the trade off here is probably having an easier time pretending vs having greater fidelity to your original goal set. Probaby.

Comment author: Pentashagon 07 March 2013 11:08:20PM -1 points [-]

One possibility to prevent the god-emperor scenario is for multiple teams to simultaneously implement and turn on their own best efforts at FAI. All the teams should check all the other teams' FAI candidate, and nothing should be turned on until all the teams think it's safe. The first thing the new FAIs should do is compare their goals with each other and terminate all instances immediately if it looks like there are any incompatible goals.

One weakness is that most teams might blindly accept the most competent team's submission, especially if that team is vastly more competent. Breaking a competent team up may reduce that risk but would also reduce the likelihood of successful FAI. Another weakness is that perhaps multiple teams implementing an FAI will always produce slightly different goals that will cause immediate termination of the FAI instances. There is always the increasing risk over time of a third party (or one of the FAI teams accidentally) turning on uFAI, too.