The Need for Human Friendliness

Elithrion

Consider the following scenario. MIRI succeeds beyond my wildest expectations. It comes up with a friendliness theory, and then uses it to make provably friendly AGI before anyone else can make an unfriendly one. And then a year and a half later, we find that Eliezer Yudkowsky has become the designated god-emperor of the lightcone, and the rest of the major MIRI researchers are his ministers. Woops.

My guess for the probability of this type of scenario given a huge MIRI success along those lines is around 15%. The reasoning is straightforward. (1) We don't know what's going on inside any particular person's head. (2) Many or most humans are selfish. (3) Looking altruistic is more likely to draw support than explicitly setting out to take over the world. (5) And human acting abilities, while limited, are likely adequate (for example, spies seem quite successful at concealing their motives). I'd say those four things are reasonably independent and sufficient for some deception to be happening, so guessing at some probabilities, it works out to something like 1×0.5×0.8×0.5 = 0.2† At least if the person is sufficiently determined to achieve their goal no matter what.

Or to put it in a more interesting and LW-referring way, it has been argued that we might not be able to distinguish friendly from unfriendly AGI during their "take off" period, since both would probably act the same way to take over the world (before optimising it in a friendly or unfriendly way). We might also expect that as a person becomes more agenty, their behaviour more closely approximates that of an AGI, since an AGI is pretty much the embodiment of agency. It certainly seems like Eliezer is quite agenty for a human, so we clearly presently find ourselves in his take off period!

So, supposing we don't want god-emperor!Eliezer, what do we do? First, I think it is important to recognise that this problem is almost certainly not urgent. Selfish-Eliezer!optimisations and altruistic-Eliezer!optimisations probably require nearly the same work to not end up being paperclips!optimisations, so for the near future it doesn't particularly matter if he's secretly out to take over the world or not. In addition, having Eliezer be the god-emperor of the light cone is probably not all that bad—it might not be optimal, but it's likely to be very much better than any unfriendly outcome, and probably better than the status quo. It does, however, seem like a useful issue to address in the longer term for any organisation that plans to make a friendly(-ish) AGI.

Off-hand, I can think of three plausible ideas:

• Extreme transparency (i.e. increase acting difficulty). If we attach microphones and cameras to all the main researchers and have a team of a hundred volunteers go over every minute of recordings every day, it would be hard to conceal any deceptive plans.

• Very good lie detection (i.e. find out what's in people's minds). While current lie detector technologies are likely to be inadequate, it seems possible that by the time the issue becomes relevant we might have much better technology. That said, it might also be the case that if someone keeps simulating having a particular goal set for decades, she may be able to easily generate responses based on that goal set without the responses registering as lies in any conventional sense.

• Some sort of organisational solution. I'm not sure what this would look like, to be honest, but it seems like some organisation patterns should be more likely to produce friendly outcomes than others independently of transparency. For example, if the essential team is huge and organised so different members check each others' work, it's likely to be more difficult to conceal or organise any conspiracy (although there would be other problems).

What I think would not work is letting the group write all the code and then checking it over. First, there are likely to be enough talented programmers in it that they'd have a good shot at hiding whatever they wanted, and, second, it's not clear that it would be possible to stop the group from just changing some key bits at the last moment (e.g. from CEV of humans in world to CEV of humans in room) and then launching without further oversight.

As I said, this is by no means a priority, but I think it would be useful to start thinking about the problems sooner rather than later, so we don't end up being caught off guard. Also, while I specifically mention Eliezer in this post, he simply happens to provide the most salient example, and most points are equally (or in some cases more) relevant to the general case of anyone working on AGI.

† I probably picked those numbers in large part to justify my above "15%", but you get the idea.

Off-hand, I can think of three plausible ideas:

† I probably picked those numbers in large part to justify my above "15%", but you get the idea.

I think you're assigning too little weight to provably friendly but this is pretty funny

How much would you be willing to wager that you will be able to follow the proof of friendly for the specific AI which gets implemented?

0Elithrion13y

I usually assume "provably friendly" means "will provably optimise for complex human-like values correctly" and thus includes both actual humanity-wide values and one person's values (and the two options can plausibly be switched between at a late stage of the design process). And, well, I meant for it to be a little funny, so I'll take that as a win!