Related worry that I've been meaning to ask about for a while:
Given that is there is still plenty of controversy over which types of unusual human minds to consider "pathological" instead of just rare variants, how is MIRI planning to decide which ones are included in CEV? My skin in the game: I'm one of the Autistic Spectrum people who feel like "curing my autism" would make me into a different person who I don't care about. I'm still transhumanist; I still want intelligence enhancements, external boosts to my executive function and sensory processing on demand, and the ability to override the nastiest of my brain chemistry. But even with all of that I would still know myself as very different from neurotypicals. I naturally see the world in different categories that most, and I don't think in anything like words or a normal human language. Maybe more relevantly, I have a far higher tolerance---even a need---for sphexishness, than most people of comparable intelligence to me.
Fun theory for me would be a little different, and I think that there really are a lot of people who would consider what I did with eternity to be somewhat sad and pathetic, maybe even horrifying. I think it could be an empathic uncanny valley effect or just an actual basic drive people have, to make everybody be the same. I'm worried that this could be an actual terminal value for some people that would hold up under reflective equilibrium.
I'm not too freaked out because I think the consensus is that since Autistic people already exist and some are happy, we should have a right to continue to exist and even make more of ourselves. But I actually believe that if we didn't exist it would be right to create us, and I worry that most neurotypicals extrapolated volition would not create all the other variations on human minds that should exist but don't yet.
If it matters, up to $1000 for MIRI this year could be at stake in answering this concern. I say this in a blatant and open effort to incentivize Eliezer etc. to answering me. I hope that I'm not out of line for waving money around like this, because this really is a big part of my choice about whether FAI is good enough. I really want to give what I can to prevent existential threats, but I consider a singularity overly dominated by neurotypicals to be a shriek.
Did MIRI answer you? I would expect them to have answered by now, and I'm curious about the answer.
Consider the following scenario. MIRI succeeds beyond my wildest expectations. It comes up with a friendliness theory, and then uses it to make provably friendly AGI before anyone else can make an unfriendly one. And then a year and a half later, we find that Eliezer Yudkowsky has become the designated god-emperor of the lightcone, and the rest of the major MIRI researchers are his ministers. Woops.
My guess for the probability of this type of scenario given a huge MIRI success along those lines is around 15%. The reasoning is straightforward. (1) We don't know what's going on inside any particular person's head. (2) Many or most humans are selfish. (3) Looking altruistic is more likely to draw support than explicitly setting out to take over the world. (5) And human acting abilities, while limited, are likely adequate (for example, spies seem quite successful at concealing their motives). I'd say those four things are reasonably independent and sufficient for some deception to be happening, so guessing at some probabilities, it works out to something like 1×0.5×0.8×0.5 = 0.2† At least if the person is sufficiently determined to achieve their goal no matter what.
Or to put it in a more interesting and LW-referring way, it has been argued that we might not be able to distinguish friendly from unfriendly AGI during their "take off" period, since both would probably act the same way to take over the world (before optimising it in a friendly or unfriendly way). We might also expect that as a person becomes more agenty, their behaviour more closely approximates that of an AGI, since an AGI is pretty much the embodiment of agency. It certainly seems like Eliezer is quite agenty for a human, so we clearly presently find ourselves in his take off period!
So, supposing we don't want god-emperor!Eliezer, what do we do? First, I think it is important to recognise that this problem is almost certainly not urgent. Selfish-Eliezer!optimisations and altruistic-Eliezer!optimisations probably require nearly the same work to not end up being paperclips!optimisations, so for the near future it doesn't particularly matter if he's secretly out to take over the world or not. In addition, having Eliezer be the god-emperor of the light cone is probably not all that bad—it might not be optimal, but it's likely to be very much better than any unfriendly outcome, and probably better than the status quo. It does, however, seem like a useful issue to address in the longer term for any organisation that plans to make a friendly(-ish) AGI.
Off-hand, I can think of three plausible ideas:
• Extreme transparency (i.e. increase acting difficulty). If we attach microphones and cameras to all the main researchers and have a team of a hundred volunteers go over every minute of recordings every day, it would be hard to conceal any deceptive plans.
• Very good lie detection (i.e. find out what's in people's minds). While current lie detector technologies are likely to be inadequate, it seems possible that by the time the issue becomes relevant we might have much better technology. That said, it might also be the case that if someone keeps simulating having a particular goal set for decades, she may be able to easily generate responses based on that goal set without the responses registering as lies in any conventional sense.
• Some sort of organisational solution. I'm not sure what this would look like, to be honest, but it seems like some organisation patterns should be more likely to produce friendly outcomes than others independently of transparency. For example, if the essential team is huge and organised so different members check each others' work, it's likely to be more difficult to conceal or organise any conspiracy (although there would be other problems).
What I think would not work is letting the group write all the code and then checking it over. First, there are likely to be enough talented programmers in it that they'd have a good shot at hiding whatever they wanted, and, second, it's not clear that it would be possible to stop the group from just changing some key bits at the last moment (e.g. from CEV of humans in world to CEV of humans in room) and then launching without further oversight.
As I said, this is by no means a priority, but I think it would be useful to start thinking about the problems sooner rather than later, so we don't end up being caught off guard. Also, while I specifically mention Eliezer in this post, he simply happens to provide the most salient example, and most points are equally (or in some cases more) relevant to the general case of anyone working on AGI.
† I probably picked those numbers in large part to justify my above "15%", but you get the idea.