gRR comments on Holden's Objection 1: Friendliness is dangerous - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (428)
Humans don't know which of their values are terminal and which are instrumental, and whether this question even makes sense in general. Their values were created by two separate evolutionary processes. In the boxes example, humans may not know about the diamond. Maybe they value blue boxes because their ancestors could always bring a blue box to a jeweler and exchange it for food, or something.
This is precisely the point of extrapolation - to untangle the values from each other and build a coherent system, if possible.
You're right about this point (and so is TheOtherDave) and I was wrong.
With that, I find myself unsure as to what we agree and disagree on. Back here you said "Well, perhaps yes." I understand that to mean you agree with my point that it's wrong / bad for the AI to promote extrapolated values while the actual values are different and conflicting. (If this is wrong please say so.)
Talking further about "extrapolated" values may be confusing in this context. I think we can taboo that and reach all the same conclusions while only mentioning actual values.
The AI starts out by implementing humans' actual present values. If some values (want blue box) lead to actually-undesired outcomes (blue box really contains death), that is a case of conflicting actual values (want blue box vs. want to not die). The AI obviously needs to be able to manage conflicting actual values, because humans always have them, but that is true regardless of CEV.
Additionally, the AI may foresee that humans are going to change and in the future have some other actual values; call these the future-values. This change may be described as "gaining intelligence etc." (as in CEV) or it may be a different sort of change - it doesn't matter for our purposes. Suppose the AI anticipates this change, and has no imperative to prevent it (such as helping humans avoid murderer-Gandhi pills due to present human values), or maybe even has an imperative to assist this change (again, according to current human values). Then the AI will want to avoid doing things today which will make its task harder tomorrow, or which will cause future people to regret their past actions: it may find itself striking a balance between present and future (predicted) human values.
This is, at the very least, dangerous - because it involves satisfying current human values not as fully as possible, while the AI may be wrong about future values. Also, the AI's actions unavoidably influence humans and so probably influence which future values they eventually have. My position is that the AI must be guided by the humans' actual present values in choosing to steer human (social) evolution towards or away from possible future values. This has lots of downsides, but what better option is there?
In contrast, CEV claims there is some unique "extrapolated" set of future values which is special, stable once reached, universal for all humans, and that it's Good to steer humanity towards it even if it conflicts with many people's present values. But I haven't seen any convincing to me arguments that such "extrapolated" values exist and have any of those qualities (uniqueness, stability, universal compatibility, Goodness).
Do you agree with this summary? Which points do you disagree with me on?
I meant that "it's wrong/bad for the AI to promote extrapolated values while the actual values are different and conflicting" will probably be a part of the extrapolated values, and the AI would act accordingly, if it can.
The problem with the actual present values (beside the fact that we cannot define them yet, no more than we can define their CEV) is that they are certain to not be universal. We can be pretty sure that someone can be found to disagree with any particular proposition. Whereas, for CEV, we can at least hope that a unique reflectively-consistent set of values exists. If it does and we succeed to define it, then we're home and dry. Meanwhile, we can think of contingency plans about what to do if it does not or we don't, but the uncertainty about whether the goal is achievable does not mean that the goal itself is wrong.
It's not merely uncertainty. My estimation is that it's almost certainly not achievable.
Actual goals conflict; why should we expect goals to converge? The burden of proof is on you: why do you assign this possibility sufficient likelihood to even raise it to the level of conscious notice and debate?
It may be true that "a unique reflectively-consistent set of values exists". What I find implausible and unsupported is that (all) humans will evolve towards having that set of values, in a way that can be forecast by "extrapolating" their current values. Even if you showed that humans might evolve towards it (which you haven't), the future isn't set in stone - who says they will evolve towards it, with sufficient certitude that you're willing to optimize for those future values before we actually have them?
Well, my own proposed plan is also a contingent modification. The strongest possible claim of CEV can be said to be:
There is a unique X, such that for all living people P, CEV<P> = X.
Assuming there is no such X, there could still be a plausible claim:
Y is not empty, where Y = Intersection{over all living people P} of CEV<P>.
And then AI would do well if it optimizes for Y while interfering the least with other things (whatever this means). This way, whatever "evolving" will happen due to AI's influence is at least agreed upon by everyone('s CEV).
I can buy, tentatively, that most people might one day agree on a very few things. If that's what you mean by Y, fine, but it restricts the FAI to doing almost nothing. I'd much rather build a FAI that implemented more values shared by fewer people (as long as those people include myself). I expect so would most people, including the ones hypothetically building the FAI - otherwise they'd expect not to benefit much from building it, since it would find very little consensus to implement! So the first team to successfully build FAI+CEV will choose to launch it as a CEV<themselves> rather than CEV<humanity>.
This is fine, because CEV of any subset of the population is very likely to include terms for CEV of humanity as a whole.
Why do you believe this?
For instance, I think CEV<humanity>, if it even exists, will include nothing of real interest because people just wouldn't agree on common goals. In such a situation, my personal CEV - or that of a few people who do agree on at least some things - would not want to include CEV<humanity>. So your belief implies that CEV<humanity> exists and is nontrivial. As I've asked before in this thread, why do you think so?
Oh, I had some evidence, but I Minimum Viable Commented. I thought it was obvious once pointed out. Illusion of transparency.
We care about what happens to humanity. We want things to go well for us. If CEV works at all, it will capture that in some way.
Even if CEV(rest of humanity) turns out to be mostly derived from radical islam, I think there would be terms in CEV(Lesswrong) for respecting that. There would also be terms for people not stoning each other to death and such. I think those (respect for CEV and good life by our standards) would only come into conflict when CEV has basically failed.
You seem to be claiming that CEV will in fact fail, which I think is a different issue. My claim is that if CEV is a useful thing, you don't have to run it on everyone (or even a representative sample) to make it work.
It depends on what you call CEV "working" or "failing".
One strategy (which seems to me to be implied by the original CEV doc) is to extrapolate everyone's personal volition, then compare and merge them to create the group's overall CEV. Where enough people agree, choose what they agree on (factoring in how sure they are, and how important this is to them). Where too many people disagree, do nothing, or be indifferent on the outcome of this question, or ask the programmers. Is this what you have in mind?
The big issue here is how much consensus is enough. Let's run with concrete examples:
It all depends on how you define required consensus - and that definition can't itself come from CEV, because it's required for the first iteration of CEV to run. It could be allowed to evolve via CEV, but you still need to start somewhere and such evolution strikes me as dangerous - if you precommit to CEV and then it evolves into "too little" or "too much" consensus and ends up doing nothing or prohibiting nothing, the whole CEV project fails. Which may well be a worse outcome from our perspective than starting with (or hardcoding) a different, less "correct" consensus requirement.
So the matter is not just what each person or group's CEV is, but how you combine them via consensus. If, as you suggest, we use the CEV of a small homogenous group instead of all of humanity, it seems clear to me that the consensus would be greater (all else being equal), and so the requirements for consensus are more likely to be satisfied, and so CEV will have a higher chance of working.
Contrariwise, if we use the CEV of all humanity, it will have a term derived from me and you for not stoning people. And it will also have a term derived from some radical Islamists for stoning people. And it will have to resolve the conrtadiction, and if there's not enough consensus among humanity's individual CEVs to do so, the CEV algorithm will "fail".
I would be fine with FAI removing existential risks and not doing any other thing until everybody('s CEV) agrees on it. (I assume here that removing existential risks is one such thing.) And an FAI team that creditably precommitted to implementing CEV<humanity> instead of CEV<themselves> would probably get more resources and would finish first.
So what makes you think everybody's CEV would eventually agree on anything more?
A FAI that never does anything except prevent existential risk - which, in a narrow interpretation, means it doesn't stop half of humanity from murdering the other half - isn't a future worth fighting for IMO. We can do so much better. (At least, we can if we're speculating about building a FAI to execute any well-defined plan we can come up with.)
I'm not even sure of that. There are people who believe religiously that End Times must come when everyone must die, and some of them want to hurry that along by actually killing people. And the meaning of "existential risk" is up for grabs anyway - does it preclude evolution into non-humans, leaving no members of original human species in existence? Does it preclude the death of everyone alive today, if some humans are always alive?
Sure, it's unlikely or it might look like a contrived example to you. But are you really willing to precommit the future light cone, the single shot at creating an FAI (singleton), to whatever CEV might happen to be, without actually knowing what CEV produces and having an abort switch? That's one of the defining points of CEV: that you can't know it correctly in advance, or you would just program it directly as a set of goals instead of building a CEV-calculating machine.
This seems wrong. A FAI team that precommitted to implementing CEV<its funders> would definitely get the most funds. Even a team that precommitted to CEV<the team itself> might get more funds than CEV<humanity>, because people like myself would reason that the team's values are closer to my own than humanity's average, plus they have a better chance of actually agreeing on more things.
No one said you have to stop with that first FAI. You can try building another. The first FAI won't oppose it (non-interference). Or, better yet, you can try talking to the other half of the humans.
Yes, but we assume they are factually wrong, and so their CEV would fix this.
Not bloody likely. I'm going to oppose your team, discourage your funders, and bomb your headquarters - because we have different moral opinions, right here, and if the differences turn out to be fundamental, and you build your FAI, then parts of my value will be forever unfulfilled.
You, on the other hand, may safely support my team, because you can be sure to like whatever my FAI will do, and regarding the rest, it won't interfere.
No. Any FAI (ETA: or other AGI) has to be a singleton to last for long. Otherwise I can build a uFAI that might replace it.
Suppose your AI only does a few things that everyone agrees on, but otherwise "doesn't interfere". Then I can build another AI, which implements values people don't agree on. Your AI must either interfere, or be resigned to not being very relevant in determining the future.
Will it only interfere if a consensus of humanity allows it to do so? Will it not stop a majority from murdering a minority? Then it's at best a nice-to-have, but most likely useless. After people successfully build one AGI, they will quickly reuse the knowledge to build more. The first AGI that does not favor inaction will become a singleton, destroying the other AIs and preventing future new AIs, to safeguard its utility function. This is unavoidable. With truly powerful AGI, preventing new AIs from gaining power is the only stable solution.
Yeah, that's worked really well for all of human history so far.
First, they may not factually wrong about the events they predict in the real world - like everyone dying - just wrong about the supernatural parts. (Especially if they're themselves working to bring these events to pass.) IOW, this may not be a factual belief to be corrected, but a desired-by-them future that others like me and you would wish to prevent.
Second, you agreed the CEV of groups of people may contain very few things that they really agree on, so you can't even assume they'll have a nontrivial CEV at all, let alone that it will "fix" values you happen to disagree with.
I have no idea what your FAI will do, because even if you make no mistakes in building it, you yourself don't know ahead of time what the CEV will work out to. If you did, you'd just plug those values into the AI directly instead of calculating the CEV. So I'll want to bomb you anyway, if that increases my chances of being the first to build a FAI. Our morals are indeed different, and since there are no objectively distinguished morals, the difference goes both ways.
Of course I will dedicate my resources to first bombing people who are building even more inimical AIs. But if I somehow knew you and I were the only ones in the race, I'd politely ask you to join me or desist or be stopped by force.
As long as we're discussing bombing, consider that the SIAI isn't and won't be in a position to bomb anyone. OTOH, if and when nation-states and militaries realize AGI is a real-world threat, they will go to war with each trying to prevent anyone else from building an AGI first. It's the ultimate winner-take-all arms race.
This is going to happen, it might be happening already if enough politicians and generals had the beliefs of Eliezer about AGI, and it will happen (or not) regardless of anyone's attempts to build any kind of Friendliness theory. Furthermore, a state military planning to build AGI singleton won't stop to think for long before wiping your civilian, unprotected FAI theory research center off the map. Either you go underground or you cooperate with a powerful player (the state on whose territory you live, presumably). Or maybe states and militaries won't wise up in time, and some private concern really will build the first AGI. Which may be better or worse depending on what they build.
Eventually, unless the whole world is bombed back into pre-computer-age tech, someone very probably will build an AGI of some kind. The SIAI idea is (possibly) to invent Friendliness theory and publish it widely, so that whoever builds that AGI, if they want it to be Friendly (at least to themselves!), they will have a relatively cheap and safe implementation to use. But for someone actually trying to build an AGI, two obvious rules are:
I want to point out that all of my objections are acknowledged (not dismissed, and not fully resolved) in the actual CEV document - which is very likely hopelessly outdated by now to Eliezer and the SIAI, but they deliberately don't publish anything newer (and I can guess at some of the reasons).
Which is why when I see people advocating CEV without understanding the dangers, I try to correct them.