DanArmak comments on Holden's Objection 1: Friendliness is dangerous - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (428)
What you are saying indeed applies only "in cases where this is impossible". I further suggest that these are extremely rare cases when a superhumanly-powerful AI is in charge. If the blue box contains horrible violent death, the AI would build a new (third) box, put a diamond inside, paint it blue, and give it to the person.
It the AI could do this, then this is exactly what the extrapolated values would tell it to do. [Assuming some natural constraints on the original values].
The actual values would also tell it to do so. This is a case where the two coincide. In most cases they don't.
No, the "actual" values would tell it to give the humans the blue boxes they want, already.
The humans don't value the blue box directly. It's an instrumental value because of what they think is inside. The humans really value (in actual, not extrapolated values) the diamond they think is inside.
That's a problem with your example (of the boxes): the values are instrumental, the boxes are not supposed to be valued in themselves.
ETA: wrong and retracted. See below.
Well, they don't value the diamond, either, on this account.
Perhaps they value the wealth they think they can have if they obtain the diamond, or perhaps they value the things they can buy given that diamond, or perhaps they value something else. It's hard to say, once we give up talking about the things we actually observe people trading other things for as being things they value.
You're right and I was wrong on this point. Please see my reply to gRR's sister comment.
Humans don't know which of their values are terminal and which are instrumental, and whether this question even makes sense in general. Their values were created by two separate evolutionary processes. In the boxes example, humans may not know about the diamond. Maybe they value blue boxes because their ancestors could always bring a blue box to a jeweler and exchange it for food, or something.
This is precisely the point of extrapolation - to untangle the values from each other and build a coherent system, if possible.
You're right about this point (and so is TheOtherDave) and I was wrong.
With that, I find myself unsure as to what we agree and disagree on. Back here you said "Well, perhaps yes." I understand that to mean you agree with my point that it's wrong / bad for the AI to promote extrapolated values while the actual values are different and conflicting. (If this is wrong please say so.)
Talking further about "extrapolated" values may be confusing in this context. I think we can taboo that and reach all the same conclusions while only mentioning actual values.
The AI starts out by implementing humans' actual present values. If some values (want blue box) lead to actually-undesired outcomes (blue box really contains death), that is a case of conflicting actual values (want blue box vs. want to not die). The AI obviously needs to be able to manage conflicting actual values, because humans always have them, but that is true regardless of CEV.
Additionally, the AI may foresee that humans are going to change and in the future have some other actual values; call these the future-values. This change may be described as "gaining intelligence etc." (as in CEV) or it may be a different sort of change - it doesn't matter for our purposes. Suppose the AI anticipates this change, and has no imperative to prevent it (such as helping humans avoid murderer-Gandhi pills due to present human values), or maybe even has an imperative to assist this change (again, according to current human values). Then the AI will want to avoid doing things today which will make its task harder tomorrow, or which will cause future people to regret their past actions: it may find itself striking a balance between present and future (predicted) human values.
This is, at the very least, dangerous - because it involves satisfying current human values not as fully as possible, while the AI may be wrong about future values. Also, the AI's actions unavoidably influence humans and so probably influence which future values they eventually have. My position is that the AI must be guided by the humans' actual present values in choosing to steer human (social) evolution towards or away from possible future values. This has lots of downsides, but what better option is there?
In contrast, CEV claims there is some unique "extrapolated" set of future values which is special, stable once reached, universal for all humans, and that it's Good to steer humanity towards it even if it conflicts with many people's present values. But I haven't seen any convincing to me arguments that such "extrapolated" values exist and have any of those qualities (uniqueness, stability, universal compatibility, Goodness).
Do you agree with this summary? Which points do you disagree with me on?
I meant that "it's wrong/bad for the AI to promote extrapolated values while the actual values are different and conflicting" will probably be a part of the extrapolated values, and the AI would act accordingly, if it can.
The problem with the actual present values (beside the fact that we cannot define them yet, no more than we can define their CEV) is that they are certain to not be universal. We can be pretty sure that someone can be found to disagree with any particular proposition. Whereas, for CEV, we can at least hope that a unique reflectively-consistent set of values exists. If it does and we succeed to define it, then we're home and dry. Meanwhile, we can think of contingency plans about what to do if it does not or we don't, but the uncertainty about whether the goal is achievable does not mean that the goal itself is wrong.
It's not merely uncertainty. My estimation is that it's almost certainly not achievable.
Actual goals conflict; why should we expect goals to converge? The burden of proof is on you: why do you assign this possibility sufficient likelihood to even raise it to the level of conscious notice and debate?
It may be true that "a unique reflectively-consistent set of values exists". What I find implausible and unsupported is that (all) humans will evolve towards having that set of values, in a way that can be forecast by "extrapolating" their current values. Even if you showed that humans might evolve towards it (which you haven't), the future isn't set in stone - who says they will evolve towards it, with sufficient certitude that you're willing to optimize for those future values before we actually have them?
Well, my own proposed plan is also a contingent modification. The strongest possible claim of CEV can be said to be:
There is a unique X, such that for all living people P, CEV<P> = X.
Assuming there is no such X, there could still be a plausible claim:
Y is not empty, where Y = Intersection{over all living people P} of CEV<P>.
And then AI would do well if it optimizes for Y while interfering the least with other things (whatever this means). This way, whatever "evolving" will happen due to AI's influence is at least agreed upon by everyone('s CEV).