In the context of whether the definition of human values can disentangled from the process of approximating/implementing that definition, David asks me:
- But I think it's reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like "manipulation". So do you disagree?
I think it's a really good question, and its answer is related to a lot of relevant issues, so I put this here as a top-level post. My current feeling is, contrary to my previous intuitions, that things like "manipulation" might not be possible to specify in a way that leads to useful disentanglement.
Why manipulate?
First of all, we should ask why an AI would be tempted to manipulate us in the first place. It may be that it needs us to do something for it to accomplish its goal; in that case it is trying to manipulate our actions. Or maybe its goal includes something that cashes out as out mental states; in that case, it is trying to manipulate our mental state directly.
The problem is that any reasonable friendly AI would have our mental states as part of its goal - it would at least want us to be happy rather than miserable. And (almost) any AI that wasn't perfectly indifferent to our actions would be trying to manipulate us just to get its goals accomplished.
So manipulation is to be expected by most AI designs, friendly or not.
Manipulation versus explanation
Well, since the urge to manipulate is expected to be present, could we just rule it out? The problem is that we need to define the difference between manipulation and explanation.
Suppose I am fully aligned/corrigible/nice or whatever other properties you might desire, and I want to inform you of something important and relevant. In doing so, especially if I am more intelligent than you, I will simplify, I will omit irrelevant details, I will omit arguably relevant details, I will emphasise things that help you get a better understanding of my position, and de-emphasise things that will just confuse you.
And these are exactly the same sorts of behaviours that smart manipulator would do. Nor can we define the difference as whether the AI is truthful or not. We want human understanding of the problem, not truth. It's perfectly possible to manipulate people while telling them nothing but the truth. And if the AI structures the order in which it presents the true facts, it can manipulate people while presenting the whole truth as well as nothing but the truth.
It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle. And even if we do it right, note that we have now motivated the AI to... aim for a particular set of mental states. We are rewarding it for manipulating us. This is contrary to the standard understanding of manipulation, which focuses on the means, not the end result.
Bad behaviour and good values
Does this mean that the situation is completely hopeless? No. There are certain manipulative practices that we might choose to ban. Especially if the AI is limited in capability at some level, this would force it to follow behaviours that are less likely to be manipulative.
Essentially, there is no boundary between manipulation and explanation, but there is a difference between extreme manipulation and explanation, so ruling out the first can help (or maybe not).
The other thing that can be done is to ensure that the AI has values close to ours. The closer the values of the AI are to us, the less manipulation it will need to use, and the less egregious the manipulation will be. It might be that, between partial value convergence and ruling out specific practices (and maybe some physical constraints), we may be able to get an AI that is very unlikely to manipulate us much.
Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible. But if the values of the AI are better aligned with us, and more physically constrained, then low impact becomes easier to define.
Hm, I understood the traditional Less Wrong view to be something along the lines of: there is truth about the world, and that truth is independent of your values. Wanting something to be true won't make it so. Whereas I'd expect a postmodernist to say something like: the Christians have their truth, the Buddhists have their truth, and the Atheists have theirs. Whose truth is the "real" truth comes down to the preferences of the individual. Your statement sounds more in line with the postmodernist view than the Less Wrong one.
This matters because if the Less Wrong view of the world is correct, it's more likely that there are clean mathematical algorithms for thinking about and sharing truth that are value-neutral (or at least value-orthogonal, e.g. "aim to share facts that the student will think are maximally interesting or surprising". Note that this doesn't necessarily need to be implemented in a way that a "fact" which triggers an epileptic fit and causes the student to hit the "maximally interesting" button will be selected for sharing. If I have a rough model of the user's current beliefs and preferences, I could use that to estimate the VoI of various bits of information to the user and use that as my selection criterion. Point being that our objective doesn't need to be defined in terms of "aiming for a particular set of mental states".)