I’ve shown that we cannot deduce the preferences of a potentially irrational agent. Even simplicity priors don’t help. We need to make extra ‘normative’ assumptions in order to be able to say anything about these preferences.
I then presented a more intuitive example, in which Alice was playing poker, and had two possible beliefs about Bob’s hand, and two possible preferences: wanting money, or wanting Bob (which, in that situations, translated into wanting to lose to Bob).
That example illustrated the impossibility result, within the narrow confines of that situation – if Alice calls, she could be a money-maximiser expecting to win, or a love-maximiser expecting to lose.
As has been pointed out, this uncertainty doesn’t really persist if we move beyond the initial situation. If Alice was motivated by love or money, we would expect to be able to tell which one, by seeing what she does in other situations – how does she respond to Bob’s flirtations, what does she confess to her closest friends, how does she act if she catches a peek of Bob’s cards, etc…
So if we look at her more general behaviour, it seems that we have two possible versions of Alice. First, , who clearly wants money, and , who clearly wants Bob. The actions of these two agents match up in the specific case I described, but not in general. Doesn’t this undermine my claim that we can’t tell the preferences of an agent from their actions?
What’s actually happening here is that we’re already making a lot of extra assumptions when we’re interpreting or ’s actions. We model other humans in very specific and narrow ways, and other humans do the same – and their models are very similar to ours (consider how often humans agree that another human is angry, or that being drunk impairs rationality). The agreement isn’t perfect, but is much better than random.
If we set those assumptions aside, then we can see what the theorem implies. There is a possible agent , whose preference is for love, but that nevertheless acts identically to (and the reverse for money-loving versus ). and are perfectly plausible agents – they just aren’t ‘human’ according to our models of what being human means.
It’s because of this that I’m somewhat optimistic we can solve the value learning problem, and why I often say the problem is “impossible in theory, but doable in practice”. Humans make a whole host of assumptions that allow them to interpret the preferences of other humans (and of themselves). And these assumptions are quite similar from human to human. So we don’t need to solve the value learning problem in some principled way, nor figure out the necessary assumptions abstractly. Instead, we just need to extract the normative assumptions that humans are already making and use these in the value learning process (and then resolve all the contradictions within human values, but that seems doable if messy).
I agree that the fact that humans are quite good at inferring preferences should give us optimism about value learning. In the framework of rationality with a mistake model, I interpret this post as trying to infer the mistake model from the way that humans infer preferences about other humans. I'm not sure whether this sidesteps the impossibility result, but it seems plausible that it does.
What would be the source of data for learning a mistake model? It seems like we have to make some assumption about how the data source leads to a mistake model, since probably the data source is going to be a subset of the full human policy, and the impossibility result already allows you to have access to the full human policy.
In the example in https://www.lesswrong.com/posts/rcXaY3FgoobMkH2jc/figuring-out-what-alice-wants-part-ii , I give examples of two algorithms with the same outputs but where we would attribute different preferences to them. This sidesteps the impossibility result, since it allows us to consider extra information, namely the internal structure of the algorithm, in a way relevant to value-computing.