Matt_Simpson comments on AI indifference through utility manipulation - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (53)
I still don't understand.
So I am choosing whether I want to play the lottery and commit to committing suicide if I lose. If I don't play the lottery my utility is 0, if I play I lose with probability 99%, receiving -1 utility, and win with probability 1%, receiving utility 99.
The policy of not playing the lottery has utility 0, as does playing the lottery without commitment to suicide. But the policy of playing the lottery and committing to suicide if I lose has utility 99 in worlds where I win the lottery and survive (which is the expected utility of histories consistent with my observations so far in which I survive). After applying indifference, I have expected utility of 99 regardless of lottery outcome, so this policy wins over not playing the lottery.
Having lost the lottery, I no longer have any incentive to kill myself. But I've also already either self-modified to change this or committed to dying (say by making some strategic play which causes me to get killed by humans unless I acquire a bunch of resources), so it doesn't matter what indifference would tell me to do after I've lost the lottery.
Can you explain what is wrong with this analysis, or give some more compelling argument that committing to killing myself isn't a good strategy in general?
Are you indifferent between (dying) and (not dying) or are you indifferent between (dying) and (winning the lottery and not dying)?
It should be clear that I can engineer a situation where survival <==> winning the lottery, in which case I am indifferent between (winning the lottery and not dying) and (not dying) because they occur in approximately the same set of possible worlds. So I'm simultaneously indifferent between (dying) and (not dying) and between (dying) and (winning the lottery and not dying).
That doesn't follow unless, in the beginning, you were already indifferent between (winning the lottery and not dying) and (dying). Remember, a utility function is a map from all possible states of the world to the real line (ignore what economists do with utility functions for the moment). (being alive) is one possible state of the world. (being alive and having won the lottery) is not the same state of the world. In more detail, assign arbitrary numbers for utlity (aside from rank) - suppose U(being alive) = U(being dead) = 0 and suppose U(being alive and having won the lottery) = 1.
Now you engineer the situation such that survival <==> having won the lottery. It is still the case that U(survival) = 0. Your utility function doesn't change because some random aspect of reality changes - if you evaluate the utility of a certain situation at time t, you should get the same answer at time t+1. It's still the same map from states of the world to the real line. A terse way of saying this is "when you ask the question doesn't change the answer." But if we update on the fact that survival => having won the lottery, then we know we should really be asking about U(being alive and having won the lottery) which we know to be 1, which is not the same as U(dying).