SUDT: A toy decision theory for updateless anthropics

[-]cousin_it12y60

Can you explain in more detail what you mean by "possible worlds"? I assume that the agent's counterfactual actions don't lead to new possible worlds in your model, e.g. "what would happen if I didn't pay up" isn't a possible world. So you're kinda assuming that all coinflips happen before all actions. But what if Omega decides to flip a coin based on the agent's action, or something like that?

ETA: would a single player extensive-form game (with incomplete information and imprefect information/recall) be a good model of SUDT?

[-]cousin_it12y00

After chatting with Benja about my comment and thinking some more, I wrote a reply to this post.

[-]Chris_Leong7y20

Re; Elizier's Anthropic Mugging - I disagree that the updateful answer is H. We need to take into account that you are only 1/1000 responsible for the decision, instead of pretending that you somehow make the decision by yourself.

[-][anonymous]12y00

After chatting with Benja about my earlier comment and thinking some more, I wrote a reply to this post.

[This comment is no longer endorsed by its author]Reply

[-]drnickbone12y00

This is a good introduction; however, by representing the outcomes as just "+" and "-" you greatly simply the range of possible utility functions, and so force SUDT to make some controversial decisions (basically to accept the counterfactual mugging). The key issue is that your decider can give no special preferences to good or bad outcomes in his own world (a world the decider knows he occupies) versus other worlds (ones which the decider knows he doesn't occupy).

Suppose instead that the decider has an outcome space with four outcomes "+Me", "-Me", "+NotMe", "-NotMe". Here, "+Me" represents a good singularity which the decider himself will get to enjoy, as opposed to "-Me" which is a bad singularity (such as an unfriendly AI which tortures the decider for the next billion years). The outcomes "+NotMe" and "-NotMe" also represent positive and negative singularities, but in worlds which the decider himself doesn't inhabit. Assume that u(+Me) > u(+NotMe) > u(-Me), and also that u(+NotMe) = u(-NotMe), because the decider doesn't care about worlds that he doesn't belong to (from the point of view of his decisions, it's exactly like they don't exist).

Then, in the counterfactual mugging, when approached by Omega, the decider knows he is in a world where the coin has fallen Heads, so he picks the policy which maximizes utility for such worlds: in short he chooses "H" rather than "T". This increases the probability of -NotMe as opposed to +NotMe, but as we've seen, the decider doesn't care about that.

Here's a possible objection: By selecting "H", the decider is condemning lots of other versions or analogues of himself (in other possible worlds where Omega didn't approach him), and his utility function might care about this. On the other hand, he might also reason like this "Analogues of me still aren't me: I still care much more about whether I get tortured than whether all those analogues do. I still pick H".

In short, I don't think SUDT (or UDT) by itself solves the problem of counterfactual mugging. Relative to one utility function it looks quite reasonable to accept the mugging, whereas relative to another utility function it is reasonable to reject it. Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested "veil of ignorance" on the decider's identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).

[-]Benya12y10

In short, I don't think SUDT (or UDT) by itself solves the problem of counterfactual mugging. [...] Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested "veil of ignorance" on the decider's identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).

I'll first give an answer to a relatively literal reading of your comment, and then one to what IMO you are "really" getting at.

Answer to a literal reading: I believe that what you value is part of the problem definition, it's not the decision theory's job to constrain that. For example, if you prefer DOOM to FOOM, (S)UDT doesn't say that your utilities are wrong, it just says you should choose (H). And if we postulate that someone doesn't care whether there's a positive intelligence explosion if they don't get to take part in it (not counting near-copies), then they should choose (H) as well.

But I disagree that this means that (S)UDT doesn't solve the counterfactual mugging. It's not like the copy-selfless utility function I discuss in the post automatically makes clear whether we should choose (H) or (T): If we went with the usual intuition that you should update on your evidence and then use the resulting probabilities in your expected utility calculation, then even if you are completely selfless, you will choose (H) in order to do the best for the world. But (S)UDT says that if you have these utilities, you should choose (T). So it would seem that the version of the counterfactual mugging discussed in the post exhibits the problem, and (S)UDT comes down squarely on the side of one of the potential solutions.

Answer to the "real" point: But of course, what I read you as "really" saying is that we could re-interpret our intuition that we should use updated probabilities as meaning that our actual utility function is not the one we would write down naively, but a version where the utilities of all outcomes in which the observer-moment making the decision isn't consciously experienced are replaced by a constant. In the case of the counterfactual mugging, this transformation gives exactly the same result as if we had updated our probabilities. So in a sense, when I say that SUDT comes down on the side of one of the solutions, I am implicitly using a rule for how to go from "naive" utilities to utilities-to-use-in-SUDT: namely, the rule "just use the naive utilities". And when I use my arguments about l-zombies to argue that choosing (T) is the right solution to the counterfactual mugging, I need to argue why this rule is correct.

In terms of clarity of meaning, I have to say that I don't feel too bad about not spelling out that the utility function is just what you would normally call your utility function, but in terms of the strength of my arguments, I agree that the possibility of re-interpreting updating in terms of utility functions is something that needs to be addressed for my argument from l-zombies to be compelling. It just happens to be one of the many things I haven't managed to address in my updateless anthropics posts so far.

In brief, my reasons are twofold: First, I've asked myself, suppose that it actually were the case that I were an l-zombie, but could influence what happens in the real world; what would my actual values be then? And the answer is, I definitely don't completely stop caring. And second, there's the part where this transformation doesn't just give back exactly what you would have gotten if you updated in all anthropic problems, which makes the case for it suspect. The situations I have in mind are when your decision determines whether you are a conscious observer: In this case, how you decide depends on the utility you assign to outcomes in which you don't exist, something that doesn't have any interpretation in terms of updating. If the only reason I adopt these utilities is to somehow implement my intuitions about updating, it seems very odd to suddenly have this new number influencing my decisions.

[-]Wei Dai12y50

I brought up some related points in http://lesswrong.com/lw/8gk/where_do_selfish_values_come_from/. At this point, I'm not totally sure that UDT solves counterfactual mugging correctly. The problem I see is that UDT is incompatible with selfishness. For example if you make a copy of a UDT agent, then both copy 1 and copy 2 will care equally about copy 1 relative to copy 2, but if you make a copy of a typical selfish human, each copy will care more about itself than the other copy. This kind of selfishness seems strongly related to intuitions for picking (H) over (T). Until we fully understand whether selfishness is right or wrong, and how it ought to be implemented or fixed (e.g., do we encode our current degrees of caring into a UDT utility function, or rewind our values to some past state, or use some other decision theory that has a concept of "self"?), it's hard to argue that UDT must be correct, especially in its handling of counterfactual mugging.

[-]cousin_it12y20

If selfishness is reflectively inconsistent, and an AI can self-modify, then I don't see how an AI can stay selfish. Do you have any ideas?

[-]Wei Dai12y40

Why would an AI want to self-modify away from selfishness? Because future copies of itself can't cooperate fully if it remained selfish? That may not be the case if we solve the problem of cooperation between agents with conflicting preferences. Alternatively, AI may not want to self-modify for "acausal" reasons (for example it's worried about itself not existing if it decided to prevent future selfish versions of itself from existing), or for ethical reasons (it values being selfish, or values the existence of selfish agents in the world).

[-]Eliezer Yudkowsky12y40

How is it coherent for an agent at time T1 to 'want' copy A at T2 to care only about A and copy B at T2 to care only about B? There's no non-meta way to express this - you would have to care more strongly about agents having a certain exact decision function than about all object-level entities at stake. When it comes to object-level things, whatever the agent at T1 coherently cares about, it will want A and B to care about.

[-]drnickbone12y20

It strikes me that a persistently selfish agent may be somewhat altruistic towards its future selves. The agent might want its future versions to be free to follow their own selfish preferences, rather than binding them to its current selfish preferences.

Another alternative is that the agent is not only selfish but lazy... it could self-modify to bind its future selves, but that takes effort, and it can't be bothered.

Either way, it's going to take a weird sort of utility function to reproduce human selfishness in an AI.

[-]cousin_it12y20

Now that I think of it, caring about making more copies of yourself might be more fundamental than caring about object-level things in the world... I wonder what kind of math could be used to model this.

[-]drnickbone12y00

Thank you for a very comprehensive reply.

In terms of clarity of meaning, I have to say that I don't feel too bad about not spelling out that the utility function is just what you would normally call your utility function

That's fine. However, normal utility functions do have self-interested components, as well as parochial components (caring about people and things that are "close" to us in various ways, above those which are more "distant"). It's also true that utilities are not totally determined by such components, and include some general pro bono terms; further that we think in some sense that utilities ought to be disinterested rather than selfish or parochial. Hence my thought that SUDT could be strengthened by barring selfish or parochial terms, or imposing some sort of veil of ignorance so that only terms like u(+NotMe) and u(-NotMe) affect decisions.

Allowing for self-interest, then in the counterfactual mugging scenario we most likely have u(+Me) >> u(+NotMe) > u(-NotMe) >> u(-Me), rather than u(+NotMe) = u(-NotMe). The decider will still be inclined to pick "H" (matching our initial intuition), but with some hesitation, particularly if Omega's coin was very heavily weighted to tails in the first place. The internal dialogue in that place will go something like this: "Hmm, it was so very unlikely that the coin fell heads - I can't believe that happened! Hmm, perhaps it didn't, and I'm in some sort of Omega-simulation. For the good of the world outside my simulation, I'd better pick T after all". That's roughly where I am with my own reaction to Counterfactual Mugging right now.

Against a background of modal realism or a many-worlds-interpretation (which in my opinion is where UDT makes most sense), caring only about the good of "our" world looks like a sort of parochialism, which is why Counterfactual Mugging is interesting. Suddenly it seems to matter whether these other worlds exist or not, rather than just being a philosophical curiosity.

[-]Squark12y00

updating definitely is the right thing to do epistemically, but plugging these probabilities into the expected utility calculation gives the wrong result

Why do you think epistemic questions are meaningful? This way of thinking gets you in trouble with the Anthropic Trilemma, and I don't think there is a good way out of there.

	Coin falls heads (chance = 1/1000)	Coin falls tails (chance = 999/1000)
You choose (H) if coin falls heads	Positive intelligence explosion	Humanity wiped out
You choose (T) if coin falls heads	Humanity wiped out	Positive intelligence explosion

	Coin falls heads (chance = 1/1000)	Coin falls tails (chance = 999/1000)
Everyone chooses (H) if asked	Positive intelligence explosion	Humanity wiped out
Everyone chooses (T) if asked	Humanity wiped out	Positive intelligence explosion
Different people choose differently	Humanity wiped out	(Depends on who is asked)

	Coin falls heads (chance = 1/1000)	Coin falls tails (chance = 999/1000)
Green roomers choose (H)	Positive intelligence explosion	Humanity wiped out
Green roomers choose (T)	Humanity wiped out	Positive intelligence explosion

LESSWRONG
LW

LESSWRONG
LW

27

SUDT: A toy decision theory for updateless anthropics

27

Ω 4

27

Ω 4