UDT agents as deontologists

Tyrrell_McAllister

One way (the usual way?) to think of an agent running Updateless Decision Theory is to imagine that the agent always cares about all possible worlds according to how probable those worlds seemed to the agent's builders when they wrote the agent's source code^{[see added footnote 1}^below]. In particular, the agent never develops any additional concern for whatever turns out to be the actual world^[2]. This is what puts the "U" in "UDT".

I suggest an alternative conception of a UDT agent, without changing the UDT formalism. According to this view, the agent cares about only the actual world. In fact, at any time, the agent cares about only one small facet of the actual world — namely, whether the agent's act at that time maximizes a certain fixed act-evaluating function. In effect, a UDT agent is the ultimate deontologist: It doesn't care at all about the actual consequences that result from its action. One implication of this conception is that a UDT agent cannot be truly counterfactually mugged.

[ETA: For completeness, I give a description of UDT here (pdf).]

Vladimir Nesov's Counterfactual Mugging presents us with the following scenario:

Imagine that one day, Omega comes to you and says that it has just tossed a fair coin, and given that the coin came up tails, it decided to ask you to give it $100. Whatever you do in this situation, nothing else will happen differently in reality as a result. Naturally you don't want to give up your $100. But see, the Omega tells you that if the coin came up heads instead of tails, it'd give you $10000, but only if you'd agree to give it $100 if the coin came up tails.

Omega can predict your decision in case it asked you to give it $100, even if that hasn't actually happened, it can compute the counterfactual truth. The Omega is also known to be absolutely honest and trustworthy, no word-twisting, so the facts are really as it says, it really tossed a coin and really would've given you $10000.

An agent following UDT will give the $100. Imagine that we were building an agent, and that we will receive whatever utility follows from the agent's actions. Then it's easy to see why we should build our agent to give Omega the money in this scenario. After all, at the time we build our agent, we know that Omega might one day flip a fair coin with the intentions Nesov describes. Whatever probability this has of happening, our expected earnings are greater if we program our agent to give Omega the $100 on tails.

More generally, if we suppose that we get whatever utility will follow from our agent's actions, then we can do no better than to program the agent to follow UDT. But since we have to program the UDT agent now, the act-evaluating function that determines how the agent will act needs to be fixed with the probabilities that we know now. This will suffice to maximize our expected utility given our best knowledge at the time when we build the agent.

So, it makes sense for a builder to program an agent to follow UDT on expected-utility grounds. We can understand the builder's motivations. We can get inside the builder's head, so to speak.

But what about the agent's head? The brilliance of Nesov's scenario is that it is so hard, on first hearing it, to imagine why a reasonable agent would give Omega the money knowing that the only result will be that they gave up $100. It's easy enough to follow the UDT formalism. But what on earth could the UDT agent itself be thinking? Yes, trying to figure this out is an exercise in anthropomorphization. Nonetheless, I think that it is worthwhile if we are going to use UDT to try to understand what we ought to do.

Here are three ways to conceive of the agent's thinking when it gives Omega the $100. They form a sort of spectrum.

One extreme view: The agent considers all the possible words to be on equal ontological footing. There is no sense in which any one of them is distinguished as "actual" by the agent. It conceives of itself as acting simultaneously in all the possible worlds so as to maximize utility over all of them. Sometimes this entails acting in one world so as to make things worse in that world. But, no matter which world this is, there is nothing special about it. The only property of the world that has any ontologically significance is the probability weight given to that world at the time that the agent was built. (I believe that this is roughly the view that Wei Dai himself takes, but I may be wrong.)
An intermediate view: The agent thinks that there is only one actual world. That is, there is an ontological fact of the matter about which world is actual. However, the other possible worlds continue to exist in some sense, although they are merely possible, not actual. Nonetheless, the agent continues to care about all of the possible worlds, and this amount of care never changes. After being counterfactually mugged, the agent is happy to know that, in some merely-possible world, Omega gave the agent $10000.
The other extreme: As in (2), the agent thinks that there is only one actual world. Contrary to (2), the agent cares about only this world. However, the agent is a deontologist. When deciding how to act, all that it cares about is whether its act in this world is "right", where "right" means "maximizes the fixed act-evaluating function that was built into me."

View (3) is the one that I wanted to develop in this post. On this view, the "probability distribution" in the act-evaluating function no longer has any epistemic meaning for the agent. The act-evaluating function is just a particular computation which, for the agent, constitutes the essence of rightness. Yes, the computation involves considering some counterfactuals, but to consider those counterfactuals does not entail any ontological commitment.

Thus, when the agent has been counterfactually mugged, it's not (as in (1)) happy because it cares about expected utility over all possible worlds. It's not (as in (2)) happy because, in some merely-possible world, Omega gave it $10000. On this view, the agent considers all those "possible worlds" to have been rendered impossible by what it has learned since it was built. The reason the agent is happy is that it did the right thing. Merely doing the right thing has given the agent all the utility it could hope for. More to the point, the agent got that utility in the actual world. The agent knows that it did the right thing, so it genuinely does not care about what actual consequences will follow from its action.

In other words, although the agent lost $100, it really gained from the interaction with Omega. This suggests that we try to consider a "true" analog of the Counterfactual Mugging. In The True Prisoner's Dilemma, Eliezer Yudkowsky presents a version of the Prisoner's Dilemma in which it's viscerally clear that the payoffs at stake capture everything that we care about, not just our selfish values. The point is to make the problem about utilons, and not about some stand-in, such as years in prison or dollars.

In a True Counterfactual Mugging, Omega would ask the agent to give up utility. Here we see that the UDT agent cannot possibly do as Omega asks. Whatever it chooses to do will turn out to have in fact maximized its utility. Not just expected utility, but actual utility. In the original Counterfactual Mugging, the agent looks like something of a chump who gave up $100 for nothing. But in the True Counterfactual Mugging, our deontological agent lives with the satisfaction that, no matter what it does, it lives in the best of all possible worlds.

[1] ETA: Under UDT, the agent assigns a utility to having all of the possible worlds P1, P2, . . . undergo respective execution histories E1, E2, . . .. (The way that a world evolves may depend in part on the agent's action). That is, for each vector <E1, E2, . . .> of ways that these worlds could respectively evolve, the agent assigns a utility U(<E1, E2, . . .>). Due to criticisms by Vladimir Nesov (beginning here), I have realized that this post only applies to instances of UDT in which the utility function U takes the form that it has in standard decision theories. In this case, each world Pi has its own probability pr(Pi) and its own utility function u_i that takes an execution history of Pi alone as input, and the function U takes the form

U(<E1, E2, . . .>) = Σ_i pr(Pi) u_i(Ei).

The probabilities pr(Pi) are what I'm talking about when I mention probabilities in this post. Wei Dai is interested in instances of UDT with more general utility functions U. However, to my knowledge, this special kind of utility function is the only one in terms of which he's talked about the meanings of probabilities of possible worlds in UDT. See in particular this quote from the original UDT post:

If your preferences for what happens in one such program is independent of what happens in another, then we can represent them by a probability distribution on the set of programs plus a utility function on the execution of each individual program.

(A "program" is what Wei Dai calls a possible world in that post.) The utility function U is "baked in" to the UDT agent at the time it's created. Therefore, so too are the probabilities pr(Pi).

[2] By "the actual world", I do not mean one of the worlds in the many-worlds interpretation (MWI) of quantum mechanics. I mean something more like the entire path traversed by the quantum state vector of the universe through its corresponding Hilbert space. Distinct possible worlds are distinct paths that the state of the universe might (for all we know) be traversing in this Hilbert space. All the "many worlds" of the MWI together constitute a single world in the sense used here.

ETA: This post was originally titled "UDT agents are deontologists". I changed the title to "UDT agents as deontologists" to emphasize that I am describing a way to view UDT agents. That is, I am describing an interpretive framework for understanding the agent's thinking. My proposal is analogous to Dennett's "intentional stance". To take the intentional stance is not to make a claim about what a conscious organism is doing. Rather, it is to make use of a framework for organizing our understanding of the organism's behavior. Similarly, I am not suggesting that UDT somehow gets things wrong. I am saying that it might be more natural for us if we think of the UDT agent as a deontologist, instead of as an agent that never changes its belief about which possible worlds will actually happen. I say a little bit more about this in this comment.

[ETA: For completeness, I give a description of UDT here (pdf).]

Vladimir Nesov's Counterfactual Mugging presents us with the following scenario:

Imagine that one day, Omega comes to you and says that it has just tossed a fair coin, and given that the coin came up tails, it decided to ask you to give it $100. Whatever you do in this situation, nothing else will happen differently in reality as a result. Naturally you don't want to give up your $100. But see, the Omega tells you that if the coin came up heads instead of tails, it'd give you $10000, but only if you'd agree to give it $100 if the coin came up tails.

Omega can predict your decision in case it asked you to give it $100, even if that hasn't actually happened, it can compute the counterfactual truth. The Omega is also known to be absolutely honest and trustworthy, no word-twisting, so the facts are really as it says, it really tossed a coin and really would've given you $10000.

So, it makes sense for a builder to program an agent to follow UDT on expected-utility grounds. We can understand the builder's motivations. We can get inside the builder's head, so to speak.

Here are three ways to conceive of the agent's thinking when it gives Omega the $100. They form a sort of spectrum.

One extreme view: The agent considers all the possible words to be on equal ontological footing. There is no sense in which any one of them is distinguished as "actual" by the agent. It conceives of itself as acting simultaneously in all the possible worlds so as to maximize utility over all of them. Sometimes this entails acting in one world so as to make things worse in that world. But, no matter which world this is, there is nothing special about it. The only property of the world that has any ontologically significance is the probability weight given to that world at the time that the agent was built. (I believe that this is roughly the view that Wei Dai himself takes, but I may be wrong.)
An intermediate view: The agent thinks that there is only one actual world. That is, there is an ontological fact of the matter about which world is actual. However, the other possible worlds continue to exist in some sense, although they are merely possible, not actual. Nonetheless, the agent continues to care about all of the possible worlds, and this amount of care never changes. After being counterfactually mugged, the agent is happy to know that, in some merely-possible world, Omega gave the agent $10000.
The other extreme: As in (2), the agent thinks that there is only one actual world. Contrary to (2), the agent cares about only this world. However, the agent is a deontologist. When deciding how to act, all that it cares about is whether its act in this world is "right", where "right" means "maximizes the fixed act-evaluating function that was built into me."

U(<E1, E2, . . .>) = Σ_i pr(Pi) u_i(Ei).

If your preferences for what happens in one such program is independent of what happens in another, then we can represent them by a probability distribution on the set of programs plus a utility function on the execution of each individual program.

(A "program" is what Wei Dai calls a possible world in that post.) The utility function U is "baked in" to the UDT agent at the time it's created. Therefore, so too are the probabilities pr(Pi).

14

UDT agents as deontologists

14

14

14

UDT agents as deontologists

14

14