One way (the usual way?) to think of an agent running Updateless Decision Theory is to imagine that the agent always cares about all possible worlds according to how probable those worlds seemed to the agent's builders when they wrote the agent's source code[see added footnote 1 below]. In particular, the agent never develops any additional concern for whatever turns out to be the actual world[2]. This is what puts the "U" in "UDT".
I suggest an alternative conception of a UDT agent, without changing the UDT formalism. According to this view, the agent cares about only the actual world. In fact, at any time, the agent cares about only one small facet of the actual world — namely, whether the agent's act at that time maximizes a certain fixed act-evaluating function. In effect, a UDT agent is the ultimate deontologist: It doesn't care at all about the actual consequences that result from its action. One implication of this conception is that a UDT agent cannot be truly counterfactually mugged.
[ETA: For completeness, I give a description of UDT here (pdf).]
Vladimir Nesov's Counterfactual Mugging presents us with the following scenario:
Imagine that one day, Omega comes to you and says that it has just tossed a fair coin, and given that the coin came up tails, it decided to ask you to give it $100. Whatever you do in this situation, nothing else will happen differently in reality as a result. Naturally you don't want to give up your $100. But see, the Omega tells you that if the coin came up heads instead of tails, it'd give you $10000, but only if you'd agree to give it $100 if the coin came up tails.
Omega can predict your decision in case it asked you to give it $100, even if that hasn't actually happened, it can compute the counterfactual truth. The Omega is also known to be absolutely honest and trustworthy, no word-twisting, so the facts are really as it says, it really tossed a coin and really would've given you $10000.
An agent following UDT will give the $100. Imagine that we were building an agent, and that we will receive whatever utility follows from the agent's actions. Then it's easy to see why we should build our agent to give Omega the money in this scenario. After all, at the time we build our agent, we know that Omega might one day flip a fair coin with the intentions Nesov describes. Whatever probability this has of happening, our expected earnings are greater if we program our agent to give Omega the $100 on tails.
More generally, if we suppose that we get whatever utility will follow from our agent's actions, then we can do no better than to program the agent to follow UDT. But since we have to program the UDT agent now, the act-evaluating function that determines how the agent will act needs to be fixed with the probabilities that we know now. This will suffice to maximize our expected utility given our best knowledge at the time when we build the agent.
So, it makes sense for a builder to program an agent to follow UDT on expected-utility grounds. We can understand the builder's motivations. We can get inside the builder's head, so to speak.
But what about the agent's head? The brilliance of Nesov's scenario is that it is so hard, on first hearing it, to imagine why a reasonable agent would give Omega the money knowing that the only result will be that they gave up $100. It's easy enough to follow the UDT formalism. But what on earth could the UDT agent itself be thinking? Yes, trying to figure this out is an exercise in anthropomorphization. Nonetheless, I think that it is worthwhile if we are going to use UDT to try to understand what we ought to do.
Here are three ways to conceive of the agent's thinking when it gives Omega the $100. They form a sort of spectrum.
- One extreme view: The agent considers all the possible words to be on equal ontological footing. There is no sense in which any one of them is distinguished as "actual" by the agent. It conceives of itself as acting simultaneously in all the possible worlds so as to maximize utility over all of them. Sometimes this entails acting in one world so as to make things worse in that world. But, no matter which world this is, there is nothing special about it. The only property of the world that has any ontologically significance is the probability weight given to that world at the time that the agent was built. (I believe that this is roughly the view that Wei Dai himself takes, but I may be wrong.)
- An intermediate view: The agent thinks that there is only one actual world. That is, there is an ontological fact of the matter about which world is actual. However, the other possible worlds continue to exist in some sense, although they are merely possible, not actual. Nonetheless, the agent continues to care about all of the possible worlds, and this amount of care never changes. After being counterfactually mugged, the agent is happy to know that, in some merely-possible world, Omega gave the agent $10000.
- The other extreme: As in (2), the agent thinks that there is only one actual world. Contrary to (2), the agent cares about only this world. However, the agent is a deontologist. When deciding how to act, all that it cares about is whether its act in this world is "right", where "right" means "maximizes the fixed act-evaluating function that was built into me."
View (3) is the one that I wanted to develop in this post. On this view, the "probability distribution" in the act-evaluating function no longer has any epistemic meaning for the agent. The act-evaluating function is just a particular computation which, for the agent, constitutes the essence of rightness. Yes, the computation involves considering some counterfactuals, but to consider those counterfactuals does not entail any ontological commitment.
Thus, when the agent has been counterfactually mugged, it's not (as in (1)) happy because it cares about expected utility over all possible worlds. It's not (as in (2)) happy because, in some merely-possible world, Omega gave it $10000. On this view, the agent considers all those "possible worlds" to have been rendered impossible by what it has learned since it was built. The reason the agent is happy is that it did the right thing. Merely doing the right thing has given the agent all the utility it could hope for. More to the point, the agent got that utility in the actual world. The agent knows that it did the right thing, so it genuinely does not care about what actual consequences will follow from its action.
In other words, although the agent lost $100, it really gained from the interaction with Omega. This suggests that we try to consider a "true" analog of the Counterfactual Mugging. In The True Prisoner's Dilemma, Eliezer Yudkowsky presents a version of the Prisoner's Dilemma in which it's viscerally clear that the payoffs at stake capture everything that we care about, not just our selfish values. The point is to make the problem about utilons, and not about some stand-in, such as years in prison or dollars.
In a True Counterfactual Mugging, Omega would ask the agent to give up utility. Here we see that the UDT agent cannot possibly do as Omega asks. Whatever it chooses to do will turn out to have in fact maximized its utility. Not just expected utility, but actual utility. In the original Counterfactual Mugging, the agent looks like something of a chump who gave up $100 for nothing. But in the True Counterfactual Mugging, our deontological agent lives with the satisfaction that, no matter what it does, it lives in the best of all possible worlds.
[1] ETA: Under UDT, the agent assigns a utility to having all of the possible worlds P1, P2, . . . undergo respective execution histories E1, E2, . . .. (The way that a world evolves may depend in part on the agent's action). That is, for each vector <E1, E2, . . .> of ways that these worlds could respectively evolve, the agent assigns a utility U(<E1, E2, . . .>). Due to criticisms by Vladimir Nesov (beginning here), I have realized that this post only applies to instances of UDT in which the utility function U takes the form that it has in standard decision theories. In this case, each world Pi has its own probability pr(Pi) and its own utility function ui that takes an execution history of Pi alone as input, and the function U takes the form
U(<E1, E2, . . .>) = Σi pr(Pi) ui(Ei).
The probabilities pr(Pi) are what I'm talking about when I mention probabilities in this post. Wei Dai is interested in instances of UDT with more general utility functions U. However, to my knowledge, this special kind of utility function is the only one in terms of which he's talked about the meanings of probabilities of possible worlds in UDT. See in particular this quote from the original UDT post:
If your preferences for what happens in one such program is independent of what happens in another, then we can represent them by a probability distribution on the set of programs plus a utility function on the execution of each individual program.
(A "program" is what Wei Dai calls a possible world in that post.) The utility function U is "baked in" to the UDT agent at the time it's created. Therefore, so too are the probabilities pr(Pi).
[2] By "the actual world", I do not mean one of the worlds in the many-worlds interpretation (MWI) of quantum mechanics. I mean something more like the entire path traversed by the quantum state vector of the universe through its corresponding Hilbert space. Distinct possible worlds are distinct paths that the state of the universe might (for all we know) be traversing in this Hilbert space. All the "many worlds" of the MWI together constitute a single world in the sense used here.
ETA: This post was originally titled "UDT agents are deontologists". I changed the title to "UDT agents as deontologists" to emphasize that I am describing a way to view UDT agents. That is, I am describing an interpretive framework for understanding the agent's thinking. My proposal is analogous to Dennett's "intentional stance". To take the intentional stance is not to make a claim about what a conscious organism is doing. Rather, it is to make use of a framework for organizing our understanding of the organism's behavior. Similarly, I am not suggesting that UDT somehow gets things wrong. I am saying that it might be more natural for us if we think of the UDT agent as a deontologist, instead of as an agent that never changes its belief about which possible worlds will actually happen. I say a little bit more about this in this comment.
To the agent's builders.
ETA: I make that clear later in the post, but I'll add it to the intro paragraph.
I'm not sure what you mean. What I'm describing as coded into the agent "from birth" is Wei Dai's function P, which takes an output string Y as its argument (using subscript notation in his post).
ETA: Sorry, that is not right. To be more careful, I mean the "mathematical intuition" that takes in an input X and returns such a function P. But P isn't controlled by the agent's decisions.
ETA2: Gah. I misremembered how Wei Dai used his notation. And when I went back to the post to answer your question, I skimmed to quickly and misread.
So, final answer, when I say that "the agent always cares about all possible worlds according to how probable those worlds seemed to the agent's builders when they wrote the agent's source code", I'm talking about the "preference vector" that Wei Dai denotes by "" and which he says "defines its preferences on how those programs should run."
I took him to be thinking of these entries Ei as corresponding to probabilities because of his post What Are Probabilities, Anyway?, where he suggests that "probabilities represent how much I care about each world".
ETA3: Nope, this was another misreading on my part. Wei Dai does not say that is a vector of preferences, or anything like that. He says that it is an input to a utility function U, and that utility function is what "defines [the agent's] preferences on how those programs should run". So, what I gather very tentatively at this point is that the probability of each possible world is baked into the utility function U.
Do you see that these E's are not intended to be interpreted as probabilities here, and so "probabilities of possible worlds are fixed at the start" remark at the beginning of your post is wrong?