A Defense of Open-Minded Updatelessness.
This work owes a great debt to many conversations with Sahil, Martín Soto, and Scott Garrabrant.
You can support my work on Patreon.
Iterated Counterfactual Mugging On a Single Coinflip
Iterated counterfactual mugging on a single coinflip begins like a classic counterfactual mugging, with Omega approaching you, explaining the situation, and asking for your money. Let's say you buy the classic UDT idea, so you happily give Omega your money.
Next week, Omega appears again, with the same question. However, Omega clarifies that it has used the same coin-flip as last week.
This throws you off a little bit, but you see that the math is the same either way; your prior still assigns a 50-50 chance to both outcomes. If you thought it was a good deal last week, you should also think it is a good deal this week. You pay up again.
On the third week, Omega makes the same offer again, and once again has used the same coinflip. You ask Omega how many times it's going to do this. Omega replies, "forever". You ask Omega whether it would have continued coming if the coin had landed heads; it says "Of course! How else could I make you this offer now? Since the coin landed tails, I will come and ask you for $100 every single week going forward. If the coin had landed heads, I would have simulated what would happen if it had landed tails, and I would come and give you $10,000 on every week that simulated-you gives up $100!"
Let's say for the sake of the thought experiment that you can afford to give Omega $100 once a week. It hurts, but it doesn't hurt as much as getting $10,000 from Omega every week would have benefited you, if that had happened.[1]
Nonetheless, I suspect many readers will feel some doubt creep in as they imagine giving Omega $100 week after week after week. The first few weeks, the possibility of the coin landing heads might feel "very real". Heck yeah I want to be the sort of person who gets a 50% chance of 10K from Omega for a (50% chance) cost of $100!
By the hundredth week, though, you may feel yourself the fool for giving up so much money for the imaginary benefit of the "heads" world that never was.
If you think you'd still happily give up the $100 for as long as Omega kept asking, then I would ask you to consider a counterlogical mugging instead. Rather than flipping a coin, Omega uses a digit of the binary expansion of ; as before, Omega uses the same digit week after week, for infinitely many counterlogical muggings.
Feeling uneasy yet? Does the possibility of the digit of going one way or the other continue to feel "just as real" as time passes? Or do you become more sympathetic to the idea that, at some point, you're wasting money on helping a non-real world?
UDT vs Learning
Updateless Decision Theory (UDT) clearly keeps giving Omega the $100 forever in this situation, at least, under the usual assumptions. A single Counterfactual Mugging is not any different from an infinitely iterated one, especially in the version above where only a single coinflip is used. The ordinary decision between "give up $100" and "refuse" is isomorphic to the choice of general policy "give up $100 forever" and "refuse forever".[2]
However, the idea of applying a decision theory to a specific decision problem is actually quite subtle, especially for UDT. We generally assume an agent's prior equals the probabilities described in the decision problem.[3] A simple interpretation of this could be that the agent is born with this prior (and immediately placed into the decision problem). This isn't very realistic, though. How did the agent get the correct prior?[4]
A more realistic idea is that:
- (a) the agent was born some time ago and has learned a fair amount about the world;
- (b) the agent has accumulated sufficient evidence to think, with high confidence, that it is now facing the decision problem being described; and importantly for UDT,
- (c) there are no further considerations in UDT's prior which would sway UDT in the situation being described.
However, (c) is not very plausible with UDT!
Lizard World
To elaborate on (c): We say that "UDT accepts counterfactual muggings". But, imagine that your prior had a 50% initial probability that you would end up in Lizard World rather than Earth. Lizards have a strange value system, which values updateful behavior intrinsically. They'll reward you greatly if your chosen policy agrees with updateful decision theory (except with respect to this specific incentive, which is of course a counterfactual mugging).
Given such a prior, UDT will refuse counterfactual muggings on Earth (if the incentive for accepting the mugging is less than the rewards offered by the Lizards). The assumption (c) is saying that there are no interfering considerations like this.
Notice that this is quite a large assumption! Usually we think of agents as starting out with quite a broad prior, such as the Solomonoff distribution, which gives lots of crazy worlds non-negligible probability.
Agents need to start with broad priors in order to learn. But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Lizards.
Learning Desiderata
Let's sharpen that idea into a maybe-desirable property:
Learning-UDT: Suppose an agent starts with prior . If the prior were updated on the actual information the agent receives over time, it would become , then , and so on. An agent obeys "learning UDT" if it eventually behaves as if it were applying UDT with prior , for any .[5] That is: for each , there is a time after which all decisions maximize expected utility according to .
Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each . Therefore, in particular, it eventually behaves like UDT with prior . So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior is required to eventually agree with the recommendations of (which also implies that these eventually agree with each other).
Here's a different learning principle which doesn't imply UDT so strongly:
Eventual Learning: At each timestep t, the agent obeys UDT with prior , where would be the beliefs fully updated on the observations at time t. monotonically increases without bound.
The idea here is to allow agents to have fairly rich priors (so that they can learn what world they are in!), and also allow them to be somewhat updateless (so that they can correctly solve problems like counterfactual mugging), but require them to eventually "face facts" rather than make mistakes forever based on weird stuff in the prior. Eventual Learning + UDT implies Learning-UDT.
So the overall argument here is as follows:
- UDT was invented to solve some decision problems.
- However, realistic agents have to somehow learn what decision problem they face.
- UDT doesn't learn like this by default. So the idea that it "solves decision problems" like counterfactual mugging is somewhat imaginary.
- Eventual Learning solves this problem by eventually updating on any particular fact. Thus, we eventually understand any decision problem that we face, if we have faced similar problems enough times in the past.
Notice that Eventual Learning implies that we eventually stop giving Omega $100 in Iterated Counterfactual Mugging on a Single Coinflip, since we eventually behave as if we've updated on the coin.
I should mention that I'm not necessarily endorsing Eventual Learning as a true normative constraint. It's more a way of pointing out that UDT can easily refuse to learn, in a way that seems bad. The main point here is to illustrate that UDT is sometimes incompatible with learning, but learning is a hidden necessary assumption behind applications of UDT to solve decision problems.
Open problem:[6] Under what conditions is classic UDT compatible with Eventual Learning? Is it possible to specify a rich prior while systematically avoiding the "god" traps I mention, such that following the standard UDT decision procedure will also satisfy the Learning-UDT criterion? If so, can it learn to behave optimally in some nice rich class of situations which includes counterfactual muggings?
Bargaining
Another way to think about this, which both Scott Garrabrant and Diffractor suggested, is that we're somehow bargaining between different branches of possibility, instead of simply maximizing expected utility according to a fixed prior.
Imagine the prior is split up into hypotheses, and the hypotheses bargain to the Pareto frontier by some method such as Nash bargaining. The resulting policy can be justified in terms of a fixed prior over hypotheses, but importantly, the weights on different hypotheses will depend on the bargaining. This means a "fair coin" might not be given 50-50 weight.
It's hard to apply this directly to Counterfactual Mugging, because the mugged branch simply loses out (if it pays up). The branch that benefits from the mugging has nothing to offer the branch that loses out, so there's no motivation for bargaining. However, let's vaguely imagine that there's an agreement to get counterfactually mugged sometimes as part of a larger bargain between hypotheses.
The UDT reasoning I offered for Iterated Counterfactual Mugging On a Single Coinflip at the beginning assumed that you're insensitive to the amount of money; if it makes sense to accept Omega's bargain once, it makes sense to accept it twice, and three times, etc. The utilities add linearly.
However, if we think of our probability for the coin-flip as the result of bargaining, it makes sense that it might be sensitive to size. The negotiation which was willing to trade $100 from one branch to get $10,000 in another branch need not be equally willing to perform that trade arbitrarily many times.
In fact, I can give a toy analysis based on bargaining which justifies giving Omega the money only sometimes.[9]
Obviously this is hand-wavy, and needs to be worked out properly. While I'm gesticulating wildly, I'll also throw in that geometric rationality seems relevant. Is the distinction between types of uncertainty we should be updateless about, vs types we should be updateful about, related to cases where we should use linear vs geometric expectation?
Can we justify a policy of eventually ignoring branches that claim too many cross-branch correlations, on the grounds that those branches are "greedy" and a "fair" prior shouldn't treat them like utility monsters?
Open Problems:
- Spell out a sensible version of bargaining-based UDT.
- Can we justify a policy of mostly ignoring Lizard World based on intuitions from bargaining, ie, that Lizard Words are "too greedy" and a "fair" prior shouldn't cede very much to them?
- Is there a connection to Geometric Rationality which clarifies what we should be updateful vs updateless about?
Possible Critiques
At this point, I imagine a man in an all-black suit has snuck up behind me while I'm writing. I'm initially unnerved, but I relax when I spot the Bayesian Conspiracy ring on his finger.
Interlocutor: ...Unorthodox.
Me: Perhaps.
Interlocutor: What you've observed is that UDT doesn't react to a decision problem as you'd expect unless there are no relevant cross-branch entanglements of sufficient importance to change that. A different way to put this would be: UDT only deviates from optimal play on a subtree when it has a good reason. Your learning principle requires the agent to eventually ignore this good reason.
Me: Not necessarily. There are two ways to get learning: we can change the decision theory so that cross-branch entanglements are eventually ignored if they're "too old" -- the simplest way to achieve this is to eventually update on each individual piece of information. Or, we can stick to UDT and bound the cross-branch entanglements allowed by the prior (this is the idea behind the learning-UDT condition).
Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.
Me: I'll concede for now[7] that reality does not forbid Iterated Counterfactual Mugging On a Single Coinflip or Lizard Worlds.
Interlocutor: And my argument against updating?
Me: You argue that any deviation from updatelessness creates an incentive to self-modify to be more updateless. However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT avoids self-modifying; it's only a conjecture.
Interlocutor: You claim the conjecture is false?
Me: A broad prior is dumb enough that it'll probably self-modify out of ignorance. If I can see that UDT has a learning problem, so can rational agents. If we start life with a broad prior, then there'll be lots of nonsense possibilities which we shouldn't be updateless about. It seems clear to me that we need some compromise between updatefulness and updatefulness, in practice.
Interlocutor: I reiterate: UDT only deviates from optimal play on a subtree when it has a good reason. Your concept of learning asks us to ignore those good reasons.
Me: I would agree so far as this: if you trust the prior, then the cases where updatelessness is optimal form a strict superset of the cases where updatefulness is optimal. When we set up a decision problem, we artificially assume that the prior is good -- that it describes reality somehow. "Prior reason" is not the same as "good reason".
Interlocutor: The prior is subjective. An agent has no choice but to trust its own prior. From its own perspective, its prior is the most accurate description of reality it can articulate.
Me: See, sometimes I feel like Bayesian subjectivism hides a subtle assumption of objectivity. Like when someone goes from endorsing Solomonoff Induction to endorsing the existence of a multiverse of all computations. Bayesian subjectivism gets substituted for multiverse frequentism.
Interlocutor: I'm doing no such thing. A Pragmatist's Guide to Epistemic Utility by Ben Levinstein shows that when an agent analyzes the value of possible beliefs in terms of their expected utility for decisions which the agent expects to make later, the result tends to be a proper scoring rule. Therefore, the agent's own beliefs will be the most valuable beliefs to have, by its calculations.
Me: But you've made a significant hidden assumption! What you say is only true if the agent knows how to write down the alternative beliefs in complete detail. Alice can think that Bob's beliefs are more accurate than hers, perhaps because Bob has access to more of the relevant information. Or, Alice can think her own later beliefs will be more accurate than her current beliefs. These things are allowed because Alice doesn't know what Bob's beliefs are in detail, and doesn't already know her future self's beliefs in detail.
Interlocutor: I would hardly call it a hidden assumption. But you're correct. What you're referring to is called the reflection principle, and is formalized as , where is the probability distribution which trusts over itself -- usually, the future beliefs of the same agent. But this being the case, will naturally utilize the information in when it learns its contents in detail. There is no need for an extra change-of-prior process.
Me: Well, for one thing, there's a size problem: in order for to think about all the possible ways might turn out, has to 'fit inside' ; but if is similarly sized to , or even larger, then there could be problems.
Interlocutor: So you're saying we might need to be updateful because we have finite computational power, and updating to a new set of beliefs can be less costly than treating those new beliefs as observations and considering how all possible policies could act on that observation?
Me: That's one way to think about it, but I worry this version will leave you thinking of UDT as the perfect ideal which we're trying to approximate in a bounded fashion.
Interlocutor: Indeed it would.
Me: Here's a thought-experiment. Suppose we've discovered the objectively correct human utility function, and we also know how to make a working UDT computation with whatever utility function we like. All that remains is to decide on the prior. Would you give it a broad prior, like Solomonoff's prior?
Interlocutor: We should use exactly our own prior. We don't want it to be too broad, because this will entail making trade-offs which sacrifice utility to worlds we don't believe in (such as your example with a 50% chance of going to Heaven where God rewards souls who don't act in a UDT-optimal way on Earth). We also don't want it to know too much, because then it might not make trade-offs which we would happily endorse based on our UDT.
Me: Unfortunately, although I'm granting you perfect knowledge of the human utility function for the purpose of the thought experiment, I am not granting you perfect knowledge of the human prior.
Interlocutor: In that case, we should design the AI to learn about humans and estimate the human prior, and make decisions based on that estimate.
Me: And should the AI be updateless about this? Should it simply make UDT decisions according to this best-guess prior we've input?
Interlocutor: Hm... no, it seems like we want it to learn. Imagine if we had narrowed down the human prior to two possibilities, and . Humans can't figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that is bad enough that it will lead to a catastrophe from the human perspective (that is, from the perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.
Me: You've invented my new decision theory for me. The reality is that we aren't certain what "our prior" is. Like the AI you describe, we are in the process of learning which prior is "correct". So, like that AI, we should be updating our prior as we learn more about what "the correct prior" is.
Interlocutor: So you're saying that divergence from UDT can be justified when we don't perfectly know our own beliefs?
Me: I'm not sure if that's exactly the condition, but at least it motivates the idea that there's some condition differentiating when we should be updateful vs updateless. I think uncertainty about "our own beliefs" is subtly wrong; it seems more like uncertainty about which beliefs we normatively endorse.
Interlocutor: But in this thought experiment, it seemed like uncertainty about our own prior is what drove the desire for updatefulness.
Me: When you said that in principle we should program the AI with our own prior, you were assuming that we were using a tiling UDT which endorses its own prior as the "correct" one. But since we are in fact uncertain about what our prior is, the same reasoning can apply to us. We shouldn't just be UDT, either, because we're still in the process of figuring out what prior we want to use with UDT. So it seems more accurate to me to say we're trying to figure out "the correct" prior, rather than "our" prior.
Interlocutor: Your language seems to suggest realism about probabilities.
Me: Yes, but in a similar sense to the moral realism of Eliezer's Meta-ethics sequence. You could call it subjectively objective. I claim that, when we put UDT into a decision problem, we make a significant assumption that the prior "correctly" describes the situation; for example, that the coin-flip in Counterfactual Mugging is indeed fair. But a totally subjectivist position about probability says: no coin is "really fair" -- 50-50 odds are a feature of maps, not territory! Coins in fact land one way or the other; all else is mere uncertainty! Yet, by asking the question "which prior should I use with UDT?" we create a need for a normative standard on probability distributions beyond mere accuracy.[8]
If you value essays like this, you can support my work on Patreon.
- ^
I am assuming that the value of money is linear (one utilon per dollar), but with temporal discounting to make overall expectations well-defined.
- ^
The iterated problem adds other possible policies, where we sometimes give up $100 and sometimes do not; from UDT's perspective, these options are simply intermediate between "always give" and "never give".
- ^
Actually, we do some kind of extrapolation to create a sensible prior. For problems like Newcomb's Problem, we can just take the situation at face value; UDT's prior is just a probabilistic description of the situation which has already been described in English. However, for problems like Counterfactual Mugging, we often describe the problem as "Omega tells you the coin has landed tails" -- but then to feed the problem to UDT, we would construct a prior which gave the coin a 50-50 chance of heads or tails.
We can eliminate this ambiguity by spelling out the prior from the beginning, as part of the decision problem.
However, the work of "extrapolating" a problem into a sensible prior for use with UDT is extremely important and worth studying. In fact, that is one way to state the main point of this post.
- ^
For example, in Newcomb's Problem, we might say things like "Society on Earth has seen Omega do things like this for a long time, and instances are very well-documented; not once has Omega been known to make an incorrect prediction."
Stuff like this is often added to sort of "get people in the mood" if they appear to be intuitively rejecting the hypothetical. From one perspective, this is irrelevant fluff that doesn't change the decision problem, instead merely helping the listener to concretely imagine it. The perspective I'm arguing here ascribes somewhat more significance to backstories like this.
- ^
Learning UDT doesn't specify anything about the learning rate; if an agent uses at time , then it's completely updateful and not UDT at all, but still counts as "Learning UDT" by this definition.
What I really have in mind is a slower learning rate. For example, I want to guarantee that the agent, faced with iterated counterfactual muggings using independent coins, eventually learns to give Omega $100 when asked.
Presumably, if UDT has a prior such that
(1) the prior is rich enough to learn about arbitrary decision problems like counterfactual mugging,
(2) the Learning UDT criterion is satisfied by the classic UDT decision procedure when using this prior,
we can also prove that UDT won't use at time , since this would violate the assumption that UDT could correctly learn that it's in a counterfactual mugging. However, it is unclear whether a prior like this can be constructed.
- ^
Actually, I think Diffractor has made at least some progress on this.
- ^
I still think it might be interesting/important to explore the compatibility of learning with UDT.
For one thing, the assumption that the universe contains only bounded "cross-branch entanglements" might be thought of as a learnability assumption, similar to a no-traps assumption. We don't believe that the universe contains zero deadly traps, but if the universe might contain some deadly traps, then it is not rational to explore the environment and learn. Therefore, we may need to make a no-traps assumption to study learning behavior. Similarly, if one branch is allowed to entangle with another forever, this stops UDT from learning. So we may wish to study the learning behavior of UDT by making a bounded-entanglement assumption.
Secondarily, I have some suspicion that a bounded-entanglement assumption can be justified by other means. Entanglements are always subjective. In the case of counterfactual mugging, for example, the agent subjectively thinks of Omega's simulation as accurate -- that is, correlated with what that agent "would have actually done" in the other branch. However, a bounded agent can only concretely believe in finitely many such correlations at a given time. So, perhaps some versions of boundedly-rational UDT would come with bounded-correlation assumptions "for free"?
- ^
Indeed, it seems somewhat frequentist: I regard the coinflip as fair if I can naturally interpret it as part of a sequence of coinflips which looks like it converges to something close to 50-50 frequency, and contains no other patterns that I can exploit with my bounded intelligence.
- ^
Imagine that Omega will approach us 5 times, and also, that we have $500 to start, so if we say yes every time we will be reduced to $0. Further, imagine that we are maximizing the product of our ending amount of money in the two branches (an assumption we can justify with Nash bargaining, assuming the BATNA is $0).
In this case, it is optimal to say yes to Omega twice:
Number of 'yes': End wealth of benefitting branch: End wealth of mugged branch: Product of end wealths: 0 500 500 250000 1 10500 400 4200000 2 20500 300 6150000 3 30500 200 6100000 4 40500 100 4050000 5 50500 0 0
Excellent explanation, congratulations! Sad I'll have to miss the discussion.
You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don't, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?
This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can't differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on "our own beliefs" or "which beliefs I endorse"? After all, that's just one more part of reality (without a clear boundary separating it).
It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can't know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: "I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I'm in the Infinite Counterlogical Mugging... then I will just eventually change my prior because I noticed I'm in the bad world!". But then again, why would we think this update is safe? That's just not being updateless, and losing out on the strategic gains from not updating.
Since a solution doesn't exist in full generality, I think we should pivot to more concrete work related to the "content" (our particular human priors and our particular environment) instead of the "formalism". For example: