Comment author: 31 January 2015 02:01:11AM 24 points [-]

Donated $300. Comment author: 30 July 2014 08:54:34PM 7 points [-] I don't follow. Could you make this example more formal, giving a set of outcomes, a set of lotteries over these outcomes, and a preference relation on these that corresponds to "I will act so that, at some point, there will have been a chance of me becoming a heavy-weight champion of the world", and which fails Continuity but satisfies all other VNM axioms? (Intuitively this sounds more like it's violating Independence, but I may well be misunderstanding what you're trying to do since I don't know how to do the above formalization of your argument.) Comment author: 26 July 2014 06:27:24PM * 3 points [-] Keep in mind that there is a nice theory about not being able to lie in parseltongue. What's that theory? I suppose open borders and unrestricted immigration are in-keeping with Harry's character as a utilitarian who tries to assign equal value to each and every human life. And yet, Hogwarts won't take in Muggleborn from outside Britain. And an early chapter mentioned "lands where Muggleborn children received no letters" (quoting from memory). Comment author: 30 July 2014 02:00:42AM * 0 points [-] Also, Magical Britain keeps Muggles out, going so far as to enforce this by not even allowing Muggles to know that Magical Britain exists. I highly doubt that Muggle Britain would do that to potential illegal immigrants even if it did have the technology... Comment author: 02 July 2014 02:39:31AM * 5 points [-] No, that would be evidence for it. I know you are trying to show I am having it both ways, but I am not. Think of the full tree of possibilities: ultimatum/no-ultimatum, bluff/real, rational-refusal/irrational-refusal. If real ultimatum had been issued and Saddam had then refused for irrational reasons, that would be strong evidence against the plan, because that's the situation which is predicted to go well. And that's the situation Punoxysm thought he'd found, but he hadn't. (Actually, you're the second person today to think I was doing something like that. I mentioned on IRC I had correctly predicted to Gawker in late 2013 that the black-market BMR would soon be busted by law enforcement - as most of its employees would be within two months or so while setting up the successor Utopia black-market - mentioning that among other warning signs, BMR had never mentioned detecting attempts by law enforcement to infiltrate it; someone quoted at me that surely 'absence of evidence is evidence of absence'? Surely if BMR had claimed to be seeing law enforcement infiltration I would consider that evidence for infiltration, so how could I turn around and argue that lack of BMR claims was also evidence for infiltration? Yes, this is a good criticism - in a binary context. But this was more complex than a binary observation: there were at least 3 possibilities. 1. it could be that law enforcement was not trying to infiltrate at all, 2. it could be they were trying & had failed, or 3. it could be that they were trying & succeeded. BMR's silence was evidence they didn't spot any attacks, so this is evidence that law enforcement was not trying, but it was also evidence for the other proposition that law enforcement was trying & succeeding; a priori, the former was massively improbable because BMR was old and notorious and it's inconceivable LE was not actively trying to bust it, while the latter was quite probable & had just been done to Silk Road. Hence, observing BMR silence pushed the infiltration outcome to a high posterior while the not-trying remained still pretty unlikely. Of course, for an obscure small marketplace, the reasoning would happen the other way around: because it starts off more likely to be ignored than infiltrated, silence is golden. I'm thinking of titling any writeup "The Pig That Didn't Oink".) Comment author: 03 July 2014 12:59:06PM 1 point [-] Incidentally, the same argument also applies to Governor Earl Warren's statement quoted in Absence of evidence is evidence of absence: He can be seen as arguing that there are at least three possibilities, (1) there is no fifth column, (2) there is a fifth column and it supposed to do sabotage independent from an invasion, (3) there is a fifth column and it is supposed to aid a Japanese invasion of the West Coast. In case (2), you would expect to have seen sabotage; in case (1) and (3), you wouldn't, because if the fifth column were known to exist by the time of the invasion, it would be much less effective. Thus, while observing no sabotage is evidence against the fifth column existing, it is evidence in favor of a fifth column existing and being intended to support an invasion. I recently heard Eliezer claim that this was giving Warren too much credit when someone was pointing out an interpretation similar to this, but I'm pretty sure this argument was represented in Warren's brain (if not in explicit words) when he made his statement, even if it's pretty plausible that his choice of words was influenced by making it sound as if the absence of sabotage was actually supporting the contention that there was a fifth column. In particular, Warren doesn't say that the lack of subversive activity convinces him that there is a fifth column, he says that it convinces him "that the sabotage we are to get, the Fifth Column activities are to get, are timed just like Pearl Harbor was timed". Moreover, in , he claims that there are reasons to think (1) very unlikely, namely that, he alleges, the Axis powers all use them everywhere else: To assume that the enemy has not planned fifth column activities for us in a wave of sabotage is simply to live in a fool's paradise. These activities, whether you call them "fifth column activities" or "sabotage" or "war behind the lines upon civilians," or whatever you may call it, are just as much an integral part of Axis warfare as any of their military and naval operations. When I say that I refer to all of the Axis powers with which we are at war. [...] Those activities are now being used actively in the war in the Pacific, in every field of operations about which I have read. They have unquestionably, gentlemen, planned such activities for California. For us to believe to the contrary is just not realistic. I.e., he claims that (1) would be very unique given the Axis powers' behavior elsewhere. On the other hand, he suggests that (3) fits a pattern of surprise attacks: [...] It convinces me more than perhaps any other factor that the sabotage that we are to get, the fifth column activities that we are to get, are timed just like Pearl Harbor was timed and just like the invasion of France, and of Denmark, and of Norway, and all of those other countries. And later, he explicitly argues that you wouldn't expect to have seen sabotage in case (3): If there were sporadic sabotage at this time or if there had been for the last 2 months, the people of California or the Federal authorities would be on the alert to such an extent that they could not possibly have any real fifth column activities when the M-day comes. So he has the pieces there for a correct Bayesian argument that a fifth column still has high posterior probability after seeing no sabotage, and that a fifth column intended to support an invasion has higher posterior than prior probability: Low prior probability of (1); (comparatively) high prior probability of (3); and an argument that (3) predicts the evidence nearly as well as (1) does. I'm not saying his premises are true, just that the fact that he claims all of them suggests that his brain did in fact represent the correct argument. The fact that he doesn't say that this argument convinces him "more than anything" that there is a fifth column, but rather says that it convinces him that the sabotage will be timed like Pearl Harbor (and France, Denmark and Norway), further supports this -- though, as noted above, while I think that his brain did represent the correct argument, it does seem plausible that his words were chosen so as to suggest the alternative interpretation as well. Comment author: 18 April 2014 10:48:30AM 3 points [-] I like how "MIRI" is written on the whiteboard behind him (in all but the first videos). (This comment in the spirit of ignoring core messages and focusing on some minor tidbit.) Comment author: 20 April 2014 02:33:46AM * 2 points [-] The true message of the first video is even more subliminal: The whiteboard behind him shows some math recently developed by MIRI, along with a (rather boring) diagram of Botworld :-) Comment author: 28 February 2014 06:02:33AM -2 points [-] I asked a question in the l-zombies thread, but got no reply. Maybe you can answer here. Comment author: 28 February 2014 08:35:50AM * 0 points [-] Sorry about that; I've had limited time to spend on this, and have mostly come down on the side of trying to get more of my previous thinking out there rather than replying to comments. (It's a tradeoff where neither of the options is good, but I'll try to at least improve my number of replies.) I've replied there. (Actually, now that I spent some time writing that reply, I realize that I should probably just have pointed to Coscott's existing reply in this thread.) Comment author: 12 February 2014 12:15:34AM -1 points [-] Why stop at the written program level? What if you are about to type the final semi-colon in the description of a simulated human? When does it become an L-zombie or, alternatively, conscious? What about the day before you go to your office to finish the program? Maybe at the moment you made the decision to write this program? Where is the magical boundary? Is "finished program" just a convenient Schelling point? In response to comment by on L-zombies! (L-zombies?) Comment author: 28 February 2014 08:31:08AM 2 points [-] I'm not sure which of the following two questions you meant to ask (though I guess probably the second one), so I'll answer both: (a) "Under what circumstances is something (either an l-zombie or conscious)?" I am not saying that something is an l-zombie only if someone has actually written out the code of the program; for the purposes of this post, I assume that all natural numbers exist as platonical objects, and therefore all observers in programs that someone could in principle write and run exist at least as l-zombies. (b) "When is a program an l-zombie, and when is it conscious?" The naive view would be that the program has to be actually run in the physical world; if you've written a program and then deleted the source without running it, it wouldn't be conscious. But as to what exactly the rule is that you can use to look at say a cellular automaton (as a model of physical reality) and ask whether the conscious experience inside a given Turing machine is "instantiated inside" that automaton, I don't have one to propose. I do think that's a weak point of the l-zombies view, and one reason that I'd assign measureless Tegmark IV higher a priori probability. Comment author: 28 February 2014 05:06:42AM 9 points [-] (I'm posting this comment under this public throwaway account for emotional reasons.) What made you think the torture hypothetical was a good idea? When I read a math book, I expect to read about math. If the writer insists on using the most horrifying example possible to illustrate their math for no reason at all, that's extremely rude. Granted, sometimes there might a good reason to mention horrifying things, like if you're explicitly making a point about scope-insensitive moral intuitions being untrustworthy. But if you're just doing some ordinary mundane decision theory, you can just say "lose$50," or "utility −10." Then your more sensitive readers might actually learn some new math instead of wasting time and energy desperately trying not to ask whether life is worth living for fear that the answer is obviously that it isn't.

Comment author: 28 February 2014 08:13:23AM *  9 points [-]

Thank you for the feedback, and sorry for causing you distress! I genuinely did not take into consideration that this choice could cause distress, and it could have occurred to me, and I apologize.

On how I came to think that it might be a good idea (as opposed to missing that it might be a bad idea): While there's math in this post, the point is really the philosophy rather than the math (whose role is just to help thinking more clearly about the philosophy, e.g. to see that PBDT fails in the same way as NBDT on this example). The original counterfactual mugging was phrased in terms of dollars, and one thing I wondered about in the early discussions was whether thinking in terms of these low stakes made people think differently than they would if something really important was at stake. I'm reconstructing, it's been a while, but I believe that's what made me rephrase it in terms of the whole world being at stake. Later, I chose the torture as something that, on a scale I'd reflectively endorse (as opposed, I acknowledge, actual psychology), is much less important than the fate of the world, but still important. But I entirely agree that for the purposes of this post, "paying \$1" (any small negative effect) would have made the point just as well.

## The sin of updating when you can change whether you exist

8 28 February 2014 01:25AM

Trigger warning: In a thought experiment in this post, I used a hypothetical torture scenario without thinking, even though it wasn't necessary to make my point. Apologies, and thanks to an anonymous user for pointing this out. I'll try to be more careful in the future.

Should you pay up in the counterfactual mugging?

I've always found the argument about self-modifying agents compelling: If you expected to face a counterfactual mugging tomorrow, you would want to choose to rewrite yourself today so that you'd pay up. Thus, a decision theory that didn't pay up wouldn't be reflectively consistent; an AI using such a theory would decide to rewrite itself to use a different theory.

But is this the only reason to pay up? This might make a difference: Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you've always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there's no benefit from paying up in this particular counterfactual mugging, and so there hasn't ever been any incentive to self-modify into an agent that would pay up ... and so you shouldn't.

I've since changed my mind, and I've recently talked about part of the reason for this, when I introduced the concept of an l-zombie, or logical philosophical zombie, a mathematically possible conscious experience that isn't physically instantiated and therefore isn't actually consciously experienced. (Obligatory disclaimer: I'm not claiming that the idea that "some mathematically possible experiences are l-zombies" is likely to be true, but I think it's a useful concept for thinking about anthropics, and I don't think we should rule out l-zombies given our present state of knowledge. More in the l-zombies post and in this post about measureless Tegmark IV.) Suppose that Omega's coin had come up the other way, and Omega had turned the sky green. Then you and I would be l-zombies. But if Omega was able to make a confident guess about the decision we'd make if confronted with the counterfactual mugging (without simulating us, so that we continue to be l-zombies), then our decisions would still influence what happens in the actual physical world. Thus, if l-zombies say "I have conscious experiences, therefore I physically exist", and update on this fact, and if the decisions they make based on this influence what happens in the real world, a lot of utility may potentially be lost. Of course, you and I aren't l-zombies, but the mathematically possible versions of us who have grown up under a green sky are, and they reason the same way as you and me—it's not possible to have only the actual conscious observers reason that way. Thus, you should pay up even in the blue-sky mugging.

But that's only part of the reason I changed my mind. The other part is that while in the counterfactual mugging, the answer you get if you try to use Bayesian updating at least looks kinda sensible, there are other thought experiments in which doing so in the straight-forward way makes you obviously bat-shit crazy. That's what I'd like to talk about today.

*

The kind of situation I have in mind involves being able to influence whether you exist, or more precisely, influence whether the version of you making the decision exists as a conscious observer (or whether it's an l-zombie).

Suppose that you wake up and Omega explains to you that it's kidnapped you and some of your friends back in 2014, and put you into suspension; it's now the year 2100. It then hands you a little box with a red button, and tells you that if you press that button, Omega will slowly torture you and your friends to death; otherwise, you'll be able to live out a more or less normal and happy life (or to commit painless suicide, if you prefer). Furthermore, it explains that one of two things have happened: Either (1) humanity has undergone a positive intelligence explosion, and Omega has predicted that you will press the button; or (2) humanity has wiped itself out, and Omega has predicted that you will not press the button. In any other scenario, Omega would still have woken you up at the same time, but wouldn't have given you the button. Finally, if humanity has wiped itself out, it won't let you try to "reboot" it; in this case, you and your friends will be the last humans.

There's a correct answer to what to do in this situation, and it isn't to decide that Omega's just given you anthropic superpowers to save the world. But that's what you get if you try to update in the most naive way: If you press the button, then (2) becomes extremely unlikely, since Omega is really really good at predicting. Thus, the true world is almost certainly (1); you'll get tortured, but humanity survives. For great utility! On the other hand, if you decide to not press the button, then by the same reasoning, the true world is almost certainly (2), and humanity has wiped itself out. Surely you're not selfish enough to prefer that?

The correct answer, clearly, is that your decision whether to press the button doesn't influence whether humanity survives, it only influences whether you get tortured to death. (Plus, of course, whether Omega hands you the button in the first place!) You don't want to get tortured, so you don't press the button. Updateless reasoning gets this right.

*

Let me spell out the rules of the naive Bayesian decision theory ("NBDT") I used there, in analogy with Simple Updateless Decision Theory (SUDT). First, let's set up our problem in the SUDT framework. To simplify things, we'll pretend that FOOM and DOOM are the only possible things that can happen to humanity. In addition, we'll assume that there's a small probability $\textstyle \varepsilon$ that Omega makes a mistake when it tries to predict what you will do if given the button. Thus, the relevant possible worlds are $\textstyle \Omega = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{correct},\mathrm{incorrect}\}$. The precise probabilities you assign to these doesn't matter very much; I'll pretend that FOOM and DOOM are equiprobable, $\textstyle \mathbb{P}(x,\mathrm{incorrect}) = \varepsilon/2$ and $\textstyle \mathbb{P}(x,\mathrm{correct}) = (1-\varepsilon)/2$.

There's only one situation in which you need to make a decision, $\textstyle \mathcal{I} = \{*\}$; I won't try to define NBDT when there is more than one situation. Your possible actions in this situation are to press or to not press the button, $\textstyle \mathcal{A}(*) = \{P,\neg P\}$, so the only possible policies are $\textstyle \pi_P$, which presses the button ($\textstyle \pi_P(*) = P$), and $\textstyle \pi_{\neg P}$, which doesn't ($\textstyle \pi_{\neg P}(*) = \neg P$); $\textstyle \Pi = \{\pi_P,\pi_{\neg P}\}$.

There are four possible outcomes, specifying (a) whether humanity survives and (b) whether you get tortured: $\textstyle \mathcal{O} = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{torture},\neg\mathrm{torture}\}$. Omega only hands you the button if FOOM and it predicts you'll press it, or DOOM and it predicts you won't. Thus, the only cases in which you'll get tortured are $\textstyle o((\mathrm{foom},\mathrm{correct}),\pi_P) = (\mathrm{foom},\mathrm{torture})$ and $\textstyle o((\mathrm{doom},\mathrm{incorrect}),\pi_P) = (\mathrm{doom},\mathrm{torture})$. For any other $\textstyle x\in\{\mathrm{foom},\mathrm{doom}\}$, $\textstyle y\in\{\mathrm{correct},\mathrm{incorrect}\}$, and $\textstyle \pi\in\Pi$, we have $\textstyle o((x,y),\pi) = (x,\neg\mathrm{torture})$.

Finally, let's define our utility function by $u((\mathrm{foom},\neg\mathrm{torture})) = L$, $u((\mathrm{foom},\mathrm{torture})) = L-1$, $u((\mathrm{doom},\neg\mathrm{torture})) = -L$, and $u((\mathrm{doom},\mathrm{torture})) = -L-1$, where $\textstyle L$ is a very large number.

This suffices to set up an SUDT decision problem. There are only two possible worlds $\textstyle \omega\in\Omega$ where $\textstyle u(o(\omega,\pi_P))$ differs from $\textstyle u(o(\omega,\pi_{\neg P}))$, namely $\textstyle (\mathrm{foom},\mathrm{correct})$ and $\textstyle (\mathrm{doom},\mathrm{incorrect})$, where $\textstyle \pi_P$ results in torture and $\textstyle \pi_{\neg P}$ doesn't. In each of these cases, the utility of $\textstyle \pi_P$ is lower (by one) than that of $\textstyle \pi_{\neg P}$. Hence, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi_P))] < \mathbb{E}[u(o(\boldsymbol{\omega},\pi_{\neg P}))]$, implying that SUDT says you should choose $\textstyle \pi_{\neg P}$.

*

For NBDT, we need to know how to update, so we need one more ingredient: a function specifying in which worlds you exist as a conscious observer. In anticipation of future discussions, I'll write this as a function $\textstyle \mu(i;\omega,\pi)$, which gives the "measure" ("amount of magical reality fluid") of the conscious observation $\textstyle i\in\mathcal{I}$ if policy $\textstyle \pi\in\Pi$ is executed in the possible world $\textstyle \omega\in\Omega$. In our case, $\textstyle i = *$ and $\textstyle \mu(*;\omega,\pi)\in\{0,1\}$, indicating non-existence and existence, respectively. We can interpret $\textstyle \mu(i;\omega,\pi)$ as the conditional probability of making observation $\textstyle i$, given that the true world is $\textstyle \omega$, if plan $\textstyle \pi$ is executed. In our case, $\textstyle \mu(*;(\mathrm{foom},\mathrm{correct}),\pi_P) =$ $\textstyle \mu(*;(\mathrm{foom},\mathrm{incorrect}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{correct}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{incorrect}),\pi_P) = 1$, and $\textstyle \mu(*;\omega,\pi) = 0$ in all other cases.

Now, we can use Bayes' theorem to calculate the posterior probability of a possible world, given information $\textstyle i = *$ and policy $\textstyle \pi$: $\textstyle \mathbb{P}(\omega\mid i;\pi) = \mathbb{P}(\omega)\cdot\mu(i;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(i;\omega',\pi)$. NBDT tells us to choose the policy $\textstyle \pi$ that maximizes the posterior expected utility, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid i;\pi]$.

In our case, we have $\textstyle \mathbb{P}((\mathrm{foom},\mathrm{correct}) \mid *;\pi_P) = \mathbb{P}((\mathrm{doom},\mathrm{correct}) \mid *;\pi_{\neg P}) = 1-\varepsilon$ and $\textstyle \mathbb{P}((\mathrm{doom},\mathrm{incorrect}) \mid *;\pi_P) = \mathbb{P}((\mathrm{foom},\mathrm{incorrect}) \mid *;\pi_{\neg P}) = \varepsilon$. Thus, if we press the button, our expected utility is dominated by the near-certainty of humanity surviving, whereas if we don't, it's dominated by humanity's near-certain doom, and NBDT says we should press.

*

But maybe it's not updating that's bad, but NBDT's way of implementing it? After all, we get the clearly wacky results only if our decisions can influence whether we exist, and perhaps the way that NBDT extends the usual formula to this case happens to be the wrong way to extend it.

One thing we could try is to mark a possible world $\textstyle \omega$ as impossible only if $\textstyle \mu(*;\omega,\pi) = 0$ for all policies $\textstyle \pi$ (rather than: for the particular policy $\textstyle \pi$ whose expected utility we are computing). But this seems very ad hoc to me. (For example, this could depend on which set of possible actions $\textstyle \mathcal{A}(*)$ we consider, which seems odd.)

There is a much more principled possibility, which I'll call pseudo-Bayesian decision theory, or PBDT. PBDT can be seen as re-interpreting updating as saying that you're indifferent about what happens in possible worlds in which you don't exist as a conscious observer, rather than ruling out those worlds as impossible given your evidence. (A version of this idea was recently brought up in a comment by drnickbone, though I'd thought of this idea myself during my journey towards my current position on updating, and I imagine it has also appeared elsewhere, though I don't remember any specific instances.) I have more than one objection to PBDT, but the simplest one to argue is that it doesn't solve the problem: it still believes that it has anthropic superpowers in the problem above.

Formally, PBDT says that we should choose the policy $\textstyle \pi$ that maximizes $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\boldsymbol{\omega},\pi)]$ (where the expectation is with respect to the prior, not the updated, probabilities). In other words, we set the utility of any outcome in which we don't exist as a conscious observer to zero; we can see PBDT as SUDT with modified outcome and utility functions.

When our existence is independent on our decisions—that is, if $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$—then it turns out that PBDT and NBDT are equivalent, i.e., PBDT implements Bayesian updating. That's because in that case, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid *;\pi] =$ $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega\mid *;\pi)$ $\textstyle = \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi)$. If $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$, then the whole denominator doesn't depend on $\textstyle \pi$, so the fraction is maximized if and only if the numerator is. But the numerator is $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) =$ $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$, exactly the quantity that PBDT says should be maximized.

Unfortunately, although in our problem above $\mu(*;\omega,\pi)$ does depend of $\pi$, the denominator as a whole still doesn't: For both $\pi_P$ and $\pi_{\neg P}$, there is exactly one possible world with probability $(1-\varepsilon)/2$ and one possible world with probability $\varepsilon/2$ in which $*$ is a conscious observer, so we have $\textstyle\sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi) = 1/2$ for both $\pi\in\Pi$. Thus, PBDT gives the same answer as NBDT, by the same mathematical argument as in the case where we can't influence our own existence. If you think of PBDT as SUDT with the utility function $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then intuitively, PBDT can be thought of as reasoning, "Sure, I can't influence whether humanity is wiped out; but I can influence whether I'm an l-zombie or a conscious observer; and who cares what happens to humanity if I'm not? Best to press to button, since getting tortured in a world where there's been a positive intelligence explosion is much better than life without torture if humanity has been wiped out."

I think that's a pretty compelling argument against PBDT, but even leaving it aside, I don't like PBDT at all. I see two possible justifications for PBDT: You can either say that $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ is your real utility function—you really don't care about what happens in worlds where the version of you making the decision doesn't exist as a conscious observer—or you can say that your real preferences are expressed by $u(o(\omega,\pi))$, and multiplying by $\mu(*;\omega,\pi)$ is just a mathematical trick to express a steelmanned version of Bayesian updating. If your preferences really are given by $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then fine, and you should be maximizing $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$ (because you should be using (S)UDT), and you should press the button. Some kind of super-selfish agent, who doesn't care a fig even about a version of itself that is exactly the same up till five seconds ago (but then wasn't handed the button) could indeed have such preferences. But I think these are wacky preferences, and you don't actually have them. (Furthermore, if you did have them, then $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ would be your actual utility function, and you should be writing it as just $u(o(\omega,\pi))$, where $o(\omega,\pi)$ must now give information about whether $*$ is a conscious observer.)

If multiplying by $\mu(*;\omega,\pi)$ is just a trick to implement updating, on the other hand, then I find it strange that it introduces a new concept that doesn't occur at all in classical Bayesian updating, namely the utility of a world in which $*$ is an l-zombie. We've set this to zero, which is no loss of generality because classical utility functions don't change their meaning if you add or subtract a constant, so whenever you have a utility function where all worlds in which $*$ is an l-zombie have the same utility $u_0$, then you can just subtract $u_0$ from all utilities (without changing the meaning of the utility function), and get a function where that utility is zero. But that means that the utility functions I've been plugging into PBDT above do change their meaning if you add a constant to them. You can set up a problem where the agent has to decide whether to bring itself into existence or not (Omega creates it iff it predicts that the agent will press a particular button), and in that case the agent will decide to do so iff the world has utility greater than zero—clearly not invariant under adding and subtracting a constant. I can't find any concept like the utility of not existing in my intuitions about Bayesian updating (though I can find such a concept in my intuitions about utility, but regarding that see the previous paragraph), so if PBDT is just a mathematical trick to implement these intuitions, where does that utility come from?

I'm not aware of a way of implementing updating in general SUDT-style problems that does better than NBDT, PBDT, and the ad-hoc idea mentioned above, so for now I've concluded that in general, trying to update is just hopeless, and we should be using (S)UDT instead. In classical decision problems, where there are no acausal influences, (S)UDT will of course behave exactly as if it did do a Bayesian update; thus, in a sense, using (S)UDT can also be seen as a reinterpretation of Bayesian updating (in this case just as updateless utility maximization in a world where all influence is causal), and that's the way I think about it nowadays.

Comment author: 24 February 2014 10:31:36PM *  0 points [-]

This is a good introduction; however, by representing the outcomes as just "+" and "-" you greatly simply the range of possible utility functions, and so force SUDT to make some controversial decisions (basically to accept the counterfactual mugging). The key issue is that your decider can give no special preferences to good or bad outcomes in his own world (a world the decider knows he occupies) versus other worlds (ones which the decider knows he doesn't occupy).

Suppose instead that the decider has an outcome space with four outcomes "+Me", "-Me", "+NotMe", "-NotMe". Here, "+Me" represents a good singularity which the decider himself will get to enjoy, as opposed to "-Me" which is a bad singularity (such as an unfriendly AI which tortures the decider for the next billion years). The outcomes "+NotMe" and "-NotMe" also represent positive and negative singularities, but in worlds which the decider himself doesn't inhabit. Assume that u(+Me) > u(+NotMe) > u(-Me), and also that u(+NotMe) = u(-NotMe), because the decider doesn't care about worlds that he doesn't belong to (from the point of view of his decisions, it's exactly like they don't exist).

Then, in the counterfactual mugging, when approached by Omega, the decider knows he is in a world where the coin has fallen Heads, so he picks the policy which maximizes utility for such worlds: in short he chooses "H" rather than "T". This increases the probability of -NotMe as opposed to +NotMe, but as we've seen, the decider doesn't care about that.

Here's a possible objection: By selecting "H", the decider is condemning lots of other versions or analogues of himself (in other possible worlds where Omega didn't approach him), and his utility function might care about this. On the other hand, he might also reason like this "Analogues of me still aren't me: I still care much more about whether I get tortured than whether all those analogues do. I still pick H".

In short, I don't think SUDT (or UDT) by itself solves the problem of counterfactual mugging. Relative to one utility function it looks quite reasonable to accept the mugging, whereas relative to another utility function it is reasonable to reject it. Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested "veil of ignorance" on the decider's identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).

Comment author: 25 February 2014 12:43:26AM 1 point [-]

In short, I don't think SUDT (or UDT) by itself solves the problem of counterfactual mugging. [...] Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested "veil of ignorance" on the decider's identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).

I'll first give an answer to a relatively literal reading of your comment, and then one to what IMO you are "really" getting at.

Answer to a literal reading: I believe that what you value is part of the problem definition, it's not the decision theory's job to constrain that. For example, if you prefer DOOM to FOOM, (S)UDT doesn't say that your utilities are wrong, it just says you should choose (H). And if we postulate that someone doesn't care whether there's a positive intelligence explosion if they don't get to take part in it (not counting near-copies), then they should choose (H) as well.

But I disagree that this means that (S)UDT doesn't solve the counterfactual mugging. It's not like the copy-selfless utility function I discuss in the post automatically makes clear whether we should choose (H) or (T): If we went with the usual intuition that you should update on your evidence and then use the resulting probabilities in your expected utility calculation, then even if you are completely selfless, you will choose (H) in order to do the best for the world. But (S)UDT says that if you have these utilities, you should choose (T). So it would seem that the version of the counterfactual mugging discussed in the post exhibits the problem, and (S)UDT comes down squarely on the side of one of the potential solutions.

Answer to the "real" point: But of course, what I read you as "really" saying is that we could re-interpret our intuition that we should use updated probabilities as meaning that our actual utility function is not the one we would write down naively, but a version where the utilities of all outcomes in which the observer-moment making the decision isn't consciously experienced are replaced by a constant. In the case of the counterfactual mugging, this transformation gives exactly the same result as if we had updated our probabilities. So in a sense, when I say that SUDT comes down on the side of one of the solutions, I am implicitly using a rule for how to go from "naive" utilities to utilities-to-use-in-SUDT: namely, the rule "just use the naive utilities". And when I use my arguments about l-zombies to argue that choosing (T) is the right solution to the counterfactual mugging, I need to argue why this rule is correct.

In terms of clarity of meaning, I have to say that I don't feel too bad about not spelling out that the utility function is just what you would normally call your utility function, but in terms of the strength of my arguments, I agree that the possibility of re-interpreting updating in terms of utility functions is something that needs to be addressed for my argument from l-zombies to be compelling. It just happens to be one of the many things I haven't managed to address in my updateless anthropics posts so far.

In brief, my reasons are twofold: First, I've asked myself, suppose that it actually were the case that I were an l-zombie, but could influence what happens in the real world; what would my actual values be then? And the answer is, I definitely don't completely stop caring. And second, there's the part where this transformation doesn't just give back exactly what you would have gotten if you updated in all anthropic problems, which makes the case for it suspect. The situations I have in mind are when your decision determines whether you are a conscious observer: In this case, how you decide depends on the utility you assign to outcomes in which you don't exist, something that doesn't have any interpretation in terms of updating. If the only reason I adopt these utilities is to somehow implement my intuitions about updating, it seems very odd to suddenly have this new number influencing my decisions.

View more: Next