## The sin of updating when you can change whether you exist

7 28 February 2014 01:25AM

Trigger warning: In a thought experiment in this post, I used a hypothetical torture scenario without thinking, even though it wasn't necessary to make my point. Apologies, and thanks to an anonymous user for pointing this out. I'll try to be more careful in the future.

Should you pay up in the counterfactual mugging?

I've always found the argument about self-modifying agents compelling: If you expected to face a counterfactual mugging tomorrow, you would want to choose to rewrite yourself today so that you'd pay up. Thus, a decision theory that didn't pay up wouldn't be reflectively consistent; an AI using such a theory would decide to rewrite itself to use a different theory.

But is this the only reason to pay up? This might make a difference: Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you've always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there's no benefit from paying up in this particular counterfactual mugging, and so there hasn't ever been any incentive to self-modify into an agent that would pay up ... and so you shouldn't.

I've since changed my mind, and I've recently talked about part of the reason for this, when I introduced the concept of an l-zombie, or logical philosophical zombie, a mathematically possible conscious experience that isn't physically instantiated and therefore isn't actually consciously experienced. (Obligatory disclaimer: I'm not claiming that the idea that "some mathematically possible experiences are l-zombies" is likely to be true, but I think it's a useful concept for thinking about anthropics, and I don't think we should rule out l-zombies given our present state of knowledge. More in the l-zombies post and in this post about measureless Tegmark IV.) Suppose that Omega's coin had come up the other way, and Omega had turned the sky green. Then you and I would be l-zombies. But if Omega was able to make a confident guess about the decision we'd make if confronted with the counterfactual mugging (without simulating us, so that we continue to be l-zombies), then our decisions would still influence what happens in the actual physical world. Thus, if l-zombies say "I have conscious experiences, therefore I physically exist", and update on this fact, and if the decisions they make based on this influence what happens in the real world, a lot of utility may potentially be lost. Of course, you and I aren't l-zombies, but the mathematically possible versions of us who have grown up under a green sky are, and they reason the same way as you and me—it's not possible to have only the actual conscious observers reason that way. Thus, you should pay up even in the blue-sky mugging.

But that's only part of the reason I changed my mind. The other part is that while in the counterfactual mugging, the answer you get if you try to use Bayesian updating at least looks kinda sensible, there are other thought experiments in which doing so in the straight-forward way makes you obviously bat-shit crazy. That's what I'd like to talk about today.

*

The kind of situation I have in mind involves being able to influence whether you exist, or more precisely, influence whether the version of you making the decision exists as a conscious observer (or whether it's an l-zombie).

Suppose that you wake up and Omega explains to you that it's kidnapped you and some of your friends back in 2014, and put you into suspension; it's now the year 2100. It then hands you a little box with a red button, and tells you that if you press that button, Omega will slowly torture you and your friends to death; otherwise, you'll be able to live out a more or less normal and happy life (or to commit painless suicide, if you prefer). Furthermore, it explains that one of two things have happened: Either (1) humanity has undergone a positive intelligence explosion, and Omega has predicted that you will press the button; or (2) humanity has wiped itself out, and Omega has predicted that you will not press the button. In any other scenario, Omega would still have woken you up at the same time, but wouldn't have given you the button. Finally, if humanity has wiped itself out, it won't let you try to "reboot" it; in this case, you and your friends will be the last humans.

There's a correct answer to what to do in this situation, and it isn't to decide that Omega's just given you anthropic superpowers to save the world. But that's what you get if you try to update in the most naive way: If you press the button, then (2) becomes extremely unlikely, since Omega is really really good at predicting. Thus, the true world is almost certainly (1); you'll get tortured, but humanity survives. For great utility! On the other hand, if you decide to not press the button, then by the same reasoning, the true world is almost certainly (2), and humanity has wiped itself out. Surely you're not selfish enough to prefer that?

The correct answer, clearly, is that your decision whether to press the button doesn't influence whether humanity survives, it only influences whether you get tortured to death. (Plus, of course, whether Omega hands you the button in the first place!) You don't want to get tortured, so you don't press the button. Updateless reasoning gets this right.

*

Let me spell out the rules of the naive Bayesian decision theory ("NBDT") I used there, in analogy with Simple Updateless Decision Theory (SUDT). First, let's set up our problem in the SUDT framework. To simplify things, we'll pretend that FOOM and DOOM are the only possible things that can happen to humanity. In addition, we'll assume that there's a small probability $\textstyle \varepsilon$ that Omega makes a mistake when it tries to predict what you will do if given the button. Thus, the relevant possible worlds are $\textstyle \Omega = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{correct},\mathrm{incorrect}\}$. The precise probabilities you assign to these doesn't matter very much; I'll pretend that FOOM and DOOM are equiprobable, $\textstyle \mathbb{P}(x,\mathrm{incorrect}) = \varepsilon/2$ and $\textstyle \mathbb{P}(x,\mathrm{correct}) = (1-\varepsilon)/2$.

There's only one situation in which you need to make a decision, $\textstyle \mathcal{I} = \{*\}$; I won't try to define NBDT when there is more than one situation. Your possible actions in this situation are to press or to not press the button, $\textstyle \mathcal{A}(*) = \{P,\neg P\}$, so the only possible policies are $\textstyle \pi_P$, which presses the button ($\textstyle \pi_P(*) = P$), and $\textstyle \pi_{\neg P}$, which doesn't ($\textstyle \pi_{\neg P}(*) = \neg P$); $\textstyle \Pi = \{\pi_P,\pi_{\neg P}\}$.

There are four possible outcomes, specifying (a) whether humanity survives and (b) whether you get tortured: $\textstyle \mathcal{O} = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{torture},\neg\mathrm{torture}\}$. Omega only hands you the button if FOOM and it predicts you'll press it, or DOOM and it predicts you won't. Thus, the only cases in which you'll get tortured are $\textstyle o((\mathrm{foom},\mathrm{correct}),\pi_P) = (\mathrm{foom},\mathrm{torture})$ and $\textstyle o((\mathrm{doom},\mathrm{incorrect}),\pi_P) = (\mathrm{doom},\mathrm{torture})$. For any other $\textstyle x\in\{\mathrm{foom},\mathrm{doom}\}$, $\textstyle y\in\{\mathrm{correct},\mathrm{incorrect}\}$, and $\textstyle \pi\in\Pi$, we have $\textstyle o((x,y),\pi) = (x,\neg\mathrm{torture})$.

Finally, let's define our utility function by $u((\mathrm{foom},\neg\mathrm{torture})) = L$, $u((\mathrm{foom},\mathrm{torture})) = L-1$, $u((\mathrm{doom},\neg\mathrm{torture})) = -L$, and $u((\mathrm{doom},\mathrm{torture})) = -L-1$, where $\textstyle L$ is a very large number.

This suffices to set up an SUDT decision problem. There are only two possible worlds $\textstyle \omega\in\Omega$ where $\textstyle u(o(\omega,\pi_P))$ differs from $\textstyle u(o(\omega,\pi_{\neg P}))$, namely $\textstyle (\mathrm{foom},\mathrm{correct})$ and $\textstyle (\mathrm{doom},\mathrm{incorrect})$, where $\textstyle \pi_P$ results in torture and $\textstyle \pi_{\neg P}$ doesn't. In each of these cases, the utility of $\textstyle \pi_P$ is lower (by one) than that of $\textstyle \pi_{\neg P}$. Hence, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi_P))] < \mathbb{E}[u(o(\boldsymbol{\omega},\pi_{\neg P}))]$, implying that SUDT says you should choose $\textstyle \pi_{\neg P}$.

*

For NBDT, we need to know how to update, so we need one more ingredient: a function specifying in which worlds you exist as a conscious observer. In anticipation of future discussions, I'll write this as a function $\textstyle \mu(i;\omega,\pi)$, which gives the "measure" ("amount of magical reality fluid") of the conscious observation $\textstyle i\in\mathcal{I}$ if policy $\textstyle \pi\in\Pi$ is executed in the possible world $\textstyle \omega\in\Omega$. In our case, $\textstyle i = *$ and $\textstyle \mu(*;\omega,\pi)\in\{0,1\}$, indicating non-existence and existence, respectively. We can interpret $\textstyle \mu(i;\omega,\pi)$ as the conditional probability of making observation $\textstyle i$, given that the true world is $\textstyle \omega$, if plan $\textstyle \pi$ is executed. In our case, $\textstyle \mu(*;(\mathrm{foom},\mathrm{correct}),\pi_P) =$ $\textstyle \mu(*;(\mathrm{foom},\mathrm{incorrect}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{correct}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{incorrect}),\pi_P) = 1$, and $\textstyle \mu(*;\omega,\pi) = 0$ in all other cases.

Now, we can use Bayes' theorem to calculate the posterior probability of a possible world, given information $\textstyle i = *$ and policy $\textstyle \pi$: $\textstyle \mathbb{P}(\omega\mid i;\pi) = \mathbb{P}(\omega)\cdot\mu(i;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(i;\omega',\pi)$. NBDT tells us to choose the policy $\textstyle \pi$ that maximizes the posterior expected utility, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid i;\pi]$.

In our case, we have $\textstyle \mathbb{P}((\mathrm{foom},\mathrm{correct}) \mid *;\pi_P) = \mathbb{P}((\mathrm{doom},\mathrm{correct}) \mid *;\pi_{\neg P}) = 1-\varepsilon$ and $\textstyle \mathbb{P}((\mathrm{doom},\mathrm{incorrect}) \mid *;\pi_P) = \mathbb{P}((\mathrm{foom},\mathrm{incorrect}) \mid *;\pi_{\neg P}) = \varepsilon$. Thus, if we press the button, our expected utility is dominated by the near-certainty of humanity surviving, whereas if we don't, it's dominated by humanity's near-certain doom, and NBDT says we should press.

*

But maybe it's not updating that's bad, but NBDT's way of implementing it? After all, we get the clearly wacky results only if our decisions can influence whether we exist, and perhaps the way that NBDT extends the usual formula to this case happens to be the wrong way to extend it.

One thing we could try is to mark a possible world $\textstyle \omega$ as impossible only if $\textstyle \mu(*;\omega,\pi) = 0$ for all policies $\textstyle \pi$ (rather than: for the particular policy $\textstyle \pi$ whose expected utility we are computing). But this seems very ad hoc to me. (For example, this could depend on which set of possible actions $\textstyle \mathcal{A}(*)$ we consider, which seems odd.)

There is a much more principled possibility, which I'll call pseudo-Bayesian decision theory, or PBDT. PBDT can be seen as re-interpreting updating as saying that you're indifferent about what happens in possible worlds in which you don't exist as a conscious observer, rather than ruling out those worlds as impossible given your evidence. (A version of this idea was recently brought up in a comment by drnickbone, though I'd thought of this idea myself during my journey towards my current position on updating, and I imagine it has also appeared elsewhere, though I don't remember any specific instances.) I have more than one objection to PBDT, but the simplest one to argue is that it doesn't solve the problem: it still believes that it has anthropic superpowers in the problem above.

Formally, PBDT says that we should choose the policy $\textstyle \pi$ that maximizes $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\boldsymbol{\omega},\pi)]$ (where the expectation is with respect to the prior, not the updated, probabilities). In other words, we set the utility of any outcome in which we don't exist as a conscious observer to zero; we can see PBDT as SUDT with modified outcome and utility functions.

When our existence is independent on our decisions—that is, if $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$—then it turns out that PBDT and NBDT are equivalent, i.e., PBDT implements Bayesian updating. That's because in that case, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid *;\pi] =$ $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega\mid *;\pi)$ $\textstyle = \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi)$. If $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$, then the whole denominator doesn't depend on $\textstyle \pi$, so the fraction is maximized if and only if the numerator is. But the numerator is $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) =$ $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$, exactly the quantity that PBDT says should be maximized.

Unfortunately, although in our problem above $\mu(*;\omega,\pi)$ does depend of $\pi$, the denominator as a whole still doesn't: For both $\pi_P$ and $\pi_{\neg P}$, there is exactly one possible world with probability $(1-\varepsilon)/2$ and one possible world with probability $\varepsilon/2$ in which $*$ is a conscious observer, so we have $\textstyle\sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi) = 1/2$ for both $\pi\in\Pi$. Thus, PBDT gives the same answer as NBDT, by the same mathematical argument as in the case where we can't influence our own existence. If you think of PBDT as SUDT with the utility function $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then intuitively, PBDT can be thought of as reasoning, "Sure, I can't influence whether humanity is wiped out; but I can influence whether I'm an l-zombie or a conscious observer; and who cares what happens to humanity if I'm not? Best to press to button, since getting tortured in a world where there's been a positive intelligence explosion is much better than life without torture if humanity has been wiped out."

I think that's a pretty compelling argument against PBDT, but even leaving it aside, I don't like PBDT at all. I see two possible justifications for PBDT: You can either say that $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ is your real utility function—you really don't care about what happens in worlds where the version of you making the decision doesn't exist as a conscious observer—or you can say that your real preferences are expressed by $u(o(\omega,\pi))$, and multiplying by $\mu(*;\omega,\pi)$ is just a mathematical trick to express a steelmanned version of Bayesian updating. If your preferences really are given by $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then fine, and you should be maximizing $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$ (because you should be using (S)UDT), and you should press the button. Some kind of super-selfish agent, who doesn't care a fig even about a version of itself that is exactly the same up till five seconds ago (but then wasn't handed the button) could indeed have such preferences. But I think these are wacky preferences, and you don't actually have them. (Furthermore, if you did have them, then $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ would be your actual utility function, and you should be writing it as just $u(o(\omega,\pi))$, where $o(\omega,\pi)$ must now give information about whether $*$ is a conscious observer.)

If multiplying by $\mu(*;\omega,\pi)$ is just a trick to implement updating, on the other hand, then I find it strange that it introduces a new concept that doesn't occur at all in classical Bayesian updating, namely the utility of a world in which $*$ is an l-zombie. We've set this to zero, which is no loss of generality because classical utility functions don't change their meaning if you add or subtract a constant, so whenever you have a utility function where all worlds in which $*$ is an l-zombie have the same utility $u_0$, then you can just subtract $u_0$ from all utilities (without changing the meaning of the utility function), and get a function where that utility is zero. But that means that the utility functions I've been plugging into PBDT above do change their meaning if you add a constant to them. You can set up a problem where the agent has to decide whether to bring itself into existence or not (Omega creates it iff it predicts that the agent will press a particular button), and in that case the agent will decide to do so iff the world has utility greater than zero—clearly not invariant under adding and subtracting a constant. I can't find any concept like the utility of not existing in my intuitions about Bayesian updating (though I can find such a concept in my intuitions about utility, but regarding that see the previous paragraph), so if PBDT is just a mathematical trick to implement these intuitions, where does that utility come from?

I'm not aware of a way of implementing updating in general SUDT-style problems that does better than NBDT, PBDT, and the ad-hoc idea mentioned above, so for now I've concluded that in general, trying to update is just hopeless, and we should be using (S)UDT instead. In classical decision problems, where there are no acausal influences, (S)UDT will of course behave exactly as if it did do a Bayesian update; thus, in a sense, using (S)UDT can also be seen as a reinterpretation of Bayesian updating (in this case just as updateless utility maximization in a world where all influence is causal), and that's the way I think about it nowadays.

## SUDT: A toy decision theory for updateless anthropics

14 23 February 2014 11:50PM

The best approach I know for thinking about anthropic problems is Wei Dai's Updateless Decision Theory (UDT). We aren't yet able to solve all problems that we'd like to—for example, when it comes to game theory, the only games we have any idea how to solve are very symmetric ones—but for many anthropic problems, UDT gives the obviously correct solution. However, UDT is somewhat underspecified, and cousin_it's concrete models of UDT based on formal logic are rather heavyweight if all you want is to figure out the solution to a simple anthropic problem.

In this post, I introduce a toy decision theory, Simple Updateless Decision Theory or SUDT, which is most definitely not a replacement for UDT but makes it easy to formally model and solve the kind of anthropic problems that we usually apply UDT to. (And, of course, it gives the same solutions as UDT.) I'll illustrate this with a few examples.

This post is a bit boring, because all it does is to take a bit of math that we already implicitly use all the time when we apply updateless reasoning to anthropic problems, and spells it out in excruciating detail. If you're already well-versed in that sort of thing, you're not going to learn much from this post. The reason I'm posting it anyway is that there are things I want to say about updateless anthropics, with a bit of simple math here and there, and while the math may be intuitive, the best thing I can point to in terms of details are the posts on UDT, which contain lots of irrelevant complications. So the main purpose of this post is to save people from having to reverse-engineer the simple math of SUDT from the more complex / less well-specified math of UDT.

(I'll also argue that Psy-Kosh's non-anthropic problem is a type of counterfactual mugging, I'll use the concept of l-zombies to explain why UDT's response to this problem is correct, and I'll explain why this argument still works if there aren't any l-zombies.)

*

I'll introduce SUDT by way of a first example: the counterfactual mugging. In my preferred version, Omega appears to you and tells you that it has thrown a very biased coin, which had only a 1/1000 chance of landing heads; however, in this case, the coin has in fact fallen heads, which is why Omega is talking to you. It asks you to choose between two options, (H) and (T). If you choose (H), Omega will create a Friendly AI; if you choose (T), it will destroy the world. However, there is a catch: Before throwing the coin, Omega made a prediction about which of these options you would choose if the coin came up heads (and it was able to make a highly confident prediction). If the coin had come up tails, Omega would have destroyed the world if it's predicted that you'd choose (H), and it would have created a Friendly AI if it's predicted (T). (Incidentally, if it hadn't been able to make a confident prediction, it would just have destroyed the world outright.)

 Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) You choose (H) if coin falls heads Positive intelligence explosion Humanity wiped out You choose (T) if coin falls heads Humanity wiped out Positive intelligence explosion

In this example, we are considering two possible worlds:  and . We write  (no pun intended) for the set of all possible worlds; thus, in this case, . We also have a probability distribution over , which we call . In our example,  and .

In the counterfactual mugging, there is only one situation you might find yourself in in which you need to make a decision, namely when Omega tells you that the coin has fallen heads. In general, we write  for the set of all possible situations in which you might need to make a decision; the  stands for the information available to you, including both sensory input and your memories. In our case, we'll write , where  is the single situation where you need to make a decision.

For every , we write  for the set of possible actions you can take if you find yourself in situation . In our case,. A policy (or "plan") is a function  that associates to every situation  an action  to take in this situation. We write  for the set of all policies. In our case, , where  and .

Next, there is a set of outcomes, , which specify all the features of what happens in the world that make a difference to our final goals, and the outcome function , which for every possible world  and every policy  specifies the outcome  that results from executing  in the world . In our case,  (standing for FAI and DOOM), and  and .

Finally, we have a utility function . In our case,  and . (The exact numbers don't really matter, as long as , because utility functions don't change their meaning under affine transformations, i.e. when you add a constant to all utilities or multiply all utilities by a positive number.)

Thus, an SUDT decision problem consists of the following ingredients: The sets ,  and  of possible worlds, situations you need to make a decision in, and outcomes; for every , the set  of possible actions in that situation; the probability distribution ; and the outcome and utility functions  and . SUDT then says that you should choose a policy  that maximizes the expected utility , where  is the expectation with respect to , and  is the true world.

In our case,  is just the probability of the good outcome , according to the (prior) distribution . For , that probability is 1/1000; for , it is 999/1000. Thus, SUDT (like UDT) recommends choosing (T).

If you set up the problem in SUDT like that, it's kind of hidden why you could possibly think that's not the right thing to do, since we aren't distinguishing situations  that are "actually experienced" in a particular possible world ; there's nothing in the formalism that reflects the fact that Omega never asks us for our choice if the coin comes up tails. In my post on l-zombies, I've argued that this makes sense because even if there's no version of you that actually consciously experiences being in the heads world, this version still exists as a Turing machine and the choices that it makes influence what happens in the real world. If all mathematically possible experiences exist, so that there aren't any l-zombies, but some experiences are "experienced more" (have more "magical reality fluid") than others, the argument is even clearer—even if there's some anthropic sense in which, upon being told that the coin fell heads, you can conclude that you should assign a high probability of being in the heads world, the same version of you still exists in the tails world, and its choices influence what happens there. And if everything is experienced to the same degree (no magical reality fluid), the argument is clearer still.

*

From Vladimir Nesov's counterfactual mugging, let's move on to what I'd like to call Psy-Kosh's probably counterfactual mugging, better known as Psy-Kosh's non-anthropic problem. This time, you're not alone: Omega gathers you together with 999,999 other advanced rationalists, all well-versed in anthropic reasoning and SUDT. It places each of you in a separate room. Then, as before, it throws a very biased coin, which has only a 1/1000 chance of landing heads. If the coin does land heads, then Omega asks all of you to choose between two options, (H) and (T). If the coin falls tails, on the other hand, Omega chooses one of you at random and asks that person to choose between (H) and (T). If the coin lands heads and you all choose (H), Omega will create a Friendly AI; same if the coin lands tails, and the person who's asked chooses (T); else, Omega will destroy the world.

 Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) Everyone chooses (H) if asked Positive intelligence explosion Humanity wiped out Everyone chooses (T) if asked Humanity wiped out Positive intelligence explosion Different people choose differently Humanity wiped out (Depends on who is asked)

We'll assume that all of you prefer a positive FOOM over a gloomy DOOM, which means that all of you have the same values as far as the outcomes of this little dilemma are concerned: , as before, and all of you have the same utility function, given by  and . As long as that's the case, we can apply SUDT to find a sensible policy for everybody to follow (though when there is more than one optimal policy, and the different people involved can't talk to each other, it may not be clear how one of the policies should be chosen).

This time, we have a million different people, who can in principle each make an independent decision about what to answer if Omega asks them the question. Thus, we have . Each of these people can choose between (H) and (T), so  for every person , and a policy  is a function that returns either (H) or (T) for every . Obviously, we're particularly interested in the policies  and  satisfying  and  for all .

The possible worlds are , and their probabilities are  and . The outcome function is as follows: ,  for ,  if , and  otherwise.

What does SUDT recommend? As in the counterfactual mugging,  is the probability of the good outcome , under policy . For , the good outcome can only happen if the coin falls heads: in other words, with probability . If , then the good outcome can not happen if the coin falls heads, because in that case everybody gets asked, and at least one person chooses (T). Thus, in this case, the good outcome will happen only if the coin comes up tails and the randomly chosen person answers (T); this probability is , where  is the number of people answering (T). Clearly, this is maximized for , where ; moreover, in this case we get the probability , which is better than for , so SUDT recommends the plan .

Again, when you set up the problem in SUDT, it's not even obvious why anyone might think this wasn't the correct answer. The reason is that if Omega asks you, and you update on the fact that you've been asked, then after updating, you are quite certain that the coin has landed heads: yes, your prior probability was only 1/1000, but if the coin has landed tails, the chances that you would be asked was only one in a million, so the posterior odds are about 1000:1 in favor of heads. So, you might reason, it would be best if everybody chose (H); and moreover, all the people in the other rooms will reason the same way as you, so if you choose (H), they will as well, and this maximizes the probability that humanity survives. This relies on the fact that the others will choose the same way as you, but since you're all good rationalists using the same decision theory, that's going to be the case.

But in the worlds where the coin comes up tails, and Omega chooses someone else than you, the version of you that gets asked for its decision still "exists"... as an l-zombie. You might think that what this version of you does or doesn't do doesn't influence what happens in the real world; but if we accept the argument from the previous paragraph that your decisions are "linked" to those of the other people in the experiment, then they're still linked if the version of you making the decision is an l-zombie: If we see you as a Turing machine making a decision, that Turing machine should reason, "If the coin came up tails and someone else was chosen, then I'm an l-zombie, but the person who is actually chosen will reason exactly the same way I'm doing now, and will come to the same decision; hence, my decision influences what happens in the real world even in this case, and I can't do an update and just ignore those possible worlds."

I call this the "probably counterfactual mugging" because in the counterfactual mugging, you are making your choice because of its benefits in a possible world that is ruled out by your observations, while in the probably counterfactual mugging, you're making it because of its benefits in a set of possible worlds that is made very improbable by your observations (because most of the worlds in this set are ruled out). As with the counterfactual mugging, this argument is just all the stronger if there are no l-zombies because all mathematically possible experiences are in fact experienced.

*

As a final example, let's look at what I'd like to call Eliezer's anthropic mugging: the anthropic problem that inspired Psy-Kosh's non-anthropic one. This time, you're alone again, except that there's many of you: Omega is creating a million copies of you. It flips its usual very biased coin, and if that coin falls heads, it places all of you in exactly identical green rooms. If the coin falls tails, it places one of you in a green room, and all the others in red rooms. It then asks all copies in green rooms to choose between (H) and (T); if your choice agrees with the coin, FOOM, else DOOM.

 Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) Green roomers choose (H) Positive intelligence explosion Humanity wiped out Green roomers choose (T) Humanity wiped out Positive intelligence explosion

Our possible worlds are back to being , with probabilities  and . We are also back to being able to make a choice in only one particular situation, namely when you're a copy in a green room: . Actions are , outcomes , utilities  and , and the outcome function is given by  and . In other words, from SUDT's perspective, this is exactly identical to the situation with the counterfactual mugging, and thus the solution is the same: Once more, SUDT recommends choosing (T).

On the other hand, the reason why someone might think that (H) could be the right answer is closer to that for Psy-Kosh's probably counterfactual mugging: After waking up in a green room, what should be your posterior probability that the coin has fallen heads? Updateful anthropic reasoning says that you should be quite sure that it has fallen heads. If you plug those probabilities into an expected utility calculation, it comes out as in Psy-Kosh's case, heavily favoring (H).

But even if these are good probabilities to assign epistemically (to satisfy your curiosity about what the world probably looks like), in light of the arguments from the counterfactual and the probably counterfactual muggings (where updating definitely is the right thing to do epistemically, but plugging these probabilities into the expected utility calculation gives the wrong result), it doesn't seem strange to me to come to the conclusion that choosing (T) is correct in Eliezer's anthropic mugging as well.

## I like simplicity, but not THAT much

15 14 February 2014 07:51PM

Followup to: L-zombies! (L-zombies?)
Reply to: Coscott's Preferences without Existence; Paul Christiano's comment on my l-zombies post

In my previous post, I introduced the idea of an "l-zombie", or logical philosophical zombie: A Turing machine that would simulate a conscious human being if it were run, but that is never run in the real, physical world, so that the experiences that this human would have had, if the Turing machine were run, aren't actually consciously experienced.

One common reply to this is to deny the possibility of logical philosophical zombies just like the possibility of physical philosophical zombies: to say that every mathematically possible conscious experience is in fact consciously experienced, and that there is no kind of "magical reality fluid" that makes some of these be experienced "more" than others. In other words, we live in the Tegmark Level IV universe, except that unlike Tegmark argues in his paper, there's no objective measure on the collection of all mathematical structures, according to which some mathematical structures somehow "exist more" than others (and, although IIRC that's not part of Tegmark's argument, according to which the conscious experiences in some mathematical structures could be "experienced more" than those in other structures). All mathematically possible experiences are experienced, and to the same "degree".

So why is our world so orderly? There's a mathematically possible continuation of the world that you seem to be living in, where purple pumpkins are about to start falling from the sky. Or the light we observe coming in from outside our galaxy is suddenly replaced by white noise. Why don't you remember ever seeing anything as obviously disorderly as that?

And the answer to that, of course, is that among all the possible experiences that get experienced in this multiverse, there are orderly ones as well as non-orderly ones, so the fact that you happen to have orderly experiences isn't in conflict with the hypothesis; after all, the orderly experiences have to be experienced as well.

One might be tempted to argue that it's somehow more likely that you will observe an orderly world if everybody who has conscious experiences at all, or if at least most conscious observers, see an orderly world. (The "most observers" version of the argument assumes that there is a measure on the conscious observers, a.k.a. some kind of magical reality fluid.) But this requires the use of anthropic probabilities, and there is simply no (known) system of anthropic probabilities that gives reasonable answers in general. Fortunately, we have an alternative: Wei Dai's updateless decision theory (which was motivated in part exactly by the problem of how to act in this kind of multiverse). The basic idea is simple (though the details do contain devils): We have a prior over what the world looks like; we have some preferences about what we would like the world to look like; and we come up with a plan for what we should do in any circumstance we might find ourselves in that maximizes our expected utility, given our prior.

*

In this framework, Coscott and Paul suggest, everything adds up to normality if, instead of saying that some experiences objectively exist more, we happen to care more about some experiences than about others. (That's not a new idea, of course, or the first time this has appeared on LW -- for example, Wei Dai's What are probabilities, anyway? comes to mind.) In particular, suppose we just care more about experiences in mathematically really simple worlds -- or more precisely, places in mathematically simple worlds that are mathematically simple to describe (since there's a simple program that runs all Turing machines, and therefore all mathematically possible human experiences, always assuming that human brains are computable). Then, even though there's a version of you that's about to see purple pumpkins rain from the sky, you act in a way that's best in the world where that doesn't happen, because that world has so much lower K-complexity, and because you therefore care so much more about what happens in that world.

There's something unsettling about that, which I think deserves to be mentioned, even though I do not think it's a good counterargument to this view. This unsettling thing is that on priors, it's very unlikely that the world you experience arises from a really simple mathematical description. (This is a version of a point I also made in my previous post.) Even if the physicists had already figured out the simple Theory of Everything, which is a super-simple cellular automaton that accords really well with experiments, you don't know that this simple cellular automaton, if you ran it, would really produce you. After all, imagine that somebody intervened in Earth's history so that orchids never evolved, but otherwise left the laws of physics the same; there might still be humans, or something like humans, and they would still run experiments and find that they match the predictions of the simple cellular automaton, so they would assume that if you ran that cellular automaton, it would compute them -- except it wouldn't, it would compute us, with orchids and all. Unless, of course, it does compute them, and a special intervention is required to get the orchids.

So you don't know that you live in a simple world. But, goes the obvious reply, you care much more about what happens if you do happen to live in the simple world. On priors, it's probably not true; but it's best, according to your values, if all people like you act as if they live in the simple world (unless they're in a counterfactual mugging type of situation, where they can influence what happens in the simple world even if they're not in the simple world themselves), because if the actual people in the simple world act like that, that gives the highest utility.

You can adapt an argument that I was making in my l-zombies post to this setting: Given these preferences, it's fine for everybody to believe that they're in a simple world, because this will increase the correspondence between map and territory for the people that do live in simple worlds, and that's who you care most about.

*

I mostly agree with this reasoning. I agree that Tegmark IV without a measure seems like the most obvious and reasonable hypothesis about what the world looks like. I agree that there seems no reason for there to be a "magical reality fluid". I agree, therefore, that on the priors that I'd put into my UDT calculation for how I should act, it's much more likely that true reality is a measureless Tegmark IV than that it has some objective measure according to which some experiences are "experienced less" than others, or not experienced at all. I don't think I understand things well enough to be extremely confident in this, but my odds would certainly be in favor of it.

Moreover, I agree that if this is the case, then my preferences are to care more about the simpler worlds, making things add up to normality; I'd want to act as if purple pumpkins are not about to start falling from the sky, precisely because I care more about the consequences my actions have in more orderly worlds.

But.

*

Imagine this: Once you finish reading this article, you hear a bell ringing, and then a sonorous voice announces: "You do indeed live in a Tegmark IV multiverse without a measure. You had better deal with it." And then it turns out that it's not just you who's heard that voice: Every single human being on the planet (who didn't sleep through it, isn't deaf etc.) has heard those same words.

On the hypothesis, this is of course about to happen to you, though only in one of those worlds with high K-complexity that you don't care about very much.

So let's consider the following possible plan of action: You could act as if there is some difference between "existence" and "non-existence", or perhaps some graded degree of existence, until you hear those words and confirm that everybody else has heard them as well, or until you've experienced one similarly obviously "disorderly" event. So until that happens, you do things like invest time and energy into trying to figure out what the best way to act is if it turns out that there is some magical reality fluid, and into trying to figure out what a non-confused version of something like a measure on conscious experience could look like, and you act in ways that don't kill you if we happen to not live in a measureless Tegmark IV. But once you've had a disorderly experience, just a single one, you switch over to optimizing for the measureless mathematical multiverse.

If the degree to which you care about worlds is really proportional to their K-complexity, with respect to what you and I would consider a "simple" universal Turing machine, then this would be a silly plan; there is very little to be gained from being right in worlds that have that much higher K-complexity. But when I query my intuitions, it seems like a rather good plan:

• Yes, I care less about those disorderly worlds. But not as much less as if I valued them by their K-complexity. I seem to be willing to tap into my complex human intuitions to refer to the notion of "single obviously disorderly event", and assign the worlds with a single such event, and otherwise low K-complexity, not that much lower importance than the worlds with actual low K-complexity.
• And if I imagine that the confused-seeming notions of "really physically exists" and "actually experienced" do have some objective meaning independent of my preferences, then I care much more about the difference between "I get to 'actually experience' a tomorrow" and "I 'really physically' get hit by a car today" than I care about the difference between the world with true low K-complexity and the worlds with a single disorderly event.

In other words, I agree that on the priors I put into my UDT calculation, it's much more likely that we live in measureless Tegmark IV; but my confidence in this isn't extreme, and if we don't, then the difference between "exists" and "doesn't exist" (or "is experienced a lot" and "is experienced only infinitesimally") is very important; much more important than the difference between "simple world" and "simple world plus one disorderly event" according to my preferences if we do live in a Tegmark IV universe. If I act optimally according to the Tegmark IV hypothesis in the latter worlds, that still gives me most of the utility that acting optimally in the truly simple worlds would give me -- or, more precisely, the utility differential isn't nearly as large as if there is something else going on, and I should be doing something about it, and I'm not.

This is the reason why I'm trying to think seriously about things like l-zombies and magical reality fluid. I mean, I don't even think that these are particularly likely to be exactly right even if the measureless Tegmark IV hypothesis is wrong; I expect that there would be some new insight that makes even more sense than Tegmark IV, and makes all the confusion go away. But trying to grapple with the confused intuitions we currently have seems at least a possible way to make progress on this, if it should be the case that there is in fact progress to be made.

*

Here's one avenue of investigation that seems worthwhile to me, and wouldn't without the above argument. One thing I could imagine finding, that could make the confusion go away, would be that the intuitive notion of "all possible Turing machines" is just wrong, and leads to outright contradictions (e.g., to inconsistencies in Peano Arithmetic, or something similarly convincing). Lots of people have entertained the idea that concepts like the real numbers don't "really" exist, and only the behavior of computable functions is "real"; perhaps not even that is real, and true reality is more restricted? (You can reinterpret many results about real numbers as results about computable functions, so maybe you could reinterpret results about computable functions as results about these hypothetical weaker objects that would actually make mathematical sense.) So it wouldn't be the case after all that there is some Turing machine that computes the conscious experiences you would have if pumpkins started falling from the sky.

Does the above make sense? Probably not. But I'd say that there's a small chance that maybe yes, and that if we understood the right kind of math, it would seem very obvious that not all intuitively possible human experiences are actually mathematically possible (just as obvious as it is today, with hindsight, that there is no Turing machine which takes a program as input and outputs whether this program halts). Moreover, it seems plausible that this could have consequences for how we should act. This, together with my argument above, make me think that this sort of thing is worth investigating -- even if my priors are heavily on the side of expecting that all experiences exist to the same degree, and ordinarily this difference in probabilities would make me think that our time would be better spent on investigating other, more likely hypotheses.

*

Leaving aside the question of how I should act, though, does all of this mean that I should believe that I live in a universe with l-zombies and magical reality fluid, until such time as I hear that voice speaking to me?

I do feel tempted to try to invoke my argument from the l-zombies post that I prefer the map-territory correspondences of actually existing humans to be correct, and don't care about whether l-zombies have their map match up with the territory. But I'm not sure that I care much more about actually existing humans being correct, if the measureless mathematical multiverse hypothesis is wrong, than I care about humans in simple worlds being correct, if that hypothesis is right. So I think that the right thing to do may be to have a subjective belief that I most likely do live in the measureless Tegmark IV, as long as that's the view that seems by far the least confused -- but continue to spend resources on investigating alternatives, because on priors they don't seem sufficiently unlikely to make up for the potential great importance of getting this right.

## L-zombies! (L-zombies?)

19 07 February 2014 06:30PM

Reply to: Benja2010's Self-modification is the correct justification for updateless decision theory; Wei Dai's Late great filter is not bad news

"P-zombie" is short for "philosophical zombie", but here I'm going to re-interpret it as standing for "physical philosophical zombie", and contrast it to what I call an "l-zombie", for "logical philosophical zombie".

A p-zombie is an ordinary human body with an ordinary human brain that does all the usual things that human brains do, such as the things that cause us to move our mouths and say "I think, therefore I am", but that isn't conscious. (The usual consensus on LW is that p-zombies can't exist, but some philosophers disagree.) The notion of p-zombie accepts that human behavior is produced by physical, computable processes, but imagines that these physical processes don't produce conscious experience without some additional epiphenomenal factor.

An l-zombie is a human being that could have existed, but doesn't: a Turing machine which, if anybody ever ran it, would compute that human's thought processes (and its interactions with a simulated environment); that would, if anybody ever ran it, compute the human saying "I think, therefore I am"; but that never gets run, and therefore isn't conscious. (If it's conscious anyway, it's not an l-zombie by this definition.) The notion of l-zombie accepts that human behavior is produced by computable processes, but supposes that these computational processes don't produce conscious experience without being physically instantiated.

Actually, there probably aren't any l-zombies: The way the evidence is pointing, it seems like we probably live in a spatially infinite universe where every physically possible human brain is instantiated somewhere, although some are instantiated less frequently than others; and if that's not true, there are the "bubble universes" arising from cosmological inflation, the branches of many-worlds quantum mechanics, and Tegmark's "level IV" multiverse of all mathematical structures, all suggesting again that all possible human brains are in fact instantiated. But (a) I don't think that even with all that evidence, we can be overwhelmingly certain that all brains are instantiated; and, more importantly actually, (b) I think that thinking about l-zombies can yield some useful insights into how to think about worlds where all humans exist, but some of them have more measure ("magical reality fluid") than others.

So I ask: Suppose that we do indeed live in a world with l-zombies, where only some of all mathematically possible humans exist physically, and only those that do have conscious experiences. How should someone living in such a world reason about their experiences, and how should they make decisions — keeping in mind that if they were an l-zombie, they would still say "I have conscious experiences, so clearly I can't be an l-zombie"?

If we can't update on our experiences to conclude that someone having these experiences must exist in the physical world, then we must of course conclude that we are almost certainly l-zombies: After all, if the physical universe isn't combinatorially large, the vast majority of mathematically possible conscious human experiences are not instantiated. You might argue that the universe you live in seems to run on relatively simple physical rules, so it should have high prior probability; but we haven't really figured out the exact rules of our universe, and although what we understand seems compatible with the hypothesis that there are simple underlying rules, that's not really proof that there are such underlying rules, if "the real universe has simple rules, but we are l-zombies living in some random simulation with a hodgepodge of rules (that isn't actually ran)" has the same prior probability; and worse, if you don't have all we do know about these rules loaded into your brain right now, you can't really verify that they make sense, since there is some mathematically possible simulation whose initial state has you remember seeing evidence that such simple rules exist, even if they don't; and much worse still, even if there are such simple rules, what evidence do you have that if these rules were actually executed, they would produce you? Only the fact that you, like, exist, but we're asking what happens if we don't let you update on that.

I find myself quite unwilling to accept this conclusion that I shouldn't update, in the world we're talking about. I mean, I actually have conscious experiences. I, like, feel them and stuff! Yes, true, my slightly altered alter ego would reason the same way, and it would be wrong; but I'm right...

...and that actually seems to offer a way out of the conundrum: Suppose that I decide to update on my experience. Then so will my alter ego, the l-zombie. This leads to a lot of l-zombies concluding "I think, therefore I am", and being wrong, and a lot of actual people concluding "I think, therefore I am", and being right. All the thoughts that are actually consciously experienced are, in fact, correct. This doesn't seem like such a terrible outcome. Therefore, I'm willing to provisionally endorse the reasoning "I think, therefore I am", and to endorse updating on the fact that I have conscious experiences to draw inferences about physical reality — taking into account the simulation argument, of course, and conditioning on living in a small universe, which is all I'm discussing in this post.

NB. There's still something quite uncomfortable about the idea that all of my behavior, including the fact that I say "I think therefore I am", is explained by the mathematical process, but actually being conscious requires some extra magical reality fluid. So I still feel confused, and using the word l-zombie in analogy to p-zombie is a way of highlighting that. But this line of reasoning still feels like progress. FWIW.

But if that's how we justify believing that we physically exist, that has some implications for how we should decide what to do. The argument is that nothing very bad happens if the l-zombies wrongly conclude that they actually exist. Mostly, that also seems to be true if they act on that belief: mostly, what l-zombies do doesn't seem to influence what happens in the real world, so if only things that actually happen are morally important, it doesn't seem to matter what the l-zombies decide to do. But there are exceptions.

Consider the counterfactual mugging: Accurate and trustworthy Omega appears to you and explains that it just has thrown a very biased coin that had only a 1/1000 chance of landing heads. As it turns out, this coin has in fact landed heads, and now Omega is offering you a choice: It can either (A) create a Friendly AI or (B) destroy humanity. Which would you like? There is a catch, though: Before it threw the coin, Omega made a prediction about what you would do if the coin fell heads (and it was able to make a confident prediction about what you would choose). If the coin had fallen tails, it would have created an FAI if it has predicted that you'd choose (B), and it would have destroyed humanity if it has predicted that you would choose (A). (If it hadn't been able to make a confident prediction about what you would choose, it would just have destroyed humanity outright.)

There is a clear argument that, if you expect to find yourself in a situation like this in the future, you would want to self-modify into somebody who would choose (B), since this gives humanity a much larger chance of survival. Thus, a decision theory stable under self-modification would answer (B). But if you update on the fact that you consciously experience Omega telling you that the coin landed heads, (A) would seem to be the better choice!

One way of looking at this is that if the coin falls tails, the l-zombie that is told the coin landed heads still exists mathematically, and this l-zombie now has the power to influence what happens in the real world. If the argument for updating was that nothing bad happens even though the l-zombies get it wrong, well, that argument breaks here. The mathematical process that is your mind doesn't have any evidence about whether the coin landed heads or tails, because as a mathematical object it exists in both possible worlds, and it has to make a decision in both worlds, and that decision affects humanity's future in both worlds.

Back in 2010, I wrote a post arguing that yes, you would want to self-modify into something that would choose (B), but that that was the only reason why you'd want to choose (B). Here's a variation on the above scenario that illustrates the point I was trying to make back then: Suppose that Omega tells you that it actually threw its coin a million years ago, and if it had fallen tails, it would have turned Alpha Centauri purple. Now throughout your history, the argument goes, you would never have had any motive to self-modify into something that chooses (B) in this particular scenario, because you've always known that Alpha Centauri isn't, in fact, purple.

But this argument assumes that you know you're not a l-zombie; if the coin had in fact fallen tails, you wouldn't exist as a conscious being, but you'd still exist as a mathematical decision-making process, and that process would be able to influence the real world, so you-the-decision-process can't reason that "I think, therefore I am, therefore the coin must have fallen heads, therefore I should choose (A)." Partly because of this, I now accept choosing (B) as the (most likely to be) correct choice even in that case. (The rest of my change in opinion has to do with all ways of making my earlier intuition formal getting into trouble in decision problems where you can influence whether you're brought into existence, but that's a topic for another post.)

However, should you feel cheerful while you're announcing your choice of (B), since with high (prior) probability, you've just saved humanity? That would lead to an actual conscious being feeling cheerful if the coin has landed heads and humanity is going to be destroyed, and an l-zombie computing, but not actually experiencing, cheerfulness if the coin has landed heads and humanity is going to be saved. Nothing good comes out of feeling cheerful, not even alignment of a conscious' being's map with the physical territory. So I think the correct thing is to choose (B), and to be deeply sad about it.

You may be asking why I should care what the right probabilities to assign or the right feelings to have are, since these don't seem to play any role in making decisions; sometimes you make your decisions as if updating on your conscious experience, but sometimes you don't, and you always get the right answer if you don't update in the first place. Indeed, I expect that the "correct" design for an AI is to fundamentally use (more precisely: approximate) updateless decision theory (though I also expect that probabilities updated on the AI's sensory input will be useful for many intermediate computations), and "I compute, therefore I am"-style reasoning will play no fundamental role in the AI. And I think the same is true for humans' decisions — the correct way to act is given by updateless reasoning. But as a human, I find myself unsatisfied by not being able to have a picture of what the physical world probably looks like. I may not need one to figure out how I should act; I still want one, not for instrumental reasons, but because I want one. In a small universe where most mathematically possible humans are l-zombies, the argument in this post seems to give me a justification to say "I think, therefore I am, therefore probably I either live in a simulation or what I've learned about the laws of physics describes how the real world works (even though there are many l-zombies who are thinking similar thoughts but are wrong about them)."

And because of this, even though I disagree with my 2010 post, I also still disagree with Wei Dai's 2010 post arguing that a late Great Filter is good news, which my own 2010 post was trying to argue against. Wei argued that if Omega gave you a choice between (A) destroying the world now and (B) having Omega destroy the world a million years ago (so that you are never instantiated as a conscious being, though your choice as an l-zombie still influences the real world), then you would choose (A), to give humanity at least the time it's had so far. Wei concluded that this means that if you learned that the Great Filter is in our future, rather than our past, that must be good news, since if you could choose where to place the filter, you should place it in the future. I now agree with Wei that (A) is the right choice, but I don't think that you should be happy about it. And similarly, I don't think you should be happy about news that tells you that the Great Filter is later than you might have expected.

## Results from MIRI's December workshop

45 15 January 2014 10:29PM

Last week (Dec. 14-20), MIRI ran its 6th research workshop on logic, probability, and reflection. Writing up mathematical results takes time, and in the past, it's taken quite a while for results from these workshops to become available even in draft form. Because of this, at the December workshop, we tried something new: taking time during and in the days immediately after the workshop to write up results in quick and somewhat dirty form, while they still feel fresh and exciting.

In total, there are seven short writeups. Here's a list, with short descriptions of each. Before you get started on these writeups, you may want to read John Baez's blog post about the workshop, which gives an introduction to the two main themes of the workshop.

## Naturalistic trust among AIs: The parable of the thesis advisor's theorem

24 15 December 2013 08:32AM

Eliezer and Marcello's article on tiling agents and the Löbian obstacle discusses several things that you intuitively would expect a rational agent to be able to do that, because of Löb's theorem, are problematic for an agent using logical reasoning. One of these desiderata is naturalistic trust: Imagine that you build an AI that uses PA for its mathematical reasoning, and this AI happens to find in its environment an automated theorem prover which, the AI carefully establishes, also uses PA for its reasoning. Our AI looks at the theorem prover's display and sees that it flashes a particular lemma that would be very useful for our AI in its own reasoning; the fact that it's on the prover's display means that the prover has just completed a formal proof of this lemma. Can our AI now use the lemma? Well, even if it can establish in its own PA-based reasoning module that there exists a proof of the lemma, by Löb's theorem this doesn't imply in PA that the lemma is in fact true; as Eliezer would put it, our agent treats proofs checked inside the boundaries of its own head different from proofs checked somewhere in the environment. (The above isn't fully formal, but the formal details can be filled in.)

At the MIRI's December workshop (which started today), we've been discussing a suggestion by Nik Weaver for how to handle this problem. Nik starts from a simple suggestion (which he doesn't consider to be entirely sufficient, and his linked paper is mostly about a much more involved proposal that addresses some remaining problems, but the simple idea will suffice for this post): Presumably there's some instrumental reason that our AI proves things; suppose that in particular, the AI will only take an action after it has proven that it is "safe" to take this action (e.g., the action doesn't blow up the planet). Nik suggests to relax this a bit: The AI will only take an action after it has (i) proven in PA that taking the action is safe; OR (ii) proven in PA that it's provable in PA that the action is safe; OR (iii) proven in PA that it's provable in PA that it's provable in PA that the action is safe; etc.

Now suppose that our AI sees that lemma, A, flashing on the theorem prover's display, and suppose that our AI can prove that A implies that action X is safe. Then our AI can also prove that it's provable that A -> safe(X), and it can prove that A is provable because it has established that the theorem prover works correctly; thus, it can prove that it's provable that safe(X), and therefore take action X.

Even if the theorem prover has only proved that A is provable, so that the AI only knows that it's provable that A is provable, it can use the same sort of reasoning to prove that it's provable that it's provable that safe(X), and again take action X.

But on hearing this, Eliezer and I had the same skeptical reaction: It seems that our AI, in an informal sense, "trusts" that A is true if it finds (i) a proof of A, or (ii) a proof that A is provable, or -- etc. Now suppose that the theorem prover our AI is looking at flashes statements on its display after it has established that they are "trustworthy" in this sense -- if it has found a proof, or a proof that there is a proof, etc. Then when A flashes on the display, our AI can only prove that there exists some n such that it's "provable^n" that A, and that's not enough for it to use the lemma. If the theorem prover flashed n on its screen together with A, everything would be fine and dandy; but if the AI doesn't know n, it's not able to use the theorem prover's work. So it still seems that the AI is unwilling to "trust" another system that reasons just like the AI itself.

I want to try to shed some light on this obstacle by giving an intuition for why the AI's behavior here could, in some sense, be considered to be the right thing to do. Let me tell you a little story.

One day you talk with a bright young mathematician about a mathematical problem that's been bothering you, and she suggests that it's an easy consequence of a theorem in cohistonomical tomolopy. You haven't heard of this theorem before, and find it rather surprising, so you ask for the proof.

"Well," she says, "I've heard it from my thesis advisor."

"Oh," you say, "fair enough. Um--"

"Yes?"

"Ah! Yeah, I made quite sure of that. In fact, I established very carefully that my thesis advisor uses exactly the same system of mathematical reasoning that I use myself, and only states theorems after she has checked the proof beyond any doubt, so as a rational agent I am compelled to accept anything as true that she's convinced herself of."

"Oh, I see! Well, fair enough. I'd still like to understand why this theorem is true, though. You wouldn't happen to know your advisor's proof, would you?"

"Ah, as a matter of fact, I do! She's heard it from her thesis advisor."

"..."

"Something the matter?"

"Er, have you considered..."

"Oh! I'm glad you asked! In fact, I've been curious myself, and yes, it does happen to be the case that there's an infinitely descending chain of thesis advisors all of which have established the truth of this theorem solely by having heard it from the previous advisor in the chain." (This parable takes place in a world without a big bang -- human history stretches infinitely far into the past.) "But never to worry -- they've all checked very carefully that the previous person in the chain used the same formal system as themselves. Of course, that was obvious by induction -- my advisor wouldn't have accepted it from her advisor without checking his reasoning first, and he would have accepted it from his advisor without checking, etc."

"Uh, doesn't it bother you that nobody has ever, like, actually proven the theorem?"

"Whatever in the world are you talking about? I've proven it myself! In fact, I just told you that infinitely many people have each proved it in slightly different ways -- for example my own proof made use of the fact that my advisor had proven the theorem, whereas her proof used her advisor instead..."

This can't literally happen with a sound proof system, but the reason is that that a system like PA can only accept things as true if they have been proven in a system weaker than PA -- i.e., because we have Löb's theorem. Our mathematician's advisor would have to use a weaker system than the mathematician herself, and the advisor's advisor a weaker system still; this sequence would have to terminate after a finite time (I don't have a formal proof of this, but I'm fairly sure you can turn the above story into a formal proof that something like this has to be true of sound proof systems), and so someone will actually have to have proved the actual theorem on the object level.

So here's my intuition: A satisfactory solution of the problems around the Löbian obstacle will have to make sure that the buck doesn't get passed on indefinitely -- you can accept a theorem because someone reasoning like you has established that someone else reasoning like you has proven the theorem, but there can only be a finite number of links between you and someone who has actually done the object-level proof. We know how to do this by decreasing the mathematical strength of the proof system, and that's not satisfactory, but my intuition is that a satisfactory solution will still have to make sure that there's something that decreases when you go up the chain of thesis advisors, and when that thing reaches zero you've found the thesis advisor that has actually proven the theorem. (I sense ordinals entering the picture.)

...aaaand in fact, I can now tell you one way to do something like this: Nik's idea, which I was talking about above. Remember how our AI "trusts" the theorem prover that flashes the number n which says how many times you have to iterate "that it's provable in PA that", but doesn't "trust" the prover that's exactly the same except it doesn't tell you this number? That's the thing that decreases. If the theorem prover actually establishes A by observing a different theorem prover flashing A and the number 1584, then it can flash A, but only with a number at least 1585. And hence, if you go 1585 thesis advisors up the chain, you find the gal who actually proved A.

The cool thing about Nik's idea is that it doesn't change mathematical strength while going down the chain. In fact, it's not hard to show that if PA proves a sentence A, then it also proves that PA proves A; and the other way, we believe that everything that PA proves is actually true, so if PA proves PA proves A, then it follows that PA proves A.

I can guess what Eliezer's reaction to my argument here might be: The problem I've been describing can only occur in infinitely large worlds, which have all sorts of other problems, like utilities not converging and stuff.

We settled for a large finite TV screen, but we could have had an arbitrarily larger finite TV screen. #infiniteworldproblems

We have Porsches for every natural number, but at every time t we have to trade down the Porsche with number t for a BMW. #infiniteworldproblems

We have ever-rising expectations for our standard of living, but the limit of our expectations doesn't equal our expectation of the limit. #infiniteworldproblems

-- Eliezer, not coincidentally after talking to me

I'm not going to be able to resolve that argument in this post, but briefly: I agree that we probably live in a finite world, and that finite worlds have many properties that make them nice to handle mathematically, but we can formally reason about infinite worlds of the kind I'm talking about here using standard, extremely well-understood mathematics.

Because proof systems like PA (or more conveniently ZFC) allow us to formalize this standard mathematical reasoning, a solution to the Löbian obstacle has to "work" properly in these infinite worlds, or we would be able to turn our story of the thesis advisors' proof that 0=1 into a formal proof of an inconsistency in PA, say. To be concrete, consider the system PA*, which consists of PA + the axiom schema "if PA* proves phi, then phi" for every formula phi; this is easily seen to be inconsistent by Löb's theorem, but if we didn't know that yet, we could translate the story of the thesis advisors (which are using PA* as their proof system this time) into a formal proof of the inconsistency of PA*.

Therefore, thinking intuitively in terms of infinite worlds can give us insight into why many approaches to the Löbian family of problems fail -- as long as we make sure that these infinite worlds, and their properties that we're using in our arguments, really can be formalized in standard mathematics, of course.

## Meetup : Bristol meetup

4 15 October 2013 08:04PM

## Discussion article for the meetup : Bristol meetup

WHEN: 20 October 2013 02:00:00PM (+0100)

WHERE: Hodgkin House, 3 Meridian Place, Bristol BS8 1JG

We'll have another meetup in Bristol this upcoming Sunday, October 20, at the student house where I live. We'll officially start at 2pm to hopefully make it reasonably convenient for everyone, but at least two of us will be there from around 12, so if you want to come earlier and hang out a bit more, let me know! (benja.fallenstein@gmail.com, or PM me.)

I'll put up a LessWrong sign outside saying this, but please call me at 07463169075 or (from around 2pm on) ring the buzzer marked "Basement" when you arrive.

Also, whether or not you can attend this time, if you're interested in future meetups, please join the Google group for organizing meetup times!

## Meetup : Second Bristol meetup & mailing list for future meetups

2 06 June 2013 04:47PM

## Discussion article for the meetup : Second Bristol meetup & mailing list for future meetups

WHEN: 16 June 2013 03:00:00PM (+0100)

WHERE: Hodgkin House, 3 Meridian Place, Bristol BS8 1JG

At our lovely first meetup (four people came, if I count myself), I unfortunately forgot to take the opportunity to sort out when a good time for the next meetup would be. Sorry!

Since I'm about to be away for a while and I think others are leaving for the summer as well, I decided to just be bold once more and announce a time and hope that somebody else is free as well. But to make it easier to find good times in the future, please join the Google Group I've just created!

Last time, we ended up sitting in the cafe for hours without consuming much, so for this meetup I've booked the dining room at the student house where I live, which should be a quiet and comfortable place to talk. I'll also put up a LessWrong sign outside saying this, but please ring the buzzer marked "Basement", or you can call me at +43-660-1461996 (unfortunately I don't have a UK mobile yet, but if you ring just once, I'll come up and meet you).

Time & date is Sunday, the 16th of June, starting at 3pm. Hope that I didn't pick a terrible time and someone will be able to join me! :-)

## Meetup : First Bristol meetup

4 17 May 2013 04:52PM

## Discussion article for the meetup : First Bristol meetup

WHEN: 25 May 2013 03:00:00PM (+0100)

WHERE: Friska Queens Road (on the Clifton triangle), Bristol

Back in 2010, Bristol had 4000+ unique LW visitors, but we've never had a meetup -- let's try and see what happens! I'll be in the Friska on Queens Road (on the Clifton triangle, right next to the university campus) on Saturday the 25th at 3pm, with a LessWrong sign and a paperback of HPMOR. Anyone going to join me? :-)