## Publication of "Anthropic Decision Theory"

8 20 September 2017 03:41PM

My paper "Anthropic decision theory for self-locating beliefs", based on posts here on Less Wrong, has been published as a Future of Humanity Institute tech report. Abstract:

This paper sets out to resolve how agents ought to act in the Sleeping Beauty problem and various related anthropic (self-locating belief) problems, not through the calculation of anthropic probabilities, but through finding the correct decision to make. It creates an anthropic decision theory (ADT) that decides these problems from a small set of principles. By doing so, it demonstrates that the attitude of agents with regards to each other (selfish or altruistic) changes the decisions they reach, and that it is very important to take this into account. To illustrate ADT, it is then applied to two major anthropic problems and paradoxes, the Presumptuous Philosopher and Doomsday problems, thus resolving some issues about the probability of human extinction.

Most of these ideas are also explained in this video.

To situate Anthropic Decision Theory within the UDT/TDT family: it's basically a piece of UDT applied to anthropic problems, where the UDT approach can be justified by using generally fewer, and more natural, assumptions than UDT does.

## Simplified Anthropic Doomsday

1 02 September 2017 08:37PM

Here is a simplified version of the Doomsday argument in Anthropic decision theory, to get easier intuitions.

Assume a single agent A exists, an average utilitarian, with utility linear in money. Their species survives with 50% probability; denote this event by S. If the species survives, there will be 100 people total; otherwise the average utilitarian is the only one of its kind. An independent coin lands heads with 50% probability; denote this event by H.

Agent A must price a coupon CS that pays out €1 on S, and a coupon CH that pays out €1 on H. The coupon CS pays out only on S, thus the reward only exists in a world where there are a hundred people, thus if S happens, the coupon CS is worth (€1)/100. Hence its expected worth is (€1)/200=(€2)/400.

But H is independent of S, so (H,S) and (H,¬S) both have probability 25%. In (H,S), there are a hundred people, so CH is worth (€1)/100. In (H,¬S), there is one person, so CH is worth (€1)/1=€1. Thus the expected value of CH is (€1)/4+(€1)/400 = (€101)/400. This is more than 50 times the value of CS.

Note that C¬S, the coupon that pays out on doom, has an even higher expected value of (€1)/2=(€200)/400.

So, H and S have identical probability, but A assigns CS and CH different expected utilities, with a higher value to CH, simply because S is correlated with survival and H is independent of it (and A assigns an ever higher value to C¬S, which is anti-correlated with survival). This is a phrasing of the Doomsday Argument in ADT.

## The Doomsday argument in anthropic decision theory

5 31 August 2017 01:44PM

EDIT: added a simplified version here.

Crossposted at the intelligent agents forum.

In Anthropic Decision Theory (ADT), behaviours that resemble the Self Sampling Assumption (SSA) derive from average utilitarian preferences (and from certain specific selfish preferences).

However, SSA implies the doomsday argument, and, to date, I hadn't found a good way to express the doomsday argument within ADT.

This post will remedy that hole, by showing how there is a natural doomsday-like behaviour for average utilitarian agents within ADT.

## [Link] Two Major Obstacles for Logical Inductor Decision Theory

1 10 June 2017 05:48AM

## [Link] Anthropic uncertainty in the Evidential Blackmail problem

4 14 May 2017 04:43PM

## Agents that don't become maximisers

8 07 April 2017 12:56PM

Cross-posted at the Intelligent Agent forum.

According to the basic AI drives thesis, (almost) any agent capable of self-modification will self-modify into an expected utility maximiser.

The typical examples are the inconsistent utility maximisers, the satisficers, unexploitable agents, and it's easy to think that all agents fall roughly into these broad categories. There's also the observation that when looking at full policies rather than individual actions, many biased agents become expected utility maximisers (unless they want to lose pointlessly).

Nevertheless... there is an entire category of agents that generically seem to not self-modify into maximisers. These are agents that attempt to maximise f(E(U)) where U is some utility function, E(U) is its expectation, and f is a function that is neither wholly increasing nor decreasing.

## Making equilibrium CDT into FDT in one+ easy step

6 21 March 2017 02:42PM

In this post, I'll argue that Joyce's equilibrium CDT (eCDT) can be made into FDT (functional decision theory) with the addition of an intermediate step - a step that should have no causal consequences. This would show that eCDT is unstable under causally irrelevant changes, and is in fact a partial version of FDT.

Joyce's principle is:

Full Information. You should act on your time-t utility assessments only if those assessments are based on beliefs that incorporate all the evidence that is both freely available to you at t and relevant to the question about what your acts are likely to cause.

When confronted by a problem with a predictor (such as Death in Damascus or the Newcomb problem), this allows eCDT to recursively update their probabilities of the behaviour of the predictor, based on their own estimates of their own actions, until this process reaches equilibrium. This allows it to behave like FDT/UDT/TDT on some (but not all) problems. I'll argue that you can modify the setup to make eCDT into a full FDT.

## Death in Damascus

In this problem, Death has predicted whether the agent will stay in Damascus (S) tomorrow, or flee to Aleppo (F). And Death has promised to be in the same city as the agent (D or A), to kill them. Having made its prediction, Death then travels to that city to wait for the agent. Death is known to be a perfect predictor, and the agent values survival at $1000, while fleeing costs$1.

Then eCDT fleeing to Aleppo with probability 999/2000. To check this, let x be the probability of fleeing to Aleppo (F), and y the probability of Death being there (A). The expected utility is then

• 1000(x(1-y)+(1-x)y)-x                                                    (1)

Differentiating this with respect to x gives 999-2000y, which is zero for y=999/2000. Since Death is a perfect predictor, y=x and eCDT's expected utility is 499.5.

The true expected utility, however, is -999/2000, since Death will get the agent anyway, and the only cost is the trip to Aleppo.

## Delegating randomness

The eCDT decision process seems rather peculiar. It seems to allow updating of the value of y dependent on the value of x - hence allow acausal factors to be considered - but only in a narrow way. Specifically, it requires that the probability of F and A be equal, but that those two events remain independent. And it then differentiates utility according to the probability of F only, leaving that of A fixed. So, in a sense, x correlates with y, but small changes in x don't correlate with small changes in y.

That's somewhat unsatisfactory, so consider the problem now with an extra step. The eCDT agent no longer considers whether to stay or flee; instead, it outputs X, a value between 0 and 1. There is a uniform random process Z, also valued between 0 and 1. If Z<X, then the agent flees to Aleppo; if not, it stays in Damascus.

This seems identical to the original setup, for the agent. Instead of outputting a decision as to whether to flee or stay, it outputs the probability of fleeing. This has moved the randomness in the agent's decision from inside the agent to outside it, but this shouldn't make any causal difference, because the agent knows the distribution of Z.

Death remains a perfect predictor, which means that it can predict X and Z, and will move to Aleppo if and only if Z<X.

Now let the eCDT agent consider outputting X=x for some x. In that case, it updates its opinion of Death's behaviour, expecting that Death will be in Aleppo if and only if Z<x. Then it can calculate the expected utility of setting X=x, which is simply 0 (Death will always find the agent) minus x (the expected cost of fleeing to Aleppo), hence -x. Among the "pure" strategies, X=0 is clearly the best.

Now let's consider mixed strategies, where the eCDT agent can consider a distribution PX over values of X (this is a sort of second order randomness, since X and Z already give randomness over the decision to move to Aleppo). If we wanted the agent to remain consistent with the previous version, the agent then models Death as sampling from PX, independently of the agent. The probability of fleeing is just the expectation of PX; but the higher the variance of PX, the harder it is for Death to predict where the agent will go. The best option is as before: PX will set X=0 with probability 1001/2000, and X=1 with probability 999/2000.

But is this a fair way of estimating mixed strategies?

## Average Death in Aleppo

Consider a weaker form of Death, Average Death. Average Death cannot predict X, but can predict PX, and will use that to determine its location, sampling independently from it. Then, from eCDT's perspective, the mixed-strategy behaviour described above is the correct way of dealing with Average Death.

But that means that the agent above is incapable of distinguishing between Death and Average Death. Joyce argues strongly for considering all the relevant information, and the distinction between Death and Average Death is relevant. Thus it seems when considering mixed strategies, the eCDT agent must instead look at the pure strategies, compute their value (-x in this case) and then look at the distribution over them.

One might object that this is no longer causal, but the whole equilibrium approach undermines the strictly causal aspect anyway. It feels daft to be allowed to update on Average Death predicting PX, but not on Death predicting X. Especially since moving from PX to X is simply some random process Z' that samples from the distribution PX. So Death is allowed to predict PX (which depends on the agent's reasoning) but not Z'. It's worse than that, in fact: Death can predict PX and Z', and the agent can know this, but the agent isn't allowed to make use of this knowledge.

Given all that, it seems that in this situation, the eCDT agent must be able to compute the mixed strategies correctly and realise (like FDT) that staying in Damascus (X=0 with certainty) is the right decision.

## Let's recurse again, like we did last summer

This deals with Death, but not with Average Death. Ironically, the "X=0 with probability 1001/2000..." solution is not the correct solution for Average Death. To get that, we need to take equation (1), set x=y first, and then differentiate with respect to x. This gives x=1999/4000, so setting "X=0 with probability 2001/4000 and X=1 with probability 1999/4000" is actually the FDT solution for Average Death.

And we can make the eCDT agent reach that. Simply recurse to the next level, and have the agent choose PX directly, via a distribution PPX over possible PX.

But these towers of recursion are clunky and unnecessary. It's simpler to state that eCDT is unstable under recursion, and that it's a partial version of FDT.

## [Stub] Newcomb problem as a prisoners' dilemma/anti-coordination game

2 21 March 2017 10:34AM

You should always cooperate with an identical copy of yourself in the prisoner's dilemma. This is obvious, because you and the copy will reach the same decision.

That justification implicitly assumes that you and your copy as (somewhat) antagonistic: that you have opposite aims. But the conclusion doesn't require that at all. Suppose that you and your copy were instead trying to ensure that one of you got maximal reward (it doesn't matter which). Then you should still jointly cooperate because (C,C) is possible, while (C,D) and (D,C) are not (I'm ignoring randomising strategies for the moment).

Now look at the Newcomb problem. You decision enters twice: once when you decide how many boxes to take, and once when Omega is simulating or estimating you to decide how much money to put in box B. You would dearly like your two "copies" (one of which may just be an estimate) to be out of sync - for the estimate to 1-box while the real you two-boxes. But without any way of distinguishing between the two, you're stuck with taking the same action - (1-box,1-box). Or, seeing it another way, (C,C).

This also makes the Newcomb problem into an anti-coordination game, where you and your copy/estimate try to pick different options. But, since this is not possible, you have to stick to the diagonal. This is why the Newcomb problem can be seen both as an anti-coordination game and a prisoners' dilemma - the differences only occur in the off-diagonal terms that can't be reached.

## [Error]: Statistical Death in Damascus

3 20 March 2017 07:17PM

Note: This post is in error, I've put up a corrected version of it here. I'm leaving the text in place, as historical record. The source of the error is that I set Pa(S)=Pe(D) and then differentiated with respect to Pa(S), while I should have differentiated first and then set the two values to be the same.

Nate Soares and Ben Levinstein have a new paper out on "Functional Decision theory", the most recent development of UDT and TDT.

This post is about further analysing the "Death in Damascus" problem, and to show that Joyce's "equilibrium" version of CDT (causal decision theory) is in a certain sense intermediate between CDT and FDT. If eCDT is this equilibrium theory, then it can deal with a certain class of predictors, which I'll call distribution predictors.

## Death in Damascus

In the original Death in Damascus problem, Death is a perfect predictor. It finds you in Damascus, and says that it's already planned it's trip for tomorrow - and it'll be in the same place you will be.

You value surviving at $1000, and can flee to Aleppo for$1.

Classical CDT will put some prior P over Death being in Damascus (D) or Aleppo (A) tomorrow. And then, if P(A)>999/2000, you should stay (S) in Damascus, while if P(A)<999/2000, you should flee (F) to Aleppo.

FDT estimates that Death will be wherever you will, and thus there's no point in F, as that will just cost you $1 for no reason. But it's interesting what eCDT produces. This decision theory requires that Pe (the equilibrium probability of A and D) be consistent with the action distribution that eCDT computes. Let Pa(S) be the action probability of S. Since Death knows what you will do, Pa(S)=Pe(D). The expected utility is 1000.Pa(S)Pe(A)+1000.Pa(F)Pe(D)-Pa(F). At equilibrium, this is 2000.Pe(A)(1-Pe(A))-Pe(A). And that quantity is maximised when Pe(A)=1999/4000 (and thus the probability of you fleeing is also 1999/4000). This is still the wrong decision, as paying the extra$1 is pointless, even if it's not a certainty to do so.

So far, nothing interesting: both CDT and eCDT fail. But consider the next example, on which eCDT does not fail.

## Statistical Death in Damascus

Let's assume now that Death has an assistant, Statistical Death, that is not a prefect predictor, but is a perfect distribution predictor. It can predict the distribution of your actions, but not your actual decision. Essentially, you have access to a source of true randomness that it cannot predict.

It informs you that its probability over whether to be in Damascus or Aleppo will follow exactly the same distribution as yours.

Classical CDT follows the same reasoning as before. As does eCDT, since Pa(S)=Pe(D), as before, since Statistical Death follows the same distribution as you do.

But what about FDT? Well, note that FDT will reach the same conclusion as eCDT. This is because 1000.Pa(S)Pe(A)+1000.Pa(F)Pe(D)-Pa(F) is the correct expected utility, the Pa(S)=Pe(D) assumption is correct for Statistical Death, and (S,F) is independent of (A,D) once the action probabilities have been fixed.

So on the Statistical Death problem, eCDT and FDT say the same thing.

## Factored joint distribution versus full joint distributions

What's happening is that there is a joint distribution over (S,F) (your actions) and (D,A) (Death's actions). FDT is capable of reasoning over all types of joint distributions, and fully assessing how its choice of Pa acausally affects Death's choice of Pe.

But eCDT is only capable of reasoning over ones where the joint distribution factors into a distribution over (S,F) times a distribution over (D,A). Within the confines of that limitation, it is capable of (acausally) changing Pe via its choice of Pa.

Death in Damascus does not factor into two distributions, so eCDT fails on it. Statistical Death in Damascus does so factor, so eCDT succeeds on it. Thus eCDT seems to be best conceived of as a version of FDT that is strangely limited in terms of which joint distributions its allowed to consider.

## [Link] The price you pay for arriving to class on time

0 24 February 2017 02:11PM

## [Link] “Betting on the Past” – a decision problem by Arif Ahmed

2 07 February 2017 09:14PM

## Is Evidential Decision Theory presumptuous?

3 02 February 2017 01:41PM

I recently had a conversation with a staunch defender of EDT who maintained that EDT gives the right answer in the Smoker’s Lesion and even Evidential Blackmail. I came up with the following, even more counterintuitive, thought experiment:

--

By doing research, you've found out that there is either

(A) only one universe or

(B) a multiverse.

You also found out that the cosmological theory has a slight influence (via different physics) on how your brain works. If (A) holds, you will likely decide to give away all your money to random strangers on the street; if there is a multiverse, you will most likely not do that. Of course, causality flows in one direction only, i.e. your decision does not determine how many universes there are.

Suppose you have a very strong preference for (A) (e.g. because a multiverse would contain infinite suffering) so that it is more important to you than your money.

Do you give away all your money or not?

--

This is structurally equivalent to the Smoker's lesion, but what's causing your action is the cosmological theory, not a lesion or a gene. CDT, TDT, and UDT would not give away the money because there is no causal (or acausal) influence on the number of universes. EDT would reason that giving the money away is evidence for (A) and therefore choose to do so.

Apart from the usual “managing the news” point, this highlights another flaw in EDT: its presumptuousness. The EDT agent thinks that her decision spawns or destroys the entire multiverse, or at least reasons as if. In other words, EDT acts as if it affects astronomical stakes with a single thought.

I find this highly counterintuitive.

What makes it even worse is that this is not even a contrived thought experiment. Our brains are in fact shaped by physics, and it is plausible that different physical theories or constants both make an agent decide differently and make the world better or worse according to one’s values. So, EDT agents might actually reason in this way in the real world.

## Did EDT get it right all along? Introducing yet another medical Newcomb problem

10 24 January 2017 11:43AM

One of the main arguments given against Evidential Decision Theory (EDT) is that it would “one-box” in medical Newcomb problems. Whether this is the winning action has been a hotly debated issue on LessWrong. A majority, including experts in the area such as Eliezer Yudkowsky and Wei Dai, seem to think that one should two-box (See e.g. Yudkowsky 2010, p.67). Others have tried to argue in favor of EDT by claiming that the winning action would be to one-box, or by offering reasons why EDT would in some cases two-box after all. In this blog post, I want to argue that EDT gets it right: one-boxing is the correct action in medical Newcomb problems. I introduce a new thought experiment, the Coin Flip Creation problem, in which I believe the winning move is to one-box. This new problem is structurally similar to other medical Newcomb problems such as the Smoking Lesion, though it might elicit the intuition to one-box even in people who would two-box in some of the other problems. I discuss both how EDT and other decision theories would reason in the problem and why people’s intuitions might diverge in different formulations of medical Newcomb problems.

## Two kinds of Newcomblike problems

There are two different kinds of Newcomblike problems. In Newcomb’s original paradox, both EDT and Logical Decision Theories (LDT), such as Timeless Decision Theory (TDT) would one-box and therefore, unlike CDT, win $1 million. In medical Newcomb problems, EDT’s and LDT’s decisions diverge. This is because in the latter, a (physical) causal node that isn’t itself a decision algorithm influences both the current world state and our decisions – resulting in a correlation between action and environment but, unlike the original Newcomb, no “logical” causation. It’s often unclear exactly how a causal node can exert influence on our decisions. Does it change our decision theory, utility function, or the information available to us? In the case of the Smoking Lesion problem, it seems plausible that it’s our utility function that is being influenced. But then it seems that as soon as we observe our utility function (“notice a tickle”; see Eells 1982), we lose “evidential power” (Almond 2010a, p.39), i.e. there’s nothing new to learn about our health by acting a certain way if we already know our utility function. In any case, as long as we don’t know and therefore still have the evidential power, I believe we should use it. The Coin Flip Creation Problem is an adaption of Caspar Oesterheld’s “Two-Boxing Gene” problem and, like the the latter, attempts to take Newcomb’s original problem and make it into a medical Newcomb problem, triggering the intuition that we should one-box. In Oesterheld’s Two-Boxing Gene, it’s stated that a certain gene correlates with our decision to one-box or two-box in Newcomb’s problem, and that Omega, instead of simulating our decision algorithm, just looks at this gene. Unfortunately, it’s not specified how the correlation between two-boxing and the gene arises, casting doubt on whether it’s a medical Newcomb problem at all, and whether other decision algorithms would disagree with one-boxing. Wei Dai argues that in the Two-Boxing Gene, if Omega conducts a study to find out which genes correlate with which decision algorithm, then Updateless Decision Theory (UDT) could just commit to one-boxing and thereby determine that all the genes UDT agents have will always correlate with one-boxing. So in some sense, UDT’s genes will still indirectly constitute a “simulation” of UDT’s algorithm, and there is a logical influence between the decision to one-box and Omega’s decision to put$1 million in box A. Similar considerations could apply for other LDTs.

The Coin Flip Creation problem is intended as an example of a problem in which EDT would give the right answer, but all causal and logical decision theories would fail. It works explicitly through a causal influence on the decision theory itself, thus reducing ambivalence about the origin of the correlation.

One day, while pondering the merits and demerits of different acausal decision theories, you’re visited by Omega, a being assumed to possess flawless powers of prediction and absolute trustworthiness. You’re presented with Newcomb’s paradox, but with one additional caveat: Omega informs you that you weren’t born like a normal human being, but were instead created by Omega. On the day you were born, Omega flipped a coin: If it came up heads, Omega created you in such a way that you would one-box when presented with the Coin Flip Creation problem, and it put $1 million in box A. If the coin came up tails, you were created such that you’d two-box, and Omega didn’t put any money in box A. We don’t know how Omega made sure what your decision would be. For all we know, it may have inserted either CDT or EDT into your source code, or even just added one hard-coded decision rule on top of your messy human brain. Do you choose both boxes, or only box A? It seems like EDT gets it right: one-boxing is the winning action here. There’s a correlation between our decision to one-box, the coin flip, and Omega’s decision to put money in box A. Conditional on us one-boxing, the probability that there is money in box A increases, and we “receive the good news” – that is, we discover that the coin must have come up heads, and we thus get the million dollars. In fact, we can be absolutely certain of the better outcome if we one-box. However, the problem persists if the correlation between our actions and the content of box A isn’t perfect. As long as the correlation is high enough, it is better to one-box. Nevertheless, neither causal nor logical counterfactuals seem to imply that we can determine whether there is money in box A. The coin flip isn’t a decision algorithm itself, so we can’t determine its outcome. The logical uncertainty about our own decision output doesn’t seem to coincide with the empirical uncertainty about the outcome of the coin flip. In absence of a causal or logical link between their decision and the content of box A, CDT and TDT would two-box. ## Updateless Decision Theory As far as I understand, UDT would come to a similar conclusion. AlephNeil writes in a post about UDT: In the Smoking Lesion problem, the presence of a 'lesion' is somehow supposed to cause Player's to choose to smoke (without altering their utility function), which can only mean that in some sense the Player's source code is 'partially written' before the Player can exercise any control over it. However, UDT wants to 'wipe the slate clean' and delete whatever half-written nonsense is there before deciding what code to write. Ultimately this means that when UDT encounters the Smoking Lesion, it simply throws away the supposed correlation between the lesion and the decision and acts as though that were never a part of the problem. This approach seems wrong to me. If we use an algorithm that changes our own source code, then this change, too, has been physically determined and can therefore correlate with events that aren’t copies of our own decision algorithm. If UDT reasons as though it could just rewrite its own source code and discard the correlation with the coin flip altogether, then UDT two-boxes and thus by definition ends up in the world where there is no money in box A. Note that updatelessness seemingly makes no difference in this problem, since it involves no a priori decision: Before the coin flip, there’s a 50% chance of becoming either a one-boxing or a two-boxing agent. In any case, we can’t do anything about the coin flip, and therefore also can’t influence whether box A contains any money. I am uncertain how UDT works, though, and would be curious about others people’s thoughts. Maybe UDT reasons that by one-boxing, it becomes a decision theory of the sort that would never be installed into an agent in a tails world, thus rendering impossible all hypothetical tails worlds with UDT agents in them. But if so, why wouldn’t UDT “one-box” in the Smoking Lesion? As far as the thought experiments are specified, the causal connection between coin flip and two-boxing in the Coin Flip Creation appears to be no different from the connection between gene and smoking in the Smoking Lesion. More adaptations and different formalizations of LDTs exist, e.g. Proof-Based Decision Theory. I could very well imagine that some of those might one-box in the thought experiment I presented. If so, then I’m once again curious as to where the benefits of such decision theories lie in comparison to plain EDT (aside from updatelessness – see Concluding thoughts). ## Coin Flip Creation, Version 2 Let’s assume UDT would two-box in the Coin Flip Creation. We could alter our thought experiment a bit so that UDT would probably one-box after all: The situation is identical to the Coin Flip Creation, with one key difference: After Omega flips the coin and creates you with the altered decision algorithm, it actually simulates your decision, just as in Newcomb’s original paradox. Only after Omega has determined your decision via simulation does it decide whether to put money in box A, conditional on your decision. Do you choose both boxes, or only box A? Here is a causal graph for the first and second version of the Coin Flip Creation problem. In the first version, a coin flip determines whether there is money in box A. In the second one, a simulation of your decision algorithm decides: Since in Version 2, there’s a simulation involved, UDT would probably one-box. I find this to be a curious conclusion. The situation remains exactly the same – we can rule out any changes in the correlation between our decision and our payoff. It seems confusing to me, then, that the optimal decision should be a different one. ## Copy-altruism and multi-worlds The Coin Flip Creation problem assumes a single world and an egoistic agent. In the following, I want to include a short discussion of how the Coin Flip Creation would play out in a multi-world environment. Suppose Omega’s coin is based on a quantum number generator and produces 50% heads worlds and 50% tails worlds. If we’re copy-egoists, EDT still recommends to one-box, since doing so would reveal to us that we’re in one of the branches in which the coin came up heads. If we’re copy-altruists, then in practice, we’d probably care a bit less about copies whose decision algorithms have been tampered with, since they would make less effective use of the resources they gain than we ourselves would (i.e. their decision algorithm sometimes behaves differently). But in theory, if we care about all the copies equally, we should be indifferent with respect to one-boxing or two-boxing, since there will always be 50% of us in either of the worlds no matter what we do. The two groups always take the opposite action. The only thing we can change is whether our own copy belongs to the tails or the heads group. To summarize, UDT and EDT would both be indifferent in the altruistic multi-world case, but UDT would (presumably) two-box, and EDT would one-box, in both the copy-egoistic multi-worlds and in the single-world case. ## “But I don’t have a choice” There seems to be an especially strong intuition of “absence of free will” inherent to the Coin Flip Creation problem. When presented with the problem, many respond that if someone had created their source code, they didn’t have any choice to begin with. But that’s the exact situation in which we all find ourselves at all times! Our decision architecture and choices are determined by physics, just like a hypothetical AI’s source code, and all of our choices will thus be determined by our “creator.” When we’re confronted with the two boxes, we know that our decisions are predetermined, just like every word of this blogpost has been predetermined. But that knowledge alone won’t help us make any decision. As far as I’m aware, even an agent with complete knowledge of its own source code would have to treat its own decision outputs as uncertain, or it would fail to implement a decision algorithm that takes counterfactuals into account. Note that our decision in the Coin Flip Creation is also no less determined than in Newcomb’s paradox. In both cases, the prediction has been made, and physics will guide our thoughts and our decision in a deterministic and predictable manner. Nevertheless, we can still assume that we have a choice until we make our decision, at which point we merely “find out” what has been our destiny all along. ## Concluding thoughts I hope that the Coin Flip Creation motivates some people to reconsider EDT’s answers in Newcomblike problems. A thought experiment somewhat similar to the Coin Flip Creation can be found in Arif Ahmed 2014. Of course, the particular setup of the Coin Flip Creation means it isn’t directly relevant to the question of which decision theory we should program into an AI. We obviously wouldn’t flip a coin before creating an AI. Also, the situation doesn’t really look like a decision problem from the outside; an impartial observer would just see Omega forcing you to pick either A or B. Still, the example demonstrates that from the inside view, evidence from the actions we take can help us achieve our goals better. Why shouldn’t we use this information? And if evidential knowledge can help us, why shouldn’t we allow a future AI to take it into account? In any case, I’m not overly confident in my analysis and would be glad to have any mistakes pointed out to me. Medical Newcomb is also not the only class of problems that challenge EDT. Evidential blackmail is an example of a different problem, wherein giving the agent access to specific compromising information is used to extract money from EDT agents. The problem attacks EDT from a different angle, though: namely by exploiting it’s lack of updatelessness, similar to the challenges in Transparent Newcomb, Parfit’s Hitchhiker, Counterfactual Mugging, and the Absent-Minded Driver. I plan to address questions related to updatelessness, e.g. whether it makes sense to give in to evidential blackmail if you already have access to the information and haven’t precommitted not to give in, at a later point. ## Two-boxing, smoking and chewing gum in Medical Newcomb problems 15 29 June 2015 10:35AM I am currently learning about the basics of decision theory, most of which is common knowledge on LW. I have a question, related to why EDT is said not to work. Consider the following Newcomblike problem: A study shows that most people who two-box in Newcomblike problems as the following have a certain gene (and one-boxers don't have the gene). Now, Omega could put you into something like Newcomb's original problem, but instead of having run a simulation of you, Omega has only looked at your DNA: If you don't have the "two-boxing gene", Omega puts$1M into box B, otherwise box B is empty. And there is $1K in box A, as usual. Would you one-box (take only box B) or two-box (take box A and B)? Here's a causal diagram for the problem: Since Omega does not do much other than translating your genes into money under a box, it does not seem to hurt to leave it out: I presume that most LWers would one-box. (And as I understand it, not only CDT but also TDT would two-box, am I wrong?) Now, how does this problem differ from the smoking lesion or Yudkowsky's (2010, p.67) chewing gum problem? Chewing Gum (or smoking) seems to be like taking box A to get at least/additional$1K, the two-boxing gene is like the CGTA gene, the illness itself (the abscess or lung cancer) is like not having 1M in box B. Here's another causal diagram, this time for the chewing gum problem: As far as I can tell, the difference between the two problems is some additional, unstated intuition in the classic medical Newcomb problems. Maybe, the additional assumption is that the actual evidence lies in the "tickle", or that knowing and thinking about the study results causes some complications. In EDT terms: The intuition is that neither smoking nor chewing gum gives the agent additional information. ## Acausal trade barriers 9 11 March 2015 01:40PM A putative new idea for AI control; index here. Many of the ideas presented here require AIs to be antagonistic towards each other - or at least hypothetically antagonistic towards hypothetical other AIs. This can fail if the AIs engage in acausal trade, so it would be useful if we could prevent such things from happening. Now, I have to admit I'm still quite confused by acausal trade, so I'll simplify it to something I understand much better, an anthropic decision problem. ## Staples and paperclips, cooperation and defection Cilppy has a utility function p, linear in paperclips, while Stapley has a utility function s, linear in staples (and both p and s are normalised to zero with one aditional item adding 1 utility). They are not causally connected, and each must choose "Cooperate" or "Defect". If they "Cooperate", they create 10 copies of the items they do not value (so Clippy creates 10 staples, Stapley creates 10 paperclips). If they choose defect, they create one copy of the item they value (so Clippy creates 1 paperclip, Stapley creates 1 staple). Assume both agents know these facts, both agents use anthropic decision theories, and both agents are identical apart from their separate locations and distinct utility functions. Then the outcome is easy: both agents will consider that "cooperate-cooperate" or "defect-defect" are the only two possible options, "cooperate-cooperate" gives them the best outcome, so they will both cooperate. It's a sweet story of cooperation and trust between lovers that never agree and never meet. ## Breaking cooperation How can we demolish this lovely agreement? As I often do, I will assume that there is some event X that will turn Clippy on, with P(X) ≈ 1 (hence P(¬X) << 1). Similarly there is an event Y that turns Stapley on. Since X and Y are almost certain, they should not affect the results above. If the events don't happen, the AIs will never get turned on at all. Now I am going to modify utility p, replacing it with p' = p - E(p|¬X). This p with a single element subtracted off it, the expected value of p given that Clippy has not been turned on. This term feels like a constant, but isn't exactly, as we shall see. Do the same modification to utility s, using Y: s' = s - E(s|¬Y). Now contrast "cooperate-cooperate" and "defect-defect". If Clippy and Stapley are both cooperators, then p=s=10. However, if the (incredibly unlikely) ¬X were to happen, then Clippy would not exist, but Stapley would still cooperate (as Stapley has no way of knowing about Clippy's non-existence), and create ten paperclips. So E(p|¬X) = E(p|X) ≈ 10, and p' ≈ 0. Similarly s' ≈ 0. If both agents are defectors, though, then p=s=1. Since each agent creates its own valuable object, E(p|¬X) = 0 (Clippy cannot create a paperclip if Clippy does not exist) and similarly E(s|¬Y)=0. So p'=s'=1, and both agents will choose to defect. If this is a good analogue for acausal decision making, it seems we can break that, if needed. ## [LINK] The P + epsilon Attack (Precommitment in cryptoeconomics) 18 29 January 2015 02:02AM Vitalik Buterin has a new post about an interesting theoretical attack against Bitcoin. The idea relies on the assumption that the attacker can credibly commit to something quite crazy. The crazy thing is this: paying out 25.01 BTC to all the people who help him in his attack to steal 25 BTC from everyone, but only if the attack fails. This leads to a weird payoff matrix where the dominant strategy is to help him in the attack. The attack succeeds, and no payoff is made. Of course, smart contracts make such crazy commitments perfectly possible, so this is a bit less theoretical than it sounds. But even as an abstract though experiment about decision theories, it looks pretty interesting. By the way, Vitalik Buterin is really on a roll. Just a week ago he had a thought-provoking blog post about how Decentralized Autonomous Organizations could possibly utilize a concept often discussed here: decision theory in a setup where agents can inspect each others' source code. It was shared on LW Discussion, but earned less exposure than I think it deserved. EDIT 1: One smart commenter of the original post spotted that an isomorphic, extremely cool game was already proposed by billionaire Warren Buffett. Does this thing already have a name in game theory maybe? EDIT 2: I wrote the game up in detail for some old-school game theorist friends: The attacker orchestrates a game with 99 players. The attacker himself does not participate in the game. Rules: Each of the players can either defect or cooperate, in the usual game theoretic setup where they do announce their decisions simultaneously, without side channels. We call "aggregate outcome" the decision that was made by the majority of the players. If the aggregate outcome is defection, we say that the attack succeeds. A player's payoff consists of two components: 1. If her decision coincides with the aggregate outcome, the player gets 10 utilons. and simultaneously: 2. if the attack succeeds, the attacker gets 1 utilons from each of the 99 players, regardless of their own decision. | Cooperate | Defect Attack fails | 10 | 0 Attack succeeds | -1 | 9 There are two equilibria, but the second payoff component breaks the symmetry, and everyone will cooperate. Now the attacker spices things up, by making a credible commitment before the game. ("Credible" simply means that somehow they make sure that the promise can not be broken. The classic way to achieve such things is an escrow, but so called smart contracts are emerging as a method for making fully unbreakable commitments.) The attacker's commitment is quite counterintuitive: he promises that he will pay 11 utilons to each of the defecting players, but only if the attack fails. Now the payoff looks like this: | Cooperate | Defect Attack fails | 10 | 11 Attack succeeds | -1 | 9 Defection became a dominant strategy. The clever thing, of course, is that if everyone defects, then the attacker reaches his goal without paying out anything. ## Less exploitable value-updating agent 5 13 January 2015 05:19PM My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out. For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow. One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example. As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon. We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose). The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted). So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X). This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring. Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example. ## Lying in negotiations: a maximally bad problem 13 17 November 2014 03:17PM In a previous post, I showed that the Nash Bargaining Solution (NBS), the Kalai-Smorodinsky Bargaining Solution (KSBS) and own my Mutual Worth Bargaining Solution (MWBS) were all maximally vulnerable to lying. Here I can present a more general result: all bargaining solutions are maximally vulnerable to lying. Assume that players X and Y have settled on some bargaining solution (which only cares about the defection point and the utilities of X and Y). Assume further that player Y knows player X's utility function. Let player X look at the possible outcomes, and let her label any outcome O "admissible" if there is some possible bargaining partner YO with utility function uO such that O would be the outcome of the bargain between X and YO. For instance, in the case of NBS and KSBS, the admissible outcomes would be the outcomes Pareto-better than the disagreement point. The MWBS has a slightly larger set of admissible outcomes, as it allows players to lose utility (up to the maximum they could possibly gain). Then the general result is: If player Y is able to lie about his utility function while knowing player X's true utility (and player X is honest), he can freely select his preferred outcome among the outcomes that are admissible. The proof of this is also derisively brief: player Y need simply claim to have utility uO, in order to force outcome O. Thus, if you've agreed on a bargaining solution, all that you've done is determined the set of outcomes among which your lying opponent will freely choose. There may be a subtlety: your system could make use of an objective (or partially objective) disagreement point, which your opponent is powerless to change. This doesn't change the result much: If player Y is able to lie about his utility function while knowing player X's true utility (and player X is honest), he can freely select his preferred outcome among the outcomes that are admissible given the disagreement point. ## Exploitation and gains from trade Note that the above result did not make any assumptions about the outcome being Pareto - giving up Pareto doesn't make you non-exploitable (or "strategyproof" as it is often called). But note also that the result does not mean that the system is exploitable! In the random dictator setup, you randomly assign power to one player, who then makes all the decisions. In terms of expected utility, this is a pUX+(1-p)UY, where UX is the best outcome ("Utopia") for X and UY the best outcome for Y, and p the probability that X is the random dictator. The theorem still holds for this setup: player X knows that player Y will be able to select freely among the admissible outcomes, which is the set S={pUX+(1-p)O | O an outcome}. However, player X knows that player Y will select pUX+(1-p)UY as this maximises his expected utility. So a bargaining solution which has a particular selection of admissible outcomes can be strategyproof. However, it seems that the only strategyproof bargaining solutions are variants of random dictators! These solutions do not allow much gain from trade. Conversely, the more you open your bargaining solution up to gains from trade, the more exploitable you become from lying. This can be seen in the examples above: my MWBS tried to allow greater gains (in expectation) by not restricting to strict Pareto improvements from the disagreement point. As a result, it makes itself more vulnerable to liars. ## What to do What can be done about this? There seem to be several possibilities: 1. Restrict to bargaining solutions difficult to exploit. This is the counsel of despair: give up most of the gains from trade, to protect yourself from lying. But there may be a system where the tradeoff between exploitability and potential gains is in some sense optimal. 2. Figure out your opponent's true utility function. The other obvious solution: prevent lying by figuring out what your opponent really values, by inspecting their code, their history, their reactions, etc... This could be combined with refusing to trade with those who don't make their true utility easy to discover (or only using non-exploitable trades with those). 3. Hide your own true utility. The above approach only works because the liar knows their opponent, and their opponent doesn't know them. If both utilities are hidden, it's not clear how exploitable the system really is. 4. Play only multi-player. If there are many different trades with many different people, it becomes harder to construct a false utility that exploits them all. This is in a sense a variant of "hiding your own true utility": in that situation, the player has to lie given their probability distribution of your possible possible utilities; in this this situation, they have to lie, given the known distribution of multiple true utilities. So there does not seem to be a principled way of getting rid of liars. But the multi-player (or hidden utility function) may point to a single "best" bargaining solution: the one that minimises the returns to lying and maximises the gains to trade, given ignorance of the other's utility function. ## Blackmail, continued: communal blackmail, uncoordinated responses 11 22 October 2014 05:53PM The heuristic that one should always resist blackmail seems a good one (no matter how tricky blackmail is to define). And one should be public about this, too; then, one is very unlikely to be blackmailed. Even if one speaks like an emperor. But there's a subtlety: what if the blackmail is being used against a whole group, not just against one person? The US justice system is often seen to function like this: prosecutors pile on ridiculous numbers charges, threatening uncounted millennia in jail, in order to get the accused to settle for a lesser charge and avoid the expenses of a trial. But for this to work, they need to occasionally find someone who rejects the offer, put them on trial, and slap them with a ridiculous sentence. Therefore by standing up to them (or proclaiming in advance that you will reject such offers), you are not actually making yourself immune to their threats. Your setting yourself up to be the sacrificial one made an example of. Of course, if everyone were a UDT agent, the correct decision would be for everyone to reject the threat. That would ensure that the threats are never made in the first place. But - and apologies if this shocks you - not everyone in the world is a perfect UDT agent. So the threats will get made, and those resisting them will get slammed to the maximum. Of course, if everyone could read everyone's mind and was perfectly rational, then they would realise that making examples of UDT agents wouldn't affect the behaviour of non-UDT agents. In that case, UDT agents should resist the threats, and the perfectly rational prosecutor wouldn't bother threatening UDT agents. However - and sorry to shock your views of reality three times in one post - not everyone is perfectly rational. And not everyone can read everyone's minds. So even a perfect UDT agent must, it seems, sometimes succumb to blackmail. ## Value learning: ultra-sophisticated Cake or Death 9 17 June 2014 04:36PM Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do. But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired. Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if: Expectation(p(C(u)) | a) = p(C(u)). Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely. That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b, Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b). How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen. In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way. continue reading » ## SUDT: A toy decision theory for updateless anthropics 15 23 February 2014 11:50PM The best approach I know for thinking about anthropic problems is Wei Dai's Updateless Decision Theory (UDT). We aren't yet able to solve all problems that we'd like to—for example, when it comes to game theory, the only games we have any idea how to solve are very symmetric ones—but for many anthropic problems, UDT gives the obviously correct solution. However, UDT is somewhat underspecified, and cousin_it's concrete models of UDT based on formal logic are rather heavyweight if all you want is to figure out the solution to a simple anthropic problem. In this post, I introduce a toy decision theory, Simple Updateless Decision Theory or SUDT, which is most definitely not a replacement for UDT but makes it easy to formally model and solve the kind of anthropic problems that we usually apply UDT to. (And, of course, it gives the same solutions as UDT.) I'll illustrate this with a few examples. This post is a bit boring, because all it does is to take a bit of math that we already implicitly use all the time when we apply updateless reasoning to anthropic problems, and spells it out in excruciating detail. If you're already well-versed in that sort of thing, you're not going to learn much from this post. The reason I'm posting it anyway is that there are things I want to say about updateless anthropics, with a bit of simple math here and there, and while the math may be intuitive, the best thing I can point to in terms of details are the posts on UDT, which contain lots of irrelevant complications. So the main purpose of this post is to save people from having to reverse-engineer the simple math of SUDT from the more complex / less well-specified math of UDT. (I'll also argue that Psy-Kosh's non-anthropic problem is a type of counterfactual mugging, I'll use the concept of l-zombies to explain why UDT's response to this problem is correct, and I'll explain why this argument still works if there aren't any l-zombies.) * I'll introduce SUDT by way of a first example: the counterfactual mugging. In my preferred version, Omega appears to you and tells you that it has thrown a very biased coin, which had only a 1/1000 chance of landing heads; however, in this case, the coin has in fact fallen heads, which is why Omega is talking to you. It asks you to choose between two options, (H) and (T). If you choose (H), Omega will create a Friendly AI; if you choose (T), it will destroy the world. However, there is a catch: Before throwing the coin, Omega made a prediction about which of these options you would choose if the coin came up heads (and it was able to make a highly confident prediction). If the coin had come up tails, Omega would have destroyed the world if it's predicted that you'd choose (H), and it would have created a Friendly AI if it's predicted (T). (Incidentally, if it hadn't been able to make a confident prediction, it would just have destroyed the world outright.)  Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) You choose (H) if coin falls heads Positive intelligence explosion Humanity wiped out You choose (T) if coin falls heads Humanity wiped out Positive intelligence explosion In this example, we are considering two possible worlds:  and . We write  (no pun intended) for the set of all possible worlds; thus, in this case, . We also have a probability distribution over , which we call . In our example,  and . In the counterfactual mugging, there is only one situation you might find yourself in in which you need to make a decision, namely when Omega tells you that the coin has fallen heads. In general, we write  for the set of all possible situations in which you might need to make a decision; the  stands for the information available to you, including both sensory input and your memories. In our case, we'll write , where  is the single situation where you need to make a decision. For every , we write  for the set of possible actions you can take if you find yourself in situation . In our case,. A policy (or "plan") is a function  that associates to every situation  an action  to take in this situation. We write  for the set of all policies. In our case, , where  and . Next, there is a set of outcomes, , which specify all the features of what happens in the world that make a difference to our final goals, and the outcome function , which for every possible world  and every policy  specifies the outcome  that results from executing  in the world . In our case,  (standing for FAI and DOOM), and  and . Finally, we have a utility function . In our case,  and . (The exact numbers don't really matter, as long as , because utility functions don't change their meaning under affine transformations, i.e. when you add a constant to all utilities or multiply all utilities by a positive number.) Thus, an SUDT decision problem consists of the following ingredients: The sets ,  and  of possible worlds, situations you need to make a decision in, and outcomes; for every , the set  of possible actions in that situation; the probability distribution ; and the outcome and utility functions  and . SUDT then says that you should choose a policy  that maximizes the expected utility , where  is the expectation with respect to , and  is the true world. In our case,  is just the probability of the good outcome , according to the (prior) distribution . For , that probability is 1/1000; for , it is 999/1000. Thus, SUDT (like UDT) recommends choosing (T). If you set up the problem in SUDT like that, it's kind of hidden why you could possibly think that's not the right thing to do, since we aren't distinguishing situations  that are "actually experienced" in a particular possible world ; there's nothing in the formalism that reflects the fact that Omega never asks us for our choice if the coin comes up tails. In my post on l-zombies, I've argued that this makes sense because even if there's no version of you that actually consciously experiences being in the heads world, this version still exists as a Turing machine and the choices that it makes influence what happens in the real world. If all mathematically possible experiences exist, so that there aren't any l-zombies, but some experiences are "experienced more" (have more "magical reality fluid") than others, the argument is even clearer—even if there's some anthropic sense in which, upon being told that the coin fell heads, you can conclude that you should assign a high probability of being in the heads world, the same version of you still exists in the tails world, and its choices influence what happens there. And if everything is experienced to the same degree (no magical reality fluid), the argument is clearer still. * From Vladimir Nesov's counterfactual mugging, let's move on to what I'd like to call Psy-Kosh's probably counterfactual mugging, better known as Psy-Kosh's non-anthropic problem. This time, you're not alone: Omega gathers you together with 999,999 other advanced rationalists, all well-versed in anthropic reasoning and SUDT. It places each of you in a separate room. Then, as before, it throws a very biased coin, which has only a 1/1000 chance of landing heads. If the coin does land heads, then Omega asks all of you to choose between two options, (H) and (T). If the coin falls tails, on the other hand, Omega chooses one of you at random and asks that person to choose between (H) and (T). If the coin lands heads and you all choose (H), Omega will create a Friendly AI; same if the coin lands tails, and the person who's asked chooses (T); else, Omega will destroy the world.  Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) Everyone chooses (H) if asked Positive intelligence explosion Humanity wiped out Everyone chooses (T) if asked Humanity wiped out Positive intelligence explosion Different people choose differently Humanity wiped out (Depends on who is asked) We'll assume that all of you prefer a positive FOOM over a gloomy DOOM, which means that all of you have the same values as far as the outcomes of this little dilemma are concerned: , as before, and all of you have the same utility function, given by  and . As long as that's the case, we can apply SUDT to find a sensible policy for everybody to follow (though when there is more than one optimal policy, and the different people involved can't talk to each other, it may not be clear how one of the policies should be chosen). This time, we have a million different people, who can in principle each make an independent decision about what to answer if Omega asks them the question. Thus, we have . Each of these people can choose between (H) and (T), so  for every person , and a policy  is a function that returns either (H) or (T) for every . Obviously, we're particularly interested in the policies  and  satisfying  and  for all . The possible worlds are , and their probabilities are  and . The outcome function is as follows: ,  for ,  if , and  otherwise. What does SUDT recommend? As in the counterfactual mugging,  is the probability of the good outcome , under policy . For , the good outcome can only happen if the coin falls heads: in other words, with probability . If , then the good outcome can not happen if the coin falls heads, because in that case everybody gets asked, and at least one person chooses (T). Thus, in this case, the good outcome will happen only if the coin comes up tails and the randomly chosen person answers (T); this probability is , where  is the number of people answering (T). Clearly, this is maximized for , where ; moreover, in this case we get the probability , which is better than for , so SUDT recommends the plan . Again, when you set up the problem in SUDT, it's not even obvious why anyone might think this wasn't the correct answer. The reason is that if Omega asks you, and you update on the fact that you've been asked, then after updating, you are quite certain that the coin has landed heads: yes, your prior probability was only 1/1000, but if the coin has landed tails, the chances that you would be asked was only one in a million, so the posterior odds are about 1000:1 in favor of heads. So, you might reason, it would be best if everybody chose (H); and moreover, all the people in the other rooms will reason the same way as you, so if you choose (H), they will as well, and this maximizes the probability that humanity survives. This relies on the fact that the others will choose the same way as you, but since you're all good rationalists using the same decision theory, that's going to be the case. But in the worlds where the coin comes up tails, and Omega chooses someone else than you, the version of you that gets asked for its decision still "exists"... as an l-zombie. You might think that what this version of you does or doesn't do doesn't influence what happens in the real world; but if we accept the argument from the previous paragraph that your decisions are "linked" to those of the other people in the experiment, then they're still linked if the version of you making the decision is an l-zombie: If we see you as a Turing machine making a decision, that Turing machine should reason, "If the coin came up tails and someone else was chosen, then I'm an l-zombie, but the person who is actually chosen will reason exactly the same way I'm doing now, and will come to the same decision; hence, my decision influences what happens in the real world even in this case, and I can't do an update and just ignore those possible worlds." I call this the "probably counterfactual mugging" because in the counterfactual mugging, you are making your choice because of its benefits in a possible world that is ruled out by your observations, while in the probably counterfactual mugging, you're making it because of its benefits in a set of possible worlds that is made very improbable by your observations (because most of the worlds in this set are ruled out). As with the counterfactual mugging, this argument is just all the stronger if there are no l-zombies because all mathematically possible experiences are in fact experienced. * As a final example, let's look at what I'd like to call Eliezer's anthropic mugging: the anthropic problem that inspired Psy-Kosh's non-anthropic one. This time, you're alone again, except that there's many of you: Omega is creating a million copies of you. It flips its usual very biased coin, and if that coin falls heads, it places all of you in exactly identical green rooms. If the coin falls tails, it places one of you in a green room, and all the others in red rooms. It then asks all copies in green rooms to choose between (H) and (T); if your choice agrees with the coin, FOOM, else DOOM.  Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) Green roomers choose (H) Positive intelligence explosion Humanity wiped out Green roomers choose (T) Humanity wiped out Positive intelligence explosion Our possible worlds are back to being , with probabilities  and . We are also back to being able to make a choice in only one particular situation, namely when you're a copy in a green room: . Actions are , outcomes , utilities  and , and the outcome function is given by  and . In other words, from SUDT's perspective, this is exactly identical to the situation with the counterfactual mugging, and thus the solution is the same: Once more, SUDT recommends choosing (T). On the other hand, the reason why someone might think that (H) could be the right answer is closer to that for Psy-Kosh's probably counterfactual mugging: After waking up in a green room, what should be your posterior probability that the coin has fallen heads? Updateful anthropic reasoning says that you should be quite sure that it has fallen heads. If you plug those probabilities into an expected utility calculation, it comes out as in Psy-Kosh's case, heavily favoring (H). But even if these are good probabilities to assign epistemically (to satisfy your curiosity about what the world probably looks like), in light of the arguments from the counterfactual and the probably counterfactual muggings (where updating definitely is the right thing to do epistemically, but plugging these probabilities into the expected utility calculation gives the wrong result), it doesn't seem strange to me to come to the conclusion that choosing (T) is correct in Eliezer's anthropic mugging as well. ## What should superrational players do in asymmetric games? 10 24 January 2014 07:42AM Rereading Hofstadter's essays on superrationality prompted me to wonder what strategies superrational agents would want to commit to in asymmetric games. In symmetric games, everyone can agree on outcome they'd like to jointly achieve, leaving the decision-theoretic question of whether the players can commit or not. In asymmetric games, life becomes murkier. There are typically many Pareto-efficient outcomes, and we enter the wilds of cooperative game theory and bargaining solutions trying to identify the right one. While, say, the Nash bargaining solution is appealing on many levels, I have a hard time connecting the logic of superrationality to any particular solution. Recently though, I found some insight in "Cooperation in Strategic Games Revisited" by Adam Kalai and Ehud Kalai (working paper version and three-page summary version) for the special case of two-player games with side transfers. Just to make sure everyone's on common ground, the prototypical game examined in the argument for superrationality is the prisoners' dilemma:  Alice / Bob Cooperate Defect Cooperate 10 / 10 0 / 12 Defect 12 / 0 4 / 4 The unique dominant-strategy equilibrium is (Defect, Defect). However, Hofstadter argues that "superrational" players would recognize the symmetry in reasoning processes between each other and thus conclude that cooperating is in their interest. The argument is not in favor of unconditional cooperation. Instead, the reasoning is closer to "I cooperate if and only I expect you to cooperate if and only if I cooperate". Many bits have been devoted to formalizing this reasoning in timeless decision theory and other variants. The symmetry in the prisoners' dilemma makes it easy to pick out (Cooperate, Cooperate) as the action profile each player ideally wants to see happen. Consider instead the following skewed prisoners' dilemma:  Alice/Bob Cooperate Defect Cooperate 2 / 18 0 / 12 Defect 12 / 0 4 / 4 The (Cooperate, Cooperate) outcome still has the highest total benefit, but (Defect, Defect) is also Pareto-efficient. With this asymmetry, it seems reasonable for Alice to Defect, even as someone who would cooperate in the original prisoners' dilemma. Suppose however players can also agree to transfer utility between themselves on a 1-to-1 basis (like if they value cash equally and can make side-payments). Then, (Cooperate, Cooperate) with a transfer between 2 and 14 from Bob to Alice dominates (Defect, Defect). The size of the transfer is still up in the air, although a transfer of 8 (leaving both with a payoff of 10) is appealing since it takes us back to the original symmetric game. I feel confident suggesting this as an outcome the players should commit to if possible. While the former game could be symmetrized in a nice way, what about more general games where payoffs could look even more askew or strategy sets could be completely different? Let A be the payoff matrix for Alice and B be the payoff matrix for Bob in any given game. Kalai and Kalai point out that the game (A, B) can be decomposed into the sum of two games: $(A,B)=\left(\frac{A+B}{2},\frac{A+B}{2}\right)+\left(\frac{A-B}{2},\frac{B-A}{2}\right),$ where payoffs are identical in the first game (the team game) and zero-sum in the second (the advantage game). Consider playing these games separately. In the team game, Alice and Bob both agree on the action profile that maximizes their payoff with no controversy. In the advantage game, preferences are exactly opposed, so each can play their maximin strategy, again with no controversy. Of course, the rub is the team game strategy profile could be very different from the advantage game strategy profile. Suppose Alice and Bob could commit to playing each game separately. Kalai and Kalai define the payoffs each gets between the two games as $\textrm{coco-value}(A,B)=\textrm{maxmax}\left(\frac{A+B}{2},\frac{A+B}{2}\right)\;+\;\textrm{maximin}\left(\frac{A-B}{2},\frac{B-A}{2}\right)$ where coco stands for cooperative/competitive. We don't actually have two games to be played separately, so the way to achieve these payoffs is for Alice and Bob to actually play the team game actions and hypothetically play the advantage game. Transfers then even out the gains from the team game results and add in the hypothetical advantage game results. Even though the original game might be asymmetric, this simple decomposition allows players to cooperate exactly where interests are aligned and compete exactly where interests are opposed. For example, consider two hot dog vendors. There are 40 potential customers at the airport and 100 at the beach. If both choose the same location, they split the customers there evenly. Otherwise, the vendor at each location sells to everyone at that place. Alice turns a profit of2 per customer, while Bob turns a profit of $1 per customer. Overall this yields the payoffs:  Alice / Bob Airport Beach Airport 40 / 20 80 / 100 Beach 200 / 40 100 / 50 The game decomposes into the team game:  Alice / Bob Airport Beach Airport 30 / 30 90 / 90 Beach 120 / 120 75 / 75 and the advantage game:  Alice / Bob Airport Beach Airport 10 / -10 -10 / 10 Beach 80 / -80 25 / -25 The maximizing strategy profile for the team game is (Beach, Airport) with payoffs (120, 120). The maximin strategy profile for the advantage game is (Beach, Beach) with payoffs (25, -25). In total, this game has a coco-value of (145, 95), which would be realized by Alice selling at the beach, Bob selling at the airport, and Alice transferring 55 to Bob. Alice generates most of the profits in this situation, but Bob has to be compensated for his credible threat to start selling at the beach too. The bulk of the Kalai and Kalai article is extending the coco-value to incomplete information settings. For instance, each vendor might have some private information about the weather tomorrow, which will affect the number of customers at the airport and the beach. The Kalais prove that being able to publicly observe the payoffs for the chosen actions is sufficient for agents to commit themselves to the coco-value ex-ante (before receiving any private information) and that being able to publicly observe all hypothetical payoffs from alternative action profiles is sufficient for commitment even after agents have private information. The Kalais provide an axiomatization of the coco-value, showing it is the payoff pair that uniquely satisfies all of the following: 1. Pareto optimality: The sum of the values is maximal. 2. Shift invariance: Increasing a player's payoff by a constant amount in each cell increases their value by the same amount. 3. Payoff dominance: If one player always gets more than the other in each cell, that player can't get a smaller value for the game. 4. Invariance to redundant strategies: Adding a new action that is a convex combination of the payoffs of two other actions can't change the value. 5. Monotonicity in actions: Removing an action from a player can't increase their value for the game. 6. Monotonicity in information: Giving a player strictly less information can't increase their value for the game. The coco-value is also easily computable, unlike Nash equilibria in general. I'm hard-pressed to think of any more I could want from it (aside from easy extensions to bigger classes of games). Given its simplicity, I'm surprised it wasn't hit upon earlier. ## Game theory and expected opponents 1 14 November 2013 11:26PM Thanks to V_V and Emile for some great discussion. Since writing up a post seems to reliably spark interesting comments, that's what I'll do! Summary If I wanted to write down a decision theory that gets the correct answer to game-theoretic problems (like playing the middle Nash equilibrium in a blind chicken-like game), it would have to, in a sense, implement all of game theory. This is hard because human-generated solutions to games use a lot of assumptions about what the other players will do, and putting those assumptions into our algorithm is a confusing problem. In order to tell what's really going on, we need to make that information more explicit. Once we do that, maybe we can get a UDT-like algorithm to make good moves in tricky games. Newcomb's Problem For an example of a game with unusually good information about our opponent, how about Newcomb's problem. Is it really a game, you ask? Sure, I say! In the payoff matrix to the right, you play red and Omega plays blue. The numbers for Omega just indicate that Omega only wants to put in the million dollars if you will 1-box. If this was a normal game-theory situation, you wouldn't easily know what to do - your best move depends on Omega's move. This is where typical game theory procedure would be to say "well, that's silly, let's specify some extra nice properties the choice of both players should have so that we get a unique solution." But the route taken in Newcomb's problem is different - we pick out a unique solution by increasing how much information the players have about each other. Omega knows what you will play, and you know that Omega knows what you will play. Now all we need to figure out what to do is some information like "If Omega has an available strategy that will definitely get it the highest possible payoff, it will take it." The best strategy, of course, is to one-box so that Omega puts in the million dollars. Newcomb's Game vs. an Ignorant opponent Consider another possible opponent in this game - one who has no information about what your move will be. Whereas Omega always knows your move, an ignorant opponent has no idea what you will play - they have no basis to think you're more likely to 1-box than 2-box, or vice versa. Interestingly, for this particular payoff matrix this makes you ignorant too - you have no basis to think the ignorant opponent would rather put the money in than not, or vice versa. So you assign a 50% chance to each (probability being quantified ignorance) and find that two-boxing has the highest rewards. This didn't even require the sophistication of taking into account your own action, like the game against Omega did, since an ignorant opponent can't respond to your action. Human opponents Ok, so we've looked at a super-knowledgeable opponent, and a super-ignorant opponent, what does a more typical game theory situation look like? Well, it's when our opponent is more like us - someone trying to pick the strategy that gets them the best reward, with similar information to what we have. In typical games between humans, both know that the other is a human player - and they know that it's known, etc. In terms of what we know, we know that our opponent is drawn from some distribution of opponents that are about as good as we are at games, and that they have the same information about us that we have about them. What information do we mean we have when we say our opponent is "good at games"? I don't know. I can lay out some possibilities, but this is the crux of the post. I'll frame our possible knowledge in terms of past games, like how one could say of Newcomb's problem "you observe a thousand games, and Omega always predicts right." Possibility 1: We know our opponent has played a lot of games against completely unknown opponents in the past, and has a good record, where "good" means "as good or better than the average opponent." Possibility 2: We know our opponent played some games against a closed group of players who played each other, and that group collectively had a good record. Possibility 3: We know our opponent is a neural net that's been trained in some standard way to be good at playing a variety of games, or some sort of hacked-together implementation of game theory, or a UDT agent if that's a good idea. (Seems more complicated than necessary, but on the other hand opponents are totally allowed to be complicated) Suppose we know information set #2. I think it's the most straightforward. All we have to do to turn this information into a distribution over opponents is to figure out what mixtures of players get above-average group results, then average those together. Once we know who our opponent is on average, we just follow the strategy that on average gets the best average payoff. Does the strategy picked this way look like what game theory would say? Not quite - it assumes that the opponent has a medium chance of being stupid. And in some games, like the prisoner's dilemma, the best-payoff groups are actually the ones you can exploit the most. So on closer examination, someone in a successful group isn't the game-theory opponent we're looking for. ## Kidnapping and the game of Chicken 15 03 November 2013 06:29AM Observe the payoff matrix at right (the unit of reward? Cookies.). Each player wants to play 'A', but only so long as the two players play different moves. Suppose that Red got to move first. There are some games where moving first is terrible - take Rock Paper Scissors for example. But in this game, moving first is great, because you get to narrow down your opponent's options! If Red goes first, Red picks 'A', and then Blue has to pick 'B' to get a cookie. This is basically kidnapping. Red has taken all three cookies hostage, and nobody gets any cookies unless Blue agrees to Red's demands for two cookies. Whoever gets to move first plays the kidnapper, and the other player has to decide whether to accede to their ransom demand in exchange for a cookie. What if neither player gets to move before the other, but instead they have their moves revealed at the same time? Pre-Move Chat: Red: "I'm going to pick A, you'd better pick B." Blue: "I don't care what you pick, I'm picking A. You can pick A too if you really want to get 0 cookies." Red: "Okay I'm really seriously going to pick A. Please pick B." Blue: "Nah, don't think so. I'll just pick A. You should just pick B." And so on. They are now playing a game of Chicken. Whoever swerves first is worse off, but if neither of them give in, they crash into each other and die and get no cookies. So, The Question: is it better to play A, or to play B? This is definitely a trick question, but it can't be too trickish because at some point Red and Blue will have to figure out what to do. So why is it a trick question? Because this is a two-player game, and whether it's good to play A or not depends on what your opponent will do. A thought experiment: suppose we threw a party where you could only get dessert (cookies!) by playing this game. At the start, people are unfamiliar with the game, but they recognize that A has higher payoffs than B, so they all pick A all the time. But alas! When both people pick A, neither get anything, so no cookies are distributed. We decide that everyone can play as much as they want until we run out of cookies. Quite soon, one kind soul decides that they will play B, even though it has a lower payoff. A new round of games is begun, and each person gets a turn to play against our kind altruist. Soon, each other person has won their game, and they have 2 cookies each, while our selfless altruist has just one cookie per match they played. So, er, 11 or so cookies? Many of the other party-goers are enlightened by this example. They, too, want to be selfless and altruistic so that they can acquire 11 cookies / win at kidnapping. But a funny thing happens as each additional person plays B - the people playing A win two more cookies per round (one round is everyone getting to play everyone else once), and the people playing B win one fewer cookie, since nobody gets cookies when both play B either. Eventually, there are eight people playing A and four people playing B, and all of them nom 8 cookies per round. It's inevitable that the people playing B eventually get the same number of cookies as the people playing A - if there was a cookie imbalance, then people would switch to the better strategy until cookies were balanced again. Playing A has a higher payoff, but all that really means is that there are eight people playing A and only 4 playing B. It's like B has an ecological niche, and that niche is only of a certain size. What does the party case say about what Red and Blue should do when playing a one-shot game? The ratios of players turn into probabilities: if you're less than 67% sure the other person will play A, you should play A. If you're more than 67% sure, you should play B. This plan only works for situations similar to drawing an opponent out of a pool of deterministic players, though. Stage two of the problem: what if we allow players access to each others' source code? While you can still have A-players and B-players, you can now have a third strategy, which is to play B against A-players and play A against B-players. This strategy will have a niche size in between playing A and playing B. What's really great about reading source code, though, is that running into a copy of yourself no longer means duplicate moves and no cookies. The best "A-players" and "B-players" now choose moves against their copies by flipping coins, so that half the time they get at least one cookie. Flipping a coin against a copy of yourself averages 3/4 of a cookie, which is almost good enough to put B-players out of business. In fact, if we'd chosen our payoff matrix to have a bigger reward for playing A, we actually could put B-players out of business. Fun question: is it possible to decrease the total number of cookies won by increasing the reward for playing A? An interesting issue is how this modification changes the advice for the one-shot game. Our advice against simpler opponents was basically the "predictor" strategy, but that strategy is now in equilibrium with the other two! Good advice now is more like a meta-strategy. If the opponent is likely to be an A-player or a B-player, be a predictor, if the opponent is likely to be a predictor, be an A-player. Now that we've been this cycle before, it should be clearer that this "advice" is really a new strategy that will be introduced when we take the game one meta-level up. The effect on the game is really to introduce gradations of players, where some play A more often and some play B more often, but the populations can be balanced such that each player gets the same average reward. An interesting facet of a competition between predictors is what we might call "stupidity envy" (See ASP). If we use the straightforward algorithm for our predictors (simulate what the opponent will do, then choose the best strategy), then a dumb predictor cannot predict the move of a smarter predictor. This is because the smarter predictor is predicting the dumb one, and you can't predict yourself in less time than you take to run. So the dumber predictor has to use some kind of default move. If its default move is A, then the smarter predictor has no good choice but to take B, and the dumber predictor wins. It's like the dumber predictor has gotten to move first. Being dumb / moving first isn't always good - imagine having to move first in rock paper scissors - but in games where moving first is better, and even a dumb predictor can see why, it's better to be the dumber predictor. On our other axis of smartness, though, the "meta-level," more meta usually produces better head-to-head results - yet the humble A-player gets the best results of all. It's only the fact that A-players do poorly against other A-players that allows a diverse ecology on the B-playing side of the spectrum. ## Of all the SIA-doomsdays in the all the worlds... 4 18 October 2013 12:56PM Ideas developed with Paul Almond, who kept on flogging a dead horse until it started showing signs of life again. ## Doomsday, SSA and SIA Imagine there's a giant box filled with people, and clearly labelled (inside and out) "(year of some people's lord) 2013". There's another giant box somewhere else in space-time, labelled "2014". You happen to be currently in the 2013 box. Then the self-sampling assumption (SSA) produces the doomsday argument. It works approximately like this: SSA has a preference for universe with smaller numbers of observers (since it's more likely that you're one-in-a-hundred than one-in-a-billion). Therefore we expect that the number of observers in 2014 is smaller than we would otherwise "objectively" believe: the likelihood of doomsday is higher than we thought. What about the self-indication assumption (SIA) - that makes the doomsday argument go away, right? Not at all! SIA has no effect on the number of observers expected in the 2014, but increases the expected number of observers in 2013. Thus we still expect that the number of observers in 2014 to be lower than we otherwise thought. There's an SIA doomsday too! ## Enter causality What's going on? SIA was supposed to defeat the doomsday argument! What happens is that I've implicitly cheated - by naming the boxes "2013" and "2014", I've heavily implied that these "boxes" figuratively correspond two subsequent years. But then I've treated them as independent for SIA, like two literal distinct boxes. continue reading » ## What makes us think _any_ of our terminal values aren't based on a misunderstanding of reality? 17 25 September 2013 11:09PM Let's say Bob's terminal value is to travel back in time and ride a dinosaur. It is instrumentally rational for Bob to study physics so he can learn how to build a time machine. As he learns more physics, Bob realizes that his terminal value is not only utterly impossible but meaningless. By definition, someone in Bob's past riding a dinosaur is not a future evolution of the present Bob. There are a number of ways to create the subjective experience of having gone into the past and ridden a dinosaur. But to Bob, it's not the same because he wanted both the subjective experience and the knowledge that it corresponded to objective fact. Without the latter, he might as well have just watched a movie or played a video game. So if we took the original, innocent-of-physics Bob and somehow calculated his coherent extrapolated volition, we would end up with a Bob who has given up on time travel. The original Bob would not want to be this Bob. But, how do we know that _anything_ we value won't similarly dissolve under sufficiently thorough deconstruction? Let's suppose for a minute that all "human values" are dangling units; that everything we want is as possible and makes as much sense as wanting to hear the sound of blue or taste the flavor of a prime number. What is the rational course of action in such a situation? PS: If your response resembles "keep attempting to XXX anyway", please explain what privileges XXX over any number of other alternatives other than your current preference. Are you using some kind of pre-commitment strategy to a subset of your current goals? Do you now wish you had used the same strategy to precommit to goals you had when you were a toddler? ## The Interrupted Ultimate Newcomb's Problem 3 10 September 2013 11:04PM While figuring out my error in my solution to the Ultimate Newcomb's Problem, I ran across this (distinct) reformulation that helped me distinguish between what I was doing and what the problem was actually asking. ... but that being said, I'm not sure if my answer to the reformulation is correct either. The question, cleaned for Discussion, looks like this: You approach the boxes and lottery, which are exactly as in the UNP. Before reaching it, you come to sign with a flashing red light. The sign reads: "INDEPENDENT SCENARIO BEGIN." Omega, who has predicted that you will be confused, shows up to explain: "This is considered an artificially independent experiment. Your algorithm for solving this problem will not be used in my simulations of your algorithm for my various other problems. In other words, you are allowed to two-box here but one-box Newcomb's problem, or vice versa." This is motivated by the realization that I've been making the same mistake as in the original Newcomb's Problem, though this justification does not (I believe) apply to the original. The mistake is simply this: that I assumed that I simply appear in medias res. When solving the UNP, it is (seems to be) important to remember that you may be in some very rare edge case of the main problem, and that you are choosing your algorithm for the problem as a whole. But if that's not true - if you're allowed to appear in the middle of the problem, and no counterfactual-yous are at risk - it sure seems like two-boxing is justified - as khafra put it, "trying to ambiently control basic arithmetic". (Speaking of which, is there a write up of ambient decision theory anywhere? For that matter, is there any compilation of decision theories?) EDIT: (Yes to the first, though not under that name: Controlling Constant Programs.) ## Duller blackmail definitions 7 15 July 2013 10:08AM For a more parable-ic version of this, see here. Suppose I make a precommitment P to take action X unless you take action Y. Action X is not in my interest: I wouldn't do it if I knew you'd never take action Y. You would want me to not precommit to P. Is this blackmail? Suppose we've been having a steamy affair together, and I have the letters to prove it. It would be bad for both of these if they were published. Then X={Publish the letters} and Y={You pay me money} is textbook blackmail. But suppose I own a MacGuffin that you want (I value it at £9). If X={Reject any offer} and Y={You offer more than £10}, is this still blackmail? Formally, it looks the same. What about if I bought the MacGuffin for £500 and you value it at £1000? This makes no difference to the formal structure of the scenario. Then my behaviour feels utterly reasonable, rather than vicious and blackmail-ly. What is the meaningful difference between the two scenarios? I can't really formalise it. ## Countess and Baron attempt to define blackmail, fail 11 15 July 2013 10:07AM For a more concise version of this argument, see here. We meet our heroes, the Countess of Rectitude and Baron Chastity, as they continue to investigate the mysteries of blackmail by sleeping together and betraying each other. The Baron had a pile of steamy letters between him and the Countess: it would be embarrassing to both of them if these letters got out. Yet the Baron confided the letters to a trusted Acolyte, with strict instructions. The Acolyte was to publish these letters, unless the Countess agreed to give the Baron her priceless Ping Vase. This seems a perfect example of blackmail: • The Baron is taking a course of action that is intrinsically negative for him. This behaviour only makes sense if it forces the Countess to take a specific action which benefits him. The Countess would very much like it if the Baron couldn't do such things. As it turns out, a servant broke the Ping Vase while chasing the Countess's griffon. The servant was swiftly executed, but the Acolyte had to publish the letters as instructed, to great embarrassment all around (sometimes precommitments aren't what they're cracked up to be). After six days of exile in the Countess's doghouse (a luxurious, twenty-room affair) and eleven days of make-up sex, the Baron was back to planning against his lover. continue reading » ## My Take on a Decision Theory 2 09 July 2013 10:46AM Finding a good decision theory is hard. Previous attempts, such as Timeless Decision Theory, work, it seems, in providing a stable, effective decision theory, but are mathematically complicated. Simpler theories, like CDT or EDT, are much more intuitive, but have deep flaws. They fail at certain problems, and thus violate the maxim that rational agents should win. This makes them imperfect. But it seems to me that there is a relatively simple fix one could make to them, in the style of TDT, to extend their power considerably. Here I will show an implementation of such an extension of CDT, that wins on the problems that classic CDT fails on. It quite possibly could turn out that this is not as powerful as TDT, but it is a significant step in that direction, starting only from the naivest of decision theories. It also could turn out that this is nothing more than a reformulation of TDT or a lesser version thereof. In that case, this still has some value as a simpler formulation, easier to understand. Because as it stands, TDT seems like a far cry from a trivial extension of the basic, intuitive decision theories, as this hopes to be. We will start by remarking that when CDT (or EDT) tries to figure out the expected value or a action or outcome, the naive way which it does so drops crucial information, which is what TDT manages to preserve. As such, I will try to calculate a CDT with this information not dropped. This information is, for CDT, the fact that Omega has simulated you and figured out what you are going to do. Why does a CDT agent automatically assume that it is the "real" one, so to speak? This trivial tweak seems powerful. I will, for the purpose of this post, call this tweaked version of CDT "Simulationist Causal Decision Theory", or SCDT for short. Let's run this tweaked version though Newcomb's problem. Let Alice be a SCDT agent. Before the problem begins, as is standard in Newcomb's problem, Omega looks at Alice and calculates what choice Alice will make in the game. Without to much loss of generality, we can assume that Omega directly simulates Alice, and runs the simulation through the a simulation of the game, in order make the determination of what choice Alice will make. In other formulations of Newcomb's problem, Omega figures in out some other way what Alice will do, say by doing a formal analysis of her source code, but that seems intuitively equivalent. This is a possible flaw, but if the different versions of Newcomb's problem are equivalent (as they seem to be) this point evaporates, and so we will put it aside for now, and continue. We will call the simulated agent SimAlice. SimAlice does not know, of course, that she is being simulated, and is an exact copy of Alice in all respects. In particular, she also uses the same SCDT thought processes as Alice, and she has the same utility function as Alice. So, Alice (or SimAlice, she doesn't know which one she is) is presented with the game. She reasons thusly: There are two possible cases: Either I am Alice or I am SimAlice. • If I am Alice: Choosing both boxes will always get me exactly$1000 more then choosing just one. Regardless of whether or not there is $1,000,000 in box 2, by choosing box 1 as well, I am getting an extra$1000. (Note that this is exactly the same reasoning standard CDT uses!)
• If I am SimAlice: Then "I" don't actually get any money in this game, regardless of what I choose. But my goal is not SimAlice getting money it is is Alice getting money, by the simple fact that this is what Alice wants, and we assumed above that SimAlice uses the same utility function as Alice.And depending what I choose now, that will affect the way Omega sets up the boxes, and so affects the amount of money Alice will get. Specifically, if I one box, Omega will put an extra $1,000,000 in box 2, and so Alice will get an extra$1,000,000, no matter what she chooses. (Because in both the choices Alice could make (taking either box 2 or boxes 1&2), she takes box 2, and so will wind up with a bonus $1,000,000 above what she would get if box 2 was empty, which is what would happen if SimAlice didn't two box.) So, as I don't know whether I am Alice or SimAlice, and as there is one of each, there is a 0.5 probability of me being either one, so by the law of total expectation, E[money|I one box]=0.5 * E[money|(I one box)&(I am Alice)] + 0.5 * E[money|(I one box)&(I am SimAlice)] So my expected return off one boxing (above what I would get by two boxing) is 0.5 * -$1000 + 0.5 * $1,000,000 =$450,000, which is positive, so I should one box.

As you can see, just by acknowledging the rules of the game, by admitting that Omega has the power to simulate her (as the rules of Newcomb's problem insist), she will one box. This is unlike a CDT agent, which would ignore Omega's power to simulate her (or otherwise figure out what she will do), and say "Hey, what's in the boxes is fixed, and my choice does not affect it". That is only valid reasoning if you know you are the "original" agent, and Alice herself uses that reasoning, but only in the case where she is assuming she is the "original". She takes care, unlike a CDT agent, to multiply the conditional expected value by the chance of the condition occurring.

This is not only limited to Newcomb's problem. Let's take a look at Parfit's Hitchhiker, another scenario CDT has trouble with. There are again two identical agents making decisions: The "real" Alice, as soon as she gets home; and the "Alice-after-she-gets-home-as simulated-by-the-driver-offering-her-a-ride, which I will again call SimAlice for short.

Conditional on an agent being Alice and not SimAlice, paying the driver loses that agent her $100 and gains her nothing compared to refusing to pay. Conditional on an agent being SimAlice and not Alice, agreeing to pay the driver loses her nothing (as she, being a simulation, cannot give the driver real money), and gains her a trip out of the desert, and so her life. So, again, the law of total expectation gives us that the expected value of paying the driver (considering you don't know which you are), is 0.5 * -$100 + 0.5 * (Value of Alice's life). This gives us that Alice should pay if and only if she values her life at more than $100, which is, once again, the correct answer. So, to sum up, we found that SCDT can not only solve Newcomb's problem, which standard CDT cannot, but also solve Parfit's Hitchhiker, which neither CDT nor EDT can do. It does so at almost no cost in complexity compared to CDT, unlike, say, TDT, which is rather more complex. In fact, I kind of think that it is entirely possible that this SCDT is nothing more than a special case of something similar to TDT. But even if it is, it is a very nice, simple, and relatively easy to understand special case, and so may deserve a look for that alone. There are still open problems for SCDT. If, rather than a simulation, you are analysed in a more direct way, should that change anything? What if, in Newcomb's problem, Omega simulates many simulations of you in parallel? Should that change the weights you place on the expected values? This ties in deeply with the philosophical problem of how you assign measure to identical, independent agents. I can not give a simple answer, and a simple answer to those questions is needed before SCDT is complete. But, if we can figure out the answer to these questions, or otherwise bypass them, we have a trivial extrapolation of CDT, the naivest decision theory, which solves correctly most or all of the problems that trip up CDT. That seems quite worthwhile. ## Other prespective on resolving the Prisoner's dilemma 11 04 June 2013 04:13PM Sometimes I see new ideas that, without offering any new information, offers a new perspective on old information, and a new way of thinking about an old problem. So it is with this lecture and the prisoner's dilemma. Now, I worked a lot with the prisoners dilemma, with superrationality, negotiations, fairness, retaliation, Rawlsian veils of ignorance, etc. I've studied the problem, and its possible resolutions, extensively. But the perspective of that lecture was refreshing and new to me: The prisoner's dilemma is resolved only when the off-diagonal outcomes of the dilemma are known to be impossible. The "off-diagonal outcomes" are the "(Defect, Cooperate)" and the "(Cooperate, Defect)" squares where one person walks away with all the benefit and the other has none: (Baron, Countess) Cooperate Defect Cooperate (3,3) (0,5) Defect (5,0) (1,1) Facing an identical (or near identical) copy of yourself? Then the off-diagonal outcomes are impossible, because you're going to choose the same thing. Facing Tit-for-tat in an iterated prisoner's dilemma? Well, the off-diagonal squares cannot be reached consistently. Is the other prisoner a Mafia don? Then the off-diagonal outcomes don't exist as written: there's a hidden negative term (you being horribly murdered) that isn't taken into account in that matrix. Various agents with open code are essentially publicly declaring the conditions under which they will not reach for the off-diagonal. The point of many contracts and agreements is to make the off-diagonal outcome impossible or expensive. As I said, nothing fundamentally new, but I find the perspective interesting. To my mind, it suggests that when resolving the prisoner's dilemma with probabilistic outcomes allowed, I should be thinking "blocking off possible outcomes", rather than "reaching agreement". ## The VNM independence axiom ignores the value of information 10 02 March 2013 02:36PM Followup to : Is risk aversion really irrational? After reading the decision theory FAQ and re-reading The Allais Paradox I realized I still don't accept the VNM axioms, especially the independence one, and I started thinking about what my true rejection could be. And then I realized I already somewhat explained it here, in my Is risk aversion really irrational? article, but it didn't make it obvious in the article how it relates to VNM - it wasn't obvious to me at that time. Here is the core idea: information has value. Uncertainty therefore has a cost. And that cost is not linear to uncertainty. Let's take a first example: A is being offered a trip to Ecuador, B is being offered a great new laptop and C is being offered a trip to Iceland. My own preference is: A > B > C. I love Ecuador - it's a fantastic country. But I prefer a laptop over a trip to Iceland, because I'm not fond of cold weather (well, actually Iceland is pretty cool too, but let's assume for the sake of the article that A > B > C is my preference). But now, I'm offered D = (50% chance of A, 50% chance of B) or E = (50% chance of A, 50% chance of C). The VNM independence principle says I should prefer D > E. But doing so, it forgets the cost of information/uncertainty. By choosing E, I'm sure I'll be offered a trip - I don't know where, but I know I'll be offered a trip, not a laptop. By choosing D, I'm no idea on the nature of the present. I've much less information on my future - and that lack of information has a cost. If I know I'll be offered a trip, I can already ask for days off at work, I can go buy a backpack, I can start doing the paperwork to get my passport. And if I know I won't be offered a laptop, I may decide to buy one, maybe not as great as one I would have been offered, but I can still buy one. But if I chose D, I've much less information about my future, and I can't optimize it as much. The same goes for the Allais paradox: having certitude of receiving a significant amount of money ($24 000) has a value, which is present in choice 1A, but not in all others (1B, 2A, 2B).

And I don't see why a "rational agent" should neglect the value of this information, as the VNM axioms imply. Any thought about that?

## Proof of fungibility theorem

3 12 January 2013 09:26AM

Appendix to: A fungibility theorem

Suppose that $P$ is a set and we have functions $v_1, \dots, v_n : P \to \mathbb{R}$. Recall that for $p, q \in P$, we say that $p$ is a Pareto improvement over $q$ if for all $i$, we have $v_i(p) \geq v_i(q)$. And we say that it is a strong Pareto improvement if in addition there is some $i$ for which $v_i(p) > v_i(q)$. We call $p$Pareto optimum if there is no strong Pareto improvement over it.

Theorem. Let $P$ be a set and suppose $v_i: P \to \mathbb{R}$ for $i = 1, \dots, n$ are functions satisfying the following property: For any $p, q \in P$ and any $\alpha \in [0, 1]$, there exists an $r \in P$ such that for all $i$, we have $v_i(r) = \alpha v_i(p) + (1 - \alpha) v_i(q)$.

Then if an element $p$ of $P$ is a Pareto optimum, then there exist nonnegative constants $c_1, \dots, c_n$ such that the function $\sum c_i v_i$ achieves a maximum at $p$.

## Math appendix for: "Why you must maximize expected utility"

8 13 December 2012 01:11AM

This is a mathematical appendix to my post "Why you must maximize expected utility", giving precise statements and proofs of some results about von Neumann-Morgenstern utility theory without the Axiom of Continuity. I wish I had the time to make this post more easily readable, giving more intuition; the ideas are rather straight-forward and I hope they won't get lost in the line noise!

The work here is my own (though closely based on the standard proof of the VNM theorem), but I don't expect the results to be new.

*

I represent preference relations as total preorders $\preccurlyeq$ on a simplex $\Delta_N$; define $\prec$, $\sim$, $\succcurlyeq$ and $\succ$ in the obvious ways (e.g., $x\sim y$ iff both $x\preccurlyeq y$ and $y\preccurlyeq x$, and $x\prec y$ iff $x\preccurlyeq y$ but not $y\preccurlyeq x$). Write $e^i$ for the $i$'th unit vector in $\mathbb{R}^N$.

In the following, I will always assume that $\preccurlyeq$ satisfies the independence axiom: that is, for all $x,y,z\in\Delta_N$ and $p\in(0,1]$, we have $x\prec y$ if and only if $px + (1-p)z \prec py + (1-p)z$. Note that the analogous statement with weak preferences follows from this: $x\preccurlyeq y$ holds iff $y\not\prec x$, which by independence is equivalent to $py + (1-p)z \not\prec px + (1-p)z$, which is just $px + (1-p)z \preccurlyeq py + (1-p)z$.

Lemma 1 (more of a good thing is always better). If $x\prec y$ and $0\le p < q \le 1$, then $(1-p)x + py\prec (1-q)x + qy$.

Proof. Let $r := q-p$. Then, $(1-p)x + py = \big((1-q)x + py\big) + rx$ and $(1-q)x + qy = \big((1-q)x + py\big) + ry$. Thus, the result follows from independence applied to $x$$y$, $\textstyle\frac{1}{1-r}\big((1-q)x + py\big)$, and $r$.$\square$

Lemma 2. If $x\preccurlyeq y\preccurlyeq z$ and $x\prec z$, then there is a unique $p\in[0,1]$ such that $(1-q)x + qz \prec y$ for $q\in[0,p)$ and $y\prec (1-q)x + qz$ for $q\in(p,1]$.

Proof. Let $p$ be the supremum of all $r\in[0,1]$ such that $(1-r)x + rz\preccurlyeq y$ (note that by assumption, this condition holds for $r=0$). Suppose that $0\le q. Then there is an $r\in(q,p]$ such that $(1-r)x + rz\preccurlyeq y$. By Lemma 1, we have $(1-q)x + qz \prec (1-r)x + rz$, and the first assertion follows.

Suppose now that $p < q \le 1$. Then by definition of $p$, we do not have $(1-q)x + qz\preccurlyeq y$, which means that we have $(1-q)x + qz\succ y$, which was the second assertion.

Finally, uniqueness is obvious, because if both $p$ and $p'$ satisfied the condition, we would have $\textstyle y \prec \big(1 - \frac{p+p'}2\big)x + \frac{p+p'}2z \prec y$.$\square$

Definition 3. $x$ is much better than $y$, notation $x\succ_* y$ or $y\prec_* x$, if there are neighbourhoods $U$ of $x$ and $V$ of $y$ (in the relative topology of $\Delta_N$) such that we have $x' \succ y'$ for all $x'\in U$ and $y'\in V$. (In other words, the graph of $\succ_*$ is the interior of the graph of $\succ$.) Write $x\preccurlyeq_* y$ or $y\succcurlyeq_* x$ when $x\nsucc_* y$ ($x$ is not much better than $y$), and $x\sim_* y$ ($x$ is about as good as $y$) when both $x\preccurlyeq_* y$ and $x\succcurlyeq_* y$.

Theorem 4 (existence of a utility function). There is a $u\in\mathbb{R}^N$ such that for all $x,y\in\Delta_N$,

$\sum_i x_i\,u_i \;<\; \sum_i y_i\,u_i\;\;\iff\;\; x\prec_* y\;\;\implies\;\;x\prec y.$

Unless $x\sim y$ for all $x$ and $y$, there are $i,j\in\{1,\dotsc,N\}$ such that $u_i\neq u_j$.

Proof. Let $i$ be a worst and $j$ a best outcome, i.e. let $i,j\in\{1,\dotsc,N\}$ be such that $e^i\preccurlyeq e^k\preccurlyeq e^j$ for all $k\in\{1,\dotsc,N\}$. If $e^i\sim e^j$, then $e^i \sim e^k$ for all $k$, and by repeated applications of independence we get $x\sim e^i\sim y$ for all $x,y\in\Delta_N$, and therefore $x\sim_* y$ again for all $x,y\in\Delta_N$, and we can simply choose $u=0$.

Thus, suppose that $e^i\prec e^j$. In this case, let $u$ be such that for every $k\in\{1,\dotsc,N\}$, $u_k$ equals the unique $p$ provided by Lemma 2 applied to $e^i\preccurlyeq e^k\preccurlyeq e^j$ and $e^i\prec e^j$. Because of Lemma 1, $u_i = 0 \neq 1 = u_j$. Let $f(r) := (1-r)e^i + re^j$.

We first show that $\textstyle p := \sum_k x_k\,u_k < \sum_k y_k\,u_k =: q$ implies $x\prec y$. For every $k$, we either have $u_k < 1$, in which case by Lemma 2 we have $e^k \prec f(u_k + \epsilon_k)$ for arbitrarily small $\epsilon_k > 0$, or we have $u_k = 1$, in which case we set $\epsilon_k := 0$ and find $e^k\preccurlyeq e^j = f(u_k + \epsilon_k)$. Set $\textstyle \epsilon := \sum_k x_k\,\epsilon_k$. Now, by independence applied $N-1$ times, we have $\textstyle x = \sum_k x_k\,e^k \preccurlyeq \sum_k x_k f(u_k + \epsilon_k) = f(p+\epsilon)$; analogously, we obtain $y \succcurlyeq f(q-\delta)$ for arbitrarily small $\delta > 0$. Thus, using $p and Lemma 1, $x\preccurlyeq f(p+\epsilon)\prec f(q-\delta)\preccurlyeq y$ and therefore $x\prec y$ as claimed. Now note that if $\textstyle\sum_k x_k\,u_k < \sum_k y_k\,u_k$, then this continues to hold for $x'$ and $y'$ in a sufficiently small neighbourhood of $x$ and $y$, and therefore we have $x\prec_* y$.

Now suppose that $\textstyle \sum_k x_k\,u_k \ge \sum_k y_k\,u_k$. Since we have $u_i = 0$ and $u_j = 1$, we can find points $x'$ and $y'$ arbitrarily close to $x$ and $y$ such that the inequality becomes strict (either the left-hand side is smaller than one and we can increase it, or the right-hand side is greater than zero and we can decrease it, or else the inequality is already strict). Then, $x'\succ y'$ by the preceding paragraph. But this implies that $x\not\prec_* y$, which completes the proof.$\square$

Corollary 5. $\preccurlyeq_*$ is a preference relation (i.e., a total preorder) that satisfies independence and the von Neumann-Morgenstern continuity axiom.

Proof. It is well-known (and straightforward to check) that this follows from the assertion of the theorem.$\square$

Corollary 6. $u$ is unique up to affine transformations.

Proof. Since $u$ is a VNM utility function for $\preccurlyeq_*$, this follows from the analogous result for that case.$\square$

Corollary 7. Unless $x\sim y$ for all $x,y\in\Delta_N$, for all $r\in\mathbb{R}$ the set $\textstyle \{x\in\Delta_N : \sum_i x_i\,u_i = r\}$ has lower dimension than $\Delta_N$ (i.e., it is the intersection of $\Delta_N$ with a lower-dimensional subspace of $\mathbb{R}^N$).

Proof. First, note that the assumption implies that $N\ge 2$. Let $v\in\mathbb{R}^N$ be given by $v_i = 1$, $\forall i$, and note that $\Delta_N$ is the intersection of the hyperplane $A := \{x\in\mathbb{R}^N : x\cdot v = 1\}$ with the closed positive orthant $\mathbb{R}^N_+$. By the theorem, $u$ is not parallel to $v$, so the hyperplane $B_r := \{x\in\mathbb{R}^N : x\cdot u = r\}$ is not parallel to $A$. It follows that $A\cap B_r$ has dimension $N-2$, and therefore $\textstyle\{x\in\Delta_N : \sum_i x_i\,u_i = r\} \;=\; A\cap B_r\cap\mathbb{R}^N_+$ can have at most this dimension. (It can have smaller dimension or be the empty set if $A\cap B_r$ only touches or lies entirely outside the positive orthant.)$\square$

## Smoking lesion as a counterexample to CDT

7 26 October 2012 12:08PM

I stumbled upon this paper by Andy Egan and thought that its main result should be shared. We have the Newcomb problem as counterexample to CDT, but that can be dismissed as being speculative or science-fictiony. In this paper, Andy Egan constructs a smoking lesion counterexample to CDT, and makes the fascinating claim that one can construct counterexamples to CDT by starting from any counterexample to EDT and modifying it systematically.

The "smoking lesion" counterexample to EDT goes like this:

• There is a rare gene (G) that both causes people to smoke (S) and causes cancer (C). Susan mildly prefers to smoke than not to - should she do so?

EDT implies that she should not smoke (since the likely outcome in a world where she doesn't smoke is better than the likely outcome in a world where she does). CDT correctly allows her to smoke: she shouldn't care about the information revealed by her preferences.

But we can modify this problem to become a counterexample to CDT, as follows:

• There is a rare gene (G) that both causes people to smoke (S) and makes smokers vulnerable to cancer (C). Susan mildly prefers to smoke than not to - should she do so?

Here EDT correctly tells her not to smoke. CDT refuses to use her possible decision as evidence that she has the gene and tells her to smoke. But this makes her very likely to get cancer, as she is very likely to have the gene given that she smokes.

The idea behind this new example is that EDT runs into paradoxes whenever there is a common cause (G) of both some action (S) and some undesirable consequence (C). We then take that problem and modify it so that there is a common cause G of both some action (S) and of a causal relationship between that action and the undesirable consequence (S→C). This is then often a paradox of CDT.

It isn't perfect match - for instance if the gene G were common, then CDT would say not to smoke in the modified smoker's lesion. But it still seems that most EDT paradoxes can be adapted to become paradoxes of CDT.

## Cake, or death!

25 25 October 2012 10:33AM

Here we'll look at the famous cake or death problem teasered in the Value loading/learning post.

Imagine you have an agent that is uncertain about its values and designed to "learn" proper values. A formula for this process is that the agent must pick an action a equal to:

• argmaxa∈A Σw∈W p(w|e,a) Σu∈U u(w)p(C(u)|w)

Let's decompose this a little, shall we? A is the set of actions, so argmax of a in A simply means that we are looking for an action a that maximises the rest of the expression. W is the set of all possible worlds, and e is the evidence that the agent has seen before. Hence p(w|e,a) is the probability of existing in a particular world, given that the agent has seen evidence e and will do action a. This is summed over each possible world in W.

And what value do we sum over in each world? Σu∈U u(w)p(C(u)|w). Here U is the set of (normalised) utility functions the agent is considering. In value loading, we don't program the agent with the correct utility function from the beginning; instead we imbue it with some sort of learning algorithm (generally with feedback) so that it can deduce for itself the correct utility function. The expression p(C(u)|w) expresses the probability that the utility u is correct in the world w. For instance, it might cover statements "it's 99% certain that 'murder is bad' is the correct morality, given that I live in a world where every programmer I ask tells me that murder is bad".

The C term is the correctness of the utility function, given whatever system of value learning we're using (note that some moral realists would insist that we don't need a C, that p(u|w) makes sense directly, that we can deduce ought from is). All the subtlety of the value learning is encoded in the various p(C(u)|w): this determines how the agent learns moral values.

So the whole formula can be described as:

• For each possible world and each possible utility function, figure out the utility of that world. Weigh that by the probability that that utility is correct is that world, and by the probability of that world. Then choose the action that maximises the weighted sum of this across all utility functions and worlds.

## Omega lies

7 24 October 2012 10:46AM

Just developing my second idea at the end of my last post. It seems to me that in the Newcomb problem and in the counterfactual mugging, the completely trustworthy Omega lies to a greater or lesser extent.

This is immediately obvious in scenarios where Omega simulates you in order to predict your reaction. In the Newcomb problem, the simulated you is told "I have already made my decision...", which is not true at that point, and in the counterfactual mugging, whenever the coin comes up heads, the simulated you is told "the coin came up tails". And the arguments only go through because these lies are accepted by the simulated you as being true.

If Omega doesn't simulate you, but uses other methods to gauge your reactions, he isn't lying to you per se. But he is estimating your reaction in the hypothetical situation where you were fed untrue information that you believed to be true. And that you believed to be true, specifically because the source is Omega, and Omega is trustworthy.

Doesn't really change much to the arguments here, but it's a thought worth bearing in mind.

## Naive TDT, Bayes nets, and counterfactual mugging

15 23 October 2012 03:58PM

I set out to understand precisely why naive TDT (possibly) fails the counterfactual mugging problem. While doing this I ended up drawing a lot of Bayes nets, and seemed to gain some insight; I'll pass these on, in the hopes that they'll be useful. All errors are, of course, my own.

## The grand old man of decision theory: the Newcomb problem

First let's look at the problem that inspired all this research: the Newcomb problem. In this problem, a supremely-insightful-and-entirely-honest superbeing called Omega presents two boxes to you, and tells you that you can either choose box A only ("1-box"), or take box A and box B ("2-box"). Box B will always contain $1K (one thousand dollars). Omega has predicted what your decision will be, though, and if you decided to 1-box, he's put$1M (one million dollars) in box A; otherwise he's put nothing in it. The problem can be cast as a Bayes net with the following nodes:

## Decision theory and "winning"

4 16 October 2012 12:35AM

With much help from crazy88, I'm still developing my Decision Theory FAQ. Here's the current section on Decision Theory and "Winning". I feel pretty uncertain about it, so I'm posting it here for feedback. (In the FAQ, CDT and EDT and TDT and Newcomblike problems have already been explained.)

One of the primary motivations for developing TDT is a sense that both CDT and EDT fail to reason in a desirable manner in some decision scenarios. However, despite acknowledging that CDT agents end up worse off in Newcomb's Problem, many (and perhaps the majority of) decision theorists are proponents of CDT. On the face of it, this may seem to suggest that these decision theorists aren't interested in developing a decision algorithm that "wins" but rather have some other aim in mind. If so then this might lead us to question the value of developing one-boxing decision algorithms.

However, the claim that most decision theorists don’t care about finding an algorithm that “wins” mischaracterizes their position. After all, proponents of CDT tend to take the challenge posed by the fact that CDT agents “lose” in Newcomb's problem seriously (in the philosophical literature, it's often referred to as the Why ain'cha rich? problem). A common reaction to this challenge is neatly summarized in Joyce (1999, p. 153-154 ) as a response to a hypothetical question about why, if two-boxing is rational, the CDT agent does not end up as rich as an agent that one-boxes:

Rachel has a perfectly good answer to the "Why ain't you rich?" question. "I am not rich," she will say, "because I am not the kind of person [Omega] thinks will refuse the money. I'm just not like you, Irene [the one-boxer]. Given that I know that I am the type who takes the money, and given that [Omega] knows that I am this type, it was reasonable of me to think that the $1,000,000 was not in [the box]. The$1,000 was the most I was going to get no matter what I did. So the only reasonable thing for me to do was to take it."

Irene may want to press the point here by asking, "But don't you wish you were like me, Rachel?"... Rachael can and should admit that she does wish she were more like Irene... At this point, Irene will exclaim, "You've admitted it! It wasn't so smart to take the money after all." Unfortunately for Irene, her conclusion does not follow from Rachel's premise. Rachel will patiently explain that wishing to be a [one-boxer] in a Newcomb problem is not inconsistent with thinking that one should take the \$1,000 whatever type one is. When Rachel wishes she was Irene's type she is wishing for Irene's options, not sanctioning her choice... While a person who knows she will face (has faced) a Newcomb problem might wish that she were (had been) the type that [Omega] labels a [one-boxer], this wish does not provide a reason for being a [one-boxer]. It might provide a reason to try (before [the boxes are filled]) to change her type if she thinks this might affect [Omega's] prediction, but it gives her no reason for doing anything other than taking the money once she comes to believes that she will be unable to influence what [Omega] does.

In other words, this response distinguishes between the winning decision and the winning type of agent and claims that two-boxing is the winning decision in Newcomb’s problem (even if one-boxers are the winning type of agent). Consequently, insofar as decision theory is about determining which decision is rational, on this account CDT reasons correctly in Newcomb’s problem.

For those that find this response perplexing, an analogy could be drawn to the chewing gum problem. In this scenario, there is near unanimous agreement that the rational decision is to chew gum. However, statistically, non-chewers will be better off than chewers. As such, the non-chewer could ask, “if you’re so smart, why aren’t you healthy?”. In this case, the above response seems particularly appropriate. The chewers are less healthy not because of their decision but rather because they’re more likely to have an undesirable gene. Having good genes doesn’t make the non-chewer more rational but simply more lucky. The proponent of CDT simply extends this response to Newcomb’s problem.

One final point about this response is worth nothing. A proponent of CDT can accept the above argument but still acknowledge that, if given the choice before the boxes are filled, they would be rational to choose to modify themselves to be a one-boxing type of agent (as Joyce acknowledged in the above passage and as argued for in Burgess, 2004). To the proponent of CDT, this is unproblematic: if we are sometimes rewarded not for the rationality of our decisions in the moment but for the type of agent we were at some past moment then it should be unsurprising that changing to a different type of agent might be beneficial.

The response to this defense of two-boxing in Newcomb’s problem has been divided. Many find it compelling but others, like Ahmed and Price (2012) think it does not adequately address to the challenge:

It is no use the causalist's whining that foreseeably, Newcomb problems do in fact reward irrationality, or rather CDT-irrationality. The point of the argument is that if everyone knows that the CDT-irrational strategy will in fact do better on average than the CDT-rational strategy, then it's rational to play the CDT-irrational strategy.

Given this, there seem to be two positions one could take on these issues. If the response given by the proponent of CDT is compelling, then we should be attempting to develop a decision theory that two-boxes on Newcomb’s problem. Perhaps the best theory for this role is CDT but perhaps it is instead BT, which many people think reasons better in the psychopath button scenario. On the other hand, if the response given by the proponents of CDT is not compelling, then we should be developing a theory that one-boxes in Newcomb’s problem. In this case, TDT, or something like it, seems like the most promising theory currently on offer.

## No Anthropic Evidence

9 23 September 2012 10:33AM

Closely related to: How Many LHC Failures Is Too Many?

Consider the following thought experiment. At the start, an "original" coin is tossed, but not shown. If it was "tails", a gun is loaded, otherwise it's not. After that, you are offered a big number of rounds of decision, where in each one you can either quit the game, or toss a coin of your own. If your coin falls "tails", the gun gets triggered, and depending on how the original coin fell (whether the gun was loaded), you either get shot or not (if the gun doesn't fire, i.e. if the original coin was "heads", you are free to go). If your coin is "heads", you are all right for the round. If you quit the game, you will get shot at the exit with probability 75% independently of what was happening during the game (and of the original coin). The question is, should you keep playing or quit if you observe, say, 1000 "heads" in a row?

Intuitively, it seems as if 1000 "heads" is "anthropic evidence" for the original coin being "tails", that the long sequence of "heads" can only be explained by the fact that "tails" would have killed you. If you know that the original coin was "tails", then to keep playing is to face the certainty of eventually tossing "tails" and getting shot, which is worse than quitting, with only 75% chance of death. Thus, it seems preferable to quit.

On the other hand, each "heads" you observe doesn't distinguish the hypothetical where the original coin was "heads" from one where it was "tails". The first round can be modeled by a 4-element finite probability space consisting of options {HH, HT, TH, TT}, where HH and HT correspond to the original coin being "heads" and HH and TH to the coin-for-the-round being "heads". Observing "heads" is the event {HH, TH} that has the same 50% posterior probabilities for "heads" and "tails" of the original coin. Thus, each round that ends in "heads" doesn't change the knowledge about the original coin, even if there were 1000 rounds of this type. And since you only get shot if the original coin was "tails", you only get to 50% probability of dying as the game continues, which is better than the 75% from quitting the game.

(See also the comments by simon2 and Benja Fallenstein on the LHC post, and this thought experiment by Benja Fallenstein.)

The result of this exercise could be generalized by saying that counterfactual possibility of dying doesn't in itself influence the conclusions that can be drawn from observations that happened within the hypotheticals where one didn't die. Only if the possibility of dying influences the probability of observations that did take place, would it be possible to detect that possibility. For example, if in the above exercise, a loaded gun would cause the coin to become biased in a known way, only then would it be possible to detect the state of the gun (1000 "heads" would imply either that the gun is likely loaded, or that it's likely not).

View more: Next