## Publication of "Anthropic Decision Theory"

4 20 September 2017 03:41PM

My paper "Anthropic decision theory for self-locating beliefs", based on posts here on Less Wrong, has been published as a Future of Humanity Institute tech report. Abstract:

This paper sets out to resolve how agents ought to act in the Sleeping Beauty problem and various related anthropic (self-locating belief) problems, not through the calculation of anthropic probabilities, but through finding the correct decision to make. It creates an anthropic decision theory (ADT) that decides these problems from a small set of principles. By doing so, it demonstrates that the attitude of agents with regards to each other (selfish or altruistic) changes the decisions they reach, and that it is very important to take this into account. To illustrate ADT, it is then applied to two major anthropic problems and paradoxes, the Presumptuous Philosopher and Doomsday problems, thus resolving some issues about the probability of human extinction.

Most of these ideas are also explained in this video.

To situate Anthropic Decision Theory within the UDT/TDT family: it's basically a piece of UDT applied to anthropic problems, where the UDT approach can be justified by using generally fewer, and more natural, assumptions than UDT does.

## Simplified Anthropic Doomsday

1 02 September 2017 08:37PM

Here is a simplified version of the Doomsday argument in Anthropic decision theory, to get easier intuitions.

Assume a single agent A exists, an average utilitarian, with utility linear in money. Their species survives with 50% probability; denote this event by S. If the species survives, there will be 100 people total; otherwise the average utilitarian is the only one of its kind. An independent coin lands heads with 50% probability; denote this event by H.

Agent A must price a coupon CS that pays out €1 on S, and a coupon CH that pays out €1 on H. The coupon CS pays out only on S, thus the reward only exists in a world where there are a hundred people, thus if S happens, the coupon CS is worth (€1)/100. Hence its expected worth is (€1)/200=(€2)/400.

But H is independent of S, so (H,S) and (H,¬S) both have probability 25%. In (H,S), there are a hundred people, so CH is worth (€1)/100. In (H,¬S), there is one person, so CH is worth (€1)/1=€1. Thus the expected value of CH is (€1)/4+(€1)/400 = (€101)/400. This is more than 50 times the value of CS.

Note that C¬S, the coupon that pays out on doom, has an even higher expected value of (€1)/2=(€200)/400.

So, H and S have identical probability, but A assigns CS and CH different expected utilities, with a higher value to CH, simply because S is correlated with survival and H is independent of it (and A assigns an ever higher value to C¬S, which is anti-correlated with survival). This is a phrasing of the Doomsday Argument in ADT.

## The Doomsday argument in anthropic decision theory

5 31 August 2017 01:44PM

EDIT: added a simplified version here.

Crossposted at the intelligent agents forum.

In Anthropic Decision Theory (ADT), behaviours that resemble the Self Sampling Assumption (SSA) derive from average utilitarian preferences (and from certain specific selfish preferences).

However, SSA implies the doomsday argument, and, to date, I hadn't found a good way to express the doomsday argument within ADT.

This post will remedy that hole, by showing how there is a natural doomsday-like behaviour for average utilitarian agents within ADT.

## [Link] Anthropic uncertainty in the Evidential Blackmail problem

4 14 May 2017 04:43PM

## [Link] How the Simulation Argument Dampens Future Fanaticism

6 09 September 2016 01:17PM

Very comprehensive analysis by Brian Tomasik on whether (and to what extent) the simulation argument should change our altruistic priorities. He concludes that the possibility of ancestor simulations somewhat increases the comparative importance of short-term helping relative to focusing on shaping the "far future".

Another important takeaway:

[...] rather than answering the question “Do I live in a simulation or not?,” a perhaps better way to think about it (in line with Stuart Armstrong's anthropic decision theory) is “Given that I’m deciding for all subjectively indistinguishable copies of myself, what fraction of my copies lives in a simulation and how many total copies are there?"

## Newcomb versus dust specks

-1 12 May 2016 03:02AM

You're given the option to torture everyone in the universe, or inflict a dust speck on everyone in the universe. Either you are the only one in the universe, or there are 3^^^3 perfect copies of you (far enough apart that you will never meet.) In the latter case, all copies of you are chosen, and all make the same choice. (Edit: if they choose specks, each person gets one dust speck. This was not meant to be ambiguous.)

As it happens, a perfect and truthful predictor has declared that you will choose torture iff you are alone.

What do you do?

How does your answer change if the predictor made the copies of you conditional on their prediction?

## Computation complexity of AGI design

6 02 February 2015 08:05PM

Summary of main point: I argue that there is a significant probability that creating de novo AGI is an intractable problem. Evolution only solved this problem because of anthropic reasons. Conclusions are drawn regarding priorities in AI risk research.

Sketch of main argument: There are suggestive relations between AGI and NP-completeness. These relations lead me to hypothesize that AGI programs posses large Levin-Kolmogorov complexity which implies that producing them is a computationally intractable problem. The timing of events in the evolution of human intelligence seems to be consistent with the assumption evolution's success is anthropic, if we postulate human intelligence as arising from a combination of two modules: an "easy" (low complexity) module and a "hard" (high complexity) module. Therefore, creating superhuman intelligence will require reverse engineering the human brain and be limited to improving the "easy" module (since creating a better "hard" module is again computationally intractable).

# AGI and P vs. NP

There are several arguments the AGI problem is of a similar "flavor" to problems that are NP-complete.

The first argument is rather vague but IMO still compelling. Many class separations in complexity theory (P vs. NP, L vs. P, R vs. RE) hinge on the existence of a complete language. This means there is a single problem solving which under the stronger resource constraints would lead to solving all problems in the larger class. Similarly, Goedel incompleteness means there is no single algorithm (a program which terminates on all inputs) for proving all provable theorems. It feels like there is a principle of mathematics which rules out algorithms that are "too good to be true": a single "magic wand" to solve all problems. In a similar way, AGI is a "magic wand": it solves "all" problems because you can simply delegate them to the AGI.

Another argument has to do with Solomonoff induction. Solomonoff induction is incomputable but it becomes computable if we set a limit T on the run-time of the "hypotheses" (programs) we consider. However, the resulting computable induction carries an
O(T 2T) slow-down penalty (the time it takes to run all possible hypotheses). On the other hand, the problem is easy modulo P# and tractable given an NP-complete oracle given certain assumptions on the required probability accuracy.

Yet another argument goes through logical uncertainty. The latter is widely suspected to be an important component of AGI and there is a compelling relation between it and P vs. NP.

What does all of it mean? We certainly don't need an NP-oracle to construct an AGI since humans are "A"GIs and (presumably) there are no NP-oracles in our brain. To shed light on this, it is useful to take the quantitative point of view on AGI. Namely, there is a metric which rates programs according to how "intelligent" they are. From this point-of-view, an AGI is just a program which ranks high on this metric. The first such metric was suggested by Legg and Hutter and I improved on their construction by combining it with UDT.

This way the AGI design problem becomes an optimization problem: find a program with an intelligence metric as high as possible. The NP-connection now suggests the following conjecture: the AGI optimization program is of exponential complexity in program length. Of course we don't necessarily need the best program of a given length but the impression remains that AGI design is hard in some rigorous complexity theoretic sense. In particular, I'm guessing there should be a relation between the intelligence (in the precise quantitative sense) of a program and its Levin-Kolmogorov complexity.

# The anthropic scenario

If we buy into the conjecture above, a glaring problem appears: if AGI design is so hard, how come evolution succeeded in it? After all, evolution is also a process with bounded computing resources. The only explanation that seems to remain is the anthropic one: evolution's a priori probability of success was insanely low but in an infinite universe it still succeeds infinitely many times and we observe one of these times for the obvious reason.

This explanation produces probabilistic predictions regarding the timing of events. For example, if there was no cosmological upper bound on when intelligence can appear we would expect it would appear extremely late. This is not the case in our universe (on a cosmological time scale). However, this is not difficult to explain since there is a relatively short time window in the lifetime of the universe in which suitable planets revolving suitable stars exist. In particular, on Earth in 0.6 billion years there won't be trees any more and in 1.1 billion years there won't oceans.

As well known, in scenarios with hard steps that are overcome anthropically, the hard steps are expected to be distributed on the timeline approximately uniformly. This seems to conflict with the most intuitive location of the intelligence hard step: somewhere between chimp and human. However, the apparent discrepancy goes away if we consider a model with two coupled "intelligence modules": an "easy" module E which is susceptible to non-anthropic evolutionary optimization and a "hard" module H which contains most of the Levin-Kolmogorov complexity and whose appearance is the hard step in question.

Before the hard step, an early version E1 of E co-evolves with a module h which performs a similar function to H but does it much worse (imagine a rough heuristic which works for many of the cases in a relatively narrow domain). During the hard step, H appears "out of the blue" due to sheer anthropic luck after which the E1-h "wire" is replaced by an E1-H wire. After the hard step, natural selection proceeds to transform E1 into its final version E2. This picture seems to be consistent with hard step happening to our chimp-like ancestor after which natural selection rapidly transformed the result into homo sapiens sapiens.

This scenario would be undermined if there was an "E-like" property of our ancestors which evolved shortly before the presumed hard step. What can this property be? The best candidate I can think of is the evolution of hands. Apparently, hands evolved 100 millions years ago. The ratio between this number and the remaining 600 million years doesn't seem to be small enough to rule out the anthropic scenario. The argument is made stronger if we take into account that there is an extinction event every 100 million years or so which means we can't reasonably expect a much larger time difference.

# Consequences for future of mankind

If AGI is a computationally intractable problem, we won't be able to solve it "fairly" in the near future. However, we can use the existing solution: homo sapiens sapiens. This means reverse engineering the brain and either modifying it (improving module E) or extracting (emulating) H and writing E from scratch. It is not clear how much intelligence improvement to expect: on the one hand we're stuck with the current H on the other hand E might still have lots of room for improvement (which is intuitively likely). It is not clear whether the monopole (singleton) or multipole scenario is more likely. It feels to me that a singleton will require rewriting E and it will be easier to start tweaking it therefore multipole superhuman intelligence will be first.

Reverse engineering and modifying the brain is a project which is likely to require considerable resources and encounter enormous legal barriers. As opposed to de novo AGI, it is difficult to imagine it accomplished by a small group or any private organization. The most likely scenario seems to be a major government project in the spirit of Manhattan, Apollo or LHC. The currently prevailing culture / system of beliefs makes it extremely unlikely for the government of a liberal country to undertake such a project if the technology was available. If this circumstance doesn't change, the first government to try will be an authoritarian one like China. Such a government will ensure the resulting superhumans will have extreme built-in loyalty*, resulting in a world-wide superdictatorship. Therefore, the highest priority seems to be changing culture in a way that will ensure a supportive public opinion for a future friendly superintelligence project. Another high priority is continuing to develop the abstract mathematical theory to better understand the likelihood of this and other scenarios.

* I am assuming (or hoping) that no government will be stupid enough to try it before brain reverse engineering identifies the "utility function module"

EDIT: The treatment of anthropics in this post is unforgivably oversimplified. I'm hoping to a write a UDT-based analysis later. Also, thanks to Mark Friedenbach for point out the extremely relevant paper by Shulman and Bostrom.

## Selfish preferences and self-modification

4 14 January 2015 08:42AM

One question I've had recently is "Are agents acting on selfish preferences doomed to having conflicts with other versions of themselves?" A major motivation of TDT and UDT was the ability to just do the right thing without having to be tied up with precommitments made by your past self - and to trust that your future self would just do the right thing, without you having to tie them up with precommitments. Is this an impossible dream in anthropic problems?

In my recent post, I talked about preferences where "if you are one of two copies and I give the other copy a candy bar, your selfish desires for eating candy are unfulfilled." If you would buy a candy bar for a dollar but not buy your copy a candy bar, this is exactly a case of strategy ranking depending on indexical information.

This dependence on indexical information is inequivalent with UDT, and thus incompatible with peace and harmony.

To be thorough, consider an experiment where I am forked into two copies, A and B. Both have a button in front of them, and 10 candies in their account. If A presses the button, it deducts 1 candy from A. But if B presses the button, it removes 1 candy from B and gives 5 candies to A.

Before the experiment begins, I want my descendants to press the button 10 times (assuming candies come in units such that my utility is linear). In fact, after the copies wake up but before they know which is which, they want to press the button!

The model of selfish preferences that is not UDT-compatible looks like this: once A and B know who is who, A wants B to press the button but B doesn't want to do it. And so earlier, I should try and make precommitments to force B to press the button.

But suppose that we simply decided to use a different model. A model of peace and harmony and, like, free love, where I just maximize the average (or total, if we specify an arbitrary zero point) amount of utility that myselves have. And so B just presses the button.

(It's like non-UDT selfish copies can make all Pareto improvements, but not all average improvements)

Is the peace-and-love model still a selfish preference? It sure seems different from the every-copy-for-themself algorithm. But on the other hand, I'm doing it for myself, in a sense.

And at least this way I don't have to waste time with precomittment. In fact, self-modifying to this form of preferences is such an effective action that conflicting preferences are self-destructive. If I have selfish preferences now but I want my copies to cooperate in the future, I'll try to become an agent who values copies of myself - so long as they date from after the time of my self-modification.

If you recall, I made an argument in favor of averaging the utility of future causal descendants when calculating expected utility, based on this being the fixed point of selfish preferences under modification when confronted with Jan's tropical paradise. But if selfish preferences are unstable under self-modification in a more intrinsic way, this rather goes out the window.

Right now I think of selfish values as a somewhat anything-goes space occupied by non-self-modified agents like me and you. But it feels uncertain. On the mutant third hand, what sort of arguments would convince me that the peace-and-love model actually captures my selfish preferences?

## How many people am I?

3 15 December 2014 06:11PM

Strongly related: the Ebborians

Imagine mapping my brain into two interpenetrating networks. For each brain cell, half of it goes to one map and half to the other. For each connection between cells, half of each connection goes to one map and half to the other. We can call these two mapped out halves Manfred One and Manfred Two. Because neurons are classical, as I think, both of these maps change together. They contain the full pattern of my thoughts. (This situation is even more clear in the Ebborians, who can literally split down the middle.)

So how many people am I? Are Manfred One and Manfred Two both people? Of course, once we have two, why stop there - are there thousands of Manfreds in here, with "me" as only one of them? Put like that it sounds a little overwrought - what's really going on here is the question of what physical system corresponds to "I" in english statements like "I wake up." This may matter.

The impact on anthropic probabilities is somewhat straightforward. With everyday definitions of "I wake up," I wake up just once per day no matter how big my head is. But if the "I" in that sentence is some constant-size physical pattern, then "I wake up" is an event that happens more times if my head is bigger. And so using the variable people-number definition, I expect to wake up with a gigantic head.

The impact on decisions is less big. If I'm in this head with a bunch of other Manfreds, we're all on the same page - it's a non-anthropic problem of coordinated decision-making. For example, if I were to make any monetary bets about my head size, and then donate profits to charity, no matter what definition I'm using, I should bet as if my head size didn't affect anthropic probabilities. So to some extent the real point of this effect is that it is a way anthropic probabilities can be ill-defined. On the other hand, what about preferences that depend directly on person-numbers like how to value people with different head sizes? Or for vegetarians, should we care more about cows than chickens, because each cow is more animals than a chicken is?

According to my common sense, it seems like my body has just one person in it. Why does my common sense think that? I think there are two answers, one unhelpful and one helpful.

The first answer is evolution. Having kids is an action that's independent of what physical system we identify with "I," and so my ancestors never found modeling their bodies as being multiple people useful.

The second answer is causality. Manfred One and Manfred Two are causally distinct from two copies of me in separate bodies but the same input/output. If a difference between the two separated copies arose somehow, (reminiscent of Dennett's factual account) henceforth the two bodies would do and say different things and have different brain states. But if some difference arises between Manfred One and Manfred Two, it is erased by diffusion.

Which is to say, the map that is Manfred One is statically the same pattern as my whole brain, but it's causally different. So is "I" the pattern, or is "I" the causal system?

In this sort of situation I am happy to stick with common sense, and thus when I say me, I think the causal system is referring to the causal system. But I'm not very sure.

Going back to the Ebborians, one interesting thing about that post is the conflict between common sense and common sense - it seems like common sense that each Ebborian is equally much one person, but it also seems like common sense that if you looked at an Ebborian dividing, there doesn't seem to be a moment where the amount of subjective experience should change, and so amount of subjective experience should be proportional to thickness. But as it is said, just because there are two opposing ideas doesn't mean one of them is right.

On the questions of subjective experience raised in that post, I think this mostly gets cleared up by precise description an  anthropic narrowness. I'm unsure of the relative sizes of this margin and the proof, but the sketch is to replace a mysterious "subjective experience" that spans copies with individual experiences of people who are using a TDT-like theory to choose so that they individually achieve good outcomes given their existence.

## [Resolved] Is the SIA doomsday argument wrong?

5 13 December 2014 06:01AM

[EDIT: I think the SIA doomsday argument works after all, and my objection to it was based on framing the problem in a misguided way. Feel free to ignore this post or skip to the resolution at the end.]

ORIGINAL POST:

Katja Grace has developed a kind of doomsday argument from SIA combined with the Great Filter. It has been discussed by Robin HansonCarl Shulman, and Nick Bostrom. The basic idea is that if the filter comes late, there are more civilizations with organisms like us than if the filter comes early, and more organisms in positions like ours means a higher expected number of (non-fake) experiences that match ours. (I'll ignore simulation-argument possibilities in this post.)

I used to agree with this reasoning. But now I'm not sure, and here's why. Your subjective experience, broadly construed, includes knowledge of a lot of Earth's history and current state, including when life evolved, which creatures evolved, the Earth's mass and distance from the sun, the chemical composition of the soil and atmosphere, and so on. The information that you know about your planet is sufficient to uniquely locate you within the observable universe. Sure, there might be exact copies of you in vastly distant Hubble volumes, and there might be many approximate copies of Earth in somewhat nearer Hubble volumes. But within any reasonable radius, probably what you know about Earth requires that your subjective experiences (if veridical) could only take place on Earth, not on any other planet in our Hubble volume.

If so, then whether there are lots of human-level extraterrestrials (ETs) or none doesn't matter anthropically, because none of those ETs within any reasonable radius could contain your exact experiences. No matter how hard or easy the emergence of human-like life is in general, it can happen on Earth, and your subjective experiences can only exist on Earth (or some planet almost identical to Earth).

A better way to think about SIA is that it favors hypotheses containing more copies of our Hubble volume within the larger universe. Within a given Hubble volume, there can be at most one location where organisms veridically perceive what we perceive.

Katja's blog post on the SIA doomsday draws orange boxes with humans waving their hands. She has us update on knowing we're in the human-level stage, i.e., that we're one of those orange boxes. But we know much more: We know that we're a particular one of those boxes, which is easily distinguished from the others based on what we observe about the world. So any hypothesis that contains us at all will have the same number of boxes containing us (namely, just one box). Hence, no anthropic update.

Am I missing something? :)

RESOLUTION:

The problem with my argument was that I compared the hypothesis "filter is early and you exist on Earth" against "filter is late and you exist on Earth". If the hypotheses already say that you exist on Earth, then there's no more anthropic work to be done. But the heart of the anthropic question is whether an early or late filter predicts that you exist on Earth at all.

Here's an oversimplified example. Suppose that the hypothesis of "early filter" tells us that there are four planets, exactly one of which contains life. "Late filter" says there are four planets, all of which contain life. Suppose for convenience that if life exists on Earth at all, you will exist on Earth. Then P(you exist | early filter) = 1/4 while P(you exist | late filter) = 1. This is where the doomsday update comes from.

## More marbles and Sleeping Beauty

4 23 November 2014 02:00AM

I

Previously I talked about an entirely uncontroversial marble game: I flip a coin, and if Tails I give you a black marble, if Heads I flip another coin to either give you a white or a black marble.

The probabilities of seeing the two marble colors are 3/4 and 1/4, and the probabilities of Heads and Tails are 1/2 each.

The marble game is analogous to how a 'halfer' would think of the Sleeping Beauty problem - the claim that Sleeping Beauty should assign probability 1/2 to Heads relies on the claim that your information for the Sleeping Beauty problem is the same as your information for the marble game - same possible events, same causal information, same mutual exclusivity and exhaustiveness relations.

So what's analogous to the 'thirder' position, after we take into account that we have this causal information? Is it some difference in causal structure, or some non-causal anthropic modification, or something even stranger?

As it turns out, nope, it's the same exact game, just re-labeled.

In the re-labeled marble game you still have two unknown variables (represented by flipping coins), and you still have a 1/2 chance of black and Tails, a 1/4 chance of black and Heads, and a 1/4 chance of white and Heads.

And then to get the thirds, you ask the question "If I get a black marble, what is the probability of the faces of the first coin?" Now you update to P(Heads|black)=1/3 and P(Tails|black)=2/3.

II

Okay, enough analogies. What's going on with these two positions in the Sleeping Beauty problem?

1:            2:

Here are two different diagrams, which are really re-labelings of the same diagram. The first labeling is the problem where P(Heads|Wake) = 1/2. The second labeling is the problem where P(Heads|Wake) = 1/3. The question at hand is really - which of these two math problems corresponds to the word problem / real world situation?

As a refresher, here's the text of the Sleeping Beauty problem that I'll use: Sleeping Beauty goes to sleep in a special room on Sunday, having signed up for an experiment. A coin is flipped - if the coin lands Heads, she will only be woken up on Monday. If the coin lands Tails, she will be woken up on both Monday and Tuesday, but with memories erased in between. Upon waking up, she then assigns some probability to the coin landing Heads, P(Heads|Wake).

Diagram 1:  First a coin is flipped to get Heads or Tails. There are two possible things that could be happening to her, Wake on Monday or Wake on Tuesday. If the coin landed Heads, then she gets Wake on Monday. If the coin landed Tails, then she could either get Wake on Monday or Wake on Tuesday (in the marble game, this was mediated by flipping a second coin, but in this case it's some unspecified process, so I've labeled it [???]).  Because all the events already assume she Wakes, P(Heads|Wake) evaluates to P(Heads), which just as in the marble game is 1/2.

This [???] node here is odd, can we identify it as something natural? Well, it's not Monday/Tuesday, like in diagram 2 - there's no option that even corresponds to Heads & Tuesday. I'm leaning towards the opinion that this node is somewhat magical / acausal, just hanging around because of analogy to the marble game. So I think we can take it out. A better causal diagram with the halfer answer, then, might merely be Coin -> (Wake on Monday / Wake on Tuesday), where Monday versus Tuesday is not determined at all by a causal node, merely informed probabilistically to be mutually exclusive and exhaustive.

Diagram 2:  A coin is flipped, Heads or Tails, and also it could be either Monday or Tuesday. Together, these have a causal effect on her waking or not waking - if Heads and Monday, she Wakes, but if Heads and Tuesday, she Doesn't wake. If Tails, she Wakes. Her pre-Waking prior for Heads is 1/2, but upon waking, the event Heads, Tuesday, Don't Wake gets eliminated, and after updating P(Heads|Wake)=1/3.

There's a neat asymmetry here. In diagram 1, when the coin was Heads she got the same outcome no matter the value of [???], and only when the coin was Tails were there really two options. In Diagram 2, when the coin is Heads, two different things happen for different values of the day, while if the coin is Tails the same thing happens no matter the day.

Do these seem like accurate depictions of what's going on in these two different math problems? If so, I'll probably move on to looking closer at what makes the math problem correspond to the word problem.

5 14 November 2014 07:02AM

Nick Bostrom's self-sampling assumption treats us as a random sample from a set of observers, but this framework raises several paradoxes. Instead, why not treat the stuff we observe to be a random sample from the set of all stuff that exists? I elaborate on this proposal in a new essay subsection: "SSA on physics rather than observers?" At first glance, it seems to work better than any of the mainstream schools of anthropics. Comments are welcome.

Has this idea been suggested before? I noticed that Robin Hanson proffered something similar way back in 1998 (four years before Bostrom's Anthropic Bias). I'm surprised Hanson's proposal hasn't received more attention in the academic literature.

## Anthropic decision theory for selfish agents

8 21 October 2014 03:56PM

Consider Nick Bostrom's Incubator Gedankenexperiment, phrased as a decision problem. In my mind, this provides the purest and simplest example of a non-trivial anthropic decision problem. In an otherwise empty world, the Incubator flips a coin. If the coin comes up heads, it creates one human, while if the coin comes up tails, it creates two humans. Each created human is put into one of two indistinguishable cells, and there's no way for created humans to tell whether another human has been created or not. Each created human is offered the possibility to buy a lottery ticket which pays 1$if the coin has shown tails. What is the maximal price that you would pay for such a lottery ticket? (Utility is proportional to Dollars.) The two traditional answers are 1/2$ and 2/3$. We can try to answer this question for agents with different utility functions: total utilitarians; average utilitarians; and selfish agents. UDT's answer is that total utilitarians should pay up to 2/3$, while average utilitarians should pay up to 1/2$; see Stuart Armstrong's paper and Wei Dai's comment. There are some heuristic ways to arrive at UDT prescpriptions, such as asking "What would I have precommited to?" or arguing based on reflective consistency. For example, a CDT agent that expects to face Counterfactual Mugging-like situations in the future (with predictions also made in the future) will self-modify to become an UDT agent, i.e., one that pays the counterfactual mugger. Now, these kinds of heuristics are not applicable to the Incubator case. It is meaningless to ask "What maximal price should I have precommited to?" or "At what odds should I bet on coin flips of this kind in the future?", since the very point of the Gedankenexperiment is that the agent's existence is contingent upon the outcome of the coin flip. Can we come up with a different heuristic that leads to the correct answer? Imagine that the Incubator's subroutine that is responsible for creating the humans is completely benevolent towards them (let's call this the "Benevolent Creator"). (We assume here that the humans' goals are identical, such that the notion of benevolence towards all humans is completely unproblematic.) The Benevolent Creator has the power to program a certain maximal price the humans pay for the lottery tickets into them. A moment's thought shows that this leads indeed to UDT's answers for average and total utilitarians. For example, consider the case of total utilitarians. If the humans pay x$ for the lottery tickets, the expected utility is 1/2*(-x) + 1/2*2*(1-x). So indeed, the break-even price is reached for x=2/3.

But what about selfish agents? For them, the Benevolent Creator heuristic is no longer applicable. Since the humans' goals do not align, the Creator cannot share them. As Wei Dai writes, the notion of selfish values does not fit well with UDT. In Anthropic decision theory, Stuart Armstrong argues that selfish agents should pay up to 1/2$(Sec. 3.3.3). His argument is based on an alleged isomorphism between the average utilitarian and the selfish case. (For instance, donating 1$ to each human increases utility by 1 for both average utilitarian and selfish agents, while it increases utility by 2 for total utilitarians in the tails world.) Here, I want to argue that this is incorrect and that selfish agents should pay up to 2/3$for the lottery tickets. (Needless to say that all the bold statements I'm about to make are based on an "inside view". An "outside view" tells me that Stuart Armstrong has thought much more carefully about these issues than I have, and has discussed them with a lot of smart people, which I haven't, so chances are my arguments are flawed somehow.) In order to make my argument, I want to introduce yet another heuristic, which I call the Submissive Gnome. Suppose each cell contains a gnome which is already present before the coin is flipped. As soon as it sees a human in its cell, it instantly adopts the human's goal. From the gnome's perspective, SIA odds are clearly correct: Since a human is twice as likely to appear in the gnome's cell if the coin shows tails, Bayes' Theorem implies that the probability of tails is 2/3 from the gnome's perspective once it has seen a human. Therefore, the gnome would advise the selfish human to pay up to 2/3$ for a lottery ticket that pays 1$in the tails world. I don't see any reason why the selfish agent shouldn't follow the gnome's advice. From the gnome's perspective, the problem is not even "anthropic" in any sense, there's just straightforward Bayesian updating. Suppose we want to use the Submissive Gnome heuristic to solve the problem for utilitarian agents. (ETA: Total/average utilitarianism includes the well-being and population of humans only, not of gnomes.) The gnome reasons as follows: "With probability 2/3, the coin has shown tails. For an average utilitarian, the expected utility after paying x$ for a ticket is 1/3*(-x)+2/3*(1-x), while for a total utilitarian the expected utility is 1/3*(-x)+2/3*2*(1-x). Average and total utilitarians should thus pay up to 2/3$and 4/5$, respectively." The gnome's advice disagrees with UDT and the solution based on the Benevolent Creator. Something has gone terribly wrong here, but what? The mistake in the gnome's reasoning here is in fact perfectly isomorphic to the mistake in the reasoning leading to the "yea" answer in Psy-Kosh's non-anthropic problem.

Things become clear if we look at the problem from the gnome's perspective before the coin is flipped. Assume, for simplicity, that there are only two cells and gnomes, 1 and 2. If the coin shows heads, the single human is placed in cell 1 and cell 2 is left empty. Since the humans don't know in which cell they are, neither should the gnomes know. So from each gnome's perspective, there are four equiprobable "worlds": it can be in cell 1 or 2 and the coin flip can result in heads or tails. We assume, of course, that the two gnomes are, like the humans, sufficiently similar such that their decisions are "linked".

We can assume that the gnomes already know what utility functions the humans are going to have. If the humans will be (total/average) utilitarians, we can then even assume that the gnomes already are so, too, since the well-being of each human is as important as that of any other. Crucially, then, for both utilitarian utility functions, the question whether the gnome is in cell 1 or 2 is irrelevant. There is just one "gnome advice" that is given identically to all (one or two) humans. Whether this advice is given by one gnome or the other or both of them is irrelevant from both gnomes' perspective. The alignment of the humans' goals leads to alignment of the gnomes' goals. The expected utility of some advice can simply be calculated by taking probability 1/2 for both heads and tails, and introducing a factor of 2 in the total utilitarian case, leading to the answers 1/2 and 2/3, in accordance with UDT and the Benevolent Creator.

The situation looks different if the humans are selfish. We can no longer assume that the gnomes already have a utility function. The gnome cannot yet care about that human, since with probability 1/4 (if the gnome is in cell 2 and the coin shows heads) there will not be a human to care for. (By contrast, it is already possible to care about the average utility of all humans there will be, which is where the alleged isomorphism between the two cases breaks down.) It is still true that there is just one "gnome advice" that is given identically to all (one or two) humans, but the method for calculating the optimal advice now differs. In three of the four equiprobable "worlds" the gnome can live in, a human will appear in its cell after the coin flip. Two out of these three are tail worlds, so the gnome decides to advise paying up to 2/3$for the lottery ticket if a human appears in its cell. There is a way to restore the equivalence between the average utilitarian and the selfish case. If the humans will be selfish, we can say that the gnome cares about the average well-being of the three humans which will appear in its cell with equal likelihood: the human created after heads, the first human created after tails, and the second human created after tails. The gnome expects to adopt each of these three humans' selfish utility function with probability 1/4. It makes thus sense to say that the gnome cares about the average well-being of these three humans. This is the correct correspondence between selfish and average utilitarian values and it leads, again, to the conclusion that the correct advise is to pay up to 2/3$ for the lottery ticket.

In Anthropic Bias, Nick Bostrom argues that each human should assign probability 1/2 to the coin having shown tails ("SSA odds"). He also introduces the possible answer 2/3 ("SSA+SIA", nowadays usually simply called "SIA") and refutes it. SIA odds have been defended by Olum. The main argument against SIA is the Presumptuous Philosopher. Main arguments for SIA and against SSA odds are that SIA avoids the Doomsday Argument1, which most people feel has to be wrong, that SSA odds depend on whom you consider to be part of your "reference class", and furthermore, as pointed out by Bostrom himself, that SSA odds allow for acausal superpowers.

The consensus view on LW seems to be that much of the SSA vs. SIA debate is confused and due to discussing probabilities detached from decision problems of agents with specific utility functions. (ETA: At least this was the impression I got. Two commenters have expressed scepticism about whether this is really the consensus view.) I think that "What are the odds at which a selfish agent should bet on tails?" is the most sensible translation of "What is the probability that the coin has shown tails?" into a decision problem. Since I've argued that selfish agents should take bets following SIA odds, one can employ the Presumptuous Philosopher argument against my conclusion: it seems to imply that selfish agents, like total but unlike average utilitarians, should bet at extreme odds on living in a extremely large universe, even if there's no empirical evidence in favor of this. I don't think this counterargument is very strong. However, since this post is already quite lengthy, I'll elaborate more on this if I get encouraging feedback for this post.

1 At least its standard version. SIA comes with its own Doomsday conclusions, cf. Katja Grace's thesis Anthropic Reasoning in the Great Filter.

## Simulation argument meets decision theory

14 24 September 2014 10:47AM

Person X stands in front of a sophisticated computer playing the decision game Y which allows for the following options: either press the button "sim" or "not sim". If she presses "sim", the computer will simulate X*_1, X*_2, ..., X*_1000 which are a thousand identical copies of X. All of them will face the game Y* which - from the standpoint of each X* - is indistinguishable from Y. But the simulated computers in the games Y* don't run simulations. Additionally, we know that if X presses "sim" she receives a utility of 1, but "not sim" would only lead to 0.9. If X*_i (for i=1,2,3..1000)  presses "sim" she receives 0.2, with "not sim" 0.1. For each agent it is true that she does not gain anything from the utility of another agent despite the fact she and the other agents are identical! Since all the agents are identical egoists facing the apparently same situation, all of them will take the same action.

Now the game starts. We face a computer and know all the above. We don't know whether we are X or any of the X*'s, should we now press "sim" or "not sim"?

EDIT: It seems to me that "identical" agents with "independent" utility functions were a clumsy set up for the above question, especially since one can interpret it as a contradiction. Hence, it might be better to switch to identical egoists whereas each agent only cares about her receiving money (linear monetary value function). If X presses "sim" she will be given 10$(else 9$) in the end of the game; each X* who presses "sim" receives 2$(else 1$), respectively. Each agent in the game wants to maximize the expected monetary value they themselves will hold in their own hand after the game. So, intrinsically, they don't care how much money the other copies make.
To spice things up: What if the simulation will only happen a year later? Are we then able to "choose" which year it is?

## The Great Filter is early, or AI is hard

19 29 August 2014 04:17PM

Attempt at the briefest content-full Less Wrong post:

Once AI is developed, it could "easily" colonise the universe. So the Great Filter (preventing the emergence of star-spanning civilizations) must strike before AI could be developed. If AI is easy, we could conceivably have built it already, or we could be on the cusp of building it. So the Great Filter must predate us, unless AI is hard.

## Anthropics doesn't explain why the Cold War stayed Cold

6 20 August 2014 07:23PM

(Epistemic status: There are some lines of argument that I haven’t even started here, which potentially defeat the thesis advocated here. I don’t go into them because this is already too long or I can’t explain them adequately without derailing the main thesis. Similarly some continuations of chains of argument and counterargument begun here are terminated in the interest of focussing on the lower-order counterarguments. Overall this piece probably overstates my confidence in its thesis. It is quite possible this post will be torn to pieces in the comments—possibly by my own aforementioned elided considerations. That’s good too.)

I

George VI, King of the United Kingdom, had five siblings. That is, the father of current Queen Elizabeth II had as many siblings as on a typical human hand. (This paragraph is true, and is not a trick; in particular, the second sentence of this paragraph really is trying to disambiguate and help convey the fact in question and relate it to prior knowledge, rather than introduce an opening for some sleight of hand so I can laugh at you later, or whatever fear such a suspiciously simple proposition might engender.)

Let it be known.

II

Exactly one of the following stories is true:

Story One

Recently I hopped on Facebook and saw the following post:

“I notice that I am confused about why a nuclear war never occurred. Like, I think (knowing only the very little I know now) that if you had asked me, at the start of the Cold War or something, the probability that it would eventually lead to a nuclear war, I would've said it was moderately likely. So what's up with that?”

The post had 14 likes. In the comments, the most-Liked explanation was:

“anthropically you are considerably more likely to live in a world where there never was a fullscale nuclear war”

That comment had 17 Likes. The second-most-liked comment that offered an explanation had 4 Likes.

Story Two

## Quickly passing through the great filter

10 06 July 2014 06:50PM

To quickly escape the great filter should we flood our galaxy with radio signals?  While communicating with fellow humans we already send out massive amounts of information that an alien civilization could eventually pickup, but should we engage in positive SETI?  Or, if you fear the attention of dangerous aliens, should we set up powerful long-lived solar or nuclear powered automated radio transmitters in the desert and in space that stay silent so long as they receive a yearly signal from us, but then if they fail to get the no-go signal because our civilization has fallen, continuously transmit our dead voice to the stars?  If we do destroy ourselves it would be an act of astronomical altruism to warn other civilizations of our fate especially if we broadcasted news stories from just before our demise, e.g. physicists excited about a new high energy experiment.

## Dissolving the Thread of Personal Identity

12 25 May 2014 06:36AM

(Background: I got interested in anthropics about a week ago. It has tormented my waking thoughts ever since in a cycle of “be confused, develop idea, work it out a bit, realize that it fails, repeat” and it is seriously driving me berserk by this point. While drawing a bunch of “thread of personal continuity” diagrams to try to flesh out my next idea, I suspected that it was a fairly nonsensical idea, came up with a thought experiment that showed it was definitely a nonsensical idea, realized I was trying to answer the question “Is there any meaningful sense in which I can expect to wake up as myself tomorrow, rather than Brittany Spears?”, kept thinking anyways for about an hour, and eventually came up with this possible reduction of personal identity over time. It differs somewhat from Kaj Sotala’s. And I still have no idea what the hell to do about anthropics, but I figured I should write up this intermediate result. It takes the form of a mental dialogue with myself, because that’s what happened.)

Doubt: Hang on, this whole notion of “thread of personal continuity” looks sort of fishy. Self, can you try to clarify what it is?

Self: Let’s see… I have a causal link to my past and future self, and this causal link is the thread of personal identity!

Current Me: Please notice Past Self’s use of the cached thought from “Timeless Identity” even though it doesn’t fit.

Doubt: Causal links can’t possibly be the thread of personal continuity. Your state at time t+1 is not just caused by your state at time t, lots of events in your surroundings also cause the t+1 state as well. A whole hell of a lot of stuff has a causal link to you. That can’t possibly be it. And when you die, alive you has a causal link to dead you.

Doubt: And another thing, personal continuity isn’t just an on-off thing. There’s a gradient to it.

Self: What do you mean?

Doubt: Let’s say you get frozen by cryonics, and then revived a century later.

Self: Sure.

Doubt: Let’s say you know that you will be revived with exactly the same set of memories, preferences, thought patterns, etc, that you have currently. As you are beginning the process, what is your subjective credence that you will wake up a century later?

Self: Fairly close to 1.

Doubt: Now, let’s say they could recover all the information from your brain except your extreme love for chocolate, so when your brain is restored, they patch in a generic average inclination for chocolate. What is your subjective credence that you will wake up a century later?

Self: Fairly close to 1.

Doubt: Let’s say that all your inclinations and thought patterns and other stuff will be restored fully, but they can’t bring back memories. You will wake up with total amnesia. What is your… you get the idea.

Self: Oh crap. I… I really don’t know. 0.6??? But then again, this is the situation that several real-life people have found themselves in… Huh.

Doubt: For this one, inclinations and thought patterns and many of your memories are unrecoverable, so when your brain is restored, you only have a third of your memories, a strong belief that you are the same person that was cryopreserved, and a completely different set of… everything else except for the memories and the belief in personal continuity. P(I wake up a century later)?

Self: Quite low. ~0.1.

Self: But I see your point. For that whole personal identity/waking up as yourself thing, it isn’t a binary trait, it’s a sliding scale of belief that I’ll keep on existing which depends on the magnitude of the difference between myself and the being that wakes up. If upload!me were fed through a lossy compression algorithm and then reconstructed, my degree of belief in continuing to exist would depend on how lossy it was.

Doubt: Now you realize that the “thread of subjective experience” doesn’t actually exist. There are just observer-moments. What would it even mean for something to have a “thread of subjective experience”?

Self: (Taps into intuition) What about that big rock over there? Forget “subjective”, that rock has a “thread of existence”. That rock will still be the same rock if it is moved 3 feet to the left, that rock will still be the same rock if a piece of it is chipped off, that rock will still be the same rock if it gets covered in moss, but that rock will cease to be a rock if a nuke goes off, turning it into rock vapor! I don’t know what the hell the “thread of existence” is, but I know it has to work like that rock!!

Doubt: So you’re saying that personal identity over time works like the Ship of Theseus?

Self: Exactly! We’ve got a fuzzy category, like “this ship” or “this rock” or “me”, and there’s stuff that we know falls in the category, stuff that we know doesn’t fall in the category, and stuff for which we aren’t sure whether it falls in the category! And the thing changes over time, and as long as it stays within certain bounds, we will still lump it into the same category.

Doubt: Huh. So this “thread of existence” comes from the human tendency to assign things into fuzzy categories. So when a person goes to sleep at night, they know that in the morning, somebody extremely similar to themselves will be waking up, and that somebody falls into the fuzzy cluster that the person falling asleep labels “I”. As somebody continues through life, they know that two minutes from now, there will be a person that is similar enough to fall into the “I” cluster.

Doubt: But there’s still a problem. 30yearfuture!me will probably be different enough from present!me to fall outside the “I” category. If I went to sleep, and I knew that 30yearfuture!me woke up, I’d consider that to be tantamount to death. The two of us would share only a fraction of our memories, and he would probably have a different set of preferences, values, and thought patterns. How does this whole thing work when versions of yourself further out than a few years from your present self don’t fall in the “I” cluster in thingspace?

Self: That’s not too hard. The “I” cluster shifts over time as well. If you compare me at time t and me at time t+1, they would both fall within the “I” cluster at time t, but the “I” cluster of time t+1 is different enough to accommodate “me” at time t+2. It’s like this rock.

Doubt: Not the rock again.

Self: Quiet. If you had this rock, and 100yearfuture!thisrock side by side, they would probably not be recognizable as the same rock, but there is a continuous series of intermediates leading from one to the other, each of which would be recognizable as the same rock as its immediate ancestors and descendants.

Self: If there is a continuous series of intermediates that doesn’t happen too fast, leading from me to something very nonhuman, I will anticipate eventually experiencing what the nonhuman thing does, while if there is a discontinuous jump, I won’t anticipate experiencing anything at all.

Doubt: Huh.

Self: So that’s where the feeling of the “thread of personal identity” comes from. We have a fuzzy category labeled “I”, anticipate experiencing the sorts of things that probable future beings who fall in that category will experience, and in everyday life, there aren’t fast jumps to spots outside of the “I” category, so it feels like you’ve stayed in the same category the whole time.

Doubt: You’ll have to unpack “anticipate experiencing the sorts of things that probable future beings who fall in that category will experience”. Why?

Self: Flippant answer: If we didn’t work that way, evolution would have killed us a long time ago. Actual answer: Me at time t+1 experiences the same sorts of things as me at time t anticipated, so when me at time t+1 anticipates that me at time t+2 will experience something, it will probably happen. Looking backwards, anticipations of past selves frequently match up with the experiences of slightly-less-past selves, so looking forwards, the anticipations of my current self are likely to match up with the experiences of the future being who falls in the “I” category.

Doubt: Makes sense.

Self: You’ll notice that this also defuses the anthropic trilemma (for humans, at least). There is a 1 in a billion chance of the quantum random number generator generating the winning lottery ticket. But then a trillion copies are made, but you at time (right after the generator returned the winning number) has a trillion expected near-future beings who fall within the “I” category, so the 1 in a billion probability is split up a trillion ways among all of them. P(loser) is about 1, P(specific winner clone) is 1 in a quintillion. All the specific winner clones are then merged, and since a trillion different hypotheses each with a 1 in a quintillion probability all predict the same series of observed future events from time(right after you merge) onwards, P(series of experiences following from winning the quantum lottery) is 1 in a billion.

Doubt: Doesn’t this imply that anthropic probabilities depend on how big a boundary the mind draws around stuff it considers “I”?

Self: Yes. Let’s say we make 2 copies of a mind, and a third “copy” produced by running the mind through a lossy compression algorithm, and uncompressing it. A blue screen will be shown to one of the perfect mind copies (which may try to destroy it). A mind that considered the crappy copy to fall in the “I” category would predict a 1/3 chance of seeing the blue screen, while a mind that only considers near-perfect copies of itself as “I” would predict a 1/2 chance of seeing the blue screen, because the mind with the broad definition of “I” seriously considers the possibility of waking up as the crappy copy, while the mind with the narrow definition of “I” doesn’t.

Doubt: This seems to render probability useless.

Self: It means that probabilities of the form (I will observe X) are mind-dependent. Different minds given the same data will disagree on the probability of that statement, because they have different reference classes for the word “I”. Probabilities of the form (reality works like X)… to be honest, I don’t know. Anthropics is still extremely aggravating. I haven’t figured out the human version of anthropics (using the personal continuity notion) yet, I especially haven’t figured out how it’s going to work if you have a AI which doesn’t assign versions of itself to a fuzzy category labeled “I”, and I’m distrustful of how UDT seems like it’s optimizing over the entire tegmark 4 multiverse when there’s a chance that our reality is the only one there is, in which case it seems like you’d need probabilities of the form (reality works like X) and some way to update far away from the Boltzmann Brain hypothesis. This above section may be confused or flat-out wrong.

## The sin of updating when you can change whether you exist

8 28 February 2014 01:25AM

Trigger warning: In a thought experiment in this post, I used a hypothetical torture scenario without thinking, even though it wasn't necessary to make my point. Apologies, and thanks to an anonymous user for pointing this out. I'll try to be more careful in the future.

Should you pay up in the counterfactual mugging?

I've always found the argument about self-modifying agents compelling: If you expected to face a counterfactual mugging tomorrow, you would want to choose to rewrite yourself today so that you'd pay up. Thus, a decision theory that didn't pay up wouldn't be reflectively consistent; an AI using such a theory would decide to rewrite itself to use a different theory.

But is this the only reason to pay up? This might make a difference: Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you've always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there's no benefit from paying up in this particular counterfactual mugging, and so there hasn't ever been any incentive to self-modify into an agent that would pay up ... and so you shouldn't.

I've since changed my mind, and I've recently talked about part of the reason for this, when I introduced the concept of an l-zombie, or logical philosophical zombie, a mathematically possible conscious experience that isn't physically instantiated and therefore isn't actually consciously experienced. (Obligatory disclaimer: I'm not claiming that the idea that "some mathematically possible experiences are l-zombies" is likely to be true, but I think it's a useful concept for thinking about anthropics, and I don't think we should rule out l-zombies given our present state of knowledge. More in the l-zombies post and in this post about measureless Tegmark IV.) Suppose that Omega's coin had come up the other way, and Omega had turned the sky green. Then you and I would be l-zombies. But if Omega was able to make a confident guess about the decision we'd make if confronted with the counterfactual mugging (without simulating us, so that we continue to be l-zombies), then our decisions would still influence what happens in the actual physical world. Thus, if l-zombies say "I have conscious experiences, therefore I physically exist", and update on this fact, and if the decisions they make based on this influence what happens in the real world, a lot of utility may potentially be lost. Of course, you and I aren't l-zombies, but the mathematically possible versions of us who have grown up under a green sky are, and they reason the same way as you and me—it's not possible to have only the actual conscious observers reason that way. Thus, you should pay up even in the blue-sky mugging.

But that's only part of the reason I changed my mind. The other part is that while in the counterfactual mugging, the answer you get if you try to use Bayesian updating at least looks kinda sensible, there are other thought experiments in which doing so in the straight-forward way makes you obviously bat-shit crazy. That's what I'd like to talk about today.

*

The kind of situation I have in mind involves being able to influence whether you exist, or more precisely, influence whether the version of you making the decision exists as a conscious observer (or whether it's an l-zombie).

Suppose that you wake up and Omega explains to you that it's kidnapped you and some of your friends back in 2014, and put you into suspension; it's now the year 2100. It then hands you a little box with a red button, and tells you that if you press that button, Omega will slowly torture you and your friends to death; otherwise, you'll be able to live out a more or less normal and happy life (or to commit painless suicide, if you prefer). Furthermore, it explains that one of two things have happened: Either (1) humanity has undergone a positive intelligence explosion, and Omega has predicted that you will press the button; or (2) humanity has wiped itself out, and Omega has predicted that you will not press the button. In any other scenario, Omega would still have woken you up at the same time, but wouldn't have given you the button. Finally, if humanity has wiped itself out, it won't let you try to "reboot" it; in this case, you and your friends will be the last humans.

There's a correct answer to what to do in this situation, and it isn't to decide that Omega's just given you anthropic superpowers to save the world. But that's what you get if you try to update in the most naive way: If you press the button, then (2) becomes extremely unlikely, since Omega is really really good at predicting. Thus, the true world is almost certainly (1); you'll get tortured, but humanity survives. For great utility! On the other hand, if you decide to not press the button, then by the same reasoning, the true world is almost certainly (2), and humanity has wiped itself out. Surely you're not selfish enough to prefer that?

The correct answer, clearly, is that your decision whether to press the button doesn't influence whether humanity survives, it only influences whether you get tortured to death. (Plus, of course, whether Omega hands you the button in the first place!) You don't want to get tortured, so you don't press the button. Updateless reasoning gets this right.

*

Let me spell out the rules of the naive Bayesian decision theory ("NBDT") I used there, in analogy with Simple Updateless Decision Theory (SUDT). First, let's set up our problem in the SUDT framework. To simplify things, we'll pretend that FOOM and DOOM are the only possible things that can happen to humanity. In addition, we'll assume that there's a small probability $\textstyle \varepsilon$ that Omega makes a mistake when it tries to predict what you will do if given the button. Thus, the relevant possible worlds are $\textstyle \Omega = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{correct},\mathrm{incorrect}\}$. The precise probabilities you assign to these doesn't matter very much; I'll pretend that FOOM and DOOM are equiprobable, $\textstyle \mathbb{P}(x,\mathrm{incorrect}) = \varepsilon/2$ and $\textstyle \mathbb{P}(x,\mathrm{correct}) = (1-\varepsilon)/2$.

There's only one situation in which you need to make a decision, $\textstyle \mathcal{I} = \{*\}$; I won't try to define NBDT when there is more than one situation. Your possible actions in this situation are to press or to not press the button, $\textstyle \mathcal{A}(*) = \{P,\neg P\}$, so the only possible policies are $\textstyle \pi_P$, which presses the button ($\textstyle \pi_P(*) = P$), and $\textstyle \pi_{\neg P}$, which doesn't ($\textstyle \pi_{\neg P}(*) = \neg P$); $\textstyle \Pi = \{\pi_P,\pi_{\neg P}\}$.

There are four possible outcomes, specifying (a) whether humanity survives and (b) whether you get tortured: $\textstyle \mathcal{O} = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{torture},\neg\mathrm{torture}\}$. Omega only hands you the button if FOOM and it predicts you'll press it, or DOOM and it predicts you won't. Thus, the only cases in which you'll get tortured are $\textstyle o((\mathrm{foom},\mathrm{correct}),\pi_P) = (\mathrm{foom},\mathrm{torture})$ and $\textstyle o((\mathrm{doom},\mathrm{incorrect}),\pi_P) = (\mathrm{doom},\mathrm{torture})$. For any other $\textstyle x\in\{\mathrm{foom},\mathrm{doom}\}$, $\textstyle y\in\{\mathrm{correct},\mathrm{incorrect}\}$, and $\textstyle \pi\in\Pi$, we have $\textstyle o((x,y),\pi) = (x,\neg\mathrm{torture})$.

Finally, let's define our utility function by $u((\mathrm{foom},\neg\mathrm{torture})) = L$, $u((\mathrm{foom},\mathrm{torture})) = L-1$, $u((\mathrm{doom},\neg\mathrm{torture})) = -L$, and $u((\mathrm{doom},\mathrm{torture})) = -L-1$, where $\textstyle L$ is a very large number.

This suffices to set up an SUDT decision problem. There are only two possible worlds $\textstyle \omega\in\Omega$ where $\textstyle u(o(\omega,\pi_P))$ differs from $\textstyle u(o(\omega,\pi_{\neg P}))$, namely $\textstyle (\mathrm{foom},\mathrm{correct})$ and $\textstyle (\mathrm{doom},\mathrm{incorrect})$, where $\textstyle \pi_P$ results in torture and $\textstyle \pi_{\neg P}$ doesn't. In each of these cases, the utility of $\textstyle \pi_P$ is lower (by one) than that of $\textstyle \pi_{\neg P}$. Hence, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi_P))] < \mathbb{E}[u(o(\boldsymbol{\omega},\pi_{\neg P}))]$, implying that SUDT says you should choose $\textstyle \pi_{\neg P}$.

*

For NBDT, we need to know how to update, so we need one more ingredient: a function specifying in which worlds you exist as a conscious observer. In anticipation of future discussions, I'll write this as a function $\textstyle \mu(i;\omega,\pi)$, which gives the "measure" ("amount of magical reality fluid") of the conscious observation $\textstyle i\in\mathcal{I}$ if policy $\textstyle \pi\in\Pi$ is executed in the possible world $\textstyle \omega\in\Omega$. In our case, $\textstyle i = *$ and $\textstyle \mu(*;\omega,\pi)\in\{0,1\}$, indicating non-existence and existence, respectively. We can interpret $\textstyle \mu(i;\omega,\pi)$ as the conditional probability of making observation $\textstyle i$, given that the true world is $\textstyle \omega$, if plan $\textstyle \pi$ is executed. In our case, $\textstyle \mu(*;(\mathrm{foom},\mathrm{correct}),\pi_P) =$ $\textstyle \mu(*;(\mathrm{foom},\mathrm{incorrect}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{correct}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{incorrect}),\pi_P) = 1$, and $\textstyle \mu(*;\omega,\pi) = 0$ in all other cases.

Now, we can use Bayes' theorem to calculate the posterior probability of a possible world, given information $\textstyle i = *$ and policy $\textstyle \pi$: $\textstyle \mathbb{P}(\omega\mid i;\pi) = \mathbb{P}(\omega)\cdot\mu(i;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(i;\omega',\pi)$. NBDT tells us to choose the policy $\textstyle \pi$ that maximizes the posterior expected utility, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid i;\pi]$.

In our case, we have $\textstyle \mathbb{P}((\mathrm{foom},\mathrm{correct}) \mid *;\pi_P) = \mathbb{P}((\mathrm{doom},\mathrm{correct}) \mid *;\pi_{\neg P}) = 1-\varepsilon$ and $\textstyle \mathbb{P}((\mathrm{doom},\mathrm{incorrect}) \mid *;\pi_P) = \mathbb{P}((\mathrm{foom},\mathrm{incorrect}) \mid *;\pi_{\neg P}) = \varepsilon$. Thus, if we press the button, our expected utility is dominated by the near-certainty of humanity surviving, whereas if we don't, it's dominated by humanity's near-certain doom, and NBDT says we should press.

*

But maybe it's not updating that's bad, but NBDT's way of implementing it? After all, we get the clearly wacky results only if our decisions can influence whether we exist, and perhaps the way that NBDT extends the usual formula to this case happens to be the wrong way to extend it.

One thing we could try is to mark a possible world $\textstyle \omega$ as impossible only if $\textstyle \mu(*;\omega,\pi) = 0$ for all policies $\textstyle \pi$ (rather than: for the particular policy $\textstyle \pi$ whose expected utility we are computing). But this seems very ad hoc to me. (For example, this could depend on which set of possible actions $\textstyle \mathcal{A}(*)$ we consider, which seems odd.)

There is a much more principled possibility, which I'll call pseudo-Bayesian decision theory, or PBDT. PBDT can be seen as re-interpreting updating as saying that you're indifferent about what happens in possible worlds in which you don't exist as a conscious observer, rather than ruling out those worlds as impossible given your evidence. (A version of this idea was recently brought up in a comment by drnickbone, though I'd thought of this idea myself during my journey towards my current position on updating, and I imagine it has also appeared elsewhere, though I don't remember any specific instances.) I have more than one objection to PBDT, but the simplest one to argue is that it doesn't solve the problem: it still believes that it has anthropic superpowers in the problem above.

Formally, PBDT says that we should choose the policy $\textstyle \pi$ that maximizes $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\boldsymbol{\omega},\pi)]$ (where the expectation is with respect to the prior, not the updated, probabilities). In other words, we set the utility of any outcome in which we don't exist as a conscious observer to zero; we can see PBDT as SUDT with modified outcome and utility functions.

When our existence is independent on our decisions—that is, if $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$—then it turns out that PBDT and NBDT are equivalent, i.e., PBDT implements Bayesian updating. That's because in that case, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid *;\pi] =$ $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega\mid *;\pi)$ $\textstyle = \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi)$. If $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$, then the whole denominator doesn't depend on $\textstyle \pi$, so the fraction is maximized if and only if the numerator is. But the numerator is $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) =$ $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$, exactly the quantity that PBDT says should be maximized.

Unfortunately, although in our problem above $\mu(*;\omega,\pi)$ does depend of $\pi$, the denominator as a whole still doesn't: For both $\pi_P$ and $\pi_{\neg P}$, there is exactly one possible world with probability $(1-\varepsilon)/2$ and one possible world with probability $\varepsilon/2$ in which $*$ is a conscious observer, so we have $\textstyle\sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi) = 1/2$ for both $\pi\in\Pi$. Thus, PBDT gives the same answer as NBDT, by the same mathematical argument as in the case where we can't influence our own existence. If you think of PBDT as SUDT with the utility function $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then intuitively, PBDT can be thought of as reasoning, "Sure, I can't influence whether humanity is wiped out; but I can influence whether I'm an l-zombie or a conscious observer; and who cares what happens to humanity if I'm not? Best to press to button, since getting tortured in a world where there's been a positive intelligence explosion is much better than life without torture if humanity has been wiped out."

I think that's a pretty compelling argument against PBDT, but even leaving it aside, I don't like PBDT at all. I see two possible justifications for PBDT: You can either say that $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ is your real utility function—you really don't care about what happens in worlds where the version of you making the decision doesn't exist as a conscious observer—or you can say that your real preferences are expressed by $u(o(\omega,\pi))$, and multiplying by $\mu(*;\omega,\pi)$ is just a mathematical trick to express a steelmanned version of Bayesian updating. If your preferences really are given by $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then fine, and you should be maximizing $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$ (because you should be using (S)UDT), and you should press the button. Some kind of super-selfish agent, who doesn't care a fig even about a version of itself that is exactly the same up till five seconds ago (but then wasn't handed the button) could indeed have such preferences. But I think these are wacky preferences, and you don't actually have them. (Furthermore, if you did have them, then $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ would be your actual utility function, and you should be writing it as just $u(o(\omega,\pi))$, where $o(\omega,\pi)$ must now give information about whether $*$ is a conscious observer.)

If multiplying by $\mu(*;\omega,\pi)$ is just a trick to implement updating, on the other hand, then I find it strange that it introduces a new concept that doesn't occur at all in classical Bayesian updating, namely the utility of a world in which $*$ is an l-zombie. We've set this to zero, which is no loss of generality because classical utility functions don't change their meaning if you add or subtract a constant, so whenever you have a utility function where all worlds in which $*$ is an l-zombie have the same utility $u_0$, then you can just subtract $u_0$ from all utilities (without changing the meaning of the utility function), and get a function where that utility is zero. But that means that the utility functions I've been plugging into PBDT above do change their meaning if you add a constant to them. You can set up a problem where the agent has to decide whether to bring itself into existence or not (Omega creates it iff it predicts that the agent will press a particular button), and in that case the agent will decide to do so iff the world has utility greater than zero—clearly not invariant under adding and subtracting a constant. I can't find any concept like the utility of not existing in my intuitions about Bayesian updating (though I can find such a concept in my intuitions about utility, but regarding that see the previous paragraph), so if PBDT is just a mathematical trick to implement these intuitions, where does that utility come from?

I'm not aware of a way of implementing updating in general SUDT-style problems that does better than NBDT, PBDT, and the ad-hoc idea mentioned above, so for now I've concluded that in general, trying to update is just hopeless, and we should be using (S)UDT instead. In classical decision problems, where there are no acausal influences, (S)UDT will of course behave exactly as if it did do a Bayesian update; thus, in a sense, using (S)UDT can also be seen as a reinterpretation of Bayesian updating (in this case just as updateless utility maximization in a world where all influence is causal), and that's the way I think about it nowadays.

## SUDT: A toy decision theory for updateless anthropics

15 23 February 2014 11:50PM

The best approach I know for thinking about anthropic problems is Wei Dai's Updateless Decision Theory (UDT). We aren't yet able to solve all problems that we'd like to—for example, when it comes to game theory, the only games we have any idea how to solve are very symmetric ones—but for many anthropic problems, UDT gives the obviously correct solution. However, UDT is somewhat underspecified, and cousin_it's concrete models of UDT based on formal logic are rather heavyweight if all you want is to figure out the solution to a simple anthropic problem.

In this post, I introduce a toy decision theory, Simple Updateless Decision Theory or SUDT, which is most definitely not a replacement for UDT but makes it easy to formally model and solve the kind of anthropic problems that we usually apply UDT to. (And, of course, it gives the same solutions as UDT.) I'll illustrate this with a few examples.

This post is a bit boring, because all it does is to take a bit of math that we already implicitly use all the time when we apply updateless reasoning to anthropic problems, and spells it out in excruciating detail. If you're already well-versed in that sort of thing, you're not going to learn much from this post. The reason I'm posting it anyway is that there are things I want to say about updateless anthropics, with a bit of simple math here and there, and while the math may be intuitive, the best thing I can point to in terms of details are the posts on UDT, which contain lots of irrelevant complications. So the main purpose of this post is to save people from having to reverse-engineer the simple math of SUDT from the more complex / less well-specified math of UDT.

(I'll also argue that Psy-Kosh's non-anthropic problem is a type of counterfactual mugging, I'll use the concept of l-zombies to explain why UDT's response to this problem is correct, and I'll explain why this argument still works if there aren't any l-zombies.)

*

I'll introduce SUDT by way of a first example: the counterfactual mugging. In my preferred version, Omega appears to you and tells you that it has thrown a very biased coin, which had only a 1/1000 chance of landing heads; however, in this case, the coin has in fact fallen heads, which is why Omega is talking to you. It asks you to choose between two options, (H) and (T). If you choose (H), Omega will create a Friendly AI; if you choose (T), it will destroy the world. However, there is a catch: Before throwing the coin, Omega made a prediction about which of these options you would choose if the coin came up heads (and it was able to make a highly confident prediction). If the coin had come up tails, Omega would have destroyed the world if it's predicted that you'd choose (H), and it would have created a Friendly AI if it's predicted (T). (Incidentally, if it hadn't been able to make a confident prediction, it would just have destroyed the world outright.)

 Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) You choose (H) if coin falls heads Positive intelligence explosion Humanity wiped out You choose (T) if coin falls heads Humanity wiped out Positive intelligence explosion

In this example, we are considering two possible worlds:  and . We write  (no pun intended) for the set of all possible worlds; thus, in this case, . We also have a probability distribution over , which we call . In our example,  and .

In the counterfactual mugging, there is only one situation you might find yourself in in which you need to make a decision, namely when Omega tells you that the coin has fallen heads. In general, we write  for the set of all possible situations in which you might need to make a decision; the  stands for the information available to you, including both sensory input and your memories. In our case, we'll write , where  is the single situation where you need to make a decision.

For every , we write  for the set of possible actions you can take if you find yourself in situation . In our case,. A policy (or "plan") is a function  that associates to every situation  an action  to take in this situation. We write  for the set of all policies. In our case, , where  and .

Next, there is a set of outcomes, , which specify all the features of what happens in the world that make a difference to our final goals, and the outcome function , which for every possible world  and every policy  specifies the outcome  that results from executing  in the world . In our case,  (standing for FAI and DOOM), and  and .

Finally, we have a utility function . In our case,  and . (The exact numbers don't really matter, as long as , because utility functions don't change their meaning under affine transformations, i.e. when you add a constant to all utilities or multiply all utilities by a positive number.)

Thus, an SUDT decision problem consists of the following ingredients: The sets ,  and  of possible worlds, situations you need to make a decision in, and outcomes; for every , the set  of possible actions in that situation; the probability distribution ; and the outcome and utility functions  and . SUDT then says that you should choose a policy  that maximizes the expected utility , where  is the expectation with respect to , and  is the true world.

In our case,  is just the probability of the good outcome , according to the (prior) distribution . For , that probability is 1/1000; for , it is 999/1000. Thus, SUDT (like UDT) recommends choosing (T).

If you set up the problem in SUDT like that, it's kind of hidden why you could possibly think that's not the right thing to do, since we aren't distinguishing situations  that are "actually experienced" in a particular possible world ; there's nothing in the formalism that reflects the fact that Omega never asks us for our choice if the coin comes up tails. In my post on l-zombies, I've argued that this makes sense because even if there's no version of you that actually consciously experiences being in the heads world, this version still exists as a Turing machine and the choices that it makes influence what happens in the real world. If all mathematically possible experiences exist, so that there aren't any l-zombies, but some experiences are "experienced more" (have more "magical reality fluid") than others, the argument is even clearer—even if there's some anthropic sense in which, upon being told that the coin fell heads, you can conclude that you should assign a high probability of being in the heads world, the same version of you still exists in the tails world, and its choices influence what happens there. And if everything is experienced to the same degree (no magical reality fluid), the argument is clearer still.

*

From Vladimir Nesov's counterfactual mugging, let's move on to what I'd like to call Psy-Kosh's probably counterfactual mugging, better known as Psy-Kosh's non-anthropic problem. This time, you're not alone: Omega gathers you together with 999,999 other advanced rationalists, all well-versed in anthropic reasoning and SUDT. It places each of you in a separate room. Then, as before, it throws a very biased coin, which has only a 1/1000 chance of landing heads. If the coin does land heads, then Omega asks all of you to choose between two options, (H) and (T). If the coin falls tails, on the other hand, Omega chooses one of you at random and asks that person to choose between (H) and (T). If the coin lands heads and you all choose (H), Omega will create a Friendly AI; same if the coin lands tails, and the person who's asked chooses (T); else, Omega will destroy the world.

 Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) Everyone chooses (H) if asked Positive intelligence explosion Humanity wiped out Everyone chooses (T) if asked Humanity wiped out Positive intelligence explosion Different people choose differently Humanity wiped out (Depends on who is asked)

We'll assume that all of you prefer a positive FOOM over a gloomy DOOM, which means that all of you have the same values as far as the outcomes of this little dilemma are concerned: , as before, and all of you have the same utility function, given by  and . As long as that's the case, we can apply SUDT to find a sensible policy for everybody to follow (though when there is more than one optimal policy, and the different people involved can't talk to each other, it may not be clear how one of the policies should be chosen).

This time, we have a million different people, who can in principle each make an independent decision about what to answer if Omega asks them the question. Thus, we have . Each of these people can choose between (H) and (T), so  for every person , and a policy  is a function that returns either (H) or (T) for every . Obviously, we're particularly interested in the policies  and  satisfying  and  for all .

The possible worlds are , and their probabilities are  and . The outcome function is as follows: ,  for ,  if , and  otherwise.

What does SUDT recommend? As in the counterfactual mugging,  is the probability of the good outcome , under policy . For , the good outcome can only happen if the coin falls heads: in other words, with probability . If , then the good outcome can not happen if the coin falls heads, because in that case everybody gets asked, and at least one person chooses (T). Thus, in this case, the good outcome will happen only if the coin comes up tails and the randomly chosen person answers (T); this probability is , where  is the number of people answering (T). Clearly, this is maximized for , where ; moreover, in this case we get the probability , which is better than for , so SUDT recommends the plan .

Again, when you set up the problem in SUDT, it's not even obvious why anyone might think this wasn't the correct answer. The reason is that if Omega asks you, and you update on the fact that you've been asked, then after updating, you are quite certain that the coin has landed heads: yes, your prior probability was only 1/1000, but if the coin has landed tails, the chances that you would be asked was only one in a million, so the posterior odds are about 1000:1 in favor of heads. So, you might reason, it would be best if everybody chose (H); and moreover, all the people in the other rooms will reason the same way as you, so if you choose (H), they will as well, and this maximizes the probability that humanity survives. This relies on the fact that the others will choose the same way as you, but since you're all good rationalists using the same decision theory, that's going to be the case.

But in the worlds where the coin comes up tails, and Omega chooses someone else than you, the version of you that gets asked for its decision still "exists"... as an l-zombie. You might think that what this version of you does or doesn't do doesn't influence what happens in the real world; but if we accept the argument from the previous paragraph that your decisions are "linked" to those of the other people in the experiment, then they're still linked if the version of you making the decision is an l-zombie: If we see you as a Turing machine making a decision, that Turing machine should reason, "If the coin came up tails and someone else was chosen, then I'm an l-zombie, but the person who is actually chosen will reason exactly the same way I'm doing now, and will come to the same decision; hence, my decision influences what happens in the real world even in this case, and I can't do an update and just ignore those possible worlds."

I call this the "probably counterfactual mugging" because in the counterfactual mugging, you are making your choice because of its benefits in a possible world that is ruled out by your observations, while in the probably counterfactual mugging, you're making it because of its benefits in a set of possible worlds that is made very improbable by your observations (because most of the worlds in this set are ruled out). As with the counterfactual mugging, this argument is just all the stronger if there are no l-zombies because all mathematically possible experiences are in fact experienced.

*

As a final example, let's look at what I'd like to call Eliezer's anthropic mugging: the anthropic problem that inspired Psy-Kosh's non-anthropic one. This time, you're alone again, except that there's many of you: Omega is creating a million copies of you. It flips its usual very biased coin, and if that coin falls heads, it places all of you in exactly identical green rooms. If the coin falls tails, it places one of you in a green room, and all the others in red rooms. It then asks all copies in green rooms to choose between (H) and (T); if your choice agrees with the coin, FOOM, else DOOM.

 Coin falls heads (chance = 1/1000) Coin falls tails (chance = 999/1000) Green roomers choose (H) Positive intelligence explosion Humanity wiped out Green roomers choose (T) Humanity wiped out Positive intelligence explosion

Our possible worlds are back to being , with probabilities  and . We are also back to being able to make a choice in only one particular situation, namely when you're a copy in a green room: . Actions are , outcomes , utilities  and , and the outcome function is given by  and . In other words, from SUDT's perspective, this is exactly identical to the situation with the counterfactual mugging, and thus the solution is the same: Once more, SUDT recommends choosing (T).

On the other hand, the reason why someone might think that (H) could be the right answer is closer to that for Psy-Kosh's probably counterfactual mugging: After waking up in a green room, what should be your posterior probability that the coin has fallen heads? Updateful anthropic reasoning says that you should be quite sure that it has fallen heads. If you plug those probabilities into an expected utility calculation, it comes out as in Psy-Kosh's case, heavily favoring (H).

But even if these are good probabilities to assign epistemically (to satisfy your curiosity about what the world probably looks like), in light of the arguments from the counterfactual and the probably counterfactual muggings (where updating definitely is the right thing to do epistemically, but plugging these probabilities into the expected utility calculation gives the wrong result), it doesn't seem strange to me to come to the conclusion that choosing (T) is correct in Eliezer's anthropic mugging as well.

## Intelligence Metrics with Naturalized Induction using UDT

13 21 February 2014 12:23PM

Followup to: Intelligence Metrics and Decision Theory
Related to: Bridge Collapse: Reductionism as Engineering Problem

A central problem in AGI is giving a formal definition of intelligence. Marcus Hutter has proposed AIXI as a model of perfectly intelligent agent. Legg and Hutter have defined a quantitative measure of intelligence applicable to any suitable formalized agent such that AIXI is the agent with maximal intelligence according to this measure.

Legg-Hutter intelligence suffers from a number of problems I have previously discussed, the most important being:

• The formalism is inherently Cartesian. Solving this problem is known as naturalized induction and it is discussed in detail here.
• The utility function Legg & Hutter use is a formalization of reinforcement learning, while we would like to consider agents with arbitrary preferences. Moreover, a real AGI designed with reinforcement learning would tend to wrestle control of the reinforcement signal from the operators (there must be a classic reference on this but I can't find it. Help?). It is straightword to tweak to formalism to allow for any utility function which depends on the agent's sensations and actions, however we would like to be able to use any ontology for defining it.
Orseau and Ring proposed a non-Cartesian intelligence metric however their formalism appears to be too general, in particular there is no Solomonoff induction or any analogue thereof, instead a completely general probability measure is used.

My attempt at defining a non-Cartesian intelligence metric ran into problems of decision-theoretic flavor. The way I tried to used UDT seems unsatisfactory, and later I tried a different approach related to metatickle EDT.

In this post, I claim to accomplish the following:
• Define a formalism for logical uncertainty. When I started writing this I thought this formalism might be novel but now I see it is essentially the same as that of Benja.
• Use this formalism to define a non-constructive formalization of UDT. By "non-constructive" I mean something that assigns values to actions rather than a specific algorithm like here.
• Apply the formalization of UDT to my quasi-Solomonoff framework to yield an intelligence metric.
• Slightly modify my original definition of the quasi-Solomonoff measure so that the confidence of the innate model becomes a continuous rather than discrete parameter. This leads to an interesting conjecture.
• Propose a "preference agnostic" variant as an alternative to Legg & Hutter's reinforcement learning.
• Discuss certain anthropic and decision-theoretic aspects.

# Logical Uncertainty

The formalism introduced here was originally proposed by Benja.

Fix a formal system F. We want to be able to assign probabilities to statements s in F, taking into account limited computing resources. Fix D a natural number related to the amount of computing resources that I call "depth of analysis".

Define P0(s) := 1/2 for all s to be our initial prior, i.e. each statement's truth value is decided by a fair coin toss. Now define
PD(s) := P0(s | there are no contradictions of length <= D).

Consider X to be a number in [0, 1] given by a definition in F. Then dk(X) := "The k-th digit of the binary expansion of X is 1" is a statement in F. We define ED(X) := Σk 2-k PD(dk(X)).

## Remarks

• Clearly if s is provable in F then for D >> 0, PD(s) = 1. Similarly if "not s" is provable in F then for D >> 0,
PD(s) = 0.
• If each digit of X is decidable in F then lim-> inf ED(X) exists and equals the value of X according to F.
• For s of length > D, PD(s) = 1/2 since no contradiction of length <= D can involve s.
• It is an interesting question whether lim-> inf PD(s) exists for any s. It seems false that this limit always exists and equals 0 or 1, i.e. this formalism is not a loophole in Goedel incompleteness. To see this consider statements that require a high (arithmetical hierarchy) order halting oracle to decide.
• In computational terms, D corresponds to non-deterministic spatial complexity. It is spatial since we assign truth values simultaneously to all statements so in any given contradiction it is enough to retain the "thickest" step. It is non-deterministic since it's enough for a contradiction to exists, we don't have an actual computation which produces it. I suspect this can be made more formal using the Curry-Howard isomorphism, unfortunately I don't understand the latter yet.

# Non-Constructive UDT

Consider A a decision algorithm for optimizing utility U, producing an output ("decision") which is an element of C. Here U is just a constant defined in F. We define the U-value of c in C for A at depth of analysis D to be
VD(c, A; U) := ED(U | "A produces c" is true). It is only well defined as long as "A doesn't produce c" cannot be proved at depth of analysis D i.e. PD("A produces c") > 0. We define the absolute U-value of c for A to be
V(cAU) := ED(c, A)(U | "A produces c" is true) where D(c, A) := max {D | PD("A produces c") > 0}. Of course D(cA) can be infinite in which case Einf(...) is understood to mean limD -> inf ED(...).

For example V(cAU) yields the natural values for A an ambient control algorithm applied to e.g. a simple model of Newcomb's problem.  To see this note that given A's output the value of U can be determined at low depths of analysis whereas the output of A requires a very high depth of analysis to determine.

# Naturalized Induction

Our starting point is the "innate model" N: a certain a priori model of the universe including the agent G. This model encodes the universe as a sequence of natural numbers Y = (yk) which obeys either specific deterministic or non-deterministic dynamics or at least some constraints on the possible histories. It may or may not include information on the initial conditions. For example, N can describe the universe as a universal Turing machine M (representing G) with special "sensory" registers e. N constraints the dynamics to be compatible with the rules of the Turing machine but leaves unspecified the behavior of e. Alternatively, N can contain in addition to M a non-trivial model of the environment. Or N can be a cellular automaton with the agent corresponding to a certain collection of cells.

However, G's confidence in N is limited: otherwise it wouldn't need induction. We cannot start with 0 confidence: it's impossible to program a machine if you don't have even a guess of how it works. Instead we introduce a positive real number t which represents the timescale over which N is expected to hold. We then assign to each hypothesis H about Y (you can think about them as programs which compute yk given yj for j < k; more on that later) the weight QS(H) := 2-L(H(1 - e-t(H)/t). Here L(H) is the length of H's encoding in bits and t(H) is the time during which H remains compatible with N. This is defined for N of deterministic / constraint type but can be generalized to stochastic N

The weights QS(H) define a probability measure on the space of hypotheses which induces a probability measure on the space of histories Y. Thus we get an alternative to Solomonoff induction which allows for G to be a mechanistic part of the universe, at the price of introducing N and t

## Remarks

• Note that time is discrete in this formalism but t is continuous.
• Since we're later going to use logical uncertainties wrt the formal system F, it is tempting to construct the hypothesis space out of predicates in F rather than programs.

# Intelligence Metric

• The decoding Q: {Y} -> {bit-string} of the agent G from the universe Y. For example Q can read off the program loaded into M at time k=0.
• A utility function U: {Y} -> [0, 1] representing G's preferences. U has to be given by a definition in F. Note that N provides the ontology wrt which U is defined.
It seems tempting to define the intelligence to be EQS(U | Q), the conditional expectation value of U for a given value of Q in the quasi-Solomonoff measure. However, this is wrong for roughly the same reasons EDT is wrong (see previous post for details).

Instead, we define I(Q0) := EQS(Emax(U(Y(H)) | "Q(Y(H)) = Q0" is true)). Here the subscript max stands for maximal depth of analysis, as in the construction of absolute UDT value above.

## Remarks

• IMO the correct way to look at this is intelligence metric = value of decision for the decision problem "what should I program into my robot?". If N is a highly detailed model including "me" (the programmer of the AI), this literally becomes the case. However for theoretical analysis it is likely to be more convenient to work with simple N (also conceptually it leaves room for a "purist" notion of agent's intelligence, decoupled from the fine details of its creator).
• As opposed to usual UDT, the algorithm (H) making the decision (Q) is not known with certainty. I think this represents a real uncertainty that has to be taken into account in decision problems in general: the decision-maker doesn't know her own algorithm. Since this "introspective uncertainty" is highly correlated with "indexical" uncertainty (uncertainty about the universe), it prevents us from absorbing the later into the utility function as proposed by Coscott
• For high values of t, G can improve its understanding of the universe by bootstrapping the knowledge it already has. This is not possible for low values of t. In other words, if I cannot trust my mind at all, I cannot deduce anything. This leads me to an interesting conjecture: There is a a critical value t* of t from which this bootstrapping becomes possible (the positive feedback look of knowledge becomes critical). I(Q) is non-smooth at t* (phase transition).
• If we wish to understand intelligence, it might be beneficial to decouple it from the choice of preferences. To achieve this we can introduce the preference formula as an unknown parameter in N. For example, if G is realized by a machine M, we can connect M to a data storage E whose content is left undetermined by N. We can then define U to be defined by the formula encoded in E at time k=0. This leads to I(Q) being a sort of "general-purpose" intelligence while avoiding the problems associated with reinforcement learning.
• As opposed to Legg-Hutter intelligence, there appears to be no simple explicit description for Q* maximizing I(Q) (e.g. among all programs of given length). This is not surprising, since computational cost considerations come into play. In this framework it appears to be inherently impossible to decouple the computational cost considerations: G's computations have to be realized mechanistically and therefore cannot be free of time cost and side-effects.
• Ceteris paribus, Q* deals efficiently with problems like counterfactual mugging. The "ceteris paribus" conditional is necessary here since because of cost and side-effects of computations it is difficult to make absolute claims. However, it doesn't deal efficiently with counterfactual mugging in which G doesn't exist in the "other universe". This is because the ontology used for defining U (which is given by N) assumes G does exist. At least this is the case for simple ontologies like described above: possibly we can construct N in which G might or might not exist. Also, if G uses a quantum ontology (i.e. N describes the universe in terms of a wavefunction and U computes the quantum expectation value of an operator) then it does take into account other Everett universes in which G doesn't exist.
• For many choices of N (for example if the G is realized by a machine M), QS-induction assigns well-defined probabilities to subjective expectations, contrary to what is expected from UDT. However:
• This is not the case for all N. In particular, if N admits destruction of M then M's sensations after the point of destruction are not well-defined. Indeed, we better allow for destruction of M if we want G's preferences to behave properly in such an event. That is, if we don't allow it we get a "weak anvil problem" in the sense that G experiences an ontological crisis when discovering its own mortality and the outcome of this crisis is not obvious. Note though that it is not the same as the original ("strong") anvil problem, for example G might come to the conclusion the dynamics of "M's ghost" will be some sort of random.
• These probabilities probably depend significantly on N and don't amount to an elegant universal law for solving the anthropic trilemma.
• Indeed this framework is not completely "updateless", it is "partially updated" by the introduction of N and t. This suggests we might want the updates to be minimal in some sense, in particular t should be t*.
• The framework suggests there is no conceptual problem with cosmologies in which Boltzmann brains are abundant. Q* wouldn't think it is a Boltzmann brain since the long address of Boltzmann brains within the universe makes the respective hypotheses complex thus suppressing them, even disregarding the suppression associated with N. I doubt this argument is original but I feel the framework validates it to some extent.

## I like simplicity, but not THAT much

15 14 February 2014 07:51PM

Followup to: L-zombies! (L-zombies?)
Reply to: Coscott's Preferences without Existence; Paul Christiano's comment on my l-zombies post

In my previous post, I introduced the idea of an "l-zombie", or logical philosophical zombie: A Turing machine that would simulate a conscious human being if it were run, but that is never run in the real, physical world, so that the experiences that this human would have had, if the Turing machine were run, aren't actually consciously experienced.

One common reply to this is to deny the possibility of logical philosophical zombies just like the possibility of physical philosophical zombies: to say that every mathematically possible conscious experience is in fact consciously experienced, and that there is no kind of "magical reality fluid" that makes some of these be experienced "more" than others. In other words, we live in the Tegmark Level IV universe, except that unlike Tegmark argues in his paper, there's no objective measure on the collection of all mathematical structures, according to which some mathematical structures somehow "exist more" than others (and, although IIRC that's not part of Tegmark's argument, according to which the conscious experiences in some mathematical structures could be "experienced more" than those in other structures). All mathematically possible experiences are experienced, and to the same "degree".

So why is our world so orderly? There's a mathematically possible continuation of the world that you seem to be living in, where purple pumpkins are about to start falling from the sky. Or the light we observe coming in from outside our galaxy is suddenly replaced by white noise. Why don't you remember ever seeing anything as obviously disorderly as that?

And the answer to that, of course, is that among all the possible experiences that get experienced in this multiverse, there are orderly ones as well as non-orderly ones, so the fact that you happen to have orderly experiences isn't in conflict with the hypothesis; after all, the orderly experiences have to be experienced as well.

One might be tempted to argue that it's somehow more likely that you will observe an orderly world if everybody who has conscious experiences at all, or if at least most conscious observers, see an orderly world. (The "most observers" version of the argument assumes that there is a measure on the conscious observers, a.k.a. some kind of magical reality fluid.) But this requires the use of anthropic probabilities, and there is simply no (known) system of anthropic probabilities that gives reasonable answers in general. Fortunately, we have an alternative: Wei Dai's updateless decision theory (which was motivated in part exactly by the problem of how to act in this kind of multiverse). The basic idea is simple (though the details do contain devils): We have a prior over what the world looks like; we have some preferences about what we would like the world to look like; and we come up with a plan for what we should do in any circumstance we might find ourselves in that maximizes our expected utility, given our prior.

*

In this framework, Coscott and Paul suggest, everything adds up to normality if, instead of saying that some experiences objectively exist more, we happen to care more about some experiences than about others. (That's not a new idea, of course, or the first time this has appeared on LW -- for example, Wei Dai's What are probabilities, anyway? comes to mind.) In particular, suppose we just care more about experiences in mathematically really simple worlds -- or more precisely, places in mathematically simple worlds that are mathematically simple to describe (since there's a simple program that runs all Turing machines, and therefore all mathematically possible human experiences, always assuming that human brains are computable). Then, even though there's a version of you that's about to see purple pumpkins rain from the sky, you act in a way that's best in the world where that doesn't happen, because that world has so much lower K-complexity, and because you therefore care so much more about what happens in that world.

There's something unsettling about that, which I think deserves to be mentioned, even though I do not think it's a good counterargument to this view. This unsettling thing is that on priors, it's very unlikely that the world you experience arises from a really simple mathematical description. (This is a version of a point I also made in my previous post.) Even if the physicists had already figured out the simple Theory of Everything, which is a super-simple cellular automaton that accords really well with experiments, you don't know that this simple cellular automaton, if you ran it, would really produce you. After all, imagine that somebody intervened in Earth's history so that orchids never evolved, but otherwise left the laws of physics the same; there might still be humans, or something like humans, and they would still run experiments and find that they match the predictions of the simple cellular automaton, so they would assume that if you ran that cellular automaton, it would compute them -- except it wouldn't, it would compute us, with orchids and all. Unless, of course, it does compute them, and a special intervention is required to get the orchids.

So you don't know that you live in a simple world. But, goes the obvious reply, you care much more about what happens if you do happen to live in the simple world. On priors, it's probably not true; but it's best, according to your values, if all people like you act as if they live in the simple world (unless they're in a counterfactual mugging type of situation, where they can influence what happens in the simple world even if they're not in the simple world themselves), because if the actual people in the simple world act like that, that gives the highest utility.

You can adapt an argument that I was making in my l-zombies post to this setting: Given these preferences, it's fine for everybody to believe that they're in a simple world, because this will increase the correspondence between map and territory for the people that do live in simple worlds, and that's who you care most about.

*

I mostly agree with this reasoning. I agree that Tegmark IV without a measure seems like the most obvious and reasonable hypothesis about what the world looks like. I agree that there seems no reason for there to be a "magical reality fluid". I agree, therefore, that on the priors that I'd put into my UDT calculation for how I should act, it's much more likely that true reality is a measureless Tegmark IV than that it has some objective measure according to which some experiences are "experienced less" than others, or not experienced at all. I don't think I understand things well enough to be extremely confident in this, but my odds would certainly be in favor of it.

Moreover, I agree that if this is the case, then my preferences are to care more about the simpler worlds, making things add up to normality; I'd want to act as if purple pumpkins are not about to start falling from the sky, precisely because I care more about the consequences my actions have in more orderly worlds.

But.

*

Imagine this: Once you finish reading this article, you hear a bell ringing, and then a sonorous voice announces: "You do indeed live in a Tegmark IV multiverse without a measure. You had better deal with it." And then it turns out that it's not just you who's heard that voice: Every single human being on the planet (who didn't sleep through it, isn't deaf etc.) has heard those same words.

On the hypothesis, this is of course about to happen to you, though only in one of those worlds with high K-complexity that you don't care about very much.

So let's consider the following possible plan of action: You could act as if there is some difference between "existence" and "non-existence", or perhaps some graded degree of existence, until you hear those words and confirm that everybody else has heard them as well, or until you've experienced one similarly obviously "disorderly" event. So until that happens, you do things like invest time and energy into trying to figure out what the best way to act is if it turns out that there is some magical reality fluid, and into trying to figure out what a non-confused version of something like a measure on conscious experience could look like, and you act in ways that don't kill you if we happen to not live in a measureless Tegmark IV. But once you've had a disorderly experience, just a single one, you switch over to optimizing for the measureless mathematical multiverse.

If the degree to which you care about worlds is really proportional to their K-complexity, with respect to what you and I would consider a "simple" universal Turing machine, then this would be a silly plan; there is very little to be gained from being right in worlds that have that much higher K-complexity. But when I query my intuitions, it seems like a rather good plan:

• Yes, I care less about those disorderly worlds. But not as much less as if I valued them by their K-complexity. I seem to be willing to tap into my complex human intuitions to refer to the notion of "single obviously disorderly event", and assign the worlds with a single such event, and otherwise low K-complexity, not that much lower importance than the worlds with actual low K-complexity.
• And if I imagine that the confused-seeming notions of "really physically exists" and "actually experienced" do have some objective meaning independent of my preferences, then I care much more about the difference between "I get to 'actually experience' a tomorrow" and "I 'really physically' get hit by a car today" than I care about the difference between the world with true low K-complexity and the worlds with a single disorderly event.

In other words, I agree that on the priors I put into my UDT calculation, it's much more likely that we live in measureless Tegmark IV; but my confidence in this isn't extreme, and if we don't, then the difference between "exists" and "doesn't exist" (or "is experienced a lot" and "is experienced only infinitesimally") is very important; much more important than the difference between "simple world" and "simple world plus one disorderly event" according to my preferences if we do live in a Tegmark IV universe. If I act optimally according to the Tegmark IV hypothesis in the latter worlds, that still gives me most of the utility that acting optimally in the truly simple worlds would give me -- or, more precisely, the utility differential isn't nearly as large as if there is something else going on, and I should be doing something about it, and I'm not.

This is the reason why I'm trying to think seriously about things like l-zombies and magical reality fluid. I mean, I don't even think that these are particularly likely to be exactly right even if the measureless Tegmark IV hypothesis is wrong; I expect that there would be some new insight that makes even more sense than Tegmark IV, and makes all the confusion go away. But trying to grapple with the confused intuitions we currently have seems at least a possible way to make progress on this, if it should be the case that there is in fact progress to be made.

*

Here's one avenue of investigation that seems worthwhile to me, and wouldn't without the above argument. One thing I could imagine finding, that could make the confusion go away, would be that the intuitive notion of "all possible Turing machines" is just wrong, and leads to outright contradictions (e.g., to inconsistencies in Peano Arithmetic, or something similarly convincing). Lots of people have entertained the idea that concepts like the real numbers don't "really" exist, and only the behavior of computable functions is "real"; perhaps not even that is real, and true reality is more restricted? (You can reinterpret many results about real numbers as results about computable functions, so maybe you could reinterpret results about computable functions as results about these hypothetical weaker objects that would actually make mathematical sense.) So it wouldn't be the case after all that there is some Turing machine that computes the conscious experiences you would have if pumpkins started falling from the sky.

Does the above make sense? Probably not. But I'd say that there's a small chance that maybe yes, and that if we understood the right kind of math, it would seem very obvious that not all intuitively possible human experiences are actually mathematically possible (just as obvious as it is today, with hindsight, that there is no Turing machine which takes a program as input and outputs whether this program halts). Moreover, it seems plausible that this could have consequences for how we should act. This, together with my argument above, make me think that this sort of thing is worth investigating -- even if my priors are heavily on the side of expecting that all experiences exist to the same degree, and ordinarily this difference in probabilities would make me think that our time would be better spent on investigating other, more likely hypotheses.

*

Leaving aside the question of how I should act, though, does all of this mean that I should believe that I live in a universe with l-zombies and magical reality fluid, until such time as I hear that voice speaking to me?

I do feel tempted to try to invoke my argument from the l-zombies post that I prefer the map-territory correspondences of actually existing humans to be correct, and don't care about whether l-zombies have their map match up with the territory. But I'm not sure that I care much more about actually existing humans being correct, if the measureless mathematical multiverse hypothesis is wrong, than I care about humans in simple worlds being correct, if that hypothesis is right. So I think that the right thing to do may be to have a subjective belief that I most likely do live in the measureless Tegmark IV, as long as that's the view that seems by far the least confused -- but continue to spend resources on investigating alternatives, because on priors they don't seem sufficiently unlikely to make up for the potential great importance of getting this right.

## L-zombies! (L-zombies?)

22 07 February 2014 06:30PM

Reply to: Benja2010's Self-modification is the correct justification for updateless decision theory; Wei Dai's Late great filter is not bad news

"P-zombie" is short for "philosophical zombie", but here I'm going to re-interpret it as standing for "physical philosophical zombie", and contrast it to what I call an "l-zombie", for "logical philosophical zombie".

A p-zombie is an ordinary human body with an ordinary human brain that does all the usual things that human brains do, such as the things that cause us to move our mouths and say "I think, therefore I am", but that isn't conscious. (The usual consensus on LW is that p-zombies can't exist, but some philosophers disagree.) The notion of p-zombie accepts that human behavior is produced by physical, computable processes, but imagines that these physical processes don't produce conscious experience without some additional epiphenomenal factor.

An l-zombie is a human being that could have existed, but doesn't: a Turing machine which, if anybody ever ran it, would compute that human's thought processes (and its interactions with a simulated environment); that would, if anybody ever ran it, compute the human saying "I think, therefore I am"; but that never gets run, and therefore isn't conscious. (If it's conscious anyway, it's not an l-zombie by this definition.) The notion of l-zombie accepts that human behavior is produced by computable processes, but supposes that these computational processes don't produce conscious experience without being physically instantiated.

Actually, there probably aren't any l-zombies: The way the evidence is pointing, it seems like we probably live in a spatially infinite universe where every physically possible human brain is instantiated somewhere, although some are instantiated less frequently than others; and if that's not true, there are the "bubble universes" arising from cosmological inflation, the branches of many-worlds quantum mechanics, and Tegmark's "level IV" multiverse of all mathematical structures, all suggesting again that all possible human brains are in fact instantiated. But (a) I don't think that even with all that evidence, we can be overwhelmingly certain that all brains are instantiated; and, more importantly actually, (b) I think that thinking about l-zombies can yield some useful insights into how to think about worlds where all humans exist, but some of them have more measure ("magical reality fluid") than others.

So I ask: Suppose that we do indeed live in a world with l-zombies, where only some of all mathematically possible humans exist physically, and only those that do have conscious experiences. How should someone living in such a world reason about their experiences, and how should they make decisions — keeping in mind that if they were an l-zombie, they would still say "I have conscious experiences, so clearly I can't be an l-zombie"?

If we can't update on our experiences to conclude that someone having these experiences must exist in the physical world, then we must of course conclude that we are almost certainly l-zombies: After all, if the physical universe isn't combinatorially large, the vast majority of mathematically possible conscious human experiences are not instantiated. You might argue that the universe you live in seems to run on relatively simple physical rules, so it should have high prior probability; but we haven't really figured out the exact rules of our universe, and although what we understand seems compatible with the hypothesis that there are simple underlying rules, that's not really proof that there are such underlying rules, if "the real universe has simple rules, but we are l-zombies living in some random simulation with a hodgepodge of rules (that isn't actually ran)" has the same prior probability; and worse, if you don't have all we do know about these rules loaded into your brain right now, you can't really verify that they make sense, since there is some mathematically possible simulation whose initial state has you remember seeing evidence that such simple rules exist, even if they don't; and much worse still, even if there are such simple rules, what evidence do you have that if these rules were actually executed, they would produce you? Only the fact that you, like, exist, but we're asking what happens if we don't let you update on that.

I find myself quite unwilling to accept this conclusion that I shouldn't update, in the world we're talking about. I mean, I actually have conscious experiences. I, like, feel them and stuff! Yes, true, my slightly altered alter ego would reason the same way, and it would be wrong; but I'm right...

...and that actually seems to offer a way out of the conundrum: Suppose that I decide to update on my experience. Then so will my alter ego, the l-zombie. This leads to a lot of l-zombies concluding "I think, therefore I am", and being wrong, and a lot of actual people concluding "I think, therefore I am", and being right. All the thoughts that are actually consciously experienced are, in fact, correct. This doesn't seem like such a terrible outcome. Therefore, I'm willing to provisionally endorse the reasoning "I think, therefore I am", and to endorse updating on the fact that I have conscious experiences to draw inferences about physical reality — taking into account the simulation argument, of course, and conditioning on living in a small universe, which is all I'm discussing in this post.

NB. There's still something quite uncomfortable about the idea that all of my behavior, including the fact that I say "I think therefore I am", is explained by the mathematical process, but actually being conscious requires some extra magical reality fluid. So I still feel confused, and using the word l-zombie in analogy to p-zombie is a way of highlighting that. But this line of reasoning still feels like progress. FWIW.

But if that's how we justify believing that we physically exist, that has some implications for how we should decide what to do. The argument is that nothing very bad happens if the l-zombies wrongly conclude that they actually exist. Mostly, that also seems to be true if they act on that belief: mostly, what l-zombies do doesn't seem to influence what happens in the real world, so if only things that actually happen are morally important, it doesn't seem to matter what the l-zombies decide to do. But there are exceptions.

Consider the counterfactual mugging: Accurate and trustworthy Omega appears to you and explains that it just has thrown a very biased coin that had only a 1/1000 chance of landing heads. As it turns out, this coin has in fact landed heads, and now Omega is offering you a choice: It can either (A) create a Friendly AI or (B) destroy humanity. Which would you like? There is a catch, though: Before it threw the coin, Omega made a prediction about what you would do if the coin fell heads (and it was able to make a confident prediction about what you would choose). If the coin had fallen tails, it would have created an FAI if it has predicted that you'd choose (B), and it would have destroyed humanity if it has predicted that you would choose (A). (If it hadn't been able to make a confident prediction about what you would choose, it would just have destroyed humanity outright.)

There is a clear argument that, if you expect to find yourself in a situation like this in the future, you would want to self-modify into somebody who would choose (B), since this gives humanity a much larger chance of survival. Thus, a decision theory stable under self-modification would answer (B). But if you update on the fact that you consciously experience Omega telling you that the coin landed heads, (A) would seem to be the better choice!

One way of looking at this is that if the coin falls tails, the l-zombie that is told the coin landed heads still exists mathematically, and this l-zombie now has the power to influence what happens in the real world. If the argument for updating was that nothing bad happens even though the l-zombies get it wrong, well, that argument breaks here. The mathematical process that is your mind doesn't have any evidence about whether the coin landed heads or tails, because as a mathematical object it exists in both possible worlds, and it has to make a decision in both worlds, and that decision affects humanity's future in both worlds.

Back in 2010, I wrote a post arguing that yes, you would want to self-modify into something that would choose (B), but that that was the only reason why you'd want to choose (B). Here's a variation on the above scenario that illustrates the point I was trying to make back then: Suppose that Omega tells you that it actually threw its coin a million years ago, and if it had fallen tails, it would have turned Alpha Centauri purple. Now throughout your history, the argument goes, you would never have had any motive to self-modify into something that chooses (B) in this particular scenario, because you've always known that Alpha Centauri isn't, in fact, purple.

But this argument assumes that you know you're not a l-zombie; if the coin had in fact fallen tails, you wouldn't exist as a conscious being, but you'd still exist as a mathematical decision-making process, and that process would be able to influence the real world, so you-the-decision-process can't reason that "I think, therefore I am, therefore the coin must have fallen heads, therefore I should choose (A)." Partly because of this, I now accept choosing (B) as the (most likely to be) correct choice even in that case. (The rest of my change in opinion has to do with all ways of making my earlier intuition formal getting into trouble in decision problems where you can influence whether you're brought into existence, but that's a topic for another post.)

However, should you feel cheerful while you're announcing your choice of (B), since with high (prior) probability, you've just saved humanity? That would lead to an actual conscious being feeling cheerful if the coin has landed heads and humanity is going to be destroyed, and an l-zombie computing, but not actually experiencing, cheerfulness if the coin has landed heads and humanity is going to be saved. Nothing good comes out of feeling cheerful, not even alignment of a conscious' being's map with the physical territory. So I think the correct thing is to choose (B), and to be deeply sad about it.

You may be asking why I should care what the right probabilities to assign or the right feelings to have are, since these don't seem to play any role in making decisions; sometimes you make your decisions as if updating on your conscious experience, but sometimes you don't, and you always get the right answer if you don't update in the first place. Indeed, I expect that the "correct" design for an AI is to fundamentally use (more precisely: approximate) updateless decision theory (though I also expect that probabilities updated on the AI's sensory input will be useful for many intermediate computations), and "I compute, therefore I am"-style reasoning will play no fundamental role in the AI. And I think the same is true for humans' decisions — the correct way to act is given by updateless reasoning. But as a human, I find myself unsatisfied by not being able to have a picture of what the physical world probably looks like. I may not need one to figure out how I should act; I still want one, not for instrumental reasons, but because I want one. In a small universe where most mathematically possible humans are l-zombies, the argument in this post seems to give me a justification to say "I think, therefore I am, therefore probably I either live in a simulation or what I've learned about the laws of physics describes how the real world works (even though there are many l-zombies who are thinking similar thoughts but are wrong about them)."

And because of this, even though I disagree with my 2010 post, I also still disagree with Wei Dai's 2010 post arguing that a late Great Filter is good news, which my own 2010 post was trying to argue against. Wei argued that if Omega gave you a choice between (A) destroying the world now and (B) having Omega destroy the world a million years ago (so that you are never instantiated as a conscious being, though your choice as an l-zombie still influences the real world), then you would choose (A), to give humanity at least the time it's had so far. Wei concluded that this means that if you learned that the Great Filter is in our future, rather than our past, that must be good news, since if you could choose where to place the filter, you should place it in the future. I now agree with Wei that (A) is the right choice, but I don't think that you should be happy about it. And similarly, I don't think you should be happy about news that tells you that the Great Filter is later than you might have expected.

10 [deleted] 21 July 2013 11:26AM

A post by Nick Land who some of you are probably already following either on his blog Outside In or at Urban Future.

There is a ‘problem’ that has been nagging at me for a long time – which is that there hasn’t been a long time. It’s Saturday, with no one around, or getting drunk, or something, so I’ll run it past you. Cosmology seems oddly childish.

An analogy might help. Among all the reasons for super-sophisticated atheistic materialists to deride Abrahamic creationists, the most arithmetically impressive is the whole James Ussher 4004 BC thing. The argument is familiar to everyone: 6,027 years — Ha!

Creationism is a topic for another time. The point for now is just: 13.7 billion years – Ha! Perhaps this cosmological consensus estimate for the age of the universe is true. I’m certainly not going to pit my carefully-rationed expertise in cosmo-physics against it. But it’s a stupidly short amount of time. If this is reality, the joke’s on us. Between Ussher’s mid-17th century estimate and (say) Hawking’s late 20th century one, the difference is just six orders of magnitude. It’s scarcely worth getting out of bed for. Or the crib.

For anyone steeped in Hindu Cosmology – which locates us 1.56 x 10^14 years into the current Age of Brahma – or Lovecraftian metaphysics, with its vaguer but abysmally extended eons, the quantity of elapsed cosmic time, according to the common understanding of our present scientific establishment, is cause for claustrophobia. Looking backward, we are sealed in a small room, with the wall of the original singularity pressed right up against us. (Looking forward, things are quite different, and we will get to that.)

There are at least three ways in which the bizarre youthfulness of the universe might be imagined:

1. Consider first the disconcerting lack of proportion between space and time. The universe contains roughly 100 billion galaxies, each a swirl of 100 billion stars. That makes Sol one of 10^22 stars in the cosmos, but it has lasted for something like a third of the life of the universe. Decompose the solar system and the discrepancy only becomes more extreme. The sun accounts for 99.86% of the system’s mass, and the gas giants incorporate 99% of the remainder, yet the age of the earth is only fractionally less than that of the sun. Earth is a cosmic time hog. In space it is next to nothing, but in time it extends back through a substantial proportion of the Stelliferous Era, so close to the origin of the universe that it is belongs to the very earliest generations of planetary bodies. Beyond it stretch incomprehensible immensities, but before it there is next to nothing.

2. Compared to the intensity of time (backward) extension is of vanishing insignificance. The unit of Planck time – corresponding to the passage of a photon across a Planck length — is about 5.4 x 10^-44 seconds. If there is a true instant, that is it. A year consists of less the 3.2 x 10^7 seconds, so cosmological consensus estimates that there have been approximately 432 339 120 000 000 000 seconds since the Big Bang, which for our purposes can be satisfactorily rounded to 4.3 x 10^17. The difference between a second and the age of the universe is smaller that that between a second and a Planck Time tick by nearly 27 orders of magnitude. In other words, if a Planck Time-sensitive questioner asked “When did the Big Bang happen?” and you answered “Just now” — in clock time — you’d be almost exactly right. If you had been asked to identify a particular star from among the entire stellar population of the universe, and you picked it out correctly, your accuracy would still be hazier by 5 orders of magnitude. Quite obviously, there haven’t been enough seconds since the Big Bang to add up to a serious number – less than one for every 10,000 stars in the universe.

3. Isotropy gets violated by time orientation like a Detroit muni-bond investor. In a universe dominated by dark energy – like ours – expansion lasts forever. The Stelliferous Era is predicted to last for roughly 100 trillion years, which is over 7,000 times the present age of the universe. Even the most pessimistic interpretation of the Anthropic Principle, therefore, places us only a fractional distance from the beginning of time. The Degenerate Era, post-dating star-formation, then extends out to 10^40 years, by the end of which time all baryonic matter will have decayed, and even the most radically advanced forms of cosmic intelligence will have found existence becoming seriously challenging. Black holes then dominate out to 10^60 years, after which the Dark Era begins, lasting a long time. (Decimal exponents become unwieldy for these magnitudes, making more elaborate modes of arithmetical notation expedient. We need not pursue it further.) The take-away: the principle of Isotropy holds that we should not find ourselves anywhere special in the universe, and yet we do – right at the beginning. More implausibly still, we are located at the very beginning of an infinity (although anthropic selection might crop this down to merely preposterous improbability).

Intuitively, this is all horribly wrong, although intuitions have no credible authority, and certainly provide no grounds for contesting rigorously assembled scientific narratives.  Possibly — I should concede most probably — time is simply ridiculous, not to say profoundly insulting. We find ourselves glued to the very edge of the Big Bang, as close to neo-natal as it is arithmetically possible to be.

That’s odd, isn’t it?

## Caught in the glare of two anthropic shadows

17 04 July 2013 07:54PM

This article consists of original new research, so would not get published on Wikipedia!

The previous post introduced the concept of the anthropic shadow: the fact that certain large and devastating disasters cannot be observed in the historical record, because if they had happened, we wouldn't be around to observe them. This absence forms an “anthropic shadow”.

But that was the result for a single category of disasters. What would happen if we consider two independent classes of disasters? Would we see a double shadow, or would one ‘overshadow’ the other?

To answer that question, we’re going to have to analyse the anthropic shadow in more detail, and see that there are two separate components to it:

• The first is the standard effect: humanity cannot have developed a technological civilization, if there were large catastrophes in the recent past.
• The second effect is the lineage effect: humanity cannot have developed a technological civilization, if there was another technological civilization in the recent past that survived to today (or at least, we couldn't have developed the way we did).

To illustrate the difference between the two, consider the following model. Segment time into arbitrarily “eras”. In a given era, a large disaster may hit with probability q, or a small disaster may independently hit with probability q (hence with probability q2, there will be both a large and a small disaster). A small disaster will prevent a technological civilization from developing during that era; a large one will prevent such a civilization from developing in that era or the next one.

If it is possible for a technological civilization to develop (no small disasters that era, no large ones in the preceding era, and no previous civilization), then one will do so with probability p. We will assume p constant: our model will only span a time frame where p is unchanging (maybe it's over the time period after the rise of big mammals?)

10 04 July 2013 07:52PM

From a paper by Milan M. Ćirković, Anders Sandberg, and Nick Bostrom:

We describe a signiﬁcant practical consequence of taking anthropic biases into account in deriving predictions for rare stochastic catastrophic events. The risks associated with catastrophes such as asteroidal/cometary impacts, supervolcanic episodes, and explosions of supernovae/gamma-ray bursts are based on their observed frequencies. As a result, the frequencies of catastrophes that destroy or are otherwise incompatible with the existence of observers are systematically underestimated. We describe the consequences of this anthropic bias for estimation of catastrophic risks, and suggest some directions for future work.

There cannot have been a large disaster on Earth in the last millennia, or we wouldn't be around to see it. There can't have been a very large disaster on Earth in the last ten thousand years, or we wouldn't be around to see it. There can't have been a huge disaster on Earth in the last million years, or we wouldn't be around to see it. There can't have been a planet-destroying disaster on Earth... ever.

Thus the fact that we exist precludes us seeing certain types of disasters in the historical record; as we get closer and closer to the present day, the magnitude of the disasters we can see goes down. These missing disasters form the "anthropic shadow", somewhat visible in the top right of this diagram:

Hence even though it looks like the risk is going down (the magnitude is diminishing as we approach the present), we can't rely on this being true: it could be a purely anthropic effect.

## UFAI cannot be the Great Filter

37 22 December 2012 11:26AM

[Summary: The fact we do not observe (and have not been wiped out by) an UFAI suggests the main component of the 'great filter' cannot be civilizations like ours being wiped out by UFAI. Gentle introduction (assuming no knowledge) and links to much better discussion below.]

### Introduction

The Great Filter is the idea that although there is lots of matter, we observe no "expanding, lasting life", like space-faring intelligences. So there is some filter through which almost all matter gets stuck before becoming expanding, lasting life. One question for those interested in the future of humankind is whether we have already 'passed' the bulk of the filter, or does it still lie ahead? For example, is it very unlikely matter will be able to form self-replicating units, but once it clears that hurdle becoming intelligent and going across the stars is highly likely; or is it getting to a humankind level of development is not that unlikely, but very few of those civilizations progress to expanding across the stars. If the latter, that motivates a concern for working out what the forthcoming filter(s) are, and trying to get past them.

One concern is that advancing technology gives the possibility of civilizations wiping themselves out, and it is this that is the main component of the Great Filter - one we are going to be approaching soon. There are several candidates for which technology will be an existential threat (nanotechnology/'Grey goo', nuclear holocaust, runaway climate change), but one that looms large is Artificial intelligence (AI), and trying to understand and mitigate the existential threat from AI is the main role of the Singularity Institute, and I guess Luke, Eliezer (and lots of folks on LW) consider AI the main existential threat.

The concern with AI is something like this:

1. AI will soon greatly surpass us in intelligence in all domains.
2. If this happens, AI will rapidly supplant humans as the dominant force on planet earth.
3. Almost all AIs, even ones we create with the intent to be benevolent, will probably be unfriendly to human flourishing.

Or, as summarized by Luke:

... AI leads to intelligence explosion, and, because we don’t know how to give an AI benevolent goals, by default an intelligence explosion will optimize the world for accidentally disastrous ends. A controlled intelligence explosion, on the other hand, could optimize the world for good. (More on this option in the next post.)

So, the aim of the game needs to be trying to work out how to control the future intelligence explosion so the vastly smarter-than-human AIs are 'friendly' (FAI) and make the world better for us, rather than unfriendly AIs (UFAI) which end up optimizing the world for something that sucks.

### 'Where is everybody?'

So, topic. I read this post by Robin Hanson which had a really good parenthetical remark (emphasis mine):

Yes, it is possible that the extremely difficultly was life’s origin, or some early step, so that, other than here on Earth, all life in the universe is stuck before this early extremely hard step. But even if you find this the most likely outcome, surely given our ignorance you must also place a non-trivial probability on other possibilities. You must see a great filter as lying between initial planets and expanding civilizations, and wonder how far along that filter we are. In particular, you must estimate a substantial chance of “disaster”, i.e., something destroying our ability or inclination to make a visible use of the vast resources we see. (And this disaster can’t be an unfriendly super-AI, because that should be visible.)

This made me realize an UFAI should also be counted as an 'expanding lasting life', and should be deemed unlikely by the Great Filter.

Another way of looking at it: if the Great Filter still lies ahead of us, and a major component of this forthcoming filter is the threat from UFAI, we should expect to see the UFAIs of other civilizations spreading across the universe (or not see anything at all, because they would wipe us out to optimize for their unfriendly ends). That we do not observe it disconfirms this conjunction.

[Edit/Elaboration: It also gives a stronger argument - as the UFAI is the 'expanding life' we do not see, the beliefs, 'the Great Filter lies ahead' and 'UFAI is a major existential risk' lie opposed to one another: the higher your credence in the filter being ahead, the lower your credence should be in UFAI being a major existential risk (as the many civilizations like ours that go on to get caught in the filter do not produce expanding UFAIs, so expanding UFAI cannot be the main x-risk); conversely, if you are confident that UFAI is the main existential risk, then you should think the bulk of the filter is behind us (as we don't see any UFAIs, there cannot be many civilizations like ours in the first place, as we are quite likely to realize an expanding UFAI).]

A much more in-depth article and comments (both highly recommended) was made by Katja Grace a couple of years ago. I can't seem to find a similar discussion on here (feel free to downvote and link in the comments if I missed it), which surprises me: I'm not bright enough to figure out the anthropics, and obviously one may hold AI to be a big deal for other-than-Great-Filter reasons (maybe a given planet has a 1 in a googol chance of getting to intelligent life, but intelligent life 'merely' has a 1 in 10 chance of successfully navigating an intelligence explosion), but this would seem to be substantial evidence driving down the proportion of x-risk we should attribute to AI.

What do you guys think?

## Why (anthropic) probability isn't enough

19 13 December 2012 04:09PM

A technical report of the Future of Humanity Institute (authored by me), on why anthropic probability isn't enough to reach decisions in anthropic situations. You also have to choose your decision theory, and take into account your altruism towards your copies. And these components can co-vary while leaving your ultimate decision the same - typically, EDT agents using SSA will reach the same decisions as CDT agents using SIA, and altruistic causal agents may decide the same way as selfish evidential agents.

## Anthropics: why probability isn't enough

This paper argues that the current treatment of anthropic and self-locating problems over-emphasises the importance of anthropic probabilities, and ignores other relevant and important factors, such as whether the various copies of the agents in question consider that they are acting in a linked fashion and whether they are mutually altruistic towards each other. These issues, generally irrelevant for non-anthropic problems, come to the forefront in anthropic situations and are at least as important as the anthropic probabilities: indeed they can erase the difference between different theories of anthropic probability, or increase their divergence. These help to reinterpret the decisions, rather than probabilities, as the fundamental objects of interest in anthropic problems.

## SIA, conditional probability and Jaan Tallinn's simulation tree

11 12 November 2012 05:24PM

If you're going to use anthropic probability, use the self indication assumption (SIA) - it's by far the most sensible way of doing things.

Now, I am of the strong belief that probabilities in anthropic problems (such as the Sleeping Beauty problem) are not meaningful - only your decisions matter. And you can have different probability theories but still always reach the decisions if you have different theories as to who bears the responsibility of the actions of your copies, or how much you value them - see anthropic decision theory (ADT).

But that's a minority position - most people still use anthropic probabilities, so it's worth taking a more through look at what SIA does and doesn't tell you about population sizes and conditional probability.

This post will aim to clarify some issues with SIA, especially concerning Jaan Tallinn's simulation-tree model which he presented in exquisite story format at the recent singularity summit. I'll be assuming basic familiarity with SIA, and will run away screaming from any questions concerning infinity. SIA fears infinity (in a shameless self plug, I'll mention that anthropic decision theory runs into far less problems with infinities; for instance a bounded utility function is a sufficient - but not necessary - condition to ensure that ADT give you sensible answers even with infinitely many copies).

But onwards and upwards with SIA! To not-quite-infinity and below!

## SIA does not (directly) predict large populations

One error people often make with SIA is to assume that it predicts a large population. It doesn't - at least not directly. What SIA predicts is that there will be a large number of agents that are subjectively indistinguishable from you. You can call these subjectively indistinguishable agents the "minimal reference class" - it is a great advantage of SIA that it will continue to make sense for any reference class you choose (as long as it contains the minimal reference class).

The SIA's impact on the total population is indirect: if the size of the total population is correlated with that of the minimal reference class, SIA will predict a large population. A correlation is not implausible: for instance, if there are a lot of humans around, then the probability that one of them is you is much larger. If there are a lot of intelligent life forms around, then the chance that humans exist is higher, and so on.

In most cases, we don't run into problems with assuming that SIA predicts large populations. But we have to bear in mind that the effect is indirect, and the effect can and does break down in many cases. For instance imagine that you knew you had evolved on some planet, but for some odd reason, didn't know whether your planet had a ring system or not. You have managed to figure out that the evolution of life on planets with ring systems is independent of the evolution of life on planets without. Since you don't know which situation you're in, SIA instructs you to increase the probability of life on ringed and on non-ringed planets (so far, so good - SIA is predicting generally larger populations).

And then one day you look up at the sky and see:

## SIA fears (expected) infinity

6 12 November 2012 05:23PM

It's well known that the Self-Indication Assumption (SIA) has problems with infinite populations (one of the reasons I strongly recommend not using the probability as the fundamental object of interest, but instead the decision, as in anthropic decision theory).

SIA also has problems with arbitrarily large finite populations, at least in some cases. What cases are these? Imagine that we had these (non-anthropic) probabilities for various populations:

p0, p1, p2, p3, p4...

Now let us apply the anthropic correction from SIA; before renormalising, we have these weights for different population levels:

0, p1, 2p2, 3p3, 4p4...

To renormalise, we need to divide by the sum 0 + p1 + 2p2 + 3p3 + 4p4... This is actually the expected population! (note: we are using the population as a proxy for the size of the reference class of agents who are subjectively indistinguishable from us; see this post for more details)

So using SIA is possible if and only if the (non-anthropic) expected population is finite (and non-zero).

Note that it is possible for the anthropic expected population to be infinite! For instance if pj is C/j3, for some constant C, then the non-anthropic expected population is finite (being the infinite sum of C/j2). However once we have done the SIA correction, we can see that the SIA-corrected expected population is infinite (being the infinite sum of some constant times 1/j).

## The REAL SIA doomsday

8 07 September 2012 11:11AM

Many thanks to Paul Almond for developing the initial form of this argument.

My previous post was somewhat confusing and potentially misleading (and the idea hadn't fully gelled in my mind). But here is a much easier way of seeing what the SIA doomsday really is.

Imagine if your parents had rolled a dice to decide how many children to have. Knowing only this, SIA implies that the dice was more likely to have been a "6" that a "1" (because there is a higher chance of you existing in that case). But, now following the family tradition, you decide to roll a dice for your children. SIA now has no impact: the dice is equally likely to be any number. So SIA predicts high numbers in the past, and no preferences for the future.

This can be generalised into an SIA "doomsday":

• Everything else being equal, SIA implies that the population growth rate in your past is likely to be higher than the rate in the future; i.e. it predicts an observed decline, not in population, but in population growth rates.

## Papers framing anthropic questions as decision problems?

3 26 April 2012 12:40AM

A few weeks ago at a Seattle LW meetup, we were discussing the Sleeping Beauty problem and the Doomsday argument. We talked about how framing Sleeping Beauty problem as a decision problem basically solves it and then got the idea of using same heuristic on the Doomsday problem. I think you would need to specify more about the Doomsday setup than is usually done to do this.

We didn't spend a lot of time on it, but it got me thinking: Are there papers on trying to gain insight into the Doomsday problem and other anthropic reasoning problems by framing them as decision problems? I'm surprised I haven't seen this approach talked about here before. The idea seems relatively simple, so perhaps there is some major problem that I'm not seeing.

## Self-locating beliefs across identity fission

1 03 March 2012 03:14AM

Fred’s home planet, Sunday, is surrounded by two moons, Monday and Tuesday. Tonight, while Fred is asleep, his body will be scanned and destroyed; then a signal is sent to both Monday and Tuesday where he will be recreated from local matter.

A lot of ink has been spent on how to describe scenarios like this. Should we say that Fred will find himself both on Monday and on Tuesday? Which of the persons awakening on the two moons is identical to the person going to sleep on Sunday? In this paper, I want to look at a different question: what should Fred’s successors believe when they awaken on Monday and on Tuesday? More precisely, how should their beliefs be related to Fred’s beliefs before he went to sleep on Sunday?

Also see: Sleeping Beauty problem.

## Anthropic Decision Theory VI: Applying ADT to common anthropic problems

3 06 November 2011 11:50AM

A near-final version of my Anthropic Decision Theory paper is available on the arXiv. Since anthropics problems have been discussed quite a bit on this list, I'll be presenting its arguments and results in this and previous posts 1 2 3 4 5 6.

Having presented ADT previously, I'll round off this mini-sequence by showing how it behaves with common anthropic problems, such as the Presumptuous Philosopher, Adam and Eve problem, and the Doomsday argument.

## The Presumptuous Philosopher

The Presumptuous Philosopher was introduced by Nick Bostrom as a way of pointing out the absurdities in SIA. In the setup, the universe either has a trillion observers, or a trillion trillion trillion observers, and physics is indifferent as to which one is correct. Some physicists are preparing to do an experiment to determine the correct universe, until a presumptuous philosopher runs up to them, claiming that his SIA probability makes the larger one nearly certainly the correct one. In fact, he will accept bets at a trillion trillion to one odds that he is in the larger universe, repeatedly defying even strong experimental evidence with his SIA probability correction.

What does ADT have to say about this problem? Implicitly, when the problem is discussed, the philosopher is understood to be selfish towards any putative other copies of himself (similarly, Sleeping Beauty is often implicitly assumed to be selfless, which may explain the diverge of intuitions that people have on the two problems). Are there necessarily other similar copies? Well, in order to use SIA, the philosopher must believe that there is nothing blocking the creation of presumptuous philosophers in the larger universe; for if there was, the odds would shift away from the larger universe (in the extreme case when only one presumptuous philosopher is allowed in any universe, SIA finds them equi-probable). So the expected number of presumptuous philosophers in the larger universe is a trillion trillion times greater than the expected number in the small universe.

1 05 November 2011 01:31PM

A near-final version of my Anthropic Decision Theory paper is available on the arXiv. Since anthropics problems have been discussed quite a bit on this list, I'll be presenting its arguments and results in this, subsequent, and previous posts 1 2 3 4 5 6.

Now that we've seen what the 'correct' decision is for various Sleeping Beauty Problems, let's see a decision theory that reaches the same conclusions.

Identical copies of Sleeping Beauty will make the same decision when faced with same situations (technically true until quantum and chaotic effects cause a divergence between them, but most decision processes will not be sensitive to random noise like this). Similarly, Sleeping Beauty and the random man on the street will make the same decision when confronted with a twenty pound note: they will pick it up. However, while we could say that the first situation is linked, the second is coincidental: were Sleeping Beauty to refrain from picking up the note, the man on the street would not so refrain, while her copy would.

The above statement brings up subtle issues of causality and counterfactuals, a deep philosophical debate. To sidestep it entirely, let us recast the problem in programming terms, seeing the agent's decision process as a deterministic algorithm. If agent α is an agent that follows an automated decision algorithm A, then if A knows its own source code (by quining for instance), it might have a line saying something like:

Module M: If B is another algorithm, belonging to agent β, identical with A ('yourself'), assume A and B will have identical outputs on identical inputs, and base your decision on this.

## Anthropic Decision Theory IV: Solving Selfish and Average-Utilitarian Sleeping Beauty

0 04 November 2011 10:55AM

A near-final version of my Anthropic Decision Theory paper is available on the arXiv. Since anthropics problems have been discussed quite a bit on this list, I'll be presenting its arguments and results in this, subsequent, and previous posts 1 2 3 4 5 6.

In the previous post, I looked at a decision problem when Sleeping Beauty was selfless or a (copy-)total utilitarian. Her behaviour was reminiscent of someone following SIA-type odds. Here I'll look at situations where her behaviour is SSA-like.

## Altruistic average utilitarian Sleeping Beauty

In the incubator variant, consider the reasoning of an Outside/Total agent who is an average utilitarian (and there are no other agents in the universe apart from the Sleeping Beauties).

"If the various Sleeping Beauties decide to pay £x for the coupon, they will make -£x in the heads world. In the tails world, they will each make £(1-x) each, so an average of £(1-x). This give me an expected utility of £0.5(-x+(1-x))= £(0.5-x), so I would want them to buy the coupon for any price less than £0.5."

And this will then be the behaviour the agents will follow, by consistency. Thus they would be behaving as if they were following SSA odds, and putting equal probability on the heads versus tails world.

## Anthropic Decision Theory III: Solving Selfless and Total Utilitarian Sleeping Beauty

3 03 November 2011 10:04AM

A near-final version of my Anthropic Decision Theory paper is available on the arXiv. Since anthropics problems have been discussed quite a bit on this list, I'll be presenting its arguments and results in this, subsequent, and previous posts 1 2 3 4 5 6.

## Consistency

In order to transform the Sleeping Beauty problem into a decision problem, assume that every time she is awoken, she is offered a coupon that pays out £1 if the coin fell tails. She must then decide at what cost she is willing to buy that coupon.

The very first axiom is that of temporal consistency. If your preferences are going to predictably change, then someone will be able to exploit this, by selling you something now that they will buy back for more later, or vice versa. This axiom is implicit in the independence axiom in the von Neumann-Morgenstern axioms of expected utility, where non-independent decisions show inconsistency after partially resolving one of the lotteries. For our purposes, we will define it as:

## Anthropic Decision Theory II: Self-Indication, Self-Sampling and decisions

6 02 November 2011 10:03AM

A near-final version of my Anthropic Decision Theory paper is available on the arXiv. Since anthropics problems have been discussed quite a bit on this list, I'll be presenting its arguments and results in this, subsequent, and previous posts 1 2 3 4 5 6.

In the last post, we saw the Sleeping Beauty problem, and the question was what probability a recently awoken or created Sleeping Beauty should give to the coin falling heads or tails and it being Monday or Tuesday when she is awakened (or whether she is in Room 1 or 2). There are two main schools of thought on this, the Self-Sampling Assumption and the Self-Indication Assumption, both of which give different probabilities for these events.

## The Self-Sampling Assumption

The self-sampling assumption (SSA) relies on the insight that Sleeping Beauty, before being put to sleep on Sunday, expects that she will be awakened in future. Thus her awakening grants her no extra information, and she should continue to give the same credence to the coin flip being heads as she did before, namely 1/2.

In the case where the coin is tails, there will be two copies of Sleeping Beauty, one on Monday and one on Tuesday, and she will not be able to tell, upon awakening, which copy she is. She should assume that both are equally likely. This leads to SSA:

## Anthropic decision theory I: Sleeping beauty and selflessness

10 01 November 2011 11:41AM

A near-final version of my Anthropic Decision Theory paper is available on the arXiv. Since anthropics problems have been discussed quite a bit on this list, I'll be presenting its arguments and results in this and subsequent posts 1 2 3 4 5 6.

Many thanks to Nick Bostrom, Wei Dai, Anders Sandberg, Katja Grace, Carl Shulman, Toby Ord, Anna Salamon, Owen Cotton-barratt, and Eliezer Yudkowsky.

## The Sleeping Beauty problem, and the incubator variant

The Sleeping Beauty problem is a major one in anthropics, and my paper establishes anthropic decision theory (ADT) by a careful analysis it. Therefore we should start with an explanation of what it is.

In the standard setup, Sleeping Beauty is put to sleep on Sunday, and awoken again Monday morning, without being told what day it is. She is put to sleep again at the end of the day. A fair coin was tossed before the experiment began. If that coin showed heads, she is never reawakened. If the coin showed tails, she is fed a one-day amnesia potion (so that she does not remember being awake on Monday) and is reawakened on Tuesday, again without being told what day it is. At the end of Tuesday, she is put to sleep for ever. This is illustrated in the next figure:

## [LINK] How Hard is Artificial Intelligence? The Evolutionary Argument and Observation Selection Effects

14 29 August 2011 05:27AM

If you're interested in evolution, anthropics, and AI timelines -- or in what the Singularity Institute has been producing lately -- you might want to check out this new paper, by SingInst research fellow Carl Shulman and FHI professor Nick Bostrom.

The paper:

How Hard is Artificial Intelligence? The Evolutionary Argument and Observation Selection Effects

The abstract:

Several authors have made the argument that because blind evolutionary processes produced human intelligence on Earth, it should be feasible for clever human engineers to create human-level artificial intelligence in the not-too-distant future.  This evolutionary argument, however, has ignored the observation selection effect that guarantees that observers will see intelligent life having arisen on their planet no matter how hard it is for intelligent life to evolve on any given Earth-like planet.  We explore how the evolutionary argument might be salvaged from this objection, using a variety of considerations from observation selection theory and analysis of specific timing features and instances of convergent evolution in the terrestrial evolutionary record.  We find that a probabilistic version of the evolutionary argument emerges largely intact once appropriate corrections have been made.

I'd be interested to hear LW-ers' takes on the content; Carl, too, would much appreciate feedback.

View more: Next