A summary of Savage's foundations for probability and utility.
Edit: I think the P2c I wrote originally may have been a bit too weak; fixed that. Nevermind, rechecking, that wasn't needed.
More edits (now consolidated): Edited nontriviality note. Edited totality note. Added in the definition of numerical probability in terms of qualitative probability (though not the proof that it works). Also slight clarifications on implications of P6' and P6''' on partitions into equivalent and almost-equivalent parts, respectively.
One very late edit, June 2: Even though we don't get countable additivity, we still want a σ-algebra rather than just an algebra (this is needed for some of the proofs in the "partition conditions" section that I don't go into here). Also noted nonemptiness of gambles.
The idea that rational agents act in a manner isomorphic to expected-utility maximizers is often used here, typically justified with the Von Neumann-Morgenstern theorem. (The last of Von Neumann and Morgenstern's axioms, the independence axiom, can be grounded in a Dutch book argument.) But the Von Neumann-Morgenstern theorem assumes that the agent already measures its beliefs with (finitely additive) probabilities. This in turn is often justified with Cox's theorem (valid so long as we assume a "large world", which is implied by e.g. the existence of a fair coin). But Cox's theorem assumes as an axiom that the plausibility of a statement is taken to be a real number, a very large assumption! I have also seen this justified here with Dutch book arguments, but these all seem to assume that we are already using some notion of expected utility maximization (which is not only somewhat circular, but also a considerably stronger assumption than that plausibilities are measured with real numbers).
There is a way of grounding both (finitely additive) probability and utility simultaneously, however, as detailed by Leonard Savage in his Foundations of Statistics (1954). In this article I will state the axioms and definitions he gives, give a summary of their logical structure, and suggest a slight modification (which is equivalent mathematically but slightly more philosophically satisfying). I would also like to ask the question: To what extent can these axioms be grounded in Dutch book arguments or other more basic principles? I warn the reader that I have not worked through all the proofs myself and I suggest simply finding a copy of the book if you want more detail.
Peter Fishburn later showed in Utility Theory for Decision Making (1970) that the axioms set forth here actually imply that utility is bounded.
(Note: The versions of the axioms and definitions in the end papers are formulated slightly differently from the ones in the text of the book, and in the 1954 version have an error. I'll be using the ones from the text, though in some cases I'll reformulate them slightly.)
Dead men tell tales: falling out of love with SIA
SIA is the Self Indication Assumption, an anthropic theory about how we should reason about the universe given that we exist. I used to love it; the argument that I've found most convincing about SIA was the one I presented in this post. Recently, I've been falling out of love with SIA, and moving more towards a UDT version of anthropics (objective probabilities and total impact of your decision being of a specific type, including in all copies of you and enemies with the same decision process). So it's time I revisit my old post, and find the hole.
The argument rested on the plausible sounding assumption that creating extra copies and killing them is no different from if they hadn't existed in the first place. More precisely, it rested on the assumption that if I was told "You are not one of the agents I am about to talk about. Extra copies were created to be destroyed," it was exactly the same as hearing "Extra copies were created to be destroyed. And you're not one of them."
But I realised that from the UDT/TDT perspective, there is a great difference between the two situations, if I have the time to update decisions in the course of the sentence. Consider the following three scenarios:
- Scenario 1 (SIA):
Two agents are created, then one is destroyed with 50% probability. Each living agent is entirely selfish, with utility linear in money, and the dead agent gets nothing. Every survivor will be presented with the same bet. Then you should take the SIA 2:1 odds that you are in the world with two agents. This is the scenario I was assuming.
- Scenario 2 (SSA):
Two agents are created, then one is destroyed with 50% probability. Each living agent is entirely selfish, with utility linear in money, and the dead agent is altruistic towards his survivor. This is similar to my initial intuition in this post. Note that every agents have the same utility: "as long as I live, I care about myself, but after I die, I'll care about the other guy", so you can't distinguish them based on their utility. As before, every survivor will be presented with the same bet.
Here, once you have been told the scenario, but before knowing whether anyone has been killed, you should pre-commit to taking 1:1 odds that you are in the world with two agents. And in UDT/TDT precommitting is the same as making the decision.
Techniques for probability estimates
Utility maximization often requires determining a probability of a particular statement being true. But humans are not utility maximizers and often refuse to give precise numerical probabilities. Nevertheless, their actions reflect a "hidden" probability. For example, even someone who refused to give a precise probability for Barack Obama's re-election would probably jump at the chance to take a bet in which ey lost $5 if Obama wasn't re-elected but won $5 million if he was; such decisions demand that the decider covertly be working off of at least a vague probability.
When untrained people try to translate vague feelings like "It seems Obama will probably be re-elected" into a precise numerical probability, they commonly fall into certain traps and pitfalls that make their probability estimates inaccurate. Calling a probability estimate "inaccurate" causes philosophical problems, but these problems can be resolved by remembering that probability is "subjectively objective" - that although a mind "hosts" a probability estimate, that mind does not arbitrarily determine the estimate, but rather calculates it according to mathematical laws from available evidence. These calculations require too much computational power to use outside the simplest hypothetical examples, but they provide a standard by which to judge real probability estimates. They also suggest tests by which one can judge probabilities as well-calibrated or poorly-calibrated: for example, a person who constantly assigns 90% confidence to eir guesses but only guesses the right answer half the time is poorly calibrated. So calling a probability estimate "accurate" or "inaccurate" has a real philosophical grounding.
There exist several techniques that help people translate vague feelings of probability into more accurate numerical estimates. Most of them translate probabilities from forms without immediate consequences (which the brain supposedly processes for signaling purposes) to forms with immediate consequences (which the brain supposedly processes while focusing on those consequences).
Iterated Sleeping Beauty and Copied Minds
Before I move on to a summation post listing the various raised thought experiments and paradoxes related to mind copying, I would like to cast attention to a particular moment regarding the notion of "subjective probability".
In my earlier discussion post on the subjective experience of a forked person, I compared the scenario where one copy is awakened in the future to the Sleeping Beauty thought experiment. And really, it describes any such process, because there will inevitably be a time gap, however short, between the time of fork and the copy's subjective awakening: no copy mechanism can be instant.
In the traditional Sleeping Beauty scenario, there are two parties: Beauty and the Experimenter. The Experimenter has access to a sleep-inducing drug that also resets Beauty's memory to the state at t=0. Suppose Beauty is put to sleep at t=0, and then a fair coin is tossed. If the coin comes heads, Beauty is woken up at t=1, permanently. If the coin comes tails, Beauty is woken up at t=1, questioned, memory-wiped, and then woken up again at t=2, this time permanently.
In this experiment, intuitively, Beauty's subjective anticipation of the coin coming tails, without access to any information other than the conditions of the experiment, should be 2/3. I won't be arguing here whether this particular answer is right or wrong: the discussion has been raised many times before, and on Less Wrong as well. I'd like to point out one property of the experiment that differentiates it from other probability-related tasks: erasure of information, which renders the whole experiment a non-experiment.
In Bayesian theory, the (prior) probability of an outcome is the measure of our anticipation of it to the best of our knowledge. Bayesians think of experiments as a way to get new information, and update their probabilities based on the information gained. However, in the Sleeping Beauty experiment, Beauty gains no new information from waking up at any time, in any outcome. She has the exact same mind-state at any point of awakening that she had at t=0, and is for all intents and purposes the exact same person at any such point. As such, we can ask Beauty, "If we perform the experiment, what is your anticipation of waking up in the branch where the coin landed tails?", and she can give the same answer without actually performing the experiment.
So how does it map to the mind-copying problem? In a very straightforward way.
Let's modify the experiment this way: at t=0, Beauty's state is backed up. Let's suppose that she is then allowed to live her normal life, but the time-slices are large enough that she dies within the course of a single round. (Say, she has a normal human lifespan and the time between successive iterations is 200 years.) However, at t=1, a copy of Beauty is created in the state at which the original was at t=0, a coin is tossed, and if and only if it comes tails, another copy is created at t=2.
If Beauty knows the condition of this experiment, no matter what answer she would give in the classic formulation of the problem, I don't expect it to change here. The two formulations are, as far as I can see, equivalent.
However, in both cases, from the Experimenter's point of view, the branching points are independent events, which allows us to construct scenarios that question the straightforward interpretation of "subjective probability". And for this, I refer to the last experiment in my earlier post.
Imagine you have an indestructible machine that restores one copy of you from backup every 200 years. In this scenario, it seems you should anticipate waking up with equal probability between now and the end of time. But it's inconsistent with the formulation of probability for discrete outcomes: we end up with a diverging series, and as the length of the experiment approaches infinity (ignoring real-world cosmology for the moment), the subjective probability of every individual outcome (finding yourself at t=1, finding yourself at t=2, etc.) approaches 0. The equivalent classic formulation is a setup where the Experimenter is programmed to wake Beauty after every time-slice and unconditionally put her back to sleep.
This is not the only possible "diverging Sleeping Beauty" problem. Suppose that at t=1, Beauty is put back to sleep with probability 1/2 (like in the classic experiment), at t=2 she is put back to sleep with probability 1/3, then 1/4, and so on. In this case, while it seems almost certain that she will eventually wake up permanently (in the same sense that it is "almost certain" that a fair random number generator will eventually output any given value), the expected value is still infinite.
In the case of a converging series of probabilities of remaining asleep - for example, if it's decided by a coin toss at each iteration whether Beauty is put back to sleep, in which case the series is 1/2 + 1/4 + 1/8 + ... = 1 -- Beauty can give a subjective expected value, or the average time at which she expects to be woken up permanently.
In a general case, let Ei be the event "the experiment continues at stage i" (that is, Beauty is not permanently awakened at stage i, or in the alternate formulation, more copies are created beyond that point). Then if we extrapolate the notion of "subjective probability" that leads us to the answer 2/3 in the classic formulation, then the definition is meaningful if and only if the series of objective probabilities ∑i=1..∞ P(Ei) converges -- it doesn't have to converge to 1, we'll just need to renormalize the calculations otherwise. Which, given that the randomizing events are independent, simply doesn't have to happen.
Even if we reformulate the experiment in terms of decision theory, it's not clear how it will help us. If the bet is "win 1 utilon if you get your iteration number right", the probability of winning it in a divergent case is 0 at any given iteration. And yet, if all cases are perfectly symmetric information-wise so that you make the same decision over and over again, you'll eventually get the answer right, with exactly one of you winning the bet, even no matter what your "decision function" is - even if it's simply something like "return 42;". Even a stopped clock is right sometimes, in this case once.
It would be tempting, seeing this, to discard the notion of "subjective anticipation" altogether as ill-defined. But that seems to me like tossing out the Born probabilities just because we go from Copenhagen to MWI. If I'm forked, I expect to continue my experience as either the original or the copy with a probability of 1/2 -- whatever that means. If I'm asked to participate in the classic Sleeping Beauty experiment, and to observe the once-flipped coin at every point I wake up, I will expect to see tails with a probability of 2/3 -- again, whatever that means.
The situations described here have a very specific set of conditions. We're dealing with complete information erasure, which prevents any kind of Bayesian update and in fact makes the situation completely symmetric from the decision agent's perspective. We're also dealing with an anticipation all the way into infinity, which cannot occur in practice due to the finite lifespan of the universe. And yet, I'm not sure what to do with the apparent need to update my anticipations for times arbitrarily far into the future, for an arbitrarily large number of copies, for outcomes with an arbitrarily high degree of causal removal from my current state, which may fail to occur, before the sequence of events that can lead to them is even put into motion.
If a tree falls on Sleeping Beauty...
Several months ago, we had an interesting discussion about the Sleeping Beauty problem, which runs as follows:
Sleeping Beauty volunteers to undergo the following experiment. On Sunday she is given a drug that sends her to sleep. A fair coin is then tossed just once in the course of the experiment to determine which experimental procedure is undertaken. If the coin comes up heads, Beauty is awakened and interviewed on Monday, and then the experiment ends. If the coin comes up tails, she is awakened and interviewed on Monday, given a second dose of the sleeping drug, and awakened and interviewed again on Tuesday. The experiment then ends on Tuesday, without flipping the coin again. The sleeping drug induces a mild amnesia, so that she cannot remember any previous awakenings during the course of the experiment (if any). During the experiment, she has no access to anything that would give a clue as to the day of the week. However, she knows all the details of the experiment.
Each interview consists of one question, “What is your credence now for the proposition that our coin landed heads?”
In the end, the fact that there were so many reasonable-sounding arguments for both sides, and so much disagreement about a simple-sounding problem among above-average rationalists, should have set off major alarm bells. Yet only a few people pointed this out; most commenters, including me, followed the silly strategy of trying to answer the question, and I did so even after I noticed that my intuition could see both answers as being right depending on which way I looked at it, which in retrospect would have been a perfect time to say “I notice that I am confused” and backtrack a bit…
And on reflection, considering my confusion rather than trying to consider the question on its own terms, it seems to me that the problem (as it’s normally stated) is completely a tree-falling-in-the-forest problem: a debate about the normatively “correct” degree of credence which only seemed like an issue because any conclusions about what Sleeping Beauty “should” believe weren’t paying their rent, were disconnected from any expectation of feedback from reality about how right they were.
What Makes My Attempt Special?
A crucial question towards the beginning of any research project is, why should my group succeed in elucidating an answer to a question where others may have tried and failed?
Here's how I'm going about dividing the possible worlds, but I'm interested to see if anyone has any other strategies. First, the whole question is conditional on nobody having already answered the particular question you're interested in. So, you first need an exhaustive lit review, that should scale in intensity based on how much effort you expect to actually expend on the project. Still nothing? These are the remaining possibilities:
1) Nobody else has ever thought of your question, even though all of the pieces of knowledge needed to formulate it have been known for years. If the field has many people involved, the probability of this is vanishingly small and you should systematically disabuse yourself of your fantasies if you think like this often. Still... if true, the prognosis: a good sign.
2) Nobody else has ever thought of your question, because it wouldn't have been ask-able without pieces of knowledge that were discovered just recently. This is common in fast-paced fields and it's why they can be especially exciting. The prognosis: a good sign, but work quickly!
3) Others have thought of your question, but didn't think it was interesting enough to devote serious attention to. We should take this seriously, as how informed others choose to allocate their attention is one of our better approximations to real prediction markets. So, the prognosis: bad sign. Figure out whether you can not only answer your question but validate its usefulness / importance, too.
4) Others have thought of your question, thought it was interesting, but have never tried to answer it because of resource or tech restraints, which you do not face. Prognosis: probably the best-case scenario.
5) Others have thought of your question and run the relevant tests, but failed to get any consistent / reliable results. It'd be nice if there were no publication bias but of course there is--people are much more likely to publish statistically significant, positive results. Due to this bias, it is sometimes hard to tell precisely how many dead skeletons and dismembered brains line your path, and because of this uncertainty you must assign this possibility a non-zero probability. The prognosis: a bad sign, but do you feel lucky?
6) Others have thought of your question, run the relevant tests, and failed to get consistent / reliable results, but used a different method than the one you will use. Your new tech might clear up some of the murkiness, but it's important here to be precise about which specific issues your method solves and which it doesn't. The prognosis: all things equal, a good sign.
These are the considerations we make when we decide whether to pursue a given topic. But even if you do choose to pursue the question, some of these possibilities have policy recommendations for how to proceed. For example, using new tech, even if it's not necessarily demonstrably better in all cases, seems like a good idea given the possibility of #6.
A Proof of Occam's Razor
Related to: Occam's Razor
If the Razor is defined as, “On average, a simpler hypothesis should be assigned a higher prior probability than a more complex hypothesis,” or stated in another way, "As the complexity of your hypotheses goes to infinity, their probability goes to zero," then it can be proven from a few assumptions.
1) The hypotheses are described by a language that has a finite number of different words, and each hypothesis is expressed by a finite number of these words. That this allows for natural languages such as English, but also for computer programming languages and so on. The proof in this post is valid for all cases.
2) A complexity measure is assigned to hypotheses in such a way that there are or may be some hypotheses which are as simple as possible, and these are assigned the complexity measure of 1, while hypotheses considered to be more complex are assigned higher integer values such as 2, 3, 4, and so on. Note that apart from this, we can define the complexity measure in any way we like, for example as the number of words used by the hypothesis, or in another way, as the shortest program which can output the hypothesis in a given programming language (e.g. the language of the hypotheses might be English but their simplicity measured according to a programming language; Eliezer Yudkowsky follows this way in the linked article.) Many other definitions would be possible. The proof is valid for all definitions that follow the conditions laid out.
3) The complexity measure should also be defined in such a way that there are a finite number of hypotheses given the measure of 1, a finite number given the measure of 2, a finite number given the measure of 3, and so on. Note that this condition is not difficult to satisfy; it would be satisfied by either of the definitions mentioned in condition 2, and in fact by any reasonable definition of simplicity and complexity. The proof would not be valid without this condition precisely because if simplicity were understood in such a way as to allow for an infinite number of hypotheses with minimum simplicity, the Razor would not be valid for that understanding of simplicity.
The Razor follows of necessity from these three conditions. To explain any data, there will be in general infinitely many mutually exclusive hypotheses which could fit the data. Suppose we assign prior probabilities for all of these hypotheses. Given condition 3, it will be possible to find the average probability for hypotheses of complexity 1 (call it x1), the average probability for hypotheses of complexity 2 (call it x2), the average probability for hypotheses of complexity 3 (call it x3), and so on. Now consider the infinite sum “x1 + x2 + x3…” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.
Now, x1 is a finite real number. So in order for this series to converge, there must be only a finite number of later terms in the series equal to or greater than x1. There will therefore be some complexity value, y1, such that all hypotheses with a complexity value greater than y1 have an average probability of less than x1. Likewise for x2: there will be some complexity value y2 such that all hypotheses with a complexity value greater than y2 have an average probability of less than x2. Leaving the derivation for the reader, it would also follow that there is some complexity value z1 such that all hypotheses with a complexity value greater than z1 have a lower probability than any hypothesis with a complexity value of 1, some other complexity value z2 such that all hypotheses with a complexity value greater than z2 have a lower probability than any hypothesis of complexity value 2, and so on.
From this it is clear that on average, or as the complexity tends to infinity, hypotheses with a greater complexity value have a lower prior probability, which was our definition of the Razor.
N.B. I have edited the beginning and end of the post to clarify the meaning of the theorem, according to some of the comments. However, I didn't remove anything because it would make the comments difficult to understand for later readers.
Applied Bayes' Theorem: Reading People
Or, how to recognize Bayes' theorem when you meet one making small talk at a cocktail party.
Knowing the theory of rationality is good, but it is of little use unless we know how to apply it. Unfortunately, humans tend to be poor at applying raw theory, instead needing several examples before it becomes instinctive. I found some very useful examples in the book Reading People: How to Understand People and Predict Their Behavior - Anytime, Anyplace. While I didn't think that it communicated the skill of actually reading people very well, I did notice that it did have one chapter (titled "Discovering Patterns: Learning to See the Forest, Not Just the Trees") that could almost have been a collection of Less Wrong posts. It also serves as an excellent example of applying Bayes' theorem in every-day life.
In "What is Bayesianism?" I said that the first core tenet of Bayesianism is "Any given observation has many different possible causes". Reading People says:
If this book could deliver but one message, it would be that to read people effectively you must gather enough information about them to establish a consistent pattern. Without that pattern, your conclusions will be about as reliable as a tarot card reading.
In fact, the author is saying that Bayes' theorem applies when you're trying to read people (if this is not immediately obvious, just keep reading). Any particular piece of evidence about a person could have various causes. For example, in a later chapter we are offered a list of possible reasons for why someone may have dressed inappropriately for an occasion. They might (1) be seeking attention, (2) lack common sense, (3) be self-centered and insensitive to others, (4) be trying to show that they are spontaneous, rebellious, or noncomformists and don't care what other people think, (5) not have been taught how to dress and act appropriately, (6) be trying to imitate someone they admire, (7) value comfort and convenience over all else, or (8) simply not have the right attire for the occasion.
Similarly, very short hair on a man might indicate that he (1) is in the military, or was at some point in his life, (2) works for an organization that demands very short hair, such as a police force or fire department, (3) is trendy, artistic or rebellious, (4) is conservative, (5) is undergoing or recovering from a medical treatment, (6) thinks he looks better with short hair, (7) plays sports, or (8) keeps his hair short for practical reasons.
So much for reading people being easy. This, again, is the essence of Bayes' theorem: even though somebody being in the military might almost certainly mean that they'd have short hair, them having a short hair does not necessarily mean that they are in the military. On the other hand, if someone has short hair, is clearly knowledgeable about weapons and tactics, displays a no-nonsense attitude, is in good shape, and has a very Spartan home... well, though it's still not for certain, it seems likely to me that of all the people having all of these attributes, quite a few of them are in the military or in similar occupations.
Conditioning on Observers
Response to Beauty quips, "I'd shut up and multiply!"
Related to The Presumptuous Philosopher's Presumptuous Friend, The Absent-Minded Driver, Sleeping Beauty gets counterfactually mugged
This is somewhat introductory. Observers play a vital role in the classic anthropic thought experiments, most notably the Sleeping Beauty and Presumptuous Philosopher gedankens. Specifically, it is remarkably common to condition simply on the existence of an observer, in spite of the continuity problems this raises. The source of confusion appears to be based on the distinction between the probability of an observer and the expectation number of observers, with the former not being a linear function of problem definitions.
There is a related difference between the expected gain of a problem and the expected gain per decision, which has been exploited in more complex counterfactual mugging scenarios. As in the case of the 1/2 or 1/3 confusion, the issue is the number of decisions that are expected to be made, and recasting problems so that there is at most one decision provides a clear intuition pump.
Beauty quips, "I'd shut up and multiply!"
When it comes to probability, you should trust probability laws over your intuition. Many people got the Monty Hall problem wrong because their intuition was bad. You can get the solution to that problem using probability laws that you learned in Stats 101 -- it's not a hard problem. Similarly, there has been a lot of debate about the Sleeping Beauty problem. Again, though, that's because people are starting with their intuition instead of letting probability laws lead them to understanding.
The Sleeping Beauty Problem
On Sunday she is given a drug that sends her to sleep. A fair coin is then tossed just once in the course of the experiment to determine which experimental procedure is undertaken. If the coin comes up heads, Beauty is awakened and interviewed on Monday, and then the experiment ends. If the coin comes up tails, she is awakened and interviewed on Monday, given a second dose of the sleeping drug, and awakened and interviewed again on Tuesday. The experiment then ends on Tuesday, without flipping the coin again. The sleeping drug induces a mild amnesia, so that she cannot remember any previous awakenings during the course of the experiment (if any). During the experiment, she has no access to anything that would give a clue as to the day of the week. However, she knows all the details of the experiment.
Each interview consists of one question, "What is your credence now for the proposition that our coin landed heads?"
Two popular solutions have been proposed: 1/3 and 1/2
The 1/3 solution
From wikipedia:
Suppose this experiment were repeated 1,000 times. We would expect to get 500 heads and 500 tails. So Beauty would be awoken 500 times after heads on Monday, 500 times after tails on Monday, and 500 times after tails on Tuesday. In other words, only in a third of the cases would heads precede her awakening. So the right answer for her to give is 1/3.
Yes, it's true that only in a third of cases would heads precede her awakening.
Radford Neal (a statistician!) argues that 1/3 is the correct solution.
This [the 1/3] view can be reinforced by supposing that on each awakening Beauty is offered a bet in which she wins 2 dollars if the coin lands Tails and loses 3 dollars if it lands Heads. (We suppose that Beauty knows such a bet will always be offered.) Beauty would not accept this bet if she assigns probability 1/2 to Heads. If she assigns a probability of 1/3 to Heads, however, her expected gain is 2 × (2/3) − 3 × (1/3) = 1/3, so she will accept, and if the experiment is repeated many times, she will come out ahead.
Neal is correct (about the gambling problem).
These two arguments for the 1/3 solution appeal to intuition and make no obvious mathematical errors. So why are they wrong?
Let's first start with probability laws and show why the 1/2 solution is correct. Just like with the Monty Hall problem, once you understand the solution, the wrong answer will no longer appeal to your intuition.
The 1/2 solution
P(Beauty woken up at least once| heads)=P(Beauty woken up at least once | tails)=1. Because of the amnesia, all Beauty knows when she is woken up is that she has woken up at least once. That event had the same probability of occurring under either coin outcome. Thus, P(heads | Beauty woken up at least once)=1/2. You can use Bayes' rule to see this if it's unclear.
Here's another way to look at it:
If it landed heads then Beauty is woken up on Monday with probability 1.
If it landed tails then Beauty is woken up on Monday and Tuesday. From her perspective, these days are indistinguishable. She doesn't know if she was woken up the day before, and she doesn't know if she'll be woken up the next day. Thus, we can view Monday and Tuesday as exchangeable here.
A probability tree can help with the intuition (this is a probability tree corresponding to an arbitrary wake up day):
If Beauty was told the coin came up heads, then she'd know it was Monday. If she was told the coin came up tails, then she'd think there is a 50% chance it's Monday and a 50% chance it's Tuesday. Of course, when Beauty is woken up she is not told the result of the flip, but she can calculate the probability of each.
When she is woken up, she's somewhere on the second set of branches. We have the following joint probabilities: P(heads, Monday)=1/2; P(heads, not Monday)=0; P(tails, Monday)=1/4; P(tails, Tuesday)=1/4; P(tails, not Monday or Tuesday)=0. Thus, P(heads)=1/2.
Where the 1/3 arguments fail
The 1/3 argument says with heads there is 1 interview, with tails there are 2 interviews, and therefore the probability of heads is 1/3. However, the argument would only hold if all 3 interview days were equally likely. That's not the case here. (on a wake up day, heads&Monday is more likely than tails&Monday, for example).
Neal's argument fails because he changed the problem. "on each awakening Beauty is offered a bet in which she wins 2 dollars if the coin lands Tails and loses 3 dollars if it lands Heads." In this scenario, she would make the bet twice if tails came up and once if heads came up. That has nothing to do with probability about the event at a particular awakening. The fact that she should take the bet doesn't imply that heads is less likely. Beauty just knows that she'll win the bet twice if tails landed. We double count for tails.
Imagine I said "if you guess heads and you're wrong nothing will happen, but if you guess tails and you're wrong I'll punch you in the stomach." In that case, you will probably guess heads. That doesn't mean your credence for heads is 1 -- it just means I added a greater penalty to the other option.
Consider changing the problem to something more extreme. Here, we start with heads having probability 0.99 and tails having probability 0.01. If heads comes up we wake Beauty up once. If tails, we wake her up 100 times. Thirder logic would go like this: if we repeated the experiment 1000 times, we'd expect her woken up 990 after heads on Monday, 10 times after tails on Monday (day 1), 10 times after tails on Tues (day 2),...., 10 times after tails on day 100. In other words, ~50% of the cases would heads precede her awakening. So the right answer for her to give is 1/2.
Of course, this would be absurd reasoning. Beauty knows heads has a 99% chance initially. But when she wakes up (which she was guaranteed to do regardless of whether heads or tails came up), she suddenly thinks they're equally likely? What if we made it even more extreme and woke her up even more times on tails?
Implausible consequence of 1/2 solution?
Nick Bostrom presents the Extreme Sleeping Beauty problem:
This is like the original problem, except that here, if the coin falls tails, Beauty will be awakened on a million subsequent days. As before, she will be given an amnesia drug each time she is put to sleep that makes her forget any previous awakenings. When she awakes on Monday, what should be her credence in HEADS?
He argues:
The adherent of the 1/2 view will maintain that Beauty, upon awakening, should retain her credence of 1/2 in HEADS, but also that, upon being informed that it is Monday, she should become extremely confident in HEADS:
P+(HEADS) = 1,000,001/1,000,002
This consequence is itself quite implausible. It is, after all, rather gutsy to have credence 0.999999% in the proposition that an unobserved fair coin will fall heads.
It's correct that, upon awakening on Monday (and not knowing it's Monday), she should retain her credence of 1/2 in heads.
However, if she is informed it's Monday, it's unclear what she conclude. Why was she informed it was Monday? Consider two alternatives.
Disclosure process 1: regardless of the result of the coin toss she will be informed it's Monday on Monday with probability 1
Under disclosure process 1, her credence of heads on Monday is still 1/2.
Disclosure process 2: if heads she'll be woken up and informed that it's Monday. If tails, she'll be woken up on Monday and one million subsequent days, and only be told the specific day on one randomly selected day.
Under disclosure process 2, if she's informed it's Monday, her credence of heads is 1,000,001/1,000,002. However, this is not implausible at all. It's correct. This statement is misleading: "It is, after all, rather gutsy to have credence 0.999999% in the proposition that an unobserved fair coin will fall heads." Beauty isn't predicting what will happen on the flip of a coin, she's predicting what did happen after receiving strong evidence that it's heads.
ETA (5/9/2010 5:38AM)
If we want to replicate the situation 1000 times, we shouldn't end up with 1500 observations. The correct way to replicate the awakening decision is to use the probability tree I included above. You'd end up with expected cell counts of 500, 250, 250, instead of 500, 500, 500.
Suppose at each awakening, we offer Beauty the following wager: she'd lose $1.50 if heads but win $1 if tails. She is asked for a decision on that wager at every awakening, but we only accept her last decision. Thus, if tails we'll accept her Tuesday decision (but won't tell her it's Tuesday). If her credence of heads is 1/3 at each awakening, then she should take the bet. If her credence of heads is 1/2 at each awakening, she shouldn't take the bet. If we repeat the experiment many times, she'd be expected to lose money if she accepts the bet every time.
The problem with the logic that leads to the 1/3 solution is it counts twice under tails, but the question was about her credence at an awakening (interview).
ETA (5/10/2010 10:18PM ET)
Suppose this experiment were repeated 1,000 times. We would expect to get 500 heads and 500 tails. So Beauty would be awoken 500 times after heads on Monday, 500 times after tails on Monday, and 500 times after tails on Tuesday. In other words, only in a third of the cases would heads precede her awakening. So the right answer for her to give is 1/3.
Another way to look at it: the denominator is not a sum of mutually exclusive events. Typically we use counts to estimate probabilities as follows: the numerator is the number of times the event of interest occurred, and the denominator is the number of times that event could have occurred.
For example, suppose Y can take values 1, 2 or 3 and follows a multinomial distribution with probabilities p1, p2 and p3=1-p1-p2, respectively. If we generate n values of Y, we could estimate p1 by taking the ratio of #{Y=1}/(#{Y=1}+#{Y=2}+#{Y=3}). As n goes to infinity, the ratio will converge to p1. Notice the events in the denominator are mutually exclusive and exhaustive. The denominator is determined by n.
The thirder solution to the Sleeping Beauty problem has as its denominator sums of events that are not mutually exclusive. The denominator is not determined by n. For example, if we repeat it 1000 times, and we get 400 heads, our denominator would be 400+600+600=1600 (even though it was not possible to get 1600 heads!). If we instead got 550 heads, our denominator would be 550+450+450=1450. Our denominator is outcome dependent, where here the outcome is the occurrence of heads. What does this ratio converge to as n goes to infinity? I surely don't know. But I do know it's not the posterior probability of heads.
= 783df68a0f980790206b9ea87794c5b6)

Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)