Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Bayesianism for humans: prosaic priors

22 BT_Uytya 02 September 2014 09:45PM


There are two insights from Bayesianism which occurred to me and which I hadn't seen anywhere else before.
I like lists in the two posts linked above, so for the sake of completeness, I'm going to add my two cents to a public domain. This post is about the second penny, the first one is here.

Prosaic Priors

The second insight can be formulated as «the dull explanations are more likely to be correct because they tend to have high prior probability.»

Why is that? 

1) Almost by definition! Some property X is 'banal' if X applies to a lot of people in an disappointingly mundane way, not having any redeeming features which would make it more rare (and, hence, interesting).

In the other words, X is banal iff base rate of X is high. Or, you can say, prior probability of X is high.

1.5) Because of Occam's Razor and burdensome details. One way to make something boring more exciting is to add interesting details: some special features which will make sure that this explanation is about you as opposed to 'about almost anybody'.

This could work the other way around: sometimes the explanation feels unsatisfying exactly because it was shaved of any unnecessary and (ultimately) burdensome details.

2) Often, the alternative of a mundane explanation is something unique and custom made to fit the case you are interested in. And anybody familiar with overfitting and conjunction fallacy (and the fact that people tend to love coherent stories with blinding passion1) should be very suspicious about such things. So, there could be a strong bias against stale explanations, which should  be countered.

* * *

I fully grokked this when being in process of CBT-induced soul-searching; usage in this context still looks the most natural to me, but I believe that the area of application of this heuristic is wider.


1) I'm fairly confident that I'm an introvert. Still, sometimes I can behave like an extrovert. I was interested in the causes of this "extroversion activation", as I called it2. I suspected that I really had two modes of functioning (with "introversion" being the default one), and some events — for example, mutual interest (when I am interested in a person I was talking to, and xe is interested in me) or feeling high-status — made me switch between them.

Or, you know, it could be just reduction in a social anxiety, which makes people more communicative. Increased anxiety levels wasn't a new element to be postulated; I already knew I had it, yet I was tempted to make up new mental entities, and prosaic explanation about anxiety managed to avoid me for a while.

2) I find it hard to do something I consider worthwhile while on a spring break, despite having lots of a free time. I tend to make grandiose plans — I should meet new people! I should be more involved in sports! I should start using Anki! I should learn Lojban! I should practice meditation! I should read these textbooks including doing most of exercises! — and then fail to do almost anything. Yet I manage to do some impressive stuff during academic term, despite having less time and more commitments.

This paradoxical situation calls for explanation.

The first hypothesis that came to my mind was about activation energy. It takes effort to go  from "procrastinating" to "doing something"; speaking more generally, you can say that it takes effort to go from "lazy day" to "productive day". During the academic term, I am forced to make most of my days productive: I have to attend classes, do homework, etc. And, already having done something good, I can do something else as well. During spring break, I am deprived of that natural structure, and, hence I am on my own in terms of starting doing something I find worthwhile.

The alternative explanation: I was tired. Because, you know, vacation comes right after midterms, and I tend to go all out while preparing for midterms. I am exhausted, my energy and willpower are scarce, so it's no wonder I am having trouble utilizing it.

(I don't really believe in the latter explanation (I think that my situation is caused by several factors, including two outlined above), so it is also an example of descriptive "probable enough" hypothesis)

3) This example comes from Slate Star Codex. Nerds tend to find aversive many group bonding activities usual people supposedly enjoy, such as patriotism, prayer, team sports, and pep rallies. Supposedly, they should feel (with a tear-jerking passion of thousand exploding suns) the great unity with their fellow citizens, church-goers, teammates or pupils respectively, but instead they feel nothing.

Might it be that nerds are unable to enjoy these activities because something is broken inside their brains? One could be tempted to construct an elaborate argument involving autism spectrum and a mild case of schizoid personality disorder. In other words, this calls for postulating a rare form of autism which affects only some types of social behaviour (perception of group activities), leaving other types unchanged.

Or, you know, maybe nerds just don't like the group they are supposed to root for. Maybe nerds don't feel unity and relationship to The Great Whole because they don't feel like they truly belong here.

As Scott put it, "It’s not that we lack the ability to lose ourselves in an in-group, it’s that all the groups people expected us to lose ourselves in weren’t ones we could imagine as our in-group by any stretch of the imagination"3.

4) This example comes from this short comic titled "Sherlock Holmes in real life".

5) Scott Aaronson uses something similar to the Hanlon's Razor to explain that the lack of practical expertise of CS theorists aren't caused by arrogance or something like that:

"If theorists don’t have as much experience building robots as they should have, don’t know as much about large software projects as they should  know, etc., then those are all defects to add to the long list of their other, unrelated defects.  But it would be a mistake to assume that they failed to acquire this knowledge because of disdain for practical people, rather than for mundane reasons like busyness or laziness."

* * *

...and after this the word "prosaic" quickly turned into an awesome compliment. Like, "so, this hypothesis explains my behaviour well; but is it boring enough?", or "your claim is refreshingly dull; I like it!".

1. If you had read Thinking: Fast and Slow, you probably know what I mean. If you hadn't, you can look at narrative fallacy in order to get a general idea.
2. Which was, as I now realize, an excellent way to deceive myself via using word with a lot of hidden assumptions. Taboo your words, folks!
3. As a side note, my friend proposed an alternative explanation: the thing is, often nerds are defined as "sort of people who dislike pep rallies". So, naturally, we have "usual people" who like pep rallies and "nerds" who avoid them. And then "nerds dislike pep rallies" is tautology rather than something to be explained.

Bayesianism for humans: "probable enough"

38 BT_Uytya 02 September 2014 09:44PM

There are two insights from Bayesianism which occurred to me and which I hadn't seen anywhere else before.
I like lists in the two posts linked above, so for the sake of completeness, I'm going to add my two cents to a public domain. Second penny is here.

"Probable enough"

When you have eliminated the impossible, whatever  remains is often more improbable than your having made a mistake in one  of your impossibility proofs.

Bayesian way of thinking introduced me to the idea of "hypothesis which is probably isn't true, but probable enough to rise to the level of conscious attention" — in other words, to the situation when P(H) is notable but less than 50%.

Looking back, I think that the notion of taking seriously something which you don't think is true was alien to me. Hence, everything was either probably true or probably false; things from the former category were over-confidently certain, and things from the latter category were barely worth thinking about.

This model was correct, but only in a formal sense.

Suppose you are living in Gotham, the city famous because of it's crime rate and it's masked (and well-funded) vigilante, Batman. Recently you had read The Better Angels of Our Nature: Why Violence Has Declined by Steven Pinker, and according to some theories described here, Batman isn't good for Gotham at all.

Now you know, for example, the theory of Donald Black that "crime is, from the point of view of the perpetrator, the pursuit of justice". You know about idea that in order for crime rate to drop, people should perceive their law system as legitimate. You suspect that criminals beaten by Bats don't perceive the act as a fair and regular punishment for something bad, or an attempt to defend them from injustice; instead the act is perceived as a round of bad luck. So, the criminals are busy plotting their revenge, not internalizing civil norms.

You believe that if you send your copy of book (with key passages highlighted) to the person connected to Batman, Batman will change his ways and Gotham will become much more nice in terms of homicide rate. 

So you are trying to find out Batman's secret identity, and there are 17 possible suspects. Derek Powers looks like a good candidate: he is wealthy, and has a long history of secretly delegating illegal-violence-including tasks to his henchmen; however, his motivation is far from obvious. You estimate P(Derek Powers employs Batman) as 20%. You have very little information about other candidates, like Ferris Boyle, Bruce Wayne, Roland Daggett, Lucius Fox or Matches Malone, so you assign an equal 5% to everyone else.

In this case you should pick Derek Powers as your best guess when forced to name only one candidate (for example, if you forced to send the book to someone today), but also you should be aware that your guess is 80% likely to be wrong. When making expected utility calculations, you should take Derek Powers more seriously than Lucius Fox, but only by 15% more seriously.

In other words, you should take maximum a posteriori probability hypothesis into account while not deluding yourself into thinking that now you understand everything or nothing at all. Derek Powers hypothesis probably isn't true; but it is useful.

Sometimes I find it easier to reframe question from "what hypothesis is true?" to "what hypothesis is probable enough?". Now it's totally okay that your pet theory isn't probable but still probable enough, so doubt becomes easier. Also, you are aware that your pet theory is likely to be wrong (and this is nothing to be sad about), so the alternatives come to mind more naturally.

These "probable enough" hypothesis can serve as a very concise summaries of state of your knowledge when you simultaneously outline the general sort of evidence you've observed, and stress that you aren't really sure. I like to think about it like a rough, qualitative and more System1-friendly variant of Likelihood ratio sharing.

Planning Fallacy

The original explanation of planning fallacy (proposed by Kahneman and Tversky) is about people focusing on a most optimistic scenario when asked about typical one (instead of trying to do an Outside VIew). If you keep the distinction between "probable" and "probable enough" in mind, you can see this claim in a new light.

Because the most optimistic scenario is the most probable and the most typical one, in a certain sense.

The illustration, with numbers pulled out of thin air, goes like this: so, you want to visit a museum.

The first thing you need to do is to get dressed and take your keys and stuff. Usually (with 80% probability) you do this very quick, but there is a weak possibility of your museum ticket having been devoured by an entropy monster living on your computer table.

The second thing is to catch bus. Usually (p = 80%), bus is on schedule, but sometimes it can be too early or too late. After this, the bus could (20%) or could not (80%) get stuck in a traffic jam.

Finally, you need to find a museum building. You've been there before once, so you sorta remember your route, yet still could be lost with 20% probability.

And there you have it: P(everything is fine) = 40%, and probability of every other scenario is 10% or even less. "Everything is fine" is probable enough, yet likely to be false. Supposedly, humans pick MAP hypothesis and then forget about every other scenario in order to save computations.

Also, "everything is fine" is a good description of your plan. If your friend asks you, "so how are you planning to get to the museum?", and you answer "well, I catch the bus, get stuck in a traffic jam for 30 agonizing minutes, and then just walk from here", your friend is going  to get a completely wrong idea about dangers of your journey. So, in a certain sense, "everything is fine" is a typical scenario. 

Maybe it isn't human inability to pick the most likely scenario which should be blamed. Maybe it is false assumption that "most likely == likely to be correct" which contributes to this ubiquitous error.

In this case you would be better off having picked the "something will go wrong, and I will be late", instead of "everything will be fine".

So, sometimes you are interested in the best specimen out of your hypothesis space, sometimes you are interested in a most likely thingy (and it doesn't matter how vague it would be), and sometimes there are no shortcuts, and you have to do an actual expected utility calculation.

A proposed inefficiency in the Bitcoin markets

3 Liron 27 December 2013 03:48AM
Salviati: Simplicio, do you think the Bitcoin markets are efficient?

Simplicio: If you'd asked me two years ago, I would have said yes. I know hindsight is 20/20, but even at the time, I think the fact that relatively few people were trading it would have risen to prominence in my analysis.

Salviati: And what about today?

Simplicio: Today, it seems like there's no shortage of trading volume. The hedge funds of the world have heard of Bitcoin, and had their quants do their fancy analyses on it, and they actively trade it.

Salviati: Well, I'm certainly not a quant, but I think I've spotted a systematic market inefficiency. Would you like to hear it?

Simplicio: Nah, I'm good.

Salviati: Did you hear what I said? I think I've spotted an exploitable pattern of price movements in a $10 Billion market. If I'm right, it could make us a lot of money.

Simplicio: Sure, but you won't convince me that whatever pattern you're thinking of is a "reliable" one.

Salviati: Come on, you don't even know what my argument is.

Simplicio: But I know how your argument is going to be structured. First you're going to identify some property of Bitcoin prices in past data. Then you'll explain some causal model you have which supposedly accounts for why prices have had that property in the past. Then you'll say that your model will continue to account for that same property in future Bitcoin prices.

Salviati: Yeah, so? What's wrong with that?

Simplicio: The problem is that you are not a trained quant, and therefore, your brain is not capable of bringing a worthwhile property of Bitcoin prices to your attention.

Salviati: Dude, I just want to let you know because this happens often and no one else is ever going to say anything: you're being a dick.

Simplicio: Look, quants are good at their job. To a first approximation, quants are like perfect Bayesian reasoners who maintain a probability distribution over the "reliability" of every single property of Bitcoin prices that you and I are capable of formulating. So this argument you're going to make to me, a quant has already made to another quant, and the other quant has incorporated it into his hedge fund's trading algorithms.

Salviati: Fine, but so what if quants have already figured out my argument for themselves? We can make money on it too.

Simplicio: No, we can't. I told you I'm pretty confident that the market is efficient, i.e. anti-inductive, meaning the quants of the world haven't left behind any reliable patterns that an armchair investor like you can detect and profit from.

Salviati: Would you just shut up and let me say my argument?

Simplicio: Whatever, knock yourself out.

Salviati: Ok, here goes. Everyone knows Bitcoin prices are volatile, right?

Simplicio: Yeah, highly volatile. But at any given moment, you don't know if the volatility is going to move the price up or down next. From your state of knowledge, it looks like a random walk. If today's Bitcoin price is $1000, then tomorrow's price is as likely to be $900 as it is to be $1100.

Salviati: I agree that the Random Walk Hypothesis provides a good model of prices in efficient markets, and that the size of a each step in a random walk provides a good model of price volatility in efficient markets.

Simplicio: See, I told you you wouldn't convince me.

Salviati: Ah, but my empirical observation of Bitcoin prices is inconsistent with the Random Walk hypothesis. So the only thing I'm led to conclude is that the Bitcoin market is not efficient.

Simplicio: What do you mean "inconsistent"?

Salviati: I mean Bitcoin's past prices don't look much like a random walk. They look more like a random walk on a log scale. If today's price is $1000, then tomorrow's price is equally likely to be $900 or $1111. So if I buy $1000 of Bitcoin today, I expect to have 0.5($900) + 0.5($1111) = $1005.50 tomorrow.

Simplicio: How do you know that? Did you write a script to loop through Bitcoin's daily closing price on Mt. Gox and simulate the behavior of a Bayesian reasoner with a variable-step-size random-walk prior and a second Bayesian reasoner with a variable-step-size log-random-walk prior, and thus calculate a much higher Bayesian Score for the log-random-walk model?

Salviati: Yeah, I did.

Simplicio: That's very virtuous of you.

[This is a fictional dialogue. The truth is, I was too lazy to do that. Can someone please do that? I would much appreciate it. --Liron.]

Salviati: So, have I convinced you that the market is anti-inductive now?

Simplicio: Well, you've empirically demonstrated that the log Random Walk Hypothesis was a good model for predicting Bitcoin prices in the past. But that's just a historical pattern. My original point was that you're not qualified to evaluate which historical patterns are *reliable* patterns. The Bitcoin markets are full of pattern-annihilating forces, and you're not qualified to evaluate which past-data-fitting models are eligible for future-data-fitting.

Salviati: Ok, I'm not saying you have to believe that the future accuracy of log-Random-Walk will probably be higher than the future accuracy of linear Random Walk. I'm just saying you should perform a Bayesian update in the direction of that conclusion.

Simplicio: Ok, but the only reason the update has nonzero strength is because I assigned an a-priori chance of 10% to the set of possible worlds wherein Bitcoin markets were inefficient, and that set of possible worlds gives a higher probability that a model like your log-Random-Walk model would fit the price data well. So I update my beliefs to promote the hypothesis that Bitcoin is inefficient, and in particular that it is inefficient in a log-Random-Walk way.

Salviati: Thanks. And hey, guess what: I think I've traced the source of the log-Random-Walk regularity.

Simplicio: I'm surprised you waited this long to mention that.

Salviati: I figured that if I mentioned it earlier, you'd snap back about how efficient markets sever the causal connection between would-be price-regularity-causing dynamics, and actual prices.

Simplicio: Fair enough.

Salviati: Anyway, the reason Bitcoin prices follow a log-Random-Walk is because they reflect the long-term Expected Value of Bitcoin's actual utility.

Simplicio: Bitcoin has no real utility.

Salviati: It does. It's liquid in novel, qualitatively different ways. It's kind of anonymous. It's a more stable unit of account than the official currencies of some countries.

Simplicio: Come on, how much utility is all that really worth in expectation?

Salviati: I don't know. The Bitcoin economy could be anywhere from hundreds of millions of dollars, to trillions of dollars. Our belief about the long-term future value of a single BTC is spread out across a range whose 90% confidence interval is something like [$10, $100,000] for 1BTC.

Simplicio: Are you saying it's spread out over the interval [$10, $100,000] in a uniform distribution?

Salviati: Nope, it's closer to a bell curve centered at $1000 on a log scale. It gives equal probability of ~10% both to the $10-100 range and to the $10,000-100,000 range.

Simplicio: How do you know that everyone's beliefs are shaped like that?

Salviati: Because everyone has a causal model in their head with a node for "order of magnitude of Bitcoin's value", and that node varies in the characteristically linear fashion of a Bayes net.

Simplicio: I don't feel confident in that explanation.

Salviati: Then take whatever explanation you give yourself to explain the effectiveness of Fermi estimates. Those output a bell curve on a log scale too, and seems like estimating Bitcoin's future value should have a lot of methodology in common with doing back-of-the-envelope calculations about the blast radius of a nuclear bomb.

Simplicio: Alright.

Salviati: So the causality of Bitcoin prices roughly looks like this:

[Beliefs about order of magnitude of Bitcoin's future value] --> [Beliefs about Bitcoin's future price] --> [Trading decisions]

Simplicio: Okay, I see how the first node can fluctuate a lot in reaction to daily news events, and that would have a disproportionately high effect on the last node. But how can an efficient market avoid that kind of log-scale fluctuation? Efficient markets always reflect a consensus estimate of an asset's price, and it's rational to arrive at an estimate that fluctuates on a log scale!

Salviati: Actually, I think a truly efficient market shouldn't just skip around across orders of magnitudes, just because expectations of future prices do. I think truly efficient markets show some degree of "drag", which should be invisible in typical cases like publicly-traded stocks, but become noticeable in cases of order-of-magnitude value-uncertainty like Bitcoin.

Simplicio: So you think you're the only one smart enough to notice that it's worth trading Bitcoin so as to create drag on Bitcoin's log-scale random walk?

Salviati: Yeah, I think maybe I am.

Salviati is claiming that his empirical observations show a lack of drag on Bitcoin price shifts, which would be actionable evidence of inefficiency. Discuss.

Bayes for Schizophrenics: Reasoning in Delusional Disorders

88 Yvain 13 August 2012 07:22PM

Related to: The Apologist and the Revolutionary, Dreams with Damaged Priors

Several years ago, I posted about V.S. Ramachandran's 1996 theory explaining anosognosia through an "apologist" and a "revolutionary".

Anosognosia, a condition in which extremely sick patients mysteriously deny their sickness, occurs during right-sided brain injury but not left-sided brain injury. It can be extraordinarily strange: for example, in one case, a woman whose left arm was paralyzed insisted she could move her left arm just fine, and when her doctor pointed out her immobile arm, she claimed that was her daughter's arm even though it was obviously attached to her own shoulder. Anosognosia can be temporarily alleviated by squirting cold water into the patient's left ear canal, after which the patient suddenly realizes her condition but later loses awareness again and reverts back to the bizarre excuses and confabulations.

Ramachandran suggested that the left brain is an "apologist", trying to justify existing theories, and the right brain is a "revolutionary" which changes existing theories when conditions warrant. If the right brain is damaged, patients are unable to change their beliefs; so when a patient's arm works fine until a right-brain stroke, the patient cannot discard the hypothesis that their arm is functional, and can only use the left brain to try to fit the facts to their belief.

In the almost twenty years since Ramachandran's theory was published, new research has kept some of the general outline while changing many of the specifics in the hopes of explaining a wider range of delusions in neurological and psychiatric patients. The newer model acknowledges the left-brain/right-brain divide, but adds some new twists based on the Mind Projection Fallacy and the brain as a Bayesian reasoner.

continue reading »

Fallacies as weak Bayesian evidence

59 Kaj_Sotala 18 March 2012 03:53AM

Abstract: Exactly what is fallacious about a claim like ”ghosts exist because no one has proved that they do not”? And why does a claim with the same logical structure, such as ”this drug is safe because we have no evidence that it is not”, seem more plausible? Looking at various fallacies – the argument from ignorance, circular arguments, and the slippery slope argument - we find that they can be analyzed in Bayesian terms, and that people are generally more convinced by arguments which provide greater Bayesian evidence. Arguments which provide only weak evidence, though often evidence nonetheless, are considered fallacies.

As a Nefarious Scientist, Dr. Zany is often teleconferencing with other Nefarious Scientists. Negotiations about things such as ”when we have taken over the world, who's the lucky bastard who gets to rule over Antarctica” will often turn tense and stressful. Dr. Zany knows that stress makes it harder to evaluate arguments logically. To make things easier, he would like to build a software tool that would monitor the conversations and automatically flag any fallacious claims as such. That way, if he's too stressed out to realize that an argument offered by one of his colleagues is actually wrong, the software will work as backup to warn him.

Unfortunately, it's not easy to define what counts as a fallacy. At first, Dr. Zany tried looking at the logical form of various claims. An early example that he considered was ”ghosts exist because no one has proved that they do not”, which felt clearly wrong, an instance of the argument from ignorance. But when he programmed his software to warn him about sentences like that, it ended up flagging the claim ”this drug is safe, because we have no evidence that it is not”. Hmm. That claim felt somewhat weak, but it didn't feel obviously wrong the way that the ghost argument did. Yet they shared the same structure. What was the difference?

The argument from ignorance

Related posts: Absence of Evidence is Evidence of Absence, But Somebody Would Have Noticed!

One kind of argument from ignorance is based on negative evidence. It assumes that if the hypothesis of interest were true, then experiments made to test it would show positive results. If a drug were toxic, tests of toxicity of reveal this. Whether or not this argument is valid depends on whether the tests would indeed show positive results, and with what probability.

With some thought and help from AS-01, Dr. Zany identified three intuitions about this kind of reasoning.

1. Prior beliefs influence whether or not the argument is accepted.

A) I've often drunk alcohol, and never gotten drunk. Therefore alcohol doesn't cause intoxication.

B) I've often taken Acme Flu Medicine, and never gotten any side effects. Therefore Acme Flu Medicine doesn't cause any side effects.

Both of these are examples of the argument from ignorance, and both seem fallacious. But B seems much more compelling than A, since we know that alcohol causes intoxication, while we also know that not all kinds of medicine have side effects.

2. The more evidence found that is compatible with the conclusions of these arguments, the more acceptable they seem to be.

C) Acme Flu Medicine is not toxic because no toxic effects were observed in 50 tests.

D) Acme Flu Medicine is not toxic because no toxic effects were observed in 1 test.

C seems more compelling than D.

3. Negative arguments are acceptable, but they are generally less acceptable than positive arguments.

E) Acme Flu Medicine is toxic because a toxic effect was observed (positive argument)

F) Acme Flu Medicine is not toxic because no toxic effect was observed (negative argument, the argument from ignorance)

Argument E seems more convincing than argument F, but F is somewhat convincing as well.

"Aha!" Dr. Zany exclaims. "These three intuitions share a common origin! They bear the signatures of Bayonet reasoning!"

"Bayesian reasoning", AS-01 politely corrects.

"Yes, Bayesian! But, hmm. Exactly how are they Bayesian?"

continue reading »

The Ellsberg paradox and money pumps

10 fool 28 January 2012 05:34PM

Followup to: The Savage theorem and the Ellsberg paradox

In the previous post, I presented a simple version of Savage's theorem, and I introduced the Ellsberg paradox. At the end of the post, I mentioned a strong Bayesian thesis, which can be summarised: "There is always a price to pay for leaving the Bayesian Way."1 But not always, it turns out. I claimed that there was a method that is Ellsberg-paradoxical, therefore non-Bayesian, but can't be money-pumped (or "Dutch booked"). I will present the method in this post.

I'm afraid this is another long post. There's a short summary of the method at the very end, if you want to skip the jibba jabba and get right to the specification. Before trying to money-pump it, I'd suggest reading at least the two highlighted dialogues.

Ambiguity aversion

To recap the Ellsberg paradox: there's an urn with 30 red balls and 60 other balls that are either green or blue, in unknown proportions. Most people, when asked to choose between betting on red or on green, choose red, but, when asked to choose between betting on red-or-blue or on green-or-blue, choose green-or-blue. For some people this behaviour persists even after due calculation and reflection. This behaviour is non-Bayesian, and is the prototypical example of ambiguity aversion.

There were some major themes that came out in the comments on that post. One theme was that I Fail Technical Writing Forever. I'll try to redeem myself.

Another theme was that the setup given may be a bit too symmetrical. The Bayesian answer would be indifference, and really, you can break ties however you want. However the paradoxical preferences are typically strict, rather than just tie-breaking behaviour. (And when it's not strict, we shouldn't call it ambiguity aversion.) One suggestion was to add or remove a couple of red balls. Speaking for myself, I would still make the paradoxical choices.

A third theme was that ambiguity aversion might be a good heuristic if betting against someone who may know something you don't. Now, no such opponent was specified, and speaking for myself, I'm not inferring one when I make the paradoxical choices. Still, let me admit that it's not contrived to infer a mischievous experimenter from the Ellsberg setup. One commentator puts it better than me:

Betting generally includes an adversary who wants you to lose money so they win in. Possibly in psychology experiments [this might not apply] ... But generally, ignoring the possibility of someone wanting to win money off you when they offer you a bet is a bad idea.

Now betting is supposed to be a metaphor for options with possibly unknown results. In which case sometimes you still need to account for the possibility that the options were made available by an adversary who wants you to choose badly, but less often. And you should also account for the possibility that they were from other people who wanted you to choose well, or that the options were not determined by any intelligent being or process trying to predict your choices, so you don't need to account for an anticorrelation between your choice and the best choice. Except for your own biases.

We can take betting on the Ellsberg urn as a stand-in for various decisions under ambiguous circumstances. Ambiguity aversion can be Bayesian if we assume the right sort of correlation between the options offered and the state of the world, or the right sort of correlation between the choice made and the state of the world. In that case just about anything can Bayesian. But sometimes the opponent will not have extra information, nor extra power. There might not even be any opponent as such. If we assume there are no such correlations, then ambiguity aversion is non-Bayesian.

The final theme was: so what? Ambiguity aversion is just another cognitive bias. One commentator specifically complained that I spent too much time talking about various abstractions and not enough time talking about how ambiguity aversion could be money-pumped. I will fix that now: I claim that ambiguity aversion cannot be money-pumped, and the rest of this post is about my claim.

I'll start with a bit of name-dropping and some whig history, to make myself sound more credible than I really am2. In the last twenty years or so many models of ambiguity averse reasoning have been constructed. Choquet expected utility3 and maxmin expected utility4 were early proposed models of ambiguity aversion. Later multiplier preferences5 were the result of applying the ideas of robust control to macroeconomic models. This results in ambiguity aversion, though it was not explicitly motivated by the Ellsberg paradox. More recently, variational preferences6 generalises both multiplier preferences and maxmin expected utility. What I'm going to present is a finitary case of variational preferences, with some of my own amateur mathematical fiddling for rhetorical purposes.

Probability intervals

The starting idea is simple enough, and may have already occurred to some LW readers. Instead of using a prior probability for events, can we not use an interval of probabilities? What should our betting behaviour be for an event with probability 50%, plus or minus 10%?

There are some different ways of filling in the details. So to be quite clear, I'm not proposing the following as the One True Probability Theory, and I am not claiming that the following is descriptive of many people's behaviour. What follows is just one way of making ambiguity aversion work, and perhaps the simplest way. This makes sense, given my aim: I should just describe a simple method that leaves the Bayesian Way, but does not pay.

Now, sometimes disjoint ambiguous events together make an event with known probability. Or even a certainty, as in an event and its negation. If we want probability intervals to be additive (and let's say that we do) then what we really want are oriented intervals. I'll use +- or -+ (pronounced: plus-or-minus, minus-or-plus) to indicate two opposite orientations. So, if P(X) = 1/2 +- 1/10, then P(not X) = 1/2 -+ 1/10, and these add up to 1 exactly.

Such oriented intervals are equivalent to ordered pairs of numbers. Sometimes it's more helpful to think of them as oriented intervals, but sometimes it's more helpful to think of them as pairs. So 1/2 +- 1/10 is the pair (3/5,2/5). And 1/2 -+ 1/10 is (2/5,3/5), the same numbers in the opposite order. The sum of these is (1,1), which is 1 exactly.

You may wonder, if we can use ordered pairs, can we use triples, or longer lists? Yes, this method can be made to work with those too. And we can still think in terms of centre, length, and orientation. The orientation can go off in all sorts of directions, instead of just two. But for my purposes, I'll just stick with two.

You might also ask, can we set P(X) = 1/2 +- 1/2? No, this method just won't handle it. A restriction of this method is that neither of the pair can be 0 or 1, except when they're both 0 or both 1. The way we will be using these intervals, 1/2 +- 1/2 would be the extreme case of ambiguity aversion. 1/2 +- 1/10 represents a lesser amount of ambiguity aversion, a sort of compromise between worst-case and average-case behaviour.

To decide among bets (having the same two outcomes), compute their probability intervals. Sometimes, the intervals will not overlap. Then it's unambiguous which is more likely, so it's clear what to pick. In general, whether they overlap or not, pick the one with the largest minimum -- though we will see there are three caveats when they do overlap. If P(X) = 1/2 +- 1/10, we would be indifferent between a bet on X and on not X: the minimum is 2/5 in either case. If P(Y) = 1/2 exactly, then we would strictly prefer a bet on Y to a bet on X.

Which leads to the first caveat: sometimes, given two options, it's strictly better to randomise. Let's suppose Y represents a fair coin. So P(Y) = 1/2 exactly, as we said. But also, Y is independent of X. P(X and Y) = 1/4 +- 1/20, and so on. This means that P((X and not Y) or (Y and not X)) = 1/2 exactly also. So we're indifferent between a bet on X and a bet on not X, but we strictly prefer the randomised bet.

In general, randomisation will be strictly better if you have two choices with overlapping intervals of opposite orientations. The best randomisation ratio will be the one that gives a bet with zero-length interval.

Now let us reconsider the Ellsberg urn. We did say the urn can be a metaphor for various situations. Generally these situations will not be symmetrical. But, even in symmetrical scenarios, we should still re-think how we apply the principle of indifference. I argue that the underlying idea is really this: if our information has a symmetry, then our decisions should have that same symmetry. If we switch green and blue, our information about the Ellsberg urn doesn't change. The situation is indistinguishable, so we should behave the same way. It follows that we should be indifferent between a bet on green and a bet on blue. Then, for the Bayesian, it follows that P(red) = P(green) = P(blue) = 1/3. Period.

But for us, there is a degree of freedom, even in this symmetrical situation. We know what the probability of red is, so of course P(red) = 1/3 exactly. But we can set, say7, P(green) = 1/3 +- 1/9, and P(blue) = 1/3 -+ 1/9. So we get P(red or green) = 2/3 +- 1/9, P(red or blue) = 2/3 -+ 1/9, P(green or blue) = 2/3 exactly, and of course P(red or green or blue) = 1 exactly.

So: red is 1/3 exactly, but the minimum of green is 2/9. (green or blue) is 2/3 exactly, but the minimum of (red or blue) is 5/9. So choose red over green, and (green or blue) over (red or blue). That's the paradoxical behaviour. Note that neither pair of choices offered in the Ellsberg paradox has the type of overlap that favours randomisation.

Once we have a decision procedure for the two-outcome case, then we can tack on any utility function, as I explained in the previous post. The result here is what you would expect: we get oriented expected utility intervals, obtained by multiplying the oriented probability intervals by the utility. When deciding, we pick the one whose interval has the largest minimum. So for example, a bet which pays 15U on red (using U for "utils", the abstract units of measurement of the utility function) has expected utility 5U exactly. A bet which pays 18U on green has expected utility 6U +- 2U, the minimum is 4U. So pick the bet on red over that.

Operationally, probability is associated with the "fair price" at which we are willing to bet. A probability interval indicates that there is no fair price. Instead we have a spread: we buy bets at their low price and sell at their high price. At least, we do that if we have no outstanding bets, or more generally, if the expected utility interval on our outstanding bets has zero-length. The second caveat is that if this interval has length, then it affects our price: we also sell bets of the same orientation at their low price, and buy bets of the opposite orientation at their high price, until the length of this interval is used up. The midpoint of the expected utility interval on our outstanding bets will be irrelevant.

This can be confusing, so it's time for an analogy.


If you are Bayesian and risk-neutral (and if bets pay in "utils" rather than cash, you are risk-neutral by definition) then outstanding bets have no effect on further betting behaviour. However, if you are risk-averse, as is the most common case, then this is no longer true. The more money you've already got on the line, the less willing you will be to bet.

But besides risk attitude, there could also be interference effects from non-monetary payouts. For example, if you are dealing in boots, then you wouldn't buy a single boot for half the price of a pair, and neither would you sell one of your boots for half the price of a pair. Unless you happened to already have unmatched boots, then you would sell those at a lower price, or buy boots of the opposite orientation at a higher price, until you had no more unmatched boots. If you were otherwise risk-neutral with respect to boots, then your behaviour would not depend on the number of pairs you have, just on the number and orientation of your unmatched boots.

This closely resembles the non-Bayesian behaviour above. In fact, for the Ellsberg urn, we could just say that a bet on red is worth a pair of boots, a bet on green is worth two left boots, and a bet on blue is worth two right boots. Without saying anything further, it's clear that we would strictly prefer red (a pair) over green (two lefts), but we would also strictly prefer green-or-blue (two pairs) over red-or-blue (one left and three rights). That's the paradoxical behaviour, but you know you can't money-pump boots.

A: I'll buy that pair of boots for 30 zorkmids.
B: Okay, here's your pair of boots.
A: And here's your 30 zorkmids. Thank you.
B: Thank you. Say, didn't you just buy an identical pair this morning?
A: Yeah, I did. Then a dingo ate the right one. I've got the left one here. Never worn.
B: How narratively convenient! How much would you sell it for?
A: Hmm, how about 10 zorkmids?
B: Really, 10 zorkmids? So, do you think right boots are more valuable than left boots?
A: No, of course not. Why?
B: Arbitrage!
A: Gesundheit.
B: Thanks. I'll buy a left boot from you for 10 zorkmids.
A: Great! Here's your left boot.
B: And here's your 10 zorkmids. Thank you.
A: Thank you!
B: And I'll buy a right boot from you for 10 zorkmids.
A: Errrm... Sorry? Why would I agree to that?
B: You just sold me a left boot for 10 zorkmids. Well, you yourself said rights aren't more valuable than lefts. So, logically, you should be willing to sell me a right boot for 10 zorkmids.
A: What? No.

Boots' rule

So much for the static case. But what do we do with new information? How do we handle conditional probabilities?

We still get P(A|B) by dividing P(A and B) by P(B). It will be easier to think in terms of pairs here. So for example P(red) = 1/3 exactly = (1/3,1/3) and P(red or green) = 2/3 +- 1/9 = (7/9,5/9), so P(red|red or green) = (3/7,3/5) = 18/35 -+ 3/35. And similarly P(green|red or green) = (1/3 +- 1/9)/(2/3 +- 1/9) = 17/35 +- 3/35.

This rule covers the dynamic passive case, where we update probabilities based on what we observe, before betting. The third and final caveat is in the active case, when information comes in between bets. Now, we saw that the length and orientation of the interval on expected utility of outstanding bets affects further betting behaviour. There is actually a separate update rule for this quantity. It is about as simple as it gets: do nothing. The interval can change when we make choices, and its midpoint can shift due to external events, but its length and orientation do not update.

You might expect the update rule for this quantity to follow from the way the expected utility updates, which follows from the way probability updates. But it has a mind of its own. So even if we are keeping track of our bets, we'd still need to keep track of this extra variable separately.

Sometimes it may be easier to think in terms of the total expected utility interval of our outstanding bets, but sometimes it may be easier to think of this in terms of having a "virtual" interval that cancels the change in the length and orientation of the "real" expected utility interval. The midpoint of this virtual interval is irrelevant and can be taken to always be zero. So, on update, compute the prior expected utility interval of outstanding bets, subtract the posterior expected utility interval from it, and add this difference to the virtual interval. Reset its midpoint to zero, keeping only the length and orientation.

That can also be confusing, so let's have another analogy.

Yo' mama's so illogical...

I recently came across this example by Mark Machina:

M: Children, I only have one treat, I can only give it to one of you.
I: Me, mama!
J: No, give it to me!
M: No. Rather than give it to either of you, it's better if I toss a coin. Heads, it goes to Irina, tails, it goes to Joey.
M: Heads. Irina gets it.
J: But mama!
M: Fair is fair.
I: Yeah Joey!
J: But mama, you yourself said it's better to toss a coin than to give it to either of us. So, logically, instead of giving it to Irina you should toss a coin again.
M: Nice try, Joey.

Instead of giving the treat to either child, she strictly prefers to toss a coin and give the treat to the winner. But after the coin is tossed, she strictly prefers to give the treat to the winner rather than toss again.

This cannot be explained in terms of maximising expected utility, in the typical sense of "utility". And of course only known probabilities are involved here, so there's no question as to whether her beliefs are probabilistically sophisticated or not. But it could be said that she is still maximising the expected value of an extended objective function. This extended objective function does not just consider who gets a treat, but also considers who "had a fair chance". She is unfair if she gives the treat to either child outright, but fair if she tosses a coin. That fairness doesn't go away when the result of the coin toss is known.

Or something like that. There are surely other ways of dissecting the mother's behaviour. But no matter what, it's going to have to take the coin toss into account, even though the coin, in and of itself, has no relevance to the situation.

Let's go back to the urn. Green and blue have the type of overlap that favours randomisation: P((green and heads) or (blue and tails)) = 1/3 exactly. A bet paying 9U on this event has expected utility of 3U exactly. Let's say we took this bet. Now say the coin comes up heads. We can update the probabilities as per above. The answer is that P(green) = 1/3 +- 1/9 as it was before. That makes sense because it's an independent event: knowing the result of the coin toss gives no information about the urn. The difference is that we now have an outstanding bet that pays 9U if the ball is green. The expected utility would therefore be 3U +- 1U. Except, the expected utility interval was zero-length before the coin was tossed, so it remains zero-length. Equivalently, the virtual interval becomes -+ 1U, so that the effective total is 3U exactly. (In this example, the midpoint of the expected utility interval didn't change either. That's not generally the case.) A bet randomised on a new coin toss would have expected utility 3U, plus the virtual interval of -+ 1U, for an effective total of 3U -+ 1U. So we would strictly prefer to keep the bet on green rather than re-randomise.

Let's compare this with a trivial example: let's say we took a bet that pays 9U if the ball drawn from the urn is green. The expected utility of this bet is 3U +- 1U. For some unrelated reason, a coin is tossed, and it comes up heads. The coin has also nothing to do with the urn or my bet. I still have a bet of 9U on green, and its expected utility is still 3U +- 1U.

But the difference between these two examples is just in the counterfactual: if the coin had come up tails, in the first example I would have had a bet of 9U on blue, and in the second example I would have had a bet of 9U on green. But the coin came up heads, and in both examples I end up with a bet of 9U on green. The virtual interval has some spooky dependency on what could have happened, just like "had a fair chance". It is the ghost of a departed bet.

I expect many on LW are wondering what happened. There was supposed to be a proof that anything that isn't Bayesian can be punished. Actually, this threat comes with some hidden assumptions, which I hope these analogies have helped to illustrate. A boot is an example of something which has no fair price, even if a pair of boots has one. A mother with two children and one treat is an example where some counterfactuals are not forgotten. The hidden assumptions fail in our case, just as they can fail in these other contexts where Bayesianism is not at issue. This can be stated more rigorously8, but that is basically how it's possible. Now We Know. And Knowing is Half the Battle.


  1. Taken almost verbatim from Eliezer Yudkowsky's post on the Allais paradox.
  2. And footnotes pointing to some tangentially relevant journal articles make me sound extra credible.
  3. For Choquet expected utility see: D. Schmeidler, Subjective probability and expected utility without additivity, Econometrica 57 (1989) pp 571-587.
  4. For maxmin expected utility see: I. Gilboa and D. Schmeidler, Maxmin expected utility with a non-unique prior, J. Math. Econ. 18 (1989) pp 141-153.
  5. For multiplier preferences see: L.P. Hansen and T.J. Sargeant, Robust control and model uncertainty, Amer. Econ. Rev. 91 (2001) pp 60-66.
  6. For variational preferences see: F. Maccheroni, M. Marinacci, and A. Rustichini, Dynamic variational preferences, J. Econ. Theory 128 (2006) pp 4-44.
  7. Any length between 0 and 1/3 works. But here's where I pulled 1/9 from: a Bayesian might assign exactly 1/61 prior probability to the 61 possible urn compositions, and the result is roughly approximated by the Laplacian rule of succession, which prescribes a pseudocount of one green and one blue ball. A similar thing with probability intervals is roughly approximated by using a pseudocount of 3/2 +- 1/2 green and 3/2 -+ 1/2 blue balls.
  8. To quickly relate this back to Savage's rules: rules 1 and 3 guarantee that there's no static money pump. Rule 2 then is supposed to guarantee that there is no dynamic money pump. But it is stronger than necessary for that purpose. I claim that this method obeys rules 1, 3, and a weaker version of rule 2, and that it is dynamically consistent. For dynamic consistency of variational preferences in general, see footnotes above. This method is a special case, for which I wrote up a simpler proof.

Appendix A: method summary

  • Events are assigned a pair of prior probabilities, which can also be thought of as an oriented probability interval. e.g. (3/5,2/5) can also be thought of as 1/2 +- 1/10.
  • Neither side of the pair can be 0 or 1, except when they're both 0 or both 1.
  • Each side of the pair is additive: if A and B are disjoint, and P(A) = (x,y), and P(B) = (u,v), then P(A or B) = (x+u,y+v).
  • Each side of the pair updates by Bayes' rule: if P(A and B) = (x,y), and P(B) = (u,v), then P(A|B) = (x/u,y/v).
  • Given a utility function, each bet will then have an expected utility interval: multiply the probability intervals by the utility for each possible outcome.
  • There is also a virtual expected utility interval to keep track of. The midpoint of this interval is always zero.
  • When updating the virtual expected utility interval, compute the prior expected utility interval of the outstanding bet(s), subtract the posterior expected utility interval from it, and add this difference to the virtual expected utility interval. Throw away the midpoint (reset the midpoint of the interval to zero, keeping just the length and orientation).
  • To decide among bets: compute the expected utility intervals of each of them -- including already outstanding bets, and including the virtual expected utility interval. Rank them according to the minimum values of the intervals.
  • Implicitly when presented with options we are also presented with the option to randomise among them, and sometimes this is strictly better than any of the pure options.

Appendix B: obligatory image for LW posts on this topic

All your Bayes are belong to us

The Savage theorem and the Ellsberg paradox

13 fool 14 January 2012 07:06PM

Followup to: A summary of Savage's foundation for probability and utility.

In 1961, Daniel Ellsberg, most famous for leaking the Pentagon Papers, published the decision-theoretic paradox which is now named after him 1. It is a cousin to the Allais paradox. They both involve violations of an independence or separability principle. But they go off in different directions: one is a violation of expected utility, while the other is a violation of subjective probability. The Allais paradox has been discussed on LW before, but when I do a search it seems that the first discussion of the Ellsberg paradox on LW was my comments on the previous post 2. It seems to me that from a Bayesian point of view, the Ellsberg paradox is the greater evil.

But I should first explain what I mean by a violation of expected utility versus subjective probability, and for that matter, what I mean by Bayesian. I will explain a special case of Savage's representation theorem, which focuses on the subjective probability side only. Then I will describe Ellsberg's paradox. In the next episode, I will give an example of how not to be Bayesian. If I don't get voted off the island at the end of this episode.

Rationality and Bayesianism

Bayesianism is often taken to involve the maximisation of expected utility with respect to a subjective probability distribution. I would argue this label only sticks to the subjective probability side. But mainly, I wish to make a clear division between the two sides, so I can focus on one.

Subjective probability and expected utility are certainly related, but they're still independent. You could be perfectly willing and able to assign belief numbers to all possible events as if they were probabilities. That is, your belief assignment obeys all the laws of probability, including Bayes' rule, which is, after all, what the -ism is named for. You could do all that, but still maximise something other than expected utility. In particular, you could combine subjective probabilities with prospect theory, which has also been discussed on LW before. In that case you may display Allais-paradoxical behaviour but, as we will see, not Ellsberg-paradoxical behaviour. The rationalists might excommunicate you, but it seems to me you should keep your Bayesianist card.

On the other hand your behaviour could be incompatible with any subjective probability distribution. But you could still maximise utility with respect to something other than subjective probability. In particular, when faced with known probabilities, you would be maximising expected utility in the normal sense. So you can not exhibit any Allais-paradoxical behaviour, because the Allais paradox involves only objective lotteries. But you may exhibit, as we will see, Ellsberg-paradoxical behaviour. I would say you are not Bayesian.

So a non-Bayesian, even the strictest frequentist, can still be an expected utility maximiser, and a perfect Bayesian need not be an expected utility maximiser. What I'm calling Bayesianist is just the idea that we should reason with our subjective beliefs the same way that we reason with objective probabilities. This also has been called having "probabilistically sophisticated" beliefs, if you prefer to avoid the B-word, or don't like the way I'm using it.

In a lot of what follows, I will bypass utility by only considering two outcomes. Utility functions are only unique up to a constant offset and a positive scale factor. With two outcomes, they evaporate entirely. The question of maximising expected utility with respect to a subjective probability distribution reduces to the question of maximising the probability, according to that distribution, of getting the better of the two outcomes. (And if the two outcomes are equal, there is nothing to maximise.)

And on the flip side, if we have a decision method for the two-outcome case, Bayesian or otherwise, then we can always tack on a utility function. The idea of utility is just that any intermediate outcome is equivalent to an objective lottery between better and worse outcomes. So if we want, we can use a utility function to reduce a decision problem with any (finite) number of outcomes to a decision problem over the best and worst outcomes in question.

Savage's representation theorem

Let me recap some of the previous post on Savage's theorem. How might we defend Bayesianism? We could invoke Cox's theorem. This starts by assuming possible events can be assigned real numbers corresponding to some sort of belief level on someone's part, and that there are certain functions over these numbers corresponding to logical operations. It can be proven that, if someone's belief functions obey some simple rules, then that person acts as if they were reasoning with subjective probability. Now, while the rules for belief functions are intuitive, the background assumptions are pretty sketchy. It is not at all clear why these mathematical constructs are requirements of rationality.

One way to justify those constructs is to argue in terms of choices a rational person must make. We imagine someone is presented with choices among various bets on uncertain events. Their level of belief in these events can be gauged by which bets they choose. But if we're going to do that anyway, then, as it turns out, we can just give some simple rules directly about these choices, and bypass the belief functions entirely. This was Leonard Savage's approach 3. To quote a comment on the previous post: "This is important because agents in general don't have to use beliefs or goals, but they do all have to choose actions."

Savage's approach actually covers both subjective probability and expected utility. The previous post discusses both, whereas I am focusing on the former. This lets me give a shorter exposition, and I think a clearer one.

We start by assuming some abstract collection of possible bets. We suppose that when you are offered two bets from this collection, you will choose one over the other, or express indifference.

As discussed, we will only consider two outcomes. So all bets have the same payout, the difference among them is just their winning conditions. It is not specified what it is that you win. But it is assumed that, given the choice between winning unconditionally and losing unconditionally, you would choose to win.

It is assumed that the collection of bets form what is called a boolean algebra. This just means we can consider combinations of bets under boolean operators like "and", "or", or "not". Here I will use brackets to indicate these combinations. (A or B) is a bet that wins under the conditions that make either A win, or B win, or both win. (A but not B) wins whenever A wins but B doesn't. And so on.

If you are rational, your choices must, it is claimed, obey some simple rules. If so, it can be proven that you are choosing as if you had a assigned subjective probabilities to bets. Savage's axioms for choosing among bets are 4:

  1. If you choose A over B, you shall not choose B over A; and, if you do not choose A over B, and do not choose B over C, you shall not choose A over C.
  2. If you choose A over B, you shall also choose (A but not B) over (B but not A); and conversely, if you choose (A but not B) over (B but not A), you shall also choose A over B.
  3. You shall not choose A over (A or B).
  4. If you choose A over B, then you shall be able to specify a finite sequence of bets C1, C2, ..., Cn, such that it is guaranteed that one and only one of the C's will win, and such that, for any one of the C's, you shall still choose (A but not C) over (B or C).

Rule 1 is a coherence requirement on rational choice. It is requires your preferences to be a total pre-order. One objection to Cox's theorem is that levels of belief could be incomparable. This objection does not apply to rule 1 in this context because, as we discussed above, we're talking about choices of bets, not beliefs. Faced with choices, we choose. A rational person's choices must be non-circular.

Rule 2 is an independence requirement. It demands that when you compare two bets, you ignore the possibilty that they could both win. In those circumstances you would be indifferent between the two anyway. The only possibilities that are relevant to the comparison are the ones where one bet wins and the other doesn't. So, you ought to compare A to B the same way you compare (A but not B) to (B but not A). Savage called this rule the Sure-thing principle.

Rule 3 is a dominance requirement on rational choice. It demands that you not choose something that cannot do better under any circumstance: whenever A would win, so would (A or B). Note that you might judge (B but not A) to be impossible a priori. So, you might legitimately express indifference between A and (A or B). We can only say it is never legitimate to choose A over (A or B).

Rule 4 is the most complicated. Luckily it's not going to be relevant to the Ellsberg paradox. Call it Mostly Harmless and forget this bit if you want.

What rule 4 says is that if you choose A over B, you must be willing to pay a premium for your choice. Now, we said there are only two outcomes in this context. Here, the premium is paid in terms of other bets. Rule 4 demands that you give a finite list of mutually exclusive and exhaustive events, and still be willing to choose A over B if we take any event on your list, cut it from A, and paste it to B. You can list as many events as you need to, but it must be a finite list.

For example, if you thought A was much more likely than B, you might pull out a die, and list the 6 possible outcomes of one roll. You would also be willing to choose (A but not a roll of 1) over (B or a roll of 1), (A but not a roll of 2) over (B or a roll of 2), and so on. If not, you might list the 36 possible outcomes of two consecutive rolls, and be willing to choose (A but not two rolls of 1) over (B or two rolls of 1), and so on. You could go to any finite number of rolls.

In fact rule 4 is pretty liberal, it doesn't even demand that every event on your list be equiprobable, or even independent of the A and B in question. It just demands that the events be mutually exclusive and exhaustive. If you are not willing to specify some such list of events, then you ought to express indifference between A and B.

If you obey rules 1-3, then that is sufficient for us construct a sort of qualitative subjective probability out of your choices. It might not be quantitative: for one thing, there could be infinitessimally likely beliefs. Another thing is that there might be more than one way to assign numbers to beliefs. Rule 4 takes care of these things. If you obey rule 4 also, then we can assign a subjective probability to every possible bet, prove that you choose among bets as if you were using those probabilities, and also prove that it is the only probability assignment that matches your choices. And, on the flip side, if you are choosing among bets based on a subjective probability assignment, then it is easy to prove you obey rules 1-3, as well as rule 4 if the collection of bets is suitably infinite, like if a fair die is avaialble to bet on.

Savage's theorem is impressive. The background assumptions involve just the concept of choice, and no numbers at all. There are only a few simple rules. Even rule 4 isn't really all that hard to understand and accept. A subjective probability distribution appears seemingly out of nowhere. In the full version, a utility function appears out of nowhere too. This theorem has been called the crowning glory of decision theory.

The Ellsberg paradox

Let's imagine there is an urn containing 90 balls. 30 of them are red, and the other 60 are either green or blue, in unknown proportion. We will draw a ball from the urn at random. Let us bet on the colour of this ball. As above, all bets have the same payout. To be specific, let's say you get pie if you win, and a boot to the head if you lose. The first question is: do you prefer to bet that the colour will be red, or that it will be green? The second question is: do you prefer to bet that it will be (red or blue), or that it will be (green or blue)?

The most common response5 is to choose red over green, and (green or blue) over (red or blue). And that's all there is to it. Paradox! 6

  30 60
Red Green Blue

A pie BOOT BOOT   A is preferred to B

C pie BOOT pie   D is preferred to C
D BOOT pie pie


If choices were based solely on an assignment of subjective probability, then because the three colours are mutually exclusive, P(red or blue) = P(red) + P(blue), and P(green or blue) = P(green) + P(blue). So, since P(red) > P(green) then P (red or blue) > P(green or blue), but instead we have P(red or blue) < P(green or blue).

Knowing Savage's representation theorem, we expect to get a formal contradiction from the 4 rules above plus the 2 expressed choices. Something has to give, so we'd like to know which rules are really involved. You can see that we are talking only about rule 2, the Sure-thing principle. It says we shall compare (red or blue) to (green or blue) the same way as we compare red to green.

This behaviour has been called ambiguity aversion. Now, perhaps this is just a cognitive bias. It wouldn't be the first time that people behave a certain way, but the analysis of their decisions shows a clear error. And indeed, when explained, some people do repent of their sins against Bayes. They change their choices to obey rule 2. But others don't. To quote Ellsberg:

...after rethinking all their 'offending' decisions in light of [Savage's] axioms, a number of people who are not only sophisticated but reasonable decide that they wish to persist in their choices. This includes people who previously felt a 'first order commitment' to the axioms, many of them surprised and some dismayed to find that they wished, in these situations, to violate the Sure-thing Principle. Since this group included L.J. Savage, when last tested by me (I have been reluctant to try him again), it seems to deserve respectful consideration.

I include myself in the group that thinks rule 2 is what should be dropped. But I don't have any dramatic (de-)conversion story to tell. I was somewhat surprised, but not at all dismayed, and I can't say I felt much if any prior commitment to the rules. And as to whether I'm sophisticated or reasonable, well never mind! Even if there are a number of other people who are all of the above, and even if Savage himself may have been one of them for a while, I do realise that smart people can be Just Plain Wrong. So I'd better have something more to say for myself.

Well, red obviously has a probability of 1/3. Our best guess is to apply the principle of indifference to also assign probability 1/3 to green or blue. But our best guess is not necessarily a good guess. The probabilities we assign to red, and to (green or blue), are objective. We're guessing the probability of green, and of (red or blue). It seems wise to take this difference into account when choosing what to bet on, doesn't it? And surely it will be all the more wise when dealing with real-life, non- symetrical situations where we can't even appeal to the principle of indifference.

Or maybe I'm just some fool talking jibba jabba. Against this sort of talk, the LW post on the Allais paradox presents a version of Howard Raiffa's dynamic inconsistency argument. This makes no references to internal thought processes, it is a purely external argument about the decisions themselves. As stated in that post, "There is always a price to pay for leaving the Bayesian Way." 7 This is expanded upon in an earlier post:

Sometimes you must seek an approximation; often, indeed. This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation - and fails to the extent that it departs.

Bayesianism's coherence and uniqueness proofs cut both ways ... anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).

Now even if you believe this about the Allais paradox, I've argued that this doesn't really have much to do with Bayesianism one way or the other. The Ellsberg paradox is what actually strays from the Path. So, does God also punish ambiguity aversion?

Tune in next time8, when I present a two-outcome decision method that obeys rules 1, 3, and 4, and even a weaker form of rule 2. But it exhibits ambiguity aversion, in gross violation of the original rule 2, so that it's not even approximately Bayesian. I will try to present it in a way that advocates for its internal cognitive merit. But the main thing 9 is that, externally, it is dynamically consistent. We do not get booked, by the Dutch or any other nationality.



  1. Ellsberg's original paper is: Risk, ambiguity, and the Savage axioms, Quarterly Journal of Economics 75 (1961) pp 643-669
  2. Some discussion followed, in which I did rather poorly. Actually I had to admit defeat. Twice. But, as they say: fool me once, shame on me; fool me twice, won't get fooled again!
  3. Savage presents his theorem in his book: The Foundations of Statistics, Wiley, New York, 1954.
  4. To compare to Savage's setup: for the two outcome case, we deal directly with "actions" or equivalently "events", here called "bets". We can dispense with "states"; in particular we don't have to demand that the collection of bets be countably complete, or even a power-set algebra of states, just that it be some boolean algebra. Savage's axioms of course have a descriptive interpretation, but it is their normativity that is at issue here, so I state them as "you shall". Rules 1-3 are his P1-P3, and 4 is P6. P4 and P7 are irrelevant in the two- outcome case. P5 is included in the background assumption that you would choose to win. I do not call this normative, because the payoff wasn't specified.
  5. Ellsberg originally proposed this just as a thought experiment, and canvassed various victims for their thoughts under what he called "absolutely non-expiremental conditions". He used $100 and $0 instead of pie and a boot to the head. Which is dull of course, but it shouldn't make a difference10. The experiment has since been repeated under more experimental conditions. The expirementers also invariably opt for the more boring cash payouts.
  6. Some people will say this isn't "really" a paradox. Meh.
  7. Actually, I inserted "to pay". It wasn't in the original post. But it should have been.
  8. Sneak preview
  9. As a great decision theorist once said, "Stupid is as stupid does."
  10. ...or should it? Savage's rule P4 demands that it shall not. And the method I have in mind obeys this rule. But it turns out this is another rule that God won't enforce. And that's yet another post, if I get to it at all.


Poker with Lennier

15 HonoreDB 15 November 2011 10:21PM

In J. Michael Straczynski's science fiction TV show Babylon 5, there's a character named Lennier. He's pretty Spock-like: he's a long-lived alien who avoids displaying emotion and feels superior to humans in intellect and wisdom. He's sworn to always speak the truth. In one episode, he and another character, the corrupt and rakish Ambassador Mollari, are chatting. Mollari is bored. But then Lennier mentions that he's spent decades studying probability. Mollari perks up, and offers to introduce him to this game the humans call poker.

continue reading »

Amanda Knox: post mortem

23 gwern 20 October 2011 04:10PM

Continuing my interest in tracking real-world predictions, I notice that the recent acquittal of Knox & Sollecito offers an interesting opportunity - specifically, many LessWrongers gave probabilities for guilt back in 2009 in komponisto’s 2 articles:

Both were interesting exercises, and it’s time to do a followup. Specifically, there are at least 3 new pieces of evidence to consider:

  1. the failure of any damning or especially relevant evidence to surface in the ~2 years since (see also: the hope function)
  2. the independent experts’ report on the DNA evidence
  3. the freeing of Knox & Sollecito, and continued imprisonment of Rudy Guede (with reduced sentence)

Point 2 particularly struck me (the press attributes much of the acquittal to the expert report, an acquittal I had not expected to succeed), but other people may find the other 2 points or unmentioned news more weighty.

continue reading »

MSF Theory: Another Explanation of Subjectively Objective Probability

14 potato 30 July 2011 07:46PM

Before I read Probability is in the Mind and Probability is Subjectively Objective I was a realist about probabilities; I was a frequentest. After I read them, I was just confused. I couldn't understand how a mind could accurately say the probability of getting a heart in a standard deck of playing cards was not 25%. It wasn't until I tried to explain the contrast between my view and the subjective view in a comment on Probability is Subjectively Objective that I realized I was a subjective Bayesian all along. So, if you've read Probability is in the Mind and read Probability is Subjectively Objective but still feel a little confused, hopefully, this will help.

I should mention that I'm not sure that EY would agree with my view of probability, but the view to be presented agrees with EY's view on at least these propositions:

  • Probability is always in a mind, not in the world.
  • The probability that an agent should ascribe to a proposition is directly related to that agent's knowledge of the world.
  • There is only one correct probability to assign to a proposition given your partial knowledge of the world.
  • If there is no uncertainty, there is no probability.

And any position that holds these propositions is a non-realist-subjective view of probability. 



Imagine a pre-shuffled deck of playing cards and two agents (they don't have to be humans), named "Johnny" and "Sally", which are betting 1 dollar each on the suit of the top card. As everyone knows, 1/4 of the cards in a playing card deck are hearts. We will name this belief F1; F1 stands for "1/4 of the cards in the deck are hearts.". Johnny and Sally both believe F1. F1 is all that Johnny knows about the deck of cards, but sally knows a little bit more about this deck. Sally also knows that 8 of the top 10 cards are hearts. Let F2 stand for "8 out of the 10 top cards are hearts.". Sally believes F2. John doesn't know whether or not F2. F1 and F2 are beliefs about the deck of cards and they are either true or false.

So, sally bets that the top card is a heart and Johnny bets against her, i.e., she puts her money on "Top card is a heart." being true; he puts his money on "~The top card is a heart." being true. After they make their bets, one could imagine Johnny making fun of Sally; he might say something like: "Are you nuts? You know, I have a 75% chance of winning. 1/4 of the cards are hearts; you can't argue with that!" Sally might reply: "Don't forget that the probability you assign to '~The top card is a heart.' depends on what you know about the deck. I think you would agree with me that there is an 80% chance that 'The top card is a heart' if you knew just a bit more about the state of the deck."

To be undecided about a proposition is to not know which possible world you are in; am I in the possible world where that proposition is true, or in the one where it is false? Both Johnny and Sally are undecided about "The top card is a heart."; their model of the world splits at that point of representation. Their knowledge is consistent with being in a possible world where the top card is a heart, or in a possible world where the top card is not a heart. The more statements they decide on, the smaller the configuration space of possible worlds they think they might find themselves in; deciding on a proposition takes a chunk off of that configuration space, and the content of that proposition determines the shape of the eliminated chunk; Sally's and Johnny's beliefs constrain their respective expected experiences, but not all the way to a point. The trick when constraining one's space of viable worlds, is to make sure that the real world is among the possible worlds that satisfy your beliefs. Sally still has the upper hand, because her space of viably possible worlds is smaller than Johnny's. There are many more ways you could arrange a standard deck of playing cards that satisfies F1 than there are ways to arrange a deck of cards that satisfies F1 and F2. To be clear, we don't need to believe that possible worlds actually exist to accept this view of belief; we just need to believe that any agent capable of being undecided about a proposition is also capable of imagining alternative ways the world could consistently turn out to be, i.e., capable of imagining possible worlds.

For convenience, we will say that a possible world W, is viable for an agent A, if and only if, W satisfies A's background knowledge of decided propositions, i.e., A thinks that W might be the world it finds itself in.

Of the possible worlds that satisfy F1, i.e., of the possible worlds where "1/4 of the cards are hearts" is true, 3/4 of them also satisfy "~The top card is a heart." Since Johnny holds that F1, and since he has no further information that might put stronger restrictions on his space of viable worlds, he ascribes a 75% probability to "~The top card is a heart." Sally, however, holds that F2 as well as F1. She knows that of the possible worlds that satisfy F1 only 1/4 of them satisfy "The top card is a heart." But she holds a proposition that constrains her space of viably possible worlds even further, namely F2. Most of the possible worlds that satisfy F1 are eliminated as viable worlds if we hold that F2 as well, because most of the possible worlds that satisfy F1 don't satisfy F2. Of the possible worlds that satisfy F2 exactly 80% of them satisfy "The top card is a heart." So, duh, Sally assigns an 80% probability to "The top card is a heart." They give that proposition different probabilities, and they are both right in assigning their respective probabilities; they don't disagree about how to assign probabilities, they just have different resources for doing so in this case. P(~The top card is a heart|F1) really is 75% and P(The top card is a heart|F2) really is 80%.

This setup makes it clear (to me at least) that the right probability to assign to a proposition depends on what you know. The more you know, i.e., the more you constrain the space of worlds you think you might be in, the more useful the probability you assign. The probability that an agent should ascribe to a proposition is directly related to that agent's knowledge of the world.

This setup also makes it easy to see how an agent can be wrong about the probability it assigns to a proposition given its background knowledge. Imagine a third agent, named "Billy", that has the same information as Sally, but say's that there's a 99% chance of "The top card is a heart." Billy doesn't have any information that further constrains the possible worlds he thinks he might find himself in; he's just wrong about the fraction of possible worlds that satisfy F2 that also satisfy "The top card is a heart.". Of all the possible worlds that satisfy F2 exactly 80% of them satisfy "The top card is a heart.", no more, no less. There is only one correct probability to assign to a proposition given your partial knowledge.

The last benefit of this way of talking I'll mention is that it makes probability's dependence on ignorance clear. We can imagine another agent that knows the truth value of every proposition, lets call him "FSM". There is only one possible world that satisfies all of FSM's background knowledge; the only viable world for FSM is the real world. Of the possible worlds that satisfy FSM's background knowledge, either all of them satisfy "The top card is a heart." or none of them do, since there is only one viable world for FSM. So the only probabilities FSM can assign to "The top card is a heart." are 1 or 0. In fact, those are the only probabilities FSM can assign to any proposition. If there is no uncertainty, there is no probability.

The world knows whether or not any given proposition is true (assuming determinism). The world itself is never uncertain, only the parts of the world that we call agents can be uncertain. Hence, Probability is always in a mind, not in the world. The probabilities that the universe assigns to a proposition are always 1 or 0, for the same reasons FSM only assigns a 1 or 0, and 1 and 0 aren't really probabilities.

In conclusion, I'll risk the hypothesis that: Where 0≤x≤1, "P(a|b)=x" is true, if and only if, of the possible worlds that satisfy "b", x of them also satisfy "a". Probabilities are propositional attitudes, and the probability value (or range of values) you assign to a proposition is representative of the fraction of possible worlds you find viable that satisfy that proposition. You may be wrong about the value of that fraction, and as a result you may be wrong about the probability you assign.

We may call the position summarized by the hypothesis above "Modal Satisfaction Frequency theory", or "MSF theory".

View more: Next