# How Much Evidence Does It Take?

**Followup to:** What is Evidence?

Previously, I defined *evidence* as "an event entangled, by links of cause and effect, with whatever you want to know about", and *entangled* as "happening differently for different possible states of the target". So how much entanglement—how much evidence—is required to support a belief?

Let's start with a question simple enough to be mathematical: how hard would you have to entangle yourself with the lottery in order to win? Suppose there are seventy balls, drawn without replacement, and six numbers to match for the win. Then there are 131,115,985 possible winning combinations, hence a randomly selected ticket would have a 1/131,115,985 probability of winning (0.0000007%). To win the lottery, you would need evidence *selective* enough to visibly favor one combination over 131,115,984 alternatives.

Suppose there are some tests you can perform which discriminate, probabilistically, between winning and losing lottery numbers. For example, you can punch a combination into a little black box that always beeps if the combination is the winner, and has only a 1/4 (25%) chance of beeping if the combination is wrong. In Bayesian terms, we would say the *likelihood ratio* is 4 to 1. This means that the box is 4 times as likely to beep when we punch in a correct combination, compared to how likely it is to beep for an incorrect combination.

There are still a whole lot of possible combinations. If you punch in 20 incorrect combinations, the box will beep on 5 of them by sheer chance (on average). If you punch in all 131,115,985 possible combinations, then while the box is certain to beep for the one winning combination, it will also beep for 32,778,996 losing combinations (on average).

So this box doesn't let you win the lottery, but it's better than nothing. If you used the box, your odds of winning would go from 1 in 131,115,985 to 1 in 32,778,997. You've made some progress toward finding your target, the truth, within the huge space of possibilities.

Suppose you can use another black box to test combinations *twice,* *independently.* Both boxes are certain to beep for the winning ticket. But the chance of a box beeping for a losing combination is 1/4 *independently* for each box; hence the chance of *both* boxes beeping for a losing combination is 1/16. We can say that the *cumulative* evidence, of two independent tests, has a likelihood ratio of 16:1. The number of losing lottery tickets that pass both tests will be (on average) 8,194,749.

Since there are 131,115,985 possible lottery tickets, you might guess that you need evidence whose strength is around 131,115,985 to 1—an event, or series of events, which is 131,115,985 times more likely to happen for a winning combination than a losing combination. Actually, this amount of evidence would only be enough to give you an *even* chance of winning the lottery. Why? Because if you apply a filter of that power to 131 million losing tickets, there will be, on average, one losing ticket that passes the filter. The winning ticket will also pass the filter. So you'll be left with two tickets that passed the filter, only one of them a winner. 50% odds of winning, if you can only buy one ticket.

A better way of viewing the problem: In the beginning, there is 1 winning ticket and 131,115,984 losing tickets, so your odds of winning are 1:131,115,984. If you use a single box, the odds of it beeping are 1 for a winning ticket and 0.25 for a losing ticket. So we multiply 1:131,115,984 by 1:0.25 and get 1:32,778,996. Adding another box of evidence multiplies the odds by 1:0.25 again, so now the odds are 1 winning ticket to 8,194,749 losing tickets.

It is convenient to measure evidence in bits—not like bits on a hard drive, but mathematician's bits, which are conceptually different. Mathematician's bits are the logarithms, base 1/2, of probabilities. For example, if there are four possible outcomes A, B, C, and D, whose probabilities are 50%, 25%, 12.5%, and 12.5%, and I tell you the outcome was "D", then I have transmitted three bits of information to you, because I informed you of an outcome whose probability was 1/8.

It so happens that 131,115,984 is slightly less than 2 to the 27th power. So 14 boxes or 28 bits of evidence—an event 268,435,456:1 times more likely to happen if the ticket-hypothesis is true than if it is false—would shift the odds from 1:131,115,984 to 268,435,456:131,115,984, which reduces to 2:1. Odds of 2 to 1 mean two chances to win for each chance to lose, so the *probability* of winning with 28 bits of evidence is 2/3. Adding another box, another 2 bits of evidence, would take the odds to 8:1. Adding yet another two boxes would take the chance of winning to 128:1.

So if you want to license a *strong belief* that you will win the lottery—arbitrarily defined as less than a 1% probability of being wrong—34 bits of evidence about the winning combination should do the trick.

In general, the rules for weighing "how much evidence it takes" follow a similar pattern: The larger the *space of possibilities* in which the hypothesis lies, or the more unlikely the hypothesis seems *a priori* compared to its neighbors, or the more confident you wish to be, the more evidence you need.

You cannot defy the rules; you cannot form accurate beliefs based on inadequate evidence. Let's say you've got 10 boxes lined up in a row, and you start punching combinations into the boxes. You cannot stop on the first combination that gets beeps from all 10 boxes, saying, "But the odds of that happening for a losing combination are a million to one! I'll just ignore those ivory-tower Bayesian rules and stop here." On average, 131 losing tickets will pass such a test for every winner. Considering the space of possibilities and the prior improbability, you jumped to a too-strong conclusion based on insufficient evidence. That's not a pointless bureaucratic regulation, it's math.

Of course, you can still *believe* based on inadequate evidence, if that is your whim; but you will not be able to believe *accurately. *It is like trying to drive your car without any fuel, because you don't believe in the silly-dilly fuddy-duddy concept that it ought to take fuel to go places. It would be so much more *fun,* and so much less expensive, if we just decided to repeal the law that cars need fuel. Isn't it just obviously better for everyone? Well, you can try, if that is your whim. You can even shut your eyes and pretend the car is moving. But to *really* arrive at accurate beliefs requires evidence-fuel, and the further you want to go, the more fuel you need.

Part of the sequence *Map and Territory*

Next post: "Occam's Razor"

Previous post: "What is Evidence?"

## Comments (30)

OldI'd be happy to buy lots of lottery tickets that had a 1/132 chance of winning, given the typical payoff structure of lotteries of the kind you describe.

To act rationally, it isn't enough to arrive at the correct (probabilities of) beliefs; to act on a belief, the degree of belief you need in it might not be very great.

Given the strong tendency to collapse all degrees of belief into a two-point scale (yea or nay) , I suspect that our intuitions about how much one has to believe in something in order to act accordingly are often too stringent, since the actual strengths of our beliefs are so often much too large.

(Note: "often" doesn't mean "always" or even "usually".)

Of course acting on beliefs is a decision theory matter. You don't have terribly much to lose by buying a losing lottery ticket, but you have a very large amount to gain if it wins, so yes 1/132 chance of winning sounds well worth $20 or so.

This also shows why independently replicated scientific experiments (more independent boxes) are more important than experiments with high p-values (boxes with better likeliehood ratios).

But the p-values go exponentially close to one with the size of the study. If you had three studies that used 11 boxes, vs. one with 33, you'd get exactly the same posterior probability for the ticket being a winner.

In other words, more experiments are exponentially more valuable than higher p-values, but higher p-values are exponentially cheaper.

Anders, I'm not sure I'd agree with that, because of publication bias. I'd feel much better about a single experiment that reported p < 0.001 than three experiments that reported p < 0.05.

Yes, publication bias matters. But it also applies to the p<0.001 experiment - if we have just a single publication, should we believe that the effect is true and just one group has done the experiment, or that the effect is false and publication bias has prevented the publication of the negative results? If we had a few experiments (even with different results) it would be easier to estimate this than in the one published experiment case.

*10 points [-]Lets do a check. Assume a worst case scenario where nobody publishes false results at all.

To get three p < 0.05 studies if the hypothesis is false requires on average 60 experiments. This is a lot but is within the realms of possibility if the issue is one which many people are interested in, so there is still grounds for scepticism of this result.

To get one p < 0.001 study if the hypothesis is false requires on average 1000 experiments. This is pretty implausible, so I would be much happier to treat this result as an indisputable fact, even in a field with many vested interests (assuming everything else about the experiment is sound).

One too many zeros in the p value there. The 1,000 figure matches p<0.001, which is also what Anders mentioned. (So your point is fine.)

Thanks

*2 points [-]This is assuming proper methodology and statistics so that the p-value actually matches the chance of the result arising by chance. In practice, since even your best judgment of the methodology is not going to account for certainty in the soundness of the experiment, I would say that a p-value of 0.001 constitutes considerably less than 10 bits of evidence, because the odds that something was wrong with the experiment are better than the odds that the results were coincidental. Multiple experiments with lower cumulative p-value can still be stronger evidence if they all make adjustments to account for possible sources of error.

Running "1000 experiments" if you don't have to publish negative results, can mean just slicing data until you find something. Someone with a large data set can just do this 100% of the time.

A replication is more informative, because it's not subject to nearly as much "find something new and publish it" bias.

Sorry, ignore my erratum above, I was wrong. I mixed up odds and probability, they are different things.

Byrnema hosted an IRC Meeting about this post and I uploaded a transcript of the conversation on the wiki. If this was the wrong place to put the transcript let me know and I will move it.

The conversation went pretty well, in my opinion, and we plan on having a similar one next week.

The lottery is a good example, but the large numbers make it hard to follow the math without a calculator. Is there a simpler example you could add with lower numbers that we can hold in our heads?

Here you say that bits = log(P(E|H)/P(E)). Everywhere else, you used bits = log(P(E|H)/P(E|!H)). They're very different.

Compare to this complaint heard in a fictitious physics classroom: "Now you say joules = 1/2 m v^2. But earlier you said joules = G m1 m2 / r and next you are going to say joules = m c^2. They are very different."

In the example I cited, P(I tell you outcome is D | outcome is D) = 1 and P(I tell you outcome is D | outcome is not D) = 0 (roughly). Thus log(P(E|H)/P(E)) = 3 and log(P(E|H)/P(E|!H)) = infinity. Log is base 1/2. Probability-bits and Odds-ratio-bits really are very different units, and Eliezer confusingly described them as the same thing. They are not interchangable like 1/2 m v^2, G m1 m2 / r, and m c^2.

I may be missing something here (and the karma voting patterns suggest that I am). But I will repeat my claim - perhaps with more clarity:

Bits are bits, just as joules are joules. But just as you can use joules as a unit to quantify different kinds of energy (kinetic, potential, relativistic), you can use bits as a unit to quantify different kinds of information (log odds-ratio, log likelihood ratio, channel capacity (in some fixed amount of time), entropy of a message source. Each of these kinds of information is measured in the same unit - bits.

You can measure evidence in bits, and you can measure the information content of the answer to a question in bits. The two are calculated using different formulas, because they are different things. Just as potential and kinetic energy are different things.

*3 points [-]You are correct that bits can be used to measure different things. The problem here is that probabilities and odds ratios describe the exact same thing in different ways. A joule of potential energy is not the same thing as a joule of kinetic energy, but they can be converted to each other at a 1:1 ratio. A probability-bit measures the same thing as an odds-ratio-bit, but is a different quantity (a probability-bit is always greater than 1 odds-ratio-bit, and can be up to infinity odds-ratio-bits). A "bit of evidence" does not unambiguously tell someone whether you mean probability-bit or odds-ratio-bit, and Eliezer does not distinguish between them properly.

1 probability bit in favor of a hypothesis gives you a posterior probability of 1/2^(n-1) from a prior of 1/2^n. n probability bits gives you a posterior of 1 from the same prior.

1 odds ratio bit in favor of a hypothesis gives you a posterior odds ratio of 1:2^(n-1) from a prior of 1:2^n. n probability bits gives you a posterior odds ratio of 1:1 (probability 1/2) from the same prior. It takes infinity probability bits to give you a posterior probability of 1.

As the prior probability approaches 0, the types of bits become interchangeable.

Clearly you understand me now, and I think that I understand you.

OK, if what is at issue here is whether Eliezer was sufficiently clear, then I'll bow out. Obviously, he was not sufficiently clear from your viewpoint. I will say, though, that your comment is the first time I have seen the word "evidence" used by a Bayesian for anything other than a log odds ratio.

Log odds evidence has the virtue that it is additive (when independent). On the other hand, your idea of a log probability meaning of 'evidence' has the virtue that a question can be decided by a finite amount of evidence.

Eliezer used it to mean log probability in the section that I quoted. That was what I was complaining about.

Ok, I think you are misinterpreting, but I see what you mean. When EY writes:

I take this as illustrating the definition of bits in general, rather than bits of 'evidence'. But, yes, I agree with you now that placing that explanation in a paragraph with that lead sentence promising a definition of 'evidence' - well it definitely could have been written more clearly.

You unfortunately forgot to mention the cost of a ticket in the lottery and the payout in the lottery. If the payout is high enough that the expected payout of the ticket is greater or equal to the cost of the ticket then the lottery make sense to play. Since each ticket in that case has a payout equal or greater then its cost it makes sense to buy up all of the possible combinations to ensure a win.

*4 points [-]He's talking about epistemology, not decision theory. Decision theory depends on a whole host of factors other than the probability of the desired outcome. I would buy a $1 lottery ticket if it were clear that it represented a 1/8,194,749 chance of winning $131,115,985. Epistemologically, however, I would be astonished if something happened besides me being $1 poorer.

Maybe I'm confused, but isn't log_2(131,115,984) about 26.9, and not greater than 27?

Ok I see, so do you always just add one bit?

*0 points [-]just to be clear for my sake, the log_2 of the likely-hood ratio is how many bits that piece of evidence is worth?

edit: should I take no one correcting me as no one knowing, or being right?

*3 points [-]The number of false bleeps is distributed almost exactly Poisson with . The important figure is not the expected number of bleeps (, which is indeed 2). It's the expected probability that a random bleep is the true one, . At the moment I can't find an analytic solution (and a short search suggests none is known), but a computation shows the result is around 63.2%, much better than 50%. Similarly, with 14 boxes (arguably "28 bits of evidence"), the chance of winning is about 79.1% on average, much better than .

Huh?