EHeller comments on Don't You Care If It Works? - Part 1 - Less Wrong

4 Post author: Jacobian 29 July 2015 02:32PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (58)

You are viewing a single comment's thread.

Comment author: EHeller 29 July 2015 05:02:46PM *  5 points [-]

It is true that optional stopping won't change Bayes rule updates (which is easy enough to show). It's also true that optional stopping does affect frequentist tests (different sampling distributions). The broader question is "which behavior is better?"

p-hacking is when statisticians use optional stopping to make their results look more significant (by not reporting their stopping rule). As it turns out you in fact can "posterior hack" Bayesians - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2374040

Edit: Also Debrah Mayo's Error Statistics book contains a demonstration that optional stopping can cause a Bayesian to construct confidence interval that never contain the true parameter value. Weirdly, those Bayesians can be posterior hacked even if you tell them about the stopping rule, because they don't think it matters.

Comment author: RichardKennaway 29 July 2015 08:20:36PM *  1 point [-]

p-hacking is when statisticians use optional stopping to make their results look more significant (by not reporting their stopping rule). As it turns out you in fact can "posterior hack" Bayesians - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2374040

That is not my understanding of the term "optional stopping" (nor, more significantly, is it that of Jaynes). Optional stopping is the process of collecting data, computing your preferred measure of resultiness as you go, and stopping the moment it passes your criterion for reporting it, whether that is p<0.05, or a Bayes factor above 3, or anything else. (If it never passes the criterion, you just never report it.) That is but one of the large arsenal of tools available to the p-hacker: computing multiple statistics from the data in the hope of finding one that passes the criterion, thinking up more hypotheses to test, selective inclusion or omission of "outliers", fitting a range of different models, and so on. And of these, optional stopping is surely the least effective, for as Jaynes remarks in "Probability Theory as Logic", it is practically impossible to sample long enough to produce substantial support for a hypothesis deviating substantially from the truth.

All of those other methods of p-hacking involve concealing the real hypothesis, which is the collection of all the hypotheses that were measured against the data. It is like dealing a bridge hand and showing that it supports astoundingly well the hypothesis that that bridge hand would be dealt. In machine learning terms, the hypothesis is being covertly trained on the data, then tested on how well it fits the data. No measure of the latter, whether frequentist or Bayesian, is a measure of how well the hypothesis will fit new data.

Comment author: EHeller 29 July 2015 08:30:11PM 0 points [-]

If you look at the paper, what you call optional stopping is what the authors called "data peeking."

In their simulations, the authors first took in a sample of 20 and calculated it, and then could selectively continue to add data up to 30 (stopping when they reach "effect" or 30 samples). The papers point is that this does skew the Bayes factor (doubles the chances of managing to get a Bayes factor > 3).

Comment author: ike 29 July 2015 09:08:45PM 0 points [-]

It skews the Bayes factor when the hypothesis is in fact not true. The times that the hypothesis is true should balance out to make the calibration correct overall.

Comment author: EHeller 29 July 2015 09:25:06PM 1 point [-]

In practice what p-hacking is about is convincing the world of an effect, so you are trying to create bias toward any data looking like a novel effect. Stopping rules/data peeking accomplish this just as much for Bayes as for frequentist inference (though if the frequentist knows about the stopping rule they can adjust in a way that bayesians can't), which is my whole point.

Whether or not the Bayesian calibration is overall correct depends not just on the Bayes factor but the prior.

Comment author: ike 29 July 2015 09:45:54PM 2 points [-]

Whether or not the Bayesian calibration is overall correct depends not just on the Bayes factor but the prior.

It depends only on the prior. I consider all these "stopping rule paradoxes" disguised cases where you give the Bayesian a bad prior, and the frequentist formula encodes a better prior.

In practice what p-hacking is about is convincing the world of an effect, so you are trying to create bias toward any data looking like a novel effect.

You still wouldn't have more chances of showing a novel effect than you thought you would when you went into the experiment, if your priors are correct. If you say "I'll stop when I have a novel effect", do this many times, and then look at all the times you found a novel effect, 95% of the time the effect should actually be true. If this is wrong, you must have bad priors.

Comment author: EHeller 29 July 2015 09:57:18PM 0 points [-]

It depends only on the prior. I consider all these "stopping rule paradoxes" disguised cases where you give the Bayesian a bad prior, and the frequentist formula encodes a better prior.

Then you are doing a very confusing thing that isn't likely to give much insight. Frequentist inference and Bayesian inference are different and it's useful to at least understand both ideas(even if you reject frequentism).

Frequentists are bounding their error with various forms of the law of large numbers, they aren't coherently integrating evidence. So saying the "frequentist encodes a better prior" is to miss the whole point of how frequentist statistics works.

And the point in the paper I linked has nothing to do with the prior, it's about the bayes factor, which is independent of the prior. Most people who advocate Bayesian statistics in experiments advocate sharing bayes factors, not posteriors in order to abstract away the problem of prior construction.

Comment author: ike 29 July 2015 10:38:01PM 0 points [-]

And the point in the paper I linked has nothing to do with the prior, it's about the bayes factor, which is independent of the prior.

Let me put it differently. Yes, your chance of getting a bayes factor of >3 is 1.8 with data peeking, as opposed to 1% without; but your chance of getting a higher factor also goes down, because you stop as soon as you reach 3. Your expected bayes factor is necessarily 1 weighted over your prior; you expect to find evidence for neither side. Changing the exact distribution of your results won't change that.

Comment author: RichardKennaway 30 July 2015 12:55:27PM 1 point [-]

Your expected bayes factor is necessarily 1

Should that say, rather, that its expected log is zero? A factor of n being as likely as a factor of 1/n.

Comment author: Anders_H 05 August 2015 04:45:07AM *  0 points [-]

My original response to this was wrong and has been deleted

I don't think this has anything to do with logs, but rather that it is about the difference between probabilities and odds. Specifically, the Bayes factor works on the odds scale but the proof for conservation of expected evidence is on the regular probability scale

If you consider the posterior under all possible outcomes of the experiment, the ratio of the posterior probability to the prior probability will on average be 1 (when weighted by the probability of the outcome under your prior). However, the ratio of the posterior probability to the prior probability is not the same thing as the Bayes factor.

If you multiply the Bayes factor by the prior odds, and then transform the resulting quantity (ie the posterior) from the odds scale to a probability, and then divide by the prior probability, the resulting quantity will on average be 1

However, this is too complicated and doesn't seem like a property that gives any additional insight on the Bayes factor..

Comment author: ike 30 July 2015 02:42:39PM 0 points [-]

That's probably a better way of putting it. I'm trying to intuitively capture the idea of "no expected evidence", you can frame that in multiple ways.

Comment author: ike 29 July 2015 10:11:30PM 0 points [-]

Then you are doing a very confusing thing that isn't likely to give much insight. Frequentist inference and Bayesian inference are different and it's useful to at least understand both ideas(even if you reject frequentism).

I think I understand frequentism. My claim here was that the specific claim of "the stopping rule paradox proves that frequentism does better than Bayes" is wrong, or is no stronger than the standard objection that Bayes relies on having good priors.

So saying the "frequentist encodes a better prior" is to miss the whole point of how frequentist statistics works.

What I meant is that you can get the same results as the frequentist in the stopping rule case if you adopt a particular prior. I might not be able to show that rigorously, though.

And the point in the paper I linked has nothing to do with the prior, it's about the bayes factor, which is independent of the prior.

That paper only calculates what happens to the bayes factor when the null is true. There's nothing that implies the inference will be wrong.

There are a couple different version of the stopping rule cases. Some are disguised priors, and some don't affect calibration/inference or any Bayesian metrics.

Comment author: EHeller 29 July 2015 10:25:09PM *  0 points [-]

That paper only calculates what happens to the bayes factor when the null is true. There's nothing that implies the inference will be wrong.

That is the practical problem for statistics (the null is true, but the experimenter desperately wants it to be false). Everyone wants their experiment to be a success. The goal of this particular form of p-hacking is to increase the chance that you get a publishable result. The goal of the p-hacker is to increase the probability of type 1 error. A publication rule based on Bayes factors instead of p-values is still susceptible to optional stopping.

You seem to be saying that a rule based on posteriors would not be susceptible to such hacking?

Comment author: ike 29 July 2015 10:51:29PM 1 point [-]

You seem to be saying that a rule based on posteriors would not be susceptible to such hacking?

I'm saying that all inferences are still correct. So if your prior is correct/well calibrated, then your posterior is as well. If you end up with 100 studies that all found an effect for different things at a posterior of 95%, 5% of them should be wrong.

The goal of the p-hacker is to increase the probability of type 1 error.

So what I should say is that the Bayesian doesn't care about the frequency of type 1 errors. If you're going to criticise that, you can do so without regard to stopping rules. I gave an example in a different reply of hacking bayes factors, now I'll give one with hacking posteriors:

Two kinds of coins: one fair, one 10%H/90%T. There are 1 billion of the fair ones, and 1 of the other kind. You take a coin, flip it 10 times, then say which coin you think it is. The Bayesian gets the biased coin, and no matter what he flips, will conclude that the coin is fair with overwhelming probability. The frequentist gets the coin, get ~9 tails, and says "no way is this fair". There, the frequentist does better because the Bayesian's prior is bad (I said there are a billion fair ones and only one biased one, but only looked at the biased ones).

It doesn't matter if you always conclude with 95% posterior that the null is false when it is true, as long as you have 20 times as many cases that the null is actually false. Yes, this opens you up to being tricked; but if you're worried about deliberate deception, you should include a prior over that. If you're worried about publication bias when reading other studies, include a prior over that, etc.

Comment author: Anders_H 02 August 2015 10:24:08PM 0 points [-]
Comment author: Jacobian 29 July 2015 06:38:28PM *  0 points [-]

Since I don't want this to spiral into another stopping rule argument, allow me to try and dissolve a confusing point that the discussions get stuck on.

What makes Bayesian "lose" in the cases proposed by Mayo and Simonsohn isn't the inference, it's the scoring rule. A Bayesian scores himself on total calibration, "number of times my 95% confidence interval includes the truth" is just a small part of it. You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian's reward. If you rig the rules to view something about 95% confidence intervals as the only measure of success, then the frequentist's decision rule about accepting hypotheses at a 5% p-value wins, it's not his inference that magically becomes superior.

Allow me to steal an analogy from my friend Simon: I'm running a Bayesian Casino in Vegas. Debrah Mayo comes to my casino every day with $31. She bets $1 on a coin flip, then bets $2 if she loses, then $4 and so on until she either wins $1 or loses all $31 if 5 flips go against her. I obviously think that by conservation of expected money in a coin flip this deal is fair, but Prof. Mayo tells me that I'm a sucker because I lose more days that I win. I tell her that I care about dollars, not days, but she replies that if she had more money in her pocket, she could make sure I have a losing day with arbitrarily high probability! I smile and ask her if she wants a drink.

Comment author: ike 29 July 2015 09:39:58PM *  2 points [-]

You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis.

This is wrong, unless I've misunderstood you. Imagine the prior for hypothesis H is p, hence the prior for ~H is 1-p. If you have a 99% chance of generating a 20:1 likelihood for H, then your prior must be bounded below by .99*(20p/19p+1). (The second term is the posterior for H if you have a 20:1 likelihood). So we have the inequality p> .99*(20p/19p+1), which I was lazy and used http://www.wolframalpha.com/input/?i=p%3E+.99*%2820p%29%2F%2819p%2B1%29%2C+0%3Cp%3C1 to solve, which tells me that p must be at least 0.989474.

So you can only expect to generate strong evidence for a hypothesis if you're already pretty sure of it, which is just as it should be.

I may have bungled these calculations, doing them quickly, though.

Comment author: Jacobian 30 July 2015 01:31:39AM *  0 points [-]

Edit: removed for misunderstanding ike's question and giving an irrelevant answer. Huge thanks to ike for teaching me math.

Comment author: ike 30 July 2015 01:44:28AM *  1 point [-]

That's exactly what I used it for in my calculation, I didn't misunderstand that. Your computation of "conservation of expected evidence" simply does not work unless your prior is extremely high to begin with. Put simply, you cannot be 99% sure that you'll later change your current belief in H of p to anything greater than 100*p/99, which places a severe lower bound on p for a likelihood ratio of 20:1.

Comment author: Jacobian 30 July 2015 09:05:50PM 2 points [-]

Yes! It worked! I learned something by getting embarrassed online!!!

ike, you're absolutely correct. I applied conservation of expected evidence to likelihood ratios instead of to posterior probabilities, and thus didn't realize that the prior puts bounds on expected likelihood ratios. This also means that the numbers I suggested (1% of 1:2000, 99% of 20:1) define the prior precisely at 98.997%.

I'm going to leave the fight to defend the reputation of Bayesian inference to you and go do some math exercises.

Comment author: Lumifer 29 July 2015 07:06:44PM *  2 points [-]

A Bayesian scores himself on total calibration, "number of times my 95% confidence interval includes the truth" is just a small part of it. You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian's reward.

Hold on. Let's say I hire a Bayesian statistician to produce some estimate for me. I do not care about "scoring" or "reward", all I care about is my estimate and how accurate it is. Now you are going to tell me that in 99% of the cases your estimate will be wrong and that's fine because there is a slight chance that you'll be really really sure of the opposite conclusion?

I'm running a Bayesian Casino in Vegas. Debrah Mayo comes to my casino every day with $31.

Why, that's such a frequentist approach X-/

Let's change the situation slightly. You are running the Bayesian Casino and Debrah Mayo comes to you casino once with, say, $1023 in her pocket. Will I lend you money to bet against her? No, I will not. The distribution matters beyond simple expected means.

Comment author: EHeller 29 July 2015 08:50:46PM *  1 point [-]

Reminds of this bit from a Wasserman paper http://ba.stat.cmu.edu/journal/2006/vol01/issue03/wasserman.pdf

van Nostrand: Of course. I remember each problem quite clearly. And I recall that on each occasion I was quite thorough. I interrogated you in detail, determined your model and prior and produced a coherent 95 percent interval for the quantity of interest.

Pennypacker: Yes indeed. We did this many times and I paid you quite handsomely.

van Nostrand: Well earned money I’d say. And it helped win you that Nobel.

Pennypacker: Well they retracted the Nobel and they took away my retirement savings.

... van Nostrand: Whatever are you talking about?

Pennypacker: You see, physics has really advanced. All those quantities I estimated have now been measured to great precision. Of those thousands of 95 percent intervals, only 3 percent contained the true values! They concluded I was a fraud.

van Nostrand: Pennypacker you fool. I never said those intervals would contain the truth 95 percent of the time. I guaranteed coherence not coverage!

Comment author: ike 29 July 2015 09:41:27PM 0 points [-]

Now you are going to tell me that in 99% of the cases your estimate will be wrong

No. Your calibration is still perfect if your priors are perfect. You can only get to that "99% chance of getting strong evidence for hypothesis" if you're already very sure of that hypothesis math here

Comment author: EHeller 29 July 2015 08:40:41PM 0 points [-]

What makes Bayesian "lose" in the cases proposed by Mayo and Simonsohn isn't the inference, it's the scoring rule. A Bayesian scores himself on total calibration, "number of times my 95% confidence interval includes the truth" is just a small part of it. You can generate an experiment that has a high chance (let's say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian's reward. If you rig the rules to view something about 95% confidence intervals as the only measure of success, then the frequentist's decision rule about accepting hypotheses at a 5% p-value wins, it's not his inference that magically becomes superior.

Sometimes we might care about "total calibration" I guess, but sometimes we care about being actually calibrated in the rationalist sense. Sometimes we want a 95% confidence interval to mean that doing this 100 times will include the true value about 95 times.

My point was this idea that the stopping rule doesn't matter is more complicated than calculating a Bayes factor and saying "look, the stopping rule doesn't change the Bayes factor."

Comment author: ike 29 July 2015 09:56:27PM 1 point [-]

My point was this idea that the stopping rule doesn't matter is more complicated than calculating a Bayes factor and saying "look, the stopping rule doesn't change the Bayes factor."

The stopping rule won't change the expectation of the Bayes factor.

Sometimes we want a 95% confidence interval to mean that doing this 100 times will include the true value about 95 times.

If your prior is correct, then your 95% credibility interval will, in fact, be well calibrated and be correct 95% of the time. I argued at length on tumblr that most or all of the force of the stopping rule objection to Bayes is a disguised "you have a bad prior" situation. If you're willing to ask the question that way, you can generate similar cases without stopping rules as well. For instance, imagine there are two kinds of coins; ones that land on heads 100% of the time, and ones that land on heads 20% of the time. (The rest are tails.) You have one flip with the coin. Oh, one more thing: I tell you that there are 1 billion coins of the first kind, and only one of the second kind.

You flip the coin once. It's easy to show that there's an overwhelming probability of getting a 20:1 likelihood in favor of the first coin. Why is this problematic?

I can and have given a similar case for 95% credibility intervals as opposed to Bayes factors, which I'll write out if you're interested.