You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

gwern comments on Open thread, August 19-25, 2013 - Less Wrong Discussion

2 Post author: David_Gerard 19 August 2013 06:58AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (325)

You are viewing a single comment's thread. Show more comments above.

Comment author: gwern 03 September 2013 04:38:47PM 0 points [-]

Eh, I'm not sure the idea of 'double-spending' really applies here. In the multiple comparisons case, you're spending all your budget on detecting the observed effect size and getting high-power/reducing-Type-II-errors (if there's an effect lurking there, you'll find it!), but you then can't buy as much Type I error reduction as you want.

This could be fine in some applications. For example, when I'm A/B testing visual changes to gwern.net, I don't care if I commit a Type I error, because if I replace one doohickey with another doohickey and they work equally well (the null hypothesis), all I've lost is a little time. I'm worried about coming up with an improvement, testing the improvement, and mistakenly believing it isn't an improvement when actually it is.

The problem with multiple comparisons comes when people don't realize they've used up their budget and they believe they really have controlled alpha errors at 5% or whatever. When they think they've had their cake & ate it too.

I guess a better financial analogy would be more like "you spend all your money on the new laptop you need for work, but not having checked your bank account balance, promise to take your friends out for dinner tomorrow"?

Comment author: Lumifer 03 September 2013 05:27:53PM 0 points [-]

I am a bit confused -- is the framework for this thread observation (where the number of samples is pretty much the only thing you can affect pre-analysis) or experiment design (where you you can greatly affect which data you collect)?

I ask because I'm intrigued by the idea of trading off Type I errors against Type II errors, but I'm not sure it's possible in the observation context without introducing bias.

Comment author: gwern 03 September 2013 06:57:26PM 0 points [-]

I'm not sure about this observation vs experiment design dichotomy you're thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.

Comment author: Lumifer 03 September 2013 07:20:53PM 0 points [-]

Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).

However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you "the" fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?

Comment author: gwern 03 September 2013 07:28:38PM 0 points [-]

But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?

Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here's a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):

R> power.t.test(n=40, delta=0.5, sig.level=0.05)
Two-sample t test power calculation n = 40
delta = 0.5
sd = 1
sig.level = 0.05
power = 0.5981
alternative = two.sided
NOTE: n is number in *each* group
R> power.t.test(n=40, delta=0.5, sig.level=0.10)
Two-sample t test power calculation n = 40
delta = 0.5
sd = 1
sig.level = 0.1
power = 0.7163
alternative = two.sided
NOTE: n is number in *each* group

In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.

Comment author: Lumifer 03 September 2013 08:10:21PM 0 points [-]

I am not explaining myself well. Let me try again.

To even talk about Type I / II errors you need two things -- a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let's keep things very simple and deal with binary variables, let's say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let's say this probability is 92%.

Now, at this point we are still in the probability land. Saying that "foo belongs to class bar with a probability of 92%" does not subject us to Type I / II errors. It's only when we commit to the binary outcome and say "foo belongs to class bar, full stop" that they appear.

The point is that in probability land you can't trade off Type I error against Type II -- you just have the probability (or a full distribution in the more general case). It's the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn't affect the model and its output, it's all post-prediction interpretation.

Comment author: gwern 03 September 2013 09:39:24PM 0 points [-]

So you're trying to talk about overall probability distributions in a Bayesian framework? I haven't ever done power analysis with that approach, so I don't know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke's paper (starting on pg11) - maybe he will be helpful?

Comment author: Lumifer 04 September 2013 01:10:28AM 0 points [-]

Not necessarily in the Bayesian framework, though it's kinda natural there. You can think in terms of complete distributions within the frequentist framework perfectly well, too.

The issue that we started with was of statistical power, right? While it's technically defined in terms of the usual significance (=rejecting the null hypothesis), you can think about it in broader terms. Essentially it's the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.

Thank for the paper, I've seen it before but didn't have a handy link to it.

Comment author: gwern 04 September 2013 05:13:44PM 0 points [-]

You can think in terms of complete distributions within the frequentist framework perfectly well, too.

Does anyone do that, though?

Essentially it's the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.

Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits; then the sample size & effect size interact to say how many bits each n contains. So a binary variable contains a lot less than a continuous variable, a shift in a rare observation like 90/10 is going to be harder to detect than a shift in a 50/50 split, etc. That's not stuff I know a lot about.

Comment author: Lumifer 04 September 2013 05:44:30PM *  0 points [-]

Does anyone do that, though?

Well, sure. The frequentist approach, aka mainstream statistics, deals with distributions all the time and the arguments about particular tests or predictions being optimal, or unbiased, or asymptotically true, etc. are all explicitly conditional on characteristics of underlying distributions.

Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits;

Yes, something like that. Take a look at Fisher information, e.g. "The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends."