You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

ciphergoth comments on Open thread, August 19-25, 2013 - Less Wrong Discussion

2 Post author: David_Gerard 19 August 2013 06:58AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (325)

You are viewing a single comment's thread. Show more comments above.

Comment author: ciphergoth 03 September 2013 12:19:50PM 1 point [-]

Many thanks for this!

So in broad strokes: the smaller a correlation is, the more samples you're going to need to detect it, so the more samples you take, the more correlations you can detect. For five different human variables, this graph shows number of samples against number of correlations detected with them on a log/log scale; from that we infer that a similar slope is likely for intelligence, and so we can use it to take a guess at how many samples we'll need to find some number of SNPs for intelligence. Am I handwaving in the right direction?

Comment author: gwern 03 September 2013 03:26:54PM 0 points [-]

so the more samples you take, the more correlations you can detect.

Yes, although I'd phrase this more as 'the more samples you take, the bigger your "budget", which you can then spend on better estimates of a single variable or if you prefer, acceptable-quality estimates of several variables'.

Which one you want depends on what you're doing. Sometimes you want one variable, other times you want more than one variable. In my self-experiments, I tend to spend my entire budget on getting good power on detecting changes in a single variable (but I could have spent my data budget in several ways: on smaller alphas or smaller effect sizes or detecting changes to multiple variables). Genomics studies like these, however, aren't interested so much in singling out any particular gene and studying it in close detail, but finding 'any relevant gene at all and as many as possible'.

Comment author: ciphergoth 03 September 2013 04:01:06PM 1 point [-]

And there's a "budget" because if you "double-spend", you end up with the XKCD green acne jelly beans?

Comment author: gwern 03 September 2013 04:38:47PM 0 points [-]

Eh, I'm not sure the idea of 'double-spending' really applies here. In the multiple comparisons case, you're spending all your budget on detecting the observed effect size and getting high-power/reducing-Type-II-errors (if there's an effect lurking there, you'll find it!), but you then can't buy as much Type I error reduction as you want.

This could be fine in some applications. For example, when I'm A/B testing visual changes to gwern.net, I don't care if I commit a Type I error, because if I replace one doohickey with another doohickey and they work equally well (the null hypothesis), all I've lost is a little time. I'm worried about coming up with an improvement, testing the improvement, and mistakenly believing it isn't an improvement when actually it is.

The problem with multiple comparisons comes when people don't realize they've used up their budget and they believe they really have controlled alpha errors at 5% or whatever. When they think they've had their cake & ate it too.

I guess a better financial analogy would be more like "you spend all your money on the new laptop you need for work, but not having checked your bank account balance, promise to take your friends out for dinner tomorrow"?

Comment author: Lumifer 03 September 2013 05:27:53PM 0 points [-]

I am a bit confused -- is the framework for this thread observation (where the number of samples is pretty much the only thing you can affect pre-analysis) or experiment design (where you you can greatly affect which data you collect)?

I ask because I'm intrigued by the idea of trading off Type I errors against Type II errors, but I'm not sure it's possible in the observation context without introducing bias.

Comment author: gwern 03 September 2013 06:57:26PM 0 points [-]

I'm not sure about this observation vs experiment design dichotomy you're thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.

Comment author: Lumifer 03 September 2013 07:20:53PM 0 points [-]

Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).

However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you "the" fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?

Comment author: gwern 03 September 2013 07:28:38PM 0 points [-]

But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?

Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here's a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):

R> power.t.test(n=40, delta=0.5, sig.level=0.05)
Two-sample t test power calculation n = 40
delta = 0.5
sd = 1
sig.level = 0.05
power = 0.5981
alternative = two.sided
NOTE: n is number in *each* group
R> power.t.test(n=40, delta=0.5, sig.level=0.10)
Two-sample t test power calculation n = 40
delta = 0.5
sd = 1
sig.level = 0.1
power = 0.7163
alternative = two.sided
NOTE: n is number in *each* group

In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.

Comment author: Lumifer 03 September 2013 08:10:21PM 0 points [-]

I am not explaining myself well. Let me try again.

To even talk about Type I / II errors you need two things -- a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let's keep things very simple and deal with binary variables, let's say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let's say this probability is 92%.

Now, at this point we are still in the probability land. Saying that "foo belongs to class bar with a probability of 92%" does not subject us to Type I / II errors. It's only when we commit to the binary outcome and say "foo belongs to class bar, full stop" that they appear.

The point is that in probability land you can't trade off Type I error against Type II -- you just have the probability (or a full distribution in the more general case). It's the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn't affect the model and its output, it's all post-prediction interpretation.

Comment author: gwern 03 September 2013 09:39:24PM 0 points [-]

So you're trying to talk about overall probability distributions in a Bayesian framework? I haven't ever done power analysis with that approach, so I don't know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke's paper (starting on pg11) - maybe he will be helpful?