The Logic of the Hypothesis Test: A Steel Man
Related to: Beyond Bayesians and Frequentists
Update: This comment by Cyan clearly explains the mistake I made - I forgot that the ordering of the hypothesis space is important is necessary for hypothesis testing to work. I'm not entirely convinced that NHST can't be recast in some "thin" theory of induction that may well change the details of the actual test, but I have no idea how to formalize this notion of a "thin" theory and most of the commenters either 1) misunderstood my aim (my fault, not theirs) or 2) don't think it can be formalized.
I'm teaching an econometrics course this semester and one of the things I'm trying to do is make sure that my students actually understand the logic of the hypothesis test. You can motivate it in terms of controlling false positives but that sort of interpretation doesn't seem to be generally applicable. Another motivation is a simple deductive syllogism with a small but very important inductive component. I'm borrowing the idea from a something we discussed in a course I had with Mark Kaiser - he called it the "nested syllogism of experimentation." I think it applies equally well to most or even all hypothesis tests. It goes something like this:
1. Either the null hypothesis or the alternative hypothesis is true.
2. If the null hypothesis is true, then the data has a certain probability distribution.
3. Under this distribution, our sample is extremely unlikely.
4. Therefore under the null hypothesis, our sample is extremely unlikely.
5. Therefore the null hypothesis is false.
6. Therefore the alternative hypothesis is true.
An example looks like this:
Suppose we have a random sample from a population with a normal distribution that has an unknown mean and unknown variance
. Then:
1. Either or
where
is some constant.
2. Construct the test statistic where
is the sample size,
is the sample mean, and
is the sample standard deviation.
3. Under the null hypothesis, has a
distribution with
degrees of freedom.
4. is really small under the null hypothesis (e.g. less than 0.05).
5. Therefore the null hypothesis is false.
6. Therefore the alternative hypothesis is true.
What's interesting to me about this process is that it almost tries to avoid induction altogether. Only the move from step 4 to 5 seems anything like an inductive argument. The rest is purely deductive - though admittedly it takes a couple premises in order to quantify just how likely our sample was and that surely has something to do with induction. But it's still a bit like solving the problem of induction by sweeping it under the rug then putting a big heavy deduction table on top so no one notices the lumps underneath.
This sounds like it's a criticism, but actually I think it might be a virtue to minimize the amount of induction in your argument. Suppose you're really uncertain about how to handle induction. Maybe you see a lot of plausible sounding approaches, but you can poke holes in all of them. So instead of trying to actually solve the problem of induction, you set out to come up with a process which is robust to alternative views of induction. Ideally, if one or another theory of induction turns out to be correct, you'd like it to do the least damage possible to any specific inductive inferences you've made. One way to do this is to avoid induction as much as possible so that you prevent "inductive contamination" spreading to everything you believe.
That's exactly what hypothesis testing seems to do. You start with a set of premises and keep deriving logical conclusions from them until you're forced to say "this seems really unlikely if a certain hypothesis is true, so we'll assume that the hypothesis is false" in order to get any further. Then you just keep on deriving logical conclusions with your new premise. Bayesians start yelling about the base rate fallacy in the inductive step, but they're presupposing their own theory of induction. If you're trying to be robust to inductive theories, why should you listen to a Bayesian instead of anyone else?
Now does hypothesis testing actually accomplish induction that is robust to philosophical views of induction? Well, I don't know - I'm really just spitballing here. But it does seem to be a useful steel man.
Frequentist vs Bayesian breakdown: interpretation vs inference
Suppose we have two different human beings, Connor and Diane, who agree to interpret their subjective anticipations as probabilities, thereby commonly earning them the title "Bayesian". On a particular project or venture, they might disagree on Trick A or Trick B to decide the next step in the project. It might be that Trick A is commonly labelled a "Frequentist inference method" and B is a "Bayesian inference method". Why might they disagree?
As far as I can see, there are 3 disagreements that get labelled "Bayesian vs Frequentist" debates, and conflating them is a problem:
(1) Whether to interpret all subjective anticipations as probabilities.
(2) Whether to interpret all probabilities as subjective anticipations.
(3) Whether, on a particular project, to use Statistical Trick B instead of Statistical Trick A to infer the best course of action, when B is commonly labelled a "Bayesian method" and A is a "Frequentist method".
(Regarding 3, UC Berkeley professor Michael Jordan offers a good heuristic for how statistical tricks get labelled as Bayesisn or Frequentist, in terms of which terms in a loss function one treats as fixed or variable. I recommend watching the first twenty minutes of his video lecture on this if you're not familiar.)
The question "is Connor a Bayesian or a Frequentist?" is commonly posed as though Connor's position on 1, 2, and 3 must be either "yes, yes, yes" or "no, no, no". I don't believe this is so often the case. For example, my position is:
(1) - Yes. Insofar as we have subjective anticipations, I agree normatively that they should behave and update as probabilities.
(2) - Don't care much. Expressions like P(X|Y) and P(X and Y) are useful for denoting both subjective anticipations and proportions of a whole, and in particular, proportions of real future events. Whether to use the word "probability" is a terminological question. Personally I try to reserve the word "probability" for when they mean subjective anticipations, and say "proportion" when they mean proportions of real future, but this is word choice. Unfortunately this word choice is strongly associated and confused with positions on (1) and (3).
(3) - It depends. In statistical inference, we commonly consider data sets x, world models M, and parameters θ that specify the model M more precisely. I consider the separation of belief into M and θ to be purely formal. When guessing the next data set y, one considers expressions of the form P(x|M,θ) in some way. If I'm already very confident in a specific world model M, and expect θ to actually vary from situation to situation, I'll probably try to estimate the parameters θ from x in a way that has the best expected success rate across all possible data sets M would generate. You might say here that I "trust the model more than the data" (though what I really don't trust are the changing model parameters), and this is a trick commonly referred to as "Frequentist". If I'm not confident in the model M, or expect the parameters θ to the be the same in many future situations, I'll probably try to estimate M,θ from x in a way that has the best expected success rate assuming x. You might say here that I "trust the data more than the model", and label this a "Bayesian" trick.
Throughout (3), since my position in (1) is not changing, a member of the Bayes Tribe will say I'm "really a Bayesian all along", but I don't want to continue with this conflation of position names. It's true that if I use the "Frequentist trick", it will be because I've updated in favor of it, i.e. my subjective confidence levels in the various theory elements are appropriate for it.
... But from now on, when term "Bayesian" or "Frequentist" arises in a debate, my plan is to taboo the terms immediately, and proceed to either dissolve the issue into (1), (2), and (3) above, or change the conversation if people don't have the energy or interest for that length of conversation.
Do people agree with this breakdown? I think I could be persuaded otherwise and would of course appreciate it if I were :)
ETA: I think the wisdom to treat beliefs as anticipation controllers and update our confidences based on evidence might be too precious to alienate people from it with the label "Bayesian", especially if the label is as ambiguous as my breakdown has found it to be.
Michael Jordan dissolves Bayesian vs Frequentist inference debate [video lecture]
UC Berkeley professor Michael Jordan, a leading researcher in machine learning, has a great reduction of the question "Are your inferences Bayesian or Frequentist?". The reduction is basically "Which term are you varying in the loss function?". He calls this the "decision theoretic perspective" on the debate, and uses this terminology well in keeping with LessWrong interests.
I don't have time to write a top-level post about this (maybe someone else does?), but I quite liked the lecture, and thought I should at least post the link!
http://videolectures.net/mlss09uk_jordan_bfway/
The discussion gets much clearer starting at the 10:11 slide, which you can click on and skip to if you like, but I watched the first 10 minutes anyway to get a sense of his general attitude.
Enjoy! I recommend watching while you eat, if it saves you time and the food's not too distracting :)
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)