What I was saying was sort of vague, so I'm going to formalize here.
Data is coming from some random process X(θ,ω), where θ parameterizes the process and ω captures all the randomness. Let's suppose that for any particular θ, living in the set Θ of parameters where the model is well-defined, it's easy to sample from X(θ,ω). We don't put any particular structure (in particular, cardinality assumptions) on Θ. Since we're being frequentists here, nature's parameter θ' is fixed and unknown. We only get to work with the realization of the random process that actually happens, X' = X(θ',ω').
We have some sort of analysis t(⋅) that returns a scalar; applying it to the random data gives us the random variables t(X(θ,ω)), which is still parameterized by θ and still easy to sample from. We pick some null hypothesis Θ0 ⊂ Θ, usually for scientific or convenience reasons.
We want some measure of how weird/surprising the value t(X') is if θ' were actually in Θ0. One way to do this, if we have a simple null hypothesis Θ0 = { θ0 }, is to calculate the p-value p(X') = P(t(X(θ0,ω)) ≥ t(X')). This can clearly be approximated using samples from t(X(θ0,ω)).
For composite null hypotheses, I guessed that using p(X') = sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ t(X')) would work. Paraphrasing jsteinhardt, if Θ0 = { θ01, ..., θ0n }, you could approximate p(X') using samples from t(X(θ01,ω)), ... t(X(θ01,ω)), but it's not clear what to do when Θ0 has infinite cardinality. I see two ways forward. One is approximating p(X') by doing the above computation over a finite subset of points in Θ0, chosen by gridding or at random. This should give an approximate lower bound on the p-value, since it might miss θ where the observed data look unexceptional. If the approximate p-value leads you to fail to reject the null, you can believe it; if it leads you to reject the null, you might be less sure and might want to continue trying more points in Θ0. Maybe this is what jsteinhardt means by saying it "doesn't terminate"? The other way forward might be to use features of t and Θ0, which we do have some control over, to simplify the expression sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ c). Say, if t(X(θ,ω)) is convex in θ for any ω and Θ0 is a convex bounded polytope living in some Euclidean space, then the supremum only depends on how P(t(X(θ0,ω)) ≥ c) behaves at a finite number of points.
So yeah, things are far more complicated than I claimed and realize now working through it. But you can do sensible things even with a composite null.
Yup I agree with all of that. Nice explanation!
I asked this question on Facebook here, and got some interesting answers, but I thought it would be interesting to ask LessWrong and get a larger range of opinions. I've modified the list of options somewhat.
What explains why some classification, prediction, and regression methods are common in academic social science, while others are common in machine learning and data science?
For instance, I've encountered probit models in some academic social science, but not in machine learning.
Similarly, I've encountered support vector machines, artificial neural networks, and random forests in machine learning, but not in academic social science.
The main algorithms that I believe are common to academic social science and machine learning are the most standard regression algorithms: linear regression and logistic regression.
Possibilities that come to mind:
(0) My observation is wrong and/or the whole question is misguided.
(1) The focus in machine learning is on algorithms that can perform well on large data sets. Thus, for instance, probit models may be academically useful but don't scale up as well as logistic regression.
(2) Academic social scientists take time to catch up with new machine learning approaches. Of the methods mentioned above, random forests and support vector machines was introduced as recently as 1995. Neural networks are older but their practical implementation is about as recent. Moreover, the practical implementations of these algorithm in the standard statistical softwares and packages that academics rely on is even more recent. (This relates to point (4)).
(3) Academic social scientists are focused on publishing papers, where the goal is generally to determine whether a hypothesis is true. Therefore, they rely on approaches that have clear rules for hypothesis testing and for establishing statistical significance (see also this post of mine). Many of the new machine learning approaches don't have clearly defined statistical approaches for significance testing. Also, the strength of machine learning approaches is more exploratory than testing already formulated hypotheses (this relates to point (5)).
(4) Some of the new methods are complicated to code, and academic social scientists don't know enough mathematics, computer science, or statistics to cope with the methods (this may change if they're taught more about these methods in graduate school, but the relative newness of the methods is a factor here, relating to (2)).
(5) It's hard to interpret the results of fancy machine learning tools in a manner that yields social scientific insight. The results of a linear or logistic regression can be interpreted somewhat intuitively: the parameters (coefficients) associated with individual features describe the extent to which those features affect the output variable. Modulo issues of feature scaling, larger coefficients mean those features play a bigger role in determining the output. Pairwise and listwise R^2 values provide additional insight on how much signal and noise there is in individual features. But if you're looking at a neural network, it's quite hard to infer human-understandable rules from that. (The opposite direction is not too hard: it is possible to convert human-understandable rules to a decision tree and then to use a neural network to approximate that, and add appropriate fuzziness. But the neural networks we obtain as a result of machine learning optimization may be quite different from those that we can interpret as humans). To my knowledge, there haven't been attempts to reinterpret neural network results in human-understandable terms, though Sebastian Kwiatkowski's comment on my Facebook post points to an example where the results of naive Bayes and SVM classifiers for hotel reviews could be translated into human-understandable terms (namely, reviews that mentioned physical aspects of the hotel, such as "small bedroom", were more likely to be truthful than reviews that talked about the reasons for the visit or the company that sponsored the visit). But Kwiatkowski's comment also pointed to other instances where the machine's algorithms weren't human-interpretable.
What's your personal view on my main question, and on any related issues?