ChristianKl comments on [QUESTION]: Academic social science and machine learning - Less Wrong

11 Post author: VipulNaik 19 July 2014 03:13PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (17)

You are viewing a single comment's thread.

Comment author: ChristianKl 19 July 2014 07:54:05PM 3 points [-]

Giving that we are on Lesswrong you missed a core one: Academic social science uses a bunch of frequentist statistics that perform well if your goal is to prove that your thesis has "statistical significance" but that aren't useful for learning what's true. Machine learning algorithms don't give you p values.

Comment author: gwern 20 July 2014 03:37:01AM 5 points [-]

Do they even give you Bayesian forms of summary values like Bayes factors?

(This is actually a relevant concern for me now: my magnesium self-experiment has finished, and the results are really surprising. To check my linear model, I tried looking at what a random forests might say; it mostly agrees with the analysis... except it also places a lot of importance on a covariate which with the linear model, the fit is better with that coavariate discarded entirely. What does this mean? I dunno. There's no statistic like a p-value I can use to interpret this.)

Comment author: [deleted] 20 July 2014 08:28:01PM 6 points [-]

You can turn any kind of analysis (which returns a scalar) into a p-value by generating a zillion fake data sets assuming the null hypothesis, analysing them all, and checking for what fraction of the fake data sets your statistic exceeds that for the real data set.

Comment author: jsteinhardt 25 July 2014 02:57:37PM 1 point [-]

This doesn't sound true to me. How do you know the underlying distribution of the null when it's just something like "these variables are independent"?

Comment author: othercriteria 28 July 2014 01:32:39AM 1 point [-]

If you're working with composite hypotheses, replace "your statistic" with "the supremum of your statistic over the relevant set of hypotheses".

Comment author: jsteinhardt 29 July 2014 02:04:40AM 1 point [-]

If there are infinitely many hypotheses in the set then the algorithm in the grandparent doesn't terminate :).

Comment author: othercriteria 29 July 2014 05:02:05PM *  1 point [-]

What I was saying was sort of vague, so I'm going to formalize here.

Data is coming from some random process X(θ,ω), where θ parameterizes the process and ω captures all the randomness. Let's suppose that for any particular θ, living in the set Θ of parameters where the model is well-defined, it's easy to sample from X(θ,ω). We don't put any particular structure (in particular, cardinality assumptions) on Θ. Since we're being frequentists here, nature's parameter θ' is fixed and unknown. We only get to work with the realization of the random process that actually happens, X' = X(θ',ω').

We have some sort of analysis t(⋅) that returns a scalar; applying it to the random data gives us the random variables t(X(θ,ω)), which is still parameterized by θ and still easy to sample from. We pick some null hypothesis Θ0 ⊂ Θ, usually for scientific or convenience reasons.

We want some measure of how weird/surprising the value t(X') is if θ' were actually in Θ0. One way to do this, if we have a simple null hypothesis Θ0 = { θ0 }, is to calculate the p-value p(X') = P(t(X(θ0,ω)) ≥ t(X')). This can clearly be approximated using samples from t(X(θ0,ω)).

For composite null hypotheses, I guessed that using p(X') = sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ t(X')) would work. Paraphrasing jsteinhardt, if Θ0 = { θ01, ..., θ0n }, you could approximate p(X') using samples from t(X(θ01,ω)), ... t(X(θ01,ω)), but it's not clear what to do when Θ0 has infinite cardinality. I see two ways forward. One is approximating p(X') by doing the above computation over a finite subset of points in Θ0, chosen by gridding or at random. This should give an approximate lower bound on the p-value, since it might miss θ where the observed data look unexceptional. If the approximate p-value leads you to fail to reject the null, you can believe it; if it leads you to reject the null, you might be less sure and might want to continue trying more points in Θ0. Maybe this is what jsteinhardt means by saying it "doesn't terminate"? The other way forward might be to use features of t and Θ0, which we do have some control over, to simplify the expression sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ c). Say, if t(X(θ,ω)) is convex in θ for any ω and Θ0 is a convex bounded polytope living in some Euclidean space, then the supremum only depends on how P(t(X(θ0,ω)) ≥ c) behaves at a finite number of points.

So yeah, things are far more complicated than I claimed and realize now working through it. But you can do sensible things even with a composite null.

Comment author: jsteinhardt 30 July 2014 05:09:40AM 0 points [-]

Yup I agree with all of that. Nice explanation!

Comment author: ChristianKl 20 July 2014 07:09:09AM 2 points [-]

I don't have knowledge on random forests in particular but I did learn a little bit about machine learning in bioinformatics classes.

As far as I understand you can train your machine learning algorithm on one set of data and then see how it predicts values of a different set of data. That means you have values for sensitivity and specificity of your model. You can build a receiver operating characteristic (ROC) plot with it. You can also do things like seeing whether you get a different model if you build the model on a different set of your data. That can tell you whether your model is robust.

The idea of p values is to decide whether or not your model is true. In general that's not what machine learning folks are concerned with. The know that their model is a model and not reality and they care about the receiver operating characteristic.

Comment author: IlyaShpitser 21 July 2014 05:32:16PM 1 point [-]

You don't know what you are talking about.

Comment author: othercriteria 22 July 2014 01:03:06PM *  1 point [-]

The grandchild comment suggests that he does, at least to the the level of a typical user (though not a researcher or developer) of these methods.