Cyan comments on A Fervent Defense of Frequentist Statistics - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (125)
I've been thinking about what program, exactly, is being defended here, and I think a good name for it might be "prior-less learning". To me, all procedures under the prior-less umbrella have a "minimax optimality" feel to them. Some approaches search for explicitly minimax-optimal procedures; but even more broadly, all such approaches aim to secure guarantees (possibly probabilistic) that the worst-case performance of a given procedure is as limited as possible within some contemplated set of possible states of the world. I have a couple of things to say about such ideas.
First, for the non-probabilistically guaranteed methods: these are relatively few and far between, and for any such procedure it must be ensured that the loss that is being guaranteed is relevant to the problem at hand. That said, there is only one possible objection to them, and it is the same as one of my objections to prior-less probabilistically guaranteed methods. That objection applies generically to the minimaxity of the prior-less learning program: when strong prior information exists but is difficult to incorporate into the method, the results of the method can "leave money on the table", as it were. Sometimes this can be caught and fixed, generally in a post hoc and ad hoc way; sometimes not.
For probabilistically-guaranteed methods, there is a epistemic gap -- in principle -- in going from the properties of such procedures in classes of repeating situations (i.e., pre-data claims about the procedure) to well-warranted claims in the cases at hand (i.e., post-data claims about the world). But it's obvious that this is merely an in-principle objection -- after all, many such techniques can be and have been successfully applied to learn true things about the world. The important question is then: does the heretofore implicit principle justifying the bridging of this gap differ significantly from the principle justifying Bayesian learning?
Thanks a lot for the thoughtful comment. I've included some of my own thoughts below / also some clarifications.
Do you think that online learning methods count as an example of this?
I think this is a valid objection, but I'll make two partial counter-arguments. The first is that, arguably, there may be some information that is not easy to incorporate as a prior but is easy to incorporate under some sort of minimax formalism. So Bayes may be forced to leave money on the table in the same way.
A more concrete response is that, often, an appropriate regularizer can incorporate similar information to what a prior would incorporate. I think the regularizer that I exhibited in Myth 6 is one example of this.
I think it's important to distinguish between two (or maybe three) different types of probabilistic guarantees; I'm not sure whether you would consider all of the below "probabilistic" or whether some of them count as non-probabilistic, so I'll elaborate on each type.
The first, which I presume is what you are talking about, is when the probability is due to some assumed distribution over nature. In this case, if I'm willing to make such an assumption, then I'd rather just go the full-on Bayesian route, unless there's some compelling reason like computational tractability to eschew it. And indeed, there exist cases where, given distributional assumptions, we can infer the parameters efficiently using a frequentist estimation technique, while the Bayesian analog runs into NP-hardness obstacles, at least in some regimes. But there are other instances where the Bayesian method is far cheaper computationally than the go-to frequentist technique for the same problem (e.g. generative vs. discriminative models for syntactic parsing), so I only mean to bring this up as an example.
The second type of guarantee is in terms of randomness generated by the algorithm, without making any assumptions about nature (other than that we have access to a random number generator that is sufficiently independent from what we are trying to predict). I'm pretty happy with this sort of guarantee, since it requires fairly weak epistemic commitments.
The third type of guarantee is somewhat in the middle: it is given by a partial constraint on the distribution. As an example, maybe I'm willing to assume knowledge of certain moments of the distribution. For sufficiently few moments, I can estimate them all accurately from empirical data, and I can even bound the error to within high probability, making no assumption other than independence of my samples. In this case, as long as I'm okay with making the independence assumption, then I consider this guarantee to be pretty good as well (as long as I can bound the error introduced into the method by the inexact estimation of the moments, which there are good techniques for doing). I think the epistemic commitments for this type of method are, modulo making an independence assumption, not really any stronger than those for the second type of method, so I'm also fairly okay with this case.
If you can cook up examples of this, that would be helpful.