Interesting talk on Bayesians and frequentists

jsteinhardt

I recently started watching an interesting lecture by Michael Jordan on Bayesians and frequentists; he's a pretty successful machine learning expert that takes both views in his work. You can watch it here: http://videolectures.net/mlss09uk_jordan_bfway/. I found it interesting because his portrayal of frequentism is much different than the standard portrayal on lesswrong. It isn't about whether probabilities are frequencies or beliefs, it's about trying to get a good model versus trying to get rigorous guarantees of performance in a class of scenarios. So I wonder why the meme on lesswrong is that frequentists think probabilities are frequencies; in practice it seems to be more about how you approach a given problem. In fact, frequentists seem more "rational", as they're willing to use any tool that solves a problem instead of constraining themselves to methods that obey Bayes' rule.

In practice, it seems that while Bayes is the main tool for epistemic rationality, instrumental rationality should oftentimes be frequentist at the top level (with epistemic rationality, guided by Bayes, in turn guiding the specific application of a frequentist algorithm).

For instance, in many cases I should be willing to, once I have a sufficiently constrained search space, try different things until one of the works, without worrying about understanding why the specific thing I did worked (think shooting a basketball, or riffle shuffling a deck of cards). In practice, it seems like epistemic rationality is important for constraining a search space, and after that some sort of online learning algorithm can be applied to find the optimal action from within that search space. Of course, this isn't true when you only get one chance to do something, or extreme precision is required, but this is not often true in everyday life.

The main point of this thread is to raise awareness of the actual distinction between Bayesians and frequentists, and why it's actually reasonable to be both, since it seems like lesswrong is strongly Bayesian and there isn't even a good discussion of the fact that there are other methods out there.

It depends on what you mean by model selection. If you mean e.g. figuring out whether to use quadratics or cubics, then the standard solution that people cite is to use Bayesian Occam's razor, i.e. compute

p(Cubic | Data)/p(Quadratic | Data) = p(Data | Cubic)/p(Data | Quadratic) * p(Cubic)/p(Quadratic)

Where we compute the probabilities on the right-hand side by marginalizing over all cubics and quadratics. But the number you get out of this will depend strongly on how quickly the tails decay on your distribution over cubics and quadratics, so I don't find this particularly satisfying. (I'm not alone in this, although there are people who would disagree with me or propose various methods for choosing the prior distributions appropriately.)

If you mean something else, like figuring out what specific model to pick out from your entire space (e.g. picking a specific function to fit your data), then you can run into problems like having to compare probability masses to probability densities, or comparing measures with different dimensionality (e.g. densities on the line versus the plane); a more fundamental issue is that picking a specific model potentially ignores other features of your posterior distribution, like how concentrated the probability mass is about that model.

I would say that the most principled way to get a single model out at the end of the day is variational inference, which basically attempts to set parameters in order to minimize the relative entropy between the distribution implied by the parameters and the actual posterior distribution. I don't know a whole lot about this area, other than a couple papers I read, but it does seem like a good way to perform inference if you'd like to restrict yourself to considering a single model at a time.

OK, so you're saying that a big problem in model selection is coming up with good prior distributions for different classes of models, specifically those with different tail decays (it sounds like you think it could also be that the standard bayes framework is missing something). This is an interesting idea which I had heard about before, but didn't understand till now. Thank you for telling me about it.

I would say that when you have a somewhat dispersed posterior it is simply misleading to pick any specific model+parameters as your fit. The correct thin... (read more)

11

Interesting talk on Bayesians and frequentists

11

11

11

Interesting talk on Bayesians and frequentists

11

11