Comment author: TheMajor 20 October 2014 10:30:07PM 1 point [-]

I think that the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim, I don't think I want to claim much more than that (although I do want to add that if we have such a unique framework that is both minimal and complete when it comes to making predictions then that seems like a very natural choice for Statistics with a capital s).

I don't think we're going to agree about the importance of computability without more context. I agree that every time I try to build myself a nice Bayesian algorithm I run into the problem of uncomputability, but personally I consider Bayesian statistics to be more of a method of evaluating algorithms than a method for creating them (although Bayesian statistics is by no means limited to this!).

As for your other questions: important to note is that your issues are issues with Bayesian statistics as much as they are issues with any other form of prediction making. To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes' Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes' Theorem to find the unknowns. At least, I think this is possible (it worked so far)). And indeed picking the prior and set of hypotheses is not an easy task - this is precisely what leads to different competing algorithms in the field of statistics.

Comment author: othercriteria 21 October 2014 06:21:41PM 0 points [-]

To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes' Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes' Theorem to find the unknowns.

Okay, this is the last thing I'll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you're updating on the data twice and it's hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?

(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I'm giving up. It is supremely dishonest to pretend there's no trade-off present in this situation. And a Bayes-first education doesn't even give you the concepts to see what you gain and what you lose by being a Bayesian.)

Comment author: IlyaShpitser 16 October 2014 02:30:58PM *  3 points [-]

That's an interesting example, thanks for linking it. I read it carefully, and also some of Robins/Ritov CODA paper:

http://www.biostat.harvard.edu/robins/coda.pdf

and I think I get it. The example is phrased in the language of sampling/missing data, but for those in the audience familiar w/ Pearl, we can rephrase it as a causal inference problem. After all, causal inference is just another type of missing data problem.

We have a treatment A (a drug), and an outcome Y (death). Doctors assign A to some patients, but not others, based on their baseline covariates C. Then some patients die. The resulting data is an observational study, and we want to infer from it the effect of drug on survival, which we can obtain from p(Y | do(A=yes)).

We know in this case that p(Y | do(A=yes)) = sum{C} p(Y | A=yes,C) p(C) (this is just what "adjusting for confounders" means).

If we then had a parametric model for E[Y | A=yes,C], we could just fit that model and average (this is "likelihood based inference.") Larry and Jamie are worried about the (admittedly adversarial) situation where maybe the relationship between Y and A and C is really complicated, and any specific parametric model we might conceivably use will be wrong, while non-parametric methods may have issues due to the curse of dimensionality in moderate samples. But of course the way we specified the problem, we know p(A | C) exactly, because doctors told us the rule by which they assign treatments.

Something like the Horvitz/Thompson estimator which uses this (correct) model only, or other estimators which address issues with the H/T estimator by also using the conditional model for Y, may have better behavior in such settings. But importantly, these methods are exploiting a part of the model we technically do not need (p(A | C) does not appear in the above "adjustment for confounders" expression anywhere), because in this particular setting it happens to be specified exactly, while the parts of the models we do technically need for likelihood based inference to work are really complicated and hard to get right at moderate samples.

But these kinds of estimators are not Bayesian. Of course arguably this entire setting is one Bayesians don't worry about (but maybe they should? These settings do come up).


The CODA paper apparently stimulated some subsequent Bayesian activity, e.g.:

http://www.is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/techreport2007_6326[0%5D.pdf

So, things are working as intended :).

Comment author: othercriteria 16 October 2014 05:58:06PM 2 points [-]

You're welcome for the link, and it's more than repaid by your causal inference restatement of the Robins-Ritov problem.

Of course arguably this entire setting is one Bayesians don't worry about (but maybe they should? These settings do come up).

Yeah, I think this is the heart of the confusion. When you encounter a problem, you can turn the Bayesian crank and it will always do the Right thing, but it won't always do the right thing. What I find disconcerting (as a Bayesian drifting towards frequentism) is that it's not obvious how to assess the adequacy of a Bayesian analysis from within the Bayesian framework. In principle, you can do this mindlessly by marginalizing over all the model classes that might apply, maybe? But in practice, a single model class usually gets picked by non-Bayesian criteria like "does the posterior depend on the data in the right way?" or "does the posterior capture the 'true model' from simulated data?". Or a Bayesian may (rightly or wrongly) decide that a Bayesian analysis is not appropriate in that setting.

Comment author: TheMajor 16 October 2014 12:07:15AM 2 points [-]

But nobody, least of all Bayesian statistical practitioners, does this.

Well obviously. Same for physicists, nobody (other than some highly specialised teams working at particle accelerators) use the standard model to compute the predictions of their models. Or for computer science - most computer scientists don't write code at the binary level, or explicitly give commands to individual transistors. Or chemists - just how many of the reaction equations do you think are being checked by solving the quantum mechanics? But just because the underlying theory doesn't give as good a result-vs-time-tradeoff as some simplified model does not mean that the underlying theory can be ignored altogether (in my particular examples above I remark that the respective researchers do study the fundamentals, but then hardly ever need to apply them!)! By studying the underlying (often mathematically elegant) theory first one can later look at the messy real-world examples through the lens of this theory, and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory. This is why studying theoretical Bayesian statistics is a good investment of time - after this all other parts of statistics become more accessible and intuitive, as the specific methods can be fitted into the overarching theory.

Of course if you actually want to apply statistical methods to a real-world problem I think that the frequentist toolbox is one of the best options available (in terms of results vs. effort). But it becomes easier to understand these algorithms (where they make which assumptions, where they use shortcuts/substitutions to approximate for the sake of computation, exactly where, how and why they might fail etc.) if you become familiar with the minimal consistent framework for statistics, which to the best of my knowledge is Bayesian statistics.

Comment author: othercriteria 16 October 2014 01:32:48AM 2 points [-]

Have you seen the series of blog posts by Robins and Wasserman that starts here? In problems like the one discussed there (such as the high-dimensional ones that are commonly seen these days), Bayesian procedures, and more broadly any procedures that satisfy the likelihood principle, just don't work. The procedures that do work, according to frequentist criteria, do not arise from the likelihood so it's hard to see how they could be approximations to a Bayesian solution.

You can also see this situation in the (frequentist) classic Theory of Point Estimation written by Lehmann and Casella. The text has four central chapters: "Unbiasedness", "Equivariance", "Average Risk Optimality", and "Minimaxity and Admissibility". Each of these introduces a principle for the design of estimators and then shows where this principle leads. "Average Risk Optimality" leads to Bayesian inference, but also Bayes-Lite methods like empirical Bayes. But each of the other three chapters leads to its own theory, with its own collection of methods that are optimal under that theory. Bayesian statistics is an important and substantial part of the story told by in that book, but it's not the whole story. Said differently, Bayesian statistics may be a framework for Bayesian procedures and a useful way of analyzing non-Bayesian statistics, but they are not the framework for all of statistics.

Comment author: TheMajor 15 October 2014 09:08:50PM 1 point [-]

I'm afraid I don't understand. (Theoretical) Bayesian statistics is the study of probability flows under minimal assumptions - any quantity that behaves like we want a probability to behave can be described by Bayesian statistics. Therefore learning this general framework is useful when later looking at applications and most notably approximations. For what reasons do you suggest studying the approximation algorithms before studying the underlying framework?

Also you mention 'Bayesian procedures', I would like to clarify that I wasn't referring to any particular Bayesian algorithm but to the complete study of (uncomputable) ideal Bayesian statistics.

Comment author: othercriteria 15 October 2014 11:30:43PM 3 points [-]

(Theoretical) Bayesian statistics is the study of probability flows under minimal assumptions - any quantity that behaves like we want a probability to behave can be described by Bayesian statistics.

But nobody, least of all Bayesian statistical practitioners, does this. They encounter data, get familiar with it, pick/invent a model, pick/invent a prior, run (possibly approximate) inference of the model against the data, verify if inference is doing something reasonable, and jump back to an earlier step and change something if it doesn't. After however long this takes (if they don't give up), they might make some decision based on the (possibly approximate) posterior distribution they end up with. This decision might involve taking some actions in the wider world and/or writing a paper.

This is essentially the same workflow a frequentist statistician would use, and it's only reasonable that a lot of the ideas that work in one of these settings would be useful, if not obvious or well-motivated, in the other.

I know that philosophical underpinnings and underlying frameworks matter but to quote from a recent review article by Reid and Cox (2014):

A healthy interplay between theory and application is crucial for statistics, as no doubt for other fields. This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods. The very word foundations may, however, be a little misleading in that it suggests a solid base on which a large structure rests for its entire security. But foundations in the present context equally depend on and must be tested and revised in the light of experience and assessed by relevance to the very wide variety of contexts in which statistical considerations arise. It would be misleading to draw too close a parallel with the notion of a structure that would collapse if its foundations were destroyed.

Comment author: Lumifer 15 October 2014 06:30:49PM *  3 points [-]

I would advise looking into frequentist statistics before studying Bayesian statistics.

Actually, if you have the necessary math background, it will probably be useful to start by looking at why and how the frequentists and the Bayesians differ.

Some good starting points, in addition to Bayes, are Fisher information and Neyman-Pearson hypothesis testing. This paper by Gelman and Shalizi could be interesting as well.

Comment author: othercriteria 15 October 2014 07:04:46PM *  0 points [-]

Thanks for pointing out the Gelman and Shalizi paper. Just skimmed it so far, but it looks like it really captures the zeitgeist of what reasonably thoughtful statisticians think of the framework they're in the business of developing and using.

Plus, their final footnote, describing their misgivings about elevating Bayesianism beyond a tool in the hypothetico-deductive toolbox, is great:

Ghosh and Ramamoorthi (2003, p. 112) see a similar attitude as discouraging inquiries into consistency: ‘the prior and the posterior given by Bayes theorem [sic] are imperatives arising out of axioms of rational behavior – and since we are already rational why worry about one more’ criterion, namely convergence to the truth?

Comment author: TheMajor 15 October 2014 07:09:35AM 5 points [-]

The two examples you give (Bayesian statistics and calculus) are very good ones, I would definitely recommend becoming familiar with these. I am not sure how much is covered by the 'calculus' label, but I would recommend trying to understand on a gut level what a differential equation means (this is simpler than it might sound. Solving them, on the other hand, is hard and often tedious). I believe vector calculus (linear algebra) and the combination with differential equations (linear ODE's of dimension at least two) are also covered by 'calculus'? Again having the ability to solve them isn't that important in most fields (in my limited experience), but grasping what exactly is happening is very valuable.

If you are wholly unfamiliar with statistics then I would also advice looking into frequentist statistics after having studied the Bayesian statistics - frequentist tools provide very accurate and easily computable approximations to the Bayesian inference, and being able to recognise/use these is useful in most sciences (from social science all the way to theoretical physics).

Comment author: othercriteria 15 October 2014 06:18:50PM *  5 points [-]

I would advise looking into frequentist statistics before studying Bayesian statistics. Inference done under Bayesian statistics is curiously silent about anything besides the posterior probability, including whether the model makes sense for the data, whether the knowledge gained about the model is likely to match reality, etc. Frequentist concepts like consistency, coverage probability, ancillarity, model checking, etc., don't just apply to frequentist estimation; they can be used to asses and justify Bayesian procedures.

If anything, Bayesian statistics should just be treated as a factory that churns out estimation procedures. By a corollary of the complete class theorem, this is also the only way you can get good estimation procedures.

ETA: Can I get comments in addition to (or instead of) down votes here? This is a topic I don't want to be mistaken about, so please tell me if I'm getting something wrong. Or rather if my comment is coming across as "boo Bayes", which calls out for punishment.

Comment author: Gunnar_Zarncke 15 October 2014 07:37:04AM 5 points [-]

My document of life-lessons spits out this (it has a focus on teaching children, but it aims high):

Key math insights of general value:

  • What is a number really - Peano's sentence

  • Equality (do the same to both sides, equivalence classes)

  • Negation and inversion (reversing any relationship in general)

  • Variables, functions, domains

  • Continuous functions

  • Limits, infinities (leads e.g. to real analysis)

  • Postponing operations (fractions, 'primitive functions', lazy evaluation)

  • Probability (enumerating paths that can are taken fractionally, Bayes rule)

  • Tracking errors (dealing with two or more functions/results at the same time)

  • Induction, proofs

  • Transformation into another space (Fourier, dual spaces, radix sort)

  • Representations of sequences and trees and graphs

  • Decomposition of plans and algorithms (O-notation)

  • Encoding of plans as numbers (Turing, Curry, Gödel)

The idea is to see the patterns behind the patterns (link in Einsteins Speed).

Comment author: othercriteria 15 October 2014 05:28:14PM 2 points [-]

This is really good and impressive. Do you have such a list for statistics?

Comment author: Douglas_Knight 15 October 2014 03:58:24AM 1 point [-]

The statement about percolation is true quite generally, not just for Erdős-Rényi random graphs, but also for the square grid. Above the critical threshold, the giant component is a positive proportion of the graph, and below the critical threshold, all components are finite.

Comment author: othercriteria 15 October 2014 01:42:06PM 0 points [-]

The example I'm thinking about is a non-random graph on the square grid where west/east neighbors are connected and north/south neighbors aren't. Its density is asymptotically right at the critical threshold and could be pushed over by adding additional west/east non-neighbor edges. The connected components are neither finite nor giant.

Comment author: Sherincall 13 October 2014 07:47:14PM 3 points [-]

I've just enrolled in a 1 year applied mathematics Master's program. The program is easy, and I'm mostly doing it because it costs me nothing and a Master's degree is a good asset to have. I plan on working full time and not attending any classes, and I'm certain I still won't have any problems there.

However, coming from a CE background, I have no idea what to do for my thesis. I want it to be something from the fields of AI or Probability/Statistics, but I'm out of ideas. So, any suggestions as to what may be either fun or useful (preferably both) in those areas, that I should dedicate my spare time to?

Comment author: othercriteria 14 October 2014 05:11:21AM 4 points [-]

If you want a solid year-long project, find a statistical model you like and figure out how to do inference in it with variational Bayes. If this has been done, change finite parts of the model into infinite ones until you reach novelty or the model is no longer recognizable/tractable. At that point, either try a new model or instead try to make the VB inference online or parallelizable. Maybe target a NIPS-style paper and a ~30-page technical report in addition to whatever your thesis will look like.

And attend a machine learning class, if offered. There's a lot of lore in that field and you'll miss out if you do the read-the-book-work-each-problem thing that is alleged to work in math.

Comment author: Yvain 12 October 2014 03:35:52AM 9 points [-]

Anything I do with gender and sex is going to have lots of people yell at me. But if I keep it the same, it will be the same people as last year and I won't make new enemies.

Comment author: othercriteria 13 October 2014 02:58:06AM 2 points [-]

But to all of us perched on the back of Cthulhu, who is forever swimming left, is it the survey that will seem fixed and unchanging from our moving point of view?

View more: Prev | Next