Andrew Gelman recently linked a new article entitled "Induction and Deduction in Bayesian Data Analysis." At his blog, he also described some of the comments made by reviewers and his rebuttle/discussion to those comments. It is interesting that he departs significantly from the common induction-based view of Bayesian approaches. As a practitioner myself, I am happiest about the discussion on model checking -- something one can definitely do in the Bayesian framework but which almost no one does. Model checking is to Bayesian data analysis as unit testing is to software engineering.

Added 03/11/12
Gelman has a new blog post today discussing another reaction to his paper and giving some additional details. Notably:

The basic idea of posterior predictive checking is, as they say, breathtakingly simple: (a) graph your data, (b) fit your model to data, (c) simulate replicated data (a Bayesian can always do this, because Bayesian models are always “generative”), (d) graph the replicated data, and (e) compare the graphs in (a) and (d). It makes me want to scream scream scream scream scream when statisticians’ philosophical scruples stop them from performing these five simple steps (or, to be precise, performing the simple steps (a), (c), (d), and (e), given that they’ve already done the hard part, which is step (b)).
 

 

New to LessWrong?

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 8:46 PM

"Model checking is to Bayesian data analysis as unit testing is to software engineering." Could you go into more detail?

[-][anonymous]12y160

In software engineering (I'm speaking only as someone who writes software as-needed and has friends in professional software development, not as an expert myself), one of the problems is that an engineer or analyst will believe they have solved a particular software problem prematurely. Just because their code compiles and gives the result they expected on the simple inputs they can think of off the top of their head doesn't mean it is ready to be shipped to the customer. For that, one needs to design suites of tests that check at various levels of resolution for bugs and mistakes in a systematic way. You need to have a game plan for designing these tests long before you finish writing the code, and you need to ruthlessly apply the standards of the tests across the board to your finished products.

In many places (academia is a large one, government labs is another I have worked in) this sort of unit testing is just ignored. Analysts write software until they personally feel like it's done. One other human being might scan their eyes over it before it is declared "check in ready" and is being used to tell Air Force policy experts how to interpret radar results. I think we should all understand that this is really bad and happens many thousands of times every single day. I can't tell you how many times I have discovered bugs in academic software, upon which award-winning research papers were based. Rarely do journals require you to submit the code and often you only have to "describe your algorithms mathematically" in the papers, and a ton of important subjective choices that some researcher made when analyzing the data get lost in translation.

I'm merely drawing a comparison between this and Bayesian data analysis. A lot of researchers tend to automatically believe their analysis is unassailably "rational" just because they had the foresight to use Bayesian methods rather than standard hypothesis testing. But this isn't so. Extremely implausible prior distributions can to a large extent be detected. Independence assumptions in the model can also be checked by bootstrapping useful test statistics from the posterior. These are simple things that almost everyone should be doing in a regular, ruthlessly systematic way any time they want to declare a statistical success in their research.

As far as I'm concerned, model checking and unit testing are the "hygiene" of computational research.

However, there is one important difference between the software world and the statistical modelling world. While it is sometimes possible to produce a "bug-free" piece of software, it is never possible to formulate a statistical model that captures reality exactly; as Box said, "all models are wrong." The challenge in statistical modelling is to find a model which is the best trade off between convenience (conceptual, mathematical, or computational) and verisimilitude. "Model checking" of some form or another is essential to this process; but it doesn't necessarily have to be standardized in a form analogous to a unit test. An alternative end towards the same means is an increased emphasis on model selection for different models of the same data, which can be put into a formalized statistical framework, although this is difficult to do in practice and hence is not very commonly done at present.

[-][anonymous]12y00

While I think your comment is generally true, I feel that it's almost a disservice to emphasize this point. A huge number of problems in the statistical sciences could be overcome by just a tiny bit of uniformity among model checking procedures. If it was seen as "bad form" to submit a journal article without doing some model expansion checks, or without providing test statistic analysis that goes beyond classical p-values, then the quality of publications would jump up. Even uniformity of the classical p-value testing would be helpful. I don't really like the use of classical p-values and test statistics, but they do say something about model validity. However, even in that domain, the test statistics are not always computed correctly; the way in which they were computed is rarely reported; and there are tons of systematic errors made by folks unfamiliar with the theory behind the statistical tests. Even if we had to continue using classical hypothesis testing, but we could just get people to apply the tests in a correct, systematic way, this would be a huge improvement. I would happily wager eating a stick of butter to get a world in which I didn't have to read statistical results and in my head be thinking, "Okay, how did these authors mess this up? Are they reporting the right thing? Did they just keep gathering data until they reached a significance level they wanted? Etc..."

Essentially, I think your comparison breaks down in one important way. While it may be possible to write software that is bug free, it's not as easy to prove that your code is as efficient as it needs to be, or that it will generalize to new use cases. Unit testing definitely focuses on proving correctness and bug-free-ness. But another, less directly objective part of it is proving that your code is well-suited to the computational task. Why did you pick the algorithm, design pattern, or language that you chose? If you truly design unit tests well, then some of the tests will also address slightly higher level issues like these, which are closer to the model checking issues.

Also, I think the flip-side to the Box quote is just as important: "All models are right; most are useless." This is discussed here.

Quote from page:

And in any reasonably large problem I will at some point discard a model and replace it with something new.

It's worth noting that a rigorous Bayesian approach does not license such a model-switch. The strict Bayesian starts with a prior, observes some evidence, and concludes with a new set of probabilities. By using this strategy Gelman is implicitly employing a vague, undefinable meta-model that exists only in his own brain. This isn't terrible, I suppose, if he gets good results, but it does mean that statistics is still as much an art as a science.

[This comment is no longer endorsed by its author]Reply

Gelman quotes from wikipedia:

Bayesian inference uses aspects of the scientific method, which involves collecting evidence that is meant to be consistent or inconsistent with a given hypothesis. As evidence accumulates, the degree of belief in a hypothesis ought to change. With enough evidence, it should become very high or very low. . . . Bayesian inference uses a numerical estimate of the degree of belief in a hypothesis before evidence has been observed and calculates a numerical estimate of the degree of belief in the hypothesis after evidence has been observed. . . . Bayesian inference usually relies on degrees of belief, or subjective probabilities, in the induction process and does not necessarily claim to provide an objective method of induction.

He then writes:

This does not describe what I do in my applied work. I do go through models, sometimes starting with something simple and building up from there, other times starting with my first guess at a full model and then trimming it down until I can understand it in the context of data. And in any reasonably large problem I will at some point discard a model and replace it with something new (see Gelman and Shalizi 2011a,b, for more detailed discussion of this process and how it roughly fits in to the philosophies of Popper and Kuhn). But I do not make these decisions on altering, rejecting, and expanding models based on the posterior probability that a model is true. Rather, knowing ahead of time that my assumptions are false, I abandon a model when a new model allows me to incorporate new data or to fit existing data better.

I don't disagree with Gelman's statistical practice, but I disagree with his justification. Statistical models are models of our uncertainty about a particular problem. Model checks are a great way to check how well the model is actually modeling our uncertainty and building models up in the fashion Gelman suggests are great ways to find a reasonable model for our uncertainty. Posterior model probabilities - when they can be calculated - are one way of assessing when we have found a better model, but they aren't the only way (and aren't necessarily the best way either).

If we 100% knew our priors (both the prior distribution and the rule by which we update on our prior disttribution, i.e. the likelihood) then Gelman's methods would be useless. Just do the Bayesian update! But we don't actually know our priors, so we must take care to model them as accurately as we can and Gelman's methods are pretty good at helping us do this.

[-][anonymous]12y40

Quoting Gelman himself on page 77 of the linked paper:

If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.

In full context of the paper, Gelman is noting this as a problem with standard Bayesian analysis. He doesn't argue, as I'm arguing, that we're trying to model our priors or the structure of our uncertainty, i.e. that we're trying to approximate the fully Baysian answer.

[-][anonymous]12y10

After going back and re-reading this, I realized your comments are more prescient than I gave them credit for in the past. I'm now struggling with the Gelman-Shalizi article (link). Do you know of any LessWrong sources that discuss this. I need to really sit back and think, but it seems to me that Gelman and Shalizi are making some serious mistakes here. And they are two of the best practitioners I know of. That scares me a great deal.

I don't know of any sources, short of an allusion or two in my comment history, but I don't recommend digging for them. One point I think I've made in the past is that an implication of viewing statistics as a method of modeling and thus approximating our uncertainty is that Gelman's posterior predictive checks have limits, though they're still useful. If posterior predictive checking tells you some part of your model is wrong but you otherwise have good reason to believe that part is an accurate representation of your true uncertainty, it might still be a good idea to leave that part alone.

As a practitioner myself, I am happiest about the discussion on model checking -- something one can definitely do in the Bayesian framework but which almost no one does.

Can you expand on that? I don't see Gelman addressing the problem in that paper. In fact, he booms his inability to do so, and says that no-one else can either. And the chapter on model checking in his book, "Bayesian Data Analysis" just labels the process "Judgement".

[-][anonymous]12y00

I disagree -- are you referring to chapter 6 of BDA? In that chapter he spells out good ways of addressing the issue: the Bayesian analogs of classical hypothesis testing statistics. Most importantly, though Gelman doesn't use this language, is the idea of devising test statistics that would falsify your model and then using bootstrapping methods to compare those test statistics on the posterior distribution to the test statistics on the data. In my own view, this is a shining success of Bayesian methods over frequentist methods. Bayesian analysis might give you intractable posterior distributions, and the test statistics that matter for falsifiability are hardly ever going to have convenient forms like F-distributions, t-distributions, or Chi-squared distributions, as are naively advertised in the classical approaches. But computational methods like Metropolis-Hastings/Gibbs sampling and other advances in MCMC still let you bootstrap the test statistic even when its distribution is impossibly complicated. I think this advantage of Bayesian methods deserves to be more widely understood. The other notions mentioned in chapter 6 of BDA are graphical data analysis and measures for model expansion / predictive accuracy.

In the paper, it seemed that the part Gelman refused to address was the way in which the addition of model checking / going back to the drawing board ruined the logical coherence of the more usual inductive Bayesian arguments. I agree that he copped out here and didn't attempt to address the underlying philosophical problem -- all he did was point out that each of the other major alternatives has basically the same coherence problem, including inductive Bayes.

Yes, that's the chapter.

For a Bayesian to relinquish his original hypothesis that the distribution belonged to some family, he needs both a way to notice when the data are far too unlikely to have been produced from any member of that family at all, and a way to choose a different family that will fit better. The likelihood of the data given the prior distribution over the family's parameters is straightforwardly computable (or approximated by calculating various test statistics, when the question you're asking is "is this family of models completely wrong?"), but the process of choosing a new model is rather more murky. The small-worlders talk about judgement, reasonableness, and plausibility, while the large-worlders can at best talk about bounded-rational approximations to the universal prior, which in practice comes down to the same thing.