What math is essential to the art of rationality?

Capla

I will give it a shot (I recall reading a well-written explanation elsewhere on LW, and I don't expect to be as clear as what I read there).

In any estimation or prediction setting we are interested in making accurate probabilistic claims about the behaviour of our system of study. In particular we would like to give a description of how the system will behave in the future (for example: 'this drug cures patients 30% of the time'). This is captured by the posterior probability distribution.

Now if we have any algorithm whatsover that makes statements about our situation, and reliably gives the right answers (for example: most frequentist algorithms) then that algorithm must do something to convert data into predictions, i.e. it must spit out something that can be rewritten into a(n) (ideal) posterior distribution (which captures all our knowledge and therefore predictions).

Now Bayes' Law gives a formula for this posterior distribution - when having observed data D then the posterior probability of any hypothesis A is equal to P(A|D) = P(D|A)P(A)/P(D) (here the P(D) is often written as sum(P(D|B)P(B)), and B runs over a complete set of mutually exclusive hypotheses). In order to give any accurate results whatsover an algorithm has to produce results that can be interpreted as the quantity above, P(A|D). This is starting to look Bayesian.

But now one can raise the (valid) point that even though we might want to produce the distribution above we don't have to explicitly use the equation above to determine it - and indeed this is what gives birth the wide range of statistical algorithms (this along with the fact that the formula above is impossible to compute - there are simply too many hypotheses. So we have to make approximations anyway, and different approximations lead to different algorithms). There are two important observations about this point:

1) There are many, many theoretically possible algorithms (by which I mean computer programs, not just limited to statistics). If you were to create an algorithm at random then it will most likely not produce (approximations of) posterior distributions in a reliable and efficient way. So to produce any statistical algorithm at all you have to somehow make use of the mathematics of probabilities, and a good starting point is to use Bayes' Theorem. Unless you know/can come up with a different formula for the l.h.s. (or at least a good new approximation) you have no choice but to approximate the r.h.s. (note that most of the time P(D|B) is known, so several of the terms on the r.h.s. are already known). This is part of the reason why many statistical algorithms can easily be interpreted and understood with Bayes' Theorem - quite a few of them are derived from Bayes' Theorem (for example for the classical testing of a hypothesis we assume that we initially have a set of hypotheses that all have exactly the same probability, and furthermore assume that some hypothesis neatly fits the data (P(D|B) ~ 1 for some B). Then if for our favourite hypothesis H0 we find a small p-value P(D|H0), i.e. the model did not predict the data, we find a posterior of P(H0|D) = P(D|H0) / sum(P(D|B)), which is also very small (the denominator is order 1, the enumalerator is small). Viewed from a Bayesian framework it is clear that it will therefore often suffice to compute P(D|H) rather than go for P(H|D), unless the data really is inexplicable by our hypotheses (P(D|B) is small for all B we consider) or our hypotheses do not have equal (or almost equal) initial likelihood (for example the theory of general relativity is supported by quite a bit of data, so if a single experiment were to disagree with the theory, even with intimidatingly low p-value, the posterior probability of the theory being right might still be large). By considering our algorithm with Bayes' Law in mind at all times we do not only understand why the algorithm (p-values) work but also in exactly which cases they break down, and are even presented with a way to improve the algorithm in those cases).

2) The more important reason that the formula above is important is that it is true, even if you decide to use a radically different way of summarising your information to make predictions. If an ideal Bayesian reasoner shows up and computes the posterior probability, and then proceeds to make predictions, then most of the time your algorithm and the reasoner are going to agree (since most of the time you both make valid predictions). So there has to be some part of Bayes' law that reflects what your algorithm does - if your algorithm works reliably, and Bayes' law gives a mathematical formula for what you should predict, then for some unknown reason Bayes' law can describe your algorithm (that is to say, if your algorithm predicts 'Hypothesis H is likely/true, now that we have seen data D' then there must be some part of the formula P(H|D) that makes it large, otherwise you two would not agree!). This is the most valuable insight into Bayesian statistics that I know - if anybody computes anything at all that acts like a prediction/probability and reliably gets correct results then in Bayes' theorem the terms must happen to line up just right to give similar predictions, so we can describe the algorithm (and maybe even improve it) by making precise which terms act how. By looking at exactly what predictions a statistical algorithm will make under which conditions we can find out how the unknown terms on the r.h.s. of Bayes' Theorem should act to produce similar results (the predictions from our non-Bayesian algorithm were correct most of the time, so there has to be some behaviour of these terms that will let us mimic the results), which explains the overt/hidden assumptions and thereby limitations of the model, provides understanding and more insight (I hope that my example above of p-values will provide an example of this insight - I certainly learned more about why people weren't concerned about 5-sigma disagreements with GR by studying Bayes) and sometimes even improvements. This is what I was trying to say (but I was lazy so I tried to use fewer words. Clearly this backfired) earlier, when I said things like "look at the messy real-world examples through the lens of [Bayesian statistics] and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory".

Lastly I would like to remark that in this post so far there has been no mention whatsoever of computational efficiency, and it should not be surprising that actual Bayesian algorithms are therefore often slower (or even uncomputable) when compared to other options (like I said above, there are many possible algorithms. There is no reason why the corrent one and the fast one should be the same). Therefore for real-world applications a non-Bayesian algoritm is often your best bet. But to study and understand the probabilities and general algorithms the Bayesian approach is superior - all because Bayes' Theorem is a Theorem.

Lastly I would like to remark that often people (including me, above) write as if Bayesian statistics and frequentist statistics are in active conflict, which is mostly but not fully correct. Frequentist tests (and other tests, I tend to forget that there are more than two schools of statistics) are often superior in practice, and should most definitely be used in practice. But given an algorithm, some peculiar black box that gives the right answers, or given a philosophical conundrum on probabilies, the correct thing to do is write down Bayes' Theorem and figure out how the unknowns should act to give the results we are faced with, i.e. figure out how an ideal Bayesian reasoner could come to the same conclusion. And it really is a shame that this doesn't always lead to quick algorithms in practice.

I hope this has cleared up some misunderstandings.

23

What math is essential to the art of rationality?

23

23

23

What math is essential to the art of rationality?

23

23