JFK was not assassinated: prior probability zero events
A lot of my work involves tweaking the utility or probability of an agent to make it believe - or act as if it believed - impossible or almost impossible events. But we have to be careful about this; an agent that believes the impossible may not be so different from one that doesn't.
Consider for instance an agent that assigns a prior probability of zero to JFK ever having been assassinated. No matter what evidence you present to it, it will go on disbelieving the "non-zero gunmen theory".
Initially, the agent will behave very unusually. If it was in charge of JFK's security in Dallas before the shooting, it would have sent all secret service agents home, because no assassination could happen. Immediately after the assassination, it would have disbelieved everything. The films would have been faked or misinterpreted; the witnesses, deluded; the dead body of the president, that of twin or an actor. It would have had huge problems with the aftermath, trying to reject all the evidence of death, seeing a vast conspiracy to hide the truth of JFK's non-death, including the many other conspiracy theories that must be false flags, because they all agree with the wrong statement that the president was actually assassinated.
But as time went on, the agent's behaviour would start to become more and more normal. It would realise the conspiracy was incredibly thorough in its faking of the evidence. All avenues it pursued to expose them would come to naught. It would stop expecting people to come forward and confess the joke, it would stop expecting to find radical new evidence overturning the accepted narrative. After a while, it would start to expect the next new piece of evidence to be in favour of the assassination idea - because if a conspiracy has been faking things this well so far, then they should continue to do so in the future. Though it cannot change its view of the assassination, its expectation for observations converge towards the norm.
If it does a really thorough investigation, it might stop believing in a conspiracy at all. At some point, the probability of a miracle will start to become more likely than a perfect but undetectable conspiracy. It is very unlikely that Lee Harvey Oswald shot at JFK, missed, and the president's head exploded simultaneously for unrelated natural causes. But after a while, such a miraculous explanation will start to become more likely than anything else the agent can consider. This explanation opens the possibility of miracles; but again, if the agent is very thorough, it will fail to find evidence of other miracles, and will probably settle on "an unrepeatable miracle caused JFK's death in a way that is physically undetectable".
But then note that such an agent will have a probability distribution over future events that is almost indistinguishable from a normal agent that just believes the standard story of JFK being assassinated. The zero-prior has been negated, not in theory but in practice.
How to do proper probability manipulation
This section is still somewhat a work in progress.
So the agent believes one false fact about the world, but its expectation is otherwise normal. This can be both desirable and undesirable. The negative is if we try and control the agent forever by giving it a false fact.
To see the positive, ask why would we want an agent to believe impossible things in the first place? Well, one example was an Oracle design where the Oracle didn't believe its output message would ever be read. Here we wanted the Oracle to believe the message wouldn't be read, but not believe anything else too weird about the world.
In terms of causality, if X designates the message being read at time t, and B and A are event before and after t, respectively, we want P(B|X)≈P(B) (probabilities about current facts in the world shouldn't change much) while P(A|X)≠P(A) is fine and often expected (the future should be different if the message is read or not).
In the JFK example, the agent eventually concluded "a miracle happened". I'll call this miracle a scrambling point. It's kind of a breakdown in causality: two futures are merged into one, given two different pasts. The two pasts are "JFK was assassinated" and "JFK wasn't assassinated", and their common scrambled future is "everything appears as if JFK was assassinated". The non-assassination belief has shifted the past but not the future.
For the Oracle, we want to do the reverse: we want the non-reading belief to shift the future but not the past. However, unlike the JFK assassination, we can try and build the scrambling point. That's why I always talk about messages going down noisy wires, or specific quantum events, or chaotic processes. If the past goes through a truly stochastic event (it doesn't matter whether there is true randomness or just that the agent can't figure out the consequences), we can get what we want.
The Oracle idea will go wrong if the Oracle conclude that non-reading must imply something is different about the past (maybe it can see through chaos in ways we thought it couldn't), just as the JFK assassination denier will continue to be crazy if can't find a route to reach "everything appears as if JFK was assassinated".
But there is a break in the symmetry: the JFK assassination denier will eventually reach that point as long as the world is complex and stochastic enough. While the Oracle requires that the future probabilities be the same in all (realistic) past universes.
Now, once the Oracle's message has been read, the Oracle will find itself in the same situation as the other agent: believing an impossible thing. For Oracles, we can simply reset them. Other agents might have to behave more like the JFK assassination disbeliever. Though if we're careful, we can quantify things more precisely, as I attempted to do here.
Expect to know better when you know more
A seemingly trivial result, that I haven't seen posted anywhere in this form, that I could find. It simply shows that we expect evidence to increase the posterior probability of the true hypothesis.
Let H be the true hypothesis/model/environment/distribution, and ~H its negation. Let e be evidence we receive, taking values e1, e2, ... en. Let pi=P(e=ei|H) and qi=P(E=ei|~H).
The expected posterior weighting of H, P(e|H), is Σpipi while the expected posterior weighting of ~H, P(e|~H), is Σqipi. Then since the pi and qi both sum to 1, Cauchy–Schwarz implies that
- E(P(e|H)) ≥ E(P(e|~H)).
Thus, in expectation, the probability of the evidence given the true hypothesis, is higher than or equal to the probability of the evidence given its negation.
This, however, doesn't mean that the Bayes factor - P(e|H)/P(e|~H) - must have expectation greater than one, since ratios of expectation are not the same as expectations of ratio. The Bayes factor given e=ei is (pi/qi). Thus the expected Bayes factor is Σ(pi/qi)pi. The negative logarithm is a convex function; hence by Jensen's inequality, -log[E(P(e|H)/P(e|~H))] ≤ -E[log(P(e|H)/P(e|~H))]. That last expectation is Σ(log(pi/qi))pi. This is the Kullback–Leibler divergence of P(e|~H) from P(e|H), and hence is non-negative. Thus log[E(P(e|H)/P(e|~H))] ≥ 0, and hence
- E(P(e|H)/P(e|~H)) ≥ 1.
Thus, in expectation, the Bayes factor, for the true hypothesis versus its negation, is greater than or equal to one.
Note that this is not true for the inverse. Indeed E(P(e|~H)/P(e|H)) = Σ(qi/pi)pi = Σqi = 1.
In the preceding proofs, ~H played no specific role, and hence
- For all K, E(P(e|H)) ≥ E(P(e|K)) and E(P(e|H)/P(e|K)) ≥ 1 (and E(P(e|K)/P(e|H)) = 1).
Thus, in expectation, the probability of the true hypothesis versus anything, is greater or equal in both absolute value and ratio.
Now we can turn to the posterior probability P(H|e). For e=ei, this is P(H)*P(e=ei|H)/P(e=ei). We can compute the expectation of P(e|H)/P(e) as above, using the non-negative Kullback–Leibler divergence of P(e) from P(e|H), and thus showing it has an expectation greater than or equal to 1. Hence:
- E(P(H|e)) ≥ P(H).
Thus, in expectation, the posterior probability of the true hypothesis is greater than or equal to its prior probability.
The trouble with Bayes (draft)
Prerequisites
This post requires some knowledge of Bayesian and Frequentist statistics, as well as probability. It is intended to explain one of the more advanced concepts in statistical theory--Bayesian non-consistency--to non-statisticians, and although the level required is much less than would be required to read some of the original papers on the topic[1], some considerable background is still required.
The Bayesian dream
Bayesian methods are enjoying a well-deserved growth of popularity in the sciences. However, most practitioners of Bayesian inference, including most statisticians, see it as a practical tool. Bayesian inference has many desirable properties for a data analysis procedure: it allows for intuitive treatment of complex statistical models, which include models with non-iid data, random effects, high-dimensional regularization, covariance estimation, outliers, and missing data. Problems which have been the subject of Ph. D. theses and entire careers in the Frequentist school, such as mixture models and the many-armed bandit problem, can be satisfactorily handled by introductory-level Bayesian statistics.
A more extreme point of view, the flavor of subjective Bayes best exemplified by Jaynes' famous book [2], and also by an sizable contingent of philosophers of science, elevates Bayesian reasoning to the methodology for probabilistic reasoning, in every domain, for every problem. One merely needs to encode one's beliefs as a prior distribution, and Bayesian inference will yield the optimal decision or inference.
To a philosophical Bayesian, the epistemological grounding of most statistics (including "pragmatic Bayes") is abysmal. The practice of data analysis is either dictated by arbitrary tradition and protocol on the one hand, or consists of users creatively employing a diverse "toolbox" of methods justified by a diverse mixture of incompatible theoretical principles like the minimax principle, invariance, asymptotics, maximum likelihood or *gasp* "Bayesian optimality." The result: a million possible methods exist for any given problem, and a million interpretations exist for any data set, all depending on how one frames the problem. Given one million different interpretations for the data, which one should *you* believe?
Why the ambiguity? Take the textbook problem of determining whether a coin is fair or weighted, based on the data obtained from, say, flipping it 10 times. Keep in mind, a principled approach to statistics decides the rule for decision-making before you see the data. So, what rule whould you use for your decision? One rule is, "declare it's weighted, if either 10/10 flips are heads or 0/10 flips are heads." Another rule is, "always declare it to be weighted." Or, "always declare it to be fair." All in all, there are 10 possible outcomes (supposing we only care about the total) and therefore there are 2^10 possible decision rules. We can probably rule out most of them as nonsensical, like, "declare it to be weighted if 5/10 are heads, and fair otherwise" since 5/10 seems like the fairest outcome possible. But among the remaining possibilities, there is no obvious way to choose the "best" rule. After all, the performance of the rule, defined as the probability you will make the correct conclusion from the data, depends on the unknown state of the world, i.e. the true probability of flipping heads for that particular the coin.
The Bayesian approach "cuts" the Gordion knot of choosing the best rule, by assuming a prior distribution over the unknown state of the world. Under this prior distribution, one can compute the average perfomance of any decision rule, and choose the best one. For example, suppose your prior is that with probability 99.9999%, the coin is fair. Then the best decision rule would be to "always declare it to be fair!"
The Bayesian approach gives you the optimal decision rule for the problem, as soon as you come up with a model for the data and a prior for your model. But when you are looking at data analysis problems in the real world (as opposed to a probability textbook), the choice of model is rarely unambiguous. Hence, for me, the standard Bayesian approach does not go far enough--if there are a million models you could choose from, you still get a million different conclusions as a Bayesian.
Hence, one could argue that a "pragmatic" Bayesian who thinks up a new model for every problem is just as epistemologically suspect as any Frequentist. Only the strongest form of subjective Bayesianism can one escape this ambiguity. The dream for the subjective Bayesian dream is to start out in life with a single model. A single prior. For the entire world. This "world prior" would contain all the entirety of one's own life experience, and the grand total of human knowledge. Surely, writing out this prior is impossible. But the point is that a true Bayesian must behave (at least approximately) as if they were driven by such a universal prior. In principle, having such an universal prior (at least conceptually) solves the problem of choosing models and priors for problems: the priors and models you choose for particular problems are determined by the posterior of your universal prior. For example, why did you decide on a linear model for your economics data? It's because according to your universal posterior, you particular economic data is well-described by such a model with high-probability.
The main practical consequence of the universal prior is that your inferences in one problem should be consistent which your inferences in another, related problem. Even if the subjective Bayesian never writes out a "grand model", their integrated approach to data analysis for related problems still distinguishes their approach from the piecemeal approach of frequentists, who tend to treat each data analysis problem as if it occurs in an isolated universe. (So I claim, though I cannot point to any real example of such a subjective Bayesian.)
Yet, even if the subjective Bayesian ideal could be realized, many philosophers of science (e.g. Deborah Mayo) would consider it just as ambiguous as non-Bayesian approaches, since even if you have an unambiguous proecdure for forming personal priors, your priors are still going to differ from mine. I don't consider this a defect, since my worldview necessarily does differ from yours. My ultimate goal is to make the best decision for myself. That said, such egocentrism, even if rationally motivated, may indeed be poorly suited for a collaborative enterprise like science.
For me, the most far more troublesome objection to the "Bayesian dream" is the question, "How would actually you go about constructing this prior that represents all of your beliefs?" Looking in the Bayesian literature, one does not find any convincing examples of any user of Bayesian inference managing to actually encode all (or even a tiny portion) of their beliefs in the form of the prior--in fact, for the most part, we see alarmingly little thought or justification being put into the construction of the priors.
Nevertheless, I myself remained one of these "hardcore Bayesians", at least from a philosophical point of view, ever since I started learning about statistics. My faith in the "Bayesian dream" persisted even after spending three years in the Ph. D. program in Stanford (a department with a heavy bias towards Frequentism) and even after I personally started doing research in frequentist methods. (I see frequentist inference as a poor man's approximation for the ideal Bayesian inference.) Though I was aware of the Bayesian non-consistency results, I largely dismissed them as mathematical pathologies. And while we were still a long way from achieving universal inference, I held the optimistic view that improved technology and theory might one day finally make the "Bayesian dream" achievable. However, I could not find a way to ignore one particular example on Wasserman's blog[3], due to its relevance to very practical problems in causal inference. Eventually I thought of an even simpler counterexample, which devastated my faith in the possibility of constructing a universal prior. Perhaps a fellow Bayesian can find a solution to this quagmire, but I am not holding my breath.
The root of the problem is the extreme degree of ignorance we have about our world, the degree of surprisingness of many true scientific discoveries, and the relative ease with which we accept these surprises. If we consider this behavior rational (which I do), then the subjective Bayesian is obligated to construct a prior which captures this behavior. Yet, the diversity of possible surprises the model must be able to accommodate makes it practically impossible (if not mathematically impossible) to construct such a prior. The alternative is to reject all possibility of surprise, and refuse to update any faster than a universal prior would (extremely slowly), which strikes me as a rather poor epistemological policy.
In the rest of the post, I'll motivate my example, sketch out a few mathematical details (explaining them as best I can to a general audience), then discuss the implications.
Introduction: Cancer classification
Biology and medicine are currently adapting to the wealth of information we can obtain by using high-throughput assays: technologies which can rapidly read the DNA of an individual, measure the concentration of messenger RNA, metabolites, and proteins. In the early days of this "large-scale" approach to biology which began with the Human Genome Project, some optimists had hoped that such an unprecedented torrent of raw data would allow scientists to quickly "crack the genetic code." By now, any such optimism has been washed away by the overwhelming complexity and uncertainty of human biology--a complexity which has been made clearer than ever by the flood of data--and replaced with a sober appreciation that in the new "big data" paradigm, making a discovery becomes a much easier task than understanding any of those discoveries.
Enter the application of machine learning to this large-scale biological data. Scientists take these massive datasets containing patient outcomes, demographic characteristics, and high-dimensional genetic, neurological, and metabolic data, and analyze them using algorithms like support vector machines, logistic regression and decision trees to learn predictive models to relate key biological variables, "biomarkers", to outcomes of interest.
To give a specific example, take a look at this abstract from the Shipp. et. al. paper on detecting survival rates for cancer patients [4]:
Diffuse large B-cell lymphoma (DLBCL), the most common lymphoid malignancy in adults, is curable in less than 50% of patients. Prognostic models based on pre-treatment characteristics, such as the International Prognostic Index (IPI), are currently used to predict outcome in DLBCL. However, clinical outcome models identify neither the molecular basis of clinical heterogeneity, nor specific therapeutic targets. We analyzed the expression of 6,817 genes in diagnostic tumor specimens from DLBCL patients who received cyclophosphamide, adriamycin, vincristine and prednisone (CHOP)-based chemotherapy, and applied a supervised learning prediction method to identify cured versus fatal or refractory disease. The algorithm classified two categories of patients with very different five-year overall survival rates (70% versus 12%). The model also effectively delineated patients within specific IPI risk categories who were likely to be cured or to die of their disease. Genes implicated in DLBCL outcome included some that regulate responses to B-cell−receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis. Our data indicate that supervised learning classification techniques can predict outcome in DLBCL and identify rational targets for intervention.
The term "supervised learning" refers to any algorithm for learning a predictive model for predicting some outcome Y(could be either categorical or numeric) from covariates or features X. In this particular paper, the authors used a relatively simple linear model (which they called "weighted voting") for prediction.
A linear model is fairly easy to interpret: it produces a single "score variable" via a weighted average of a number of predictor variables. Then it predicts the outcome (say "survival" or "no survival") based on a rule like, "Predict survival if the score is larger than 0." Yet, far more advanced machine learning models have been developed, including "deep neural networks" which are winning all of the image recognition and machine translation competitions at the moment. These "deep neural networks" are especially notorious for being difficult to interpret. Along with similarly complicated models, neural networks are often called "black box models": although you can get miraculously accurate answers out of the "box", peering inside won't give you much of a clue as to how it actually works.
Now it is time for the first thought experiment. Suppose a follow-up paper to the Shipp paper reports dramatically improved prediction for survival outcomes of lymphoma patients. The authors of this follow-up paper trained their model on a "training sample" of 500 patients, then used it to predict the five-year outcome of chemotherapy patients, on a "test sample" of 1000 patients. It correctly predicts the outcome ("survival" vs "no survival") on 990 of the 1000 patients.
Question 1: what is your opinion on the predictive accuracy of this model on the population of chemotherapy patients? Suppose that publication bias is not an issue (the authors of this paper designed the study in advance and committed to publishing) and suppose that the test sample of 1000 patients is "representative" of the entire population of chemotherapy patients.
Question 2: does your judgment depend on the complexity of the model they used? What if the authors used an extremely complex and counterintuitive model, and cannot even offer any justification or explanation for why it works? (Nevertheless, their peers have independently confirmed the predictive accuracy of the model.)
A Frequentist approach
The Frequentist answer to the thought experiment is as follows. The accuracy of the model is a probability p which we wish to estimate. The number of successes on the 1000 test patients is Binomial(p, 1000). Based on the data, one can construct a confidence interal: say, we are 99% confident that the accuracy is above 83%. What does 99% confident mean? I won't try to explain, but simply say that in this particular situation, "I'm pretty sure" that the accuracy of the model is above 83%.
A Bayesian approach
The Bayesian interjects, "Hah! You can't explain what your confidence interval actually means!" He puts a uniform prior on the probability p. The posterior distribution of p, conditional on the data, is Beta(991, 11). This gives a 99% credible interval that p is in [0.978, 0.995]. You can actually interpret the interval in probabilistic terms, and it gives a much tighter interval as well. Seems like a Bayesian victory...?
A subjective Bayesian approach
As I have argued before, a Bayesian approach which comes up with a model after hearing about the problem is bound to suffer from the same inconsistency and arbitariness as any non-Bayesian approach. You might assume a uniform distribution for p in this problem... but yet another paper comes along with a similar prediction model? You would need a join distribution for the current model and the new model. What if a theory comes along that could help explain the success of the current method? The parameter p might take a new meaning in this context.
So as a subjective Bayesian, I argue that slapping a uniform prior on the accuracy is the wrong approach. But I'll stop short of actually constructing a Bayesian model of the entire world: let's say we want to restrict our attention to this particular issue of cancer prediction. We want to model the dynamics behind cancer and cancer treatment in humans. Needless to say, the model is still ridiculously complicated. However, I don't think it's out of reach of the efforts of a well-funded, large collaborative effort of scientists.
Roughly speaking, the model can be divided into a distribution over theories of human biology, and conditional on the theory of biology, a course-grained model of an individual patient. The model would not include every cell, every molecule, etc., but it would contain many latent variables in addition to the variables measured in any particular cancer study. Let's call the variables actually measured in the study, X, and also the survival outcome, Y.
Now here is the epistemologically correct way to answer the thought experiment. Take a look at the X's and Y's of the patients in the training and test set. Update your probabilistic model of human biology based on the data. Then take a look at the actual form of the classifier: it's a function f() mapping X's to Y's. The accuracy of the classsifer is no longer parameter: it's a quantity Pr[f(X) = Y] which has a distribution under your posterior. That is, for any given "theory of human biology", Pr[f(X) = Y] has a fixed value: now, over the distribution of possible theories of human biology (based on the data of the current study as well as all previous studies and your own beliefs) Pr[f(X) = Y] has a distribution, and therefore, an average. But what will this posterior give you? Will you get something similar to the interval [0.978, 0.995] you got from the "practical Bayes" approach?
Who knows? But I would guess in all likelihood not. My guess you would get a very different interval from [0.978, 0.995], because in this complex model there is no direct link from the empirical success rate of prediction, and the quantity Pr[f(X) = Y]. But my intuition for this fact comes from the following simpler framework.
A non-parametric Bayesian approach
Instead of reasoning about a gand Bayesian model of biology, I now take a middle ground, and suggesting that while we don't need to capture the entire latent dynamics of cancer, we should at the very least we should try to include the X's and the Y's in the model, instead of merely abstracting the whole experiment as a Binomial trial (as did the frequentist and pragmatic Bayesian.) Hence we need a prior over joint distributions of (X, Y). And yes, I do mean a prior distribution over probability distributions: we are saying that (X, Y) has some unknown joint distribution, which we treat as being drawn at random from a large collection of distributions. This is therefore a non-parametric Bayes approach: the term non-parametric means that the number of the parameters in the model is not finite.
Since in this case Y is a binary outcome, a joint distribution can be decomposed as a marginal distribution over X, and a function g(x) giving the conditional probability that Y=1 given X=x. The marginal distribution is not so interesting or important for us, since it simple reflects the composition of the population of patients. For the purpose of this example, let us say that the marginal is known (e.g., a finite distribution over the population of US cancer patients). What we want to know is the probability of patient survival, and this is given by the function g(X) for the particular patient's X. Hence, we will mainly deal with constructing a prior over g(X).
To construct a prior, we need to think of intuitive properties of the survival probability function g(x). If x is similar to x', then we expect the survival probabilities to be similar. Hence the prior on g(x) should be over random, smooth functions. But we need to choose the smoothness so that the prior does not consist of almost-constant functions. Suppose for now that we choose particular class of smooth functions (e.g. functions with a certain Lipschitz norm) and choose our prior to to be uniform over functions of that smoothness. We could go further and put a prior on the smoothness hyperparameter, but for now we won't.
Now, although I assert my faithfulness to the Bayesian ideal, I still want to think about how whatever prior we choose would allow use to answer some simple though experiments. Why is that? I hold that the ideal Bayesian inference should capture and refine what I take to be "rational behavior." Hence, if a prior produces irrational outcomes, I reject that prior as not reflecting my beliefs.
Take the following thought experiment: we simply want to estimate the expected value of Y, E[Y]. Hence, we draw 100 patients independently with replacement from the population and record their outcomes: suppose the sum is 80 out of 100. The Frequentist (and prgamatic Bayesian) would end up concluding that with high probability/confidence/whatever, the expected value of Y is around 0.8, and I would hold that an ideal rationalist come up with a similar belief. But what would our non-parametric model say? We would draw a random function g(x) conditional on our particular observations: we get a quantity E[g(X)] for each instantiation of g(x): the distribution of E[g(X)]'s over the posterior allows us to make credible intervals for E[Y].
But what do we end up getting? Either one of two things happens. Either you choose too little smoothness, and E[g(X)] ends up concentrating at around 0.5, no matter what data you put into the model. This is the phenomenon of Bayesian non-consistency, and a detailed explanation can be found in several of the listed references: but to put it briefly, sampling at a few isolated points gives you too little information on the rest of the function. This example is not as pathological as the ones used in the literature: if you sample infinitely many points, you will eventually get the posterior to concentrate around the true value of E[Y], but all the same, the convergence is ridiculously slow. Alternatively, use a super-high smoothness, and the posterior of E[g(X)] has a nice interval around the sample value just like in the Binomial example. But now if you look at your posterior draws of g(x), you'll notice the functions are basically constants. Putting a prior on smoothness doesn't change things: the posterior on smoothness doesn't change, since you don't actually have enough data to determine the smoothness of the function. The posterior average of E[g(X)] is no longer always 0.5: it gets a little bit affected by the data, since within the 10% mass of the posterior corresponding to the smooth prior, the average of E[g(X)] is responding to the data. But you are still almost as slow as before in converging to the truth.
At the time that I started thinking about the above "uniform sampling" example, I was stil convinced of a Bayesian resolution. Obviously, using a uniform prior over smooth functions is too naive: you can tell by seeing that the prior distribution over E[g(X)] is already highly concentrated around 0.5. How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to capturing my beliefs.
But then I thought, what about more complicated problems than computing E[Y]? What if you have to compute the expectation of Y conditional on some complicated function of X taking on a certain value: i.e. E[Y|f(X) = 1]? In the frequentist world, you can easily compute E[Y|f(X)=1] by rejection sampling: get a sample of individuals, average the Y's of the individuals whose X's satisfy f(X) = 1. But how could you formulate a prior that has the same property? For a finite collection of functions f, {f1,...,f100}, say, you might be able to construct a prior for g(x) so that the posterior for E[g(X)|fi = 1] converges to the truth for every i in {1,..,100}. I don't know how to do so, but perhaps you know. But the frequentist intervals work for every function f! Can you construct a prior which can do the same?
I am happy to argue that a true Bayesian would not need consistency for every possible f in the mathematical universe. It is cool that frequentist inference works for such a general collection: but it may well be unnecessary for the world we live in. In other words, there may be functions f which are so ridiculous, that even if you showed me that empirically, E[Y|f(X)=1] = 0.9, based on data from 1 million patients, I would not believe that E[Y|f(X)=1] was close to 0.9. It is a counterintuitive conclusion, but one that I am prepared to accept.
Yet, the set of f's which are not so ridiculous, which in fact I might accept to be reasonable based on conventional science, may be so large as to render impossible the construction of a prior which could accommodate them all. But the Bayesian dream makes the far stronger demand that our prior capture not just our current understanding of science but to match the flexibility of rational thought. I hold that given the appropriate evidence, rationalists can be persuaded to accept truths which they could not even imagine beforehand. Thinking about how we could possibly construct a prior to mimic this behavior, the Bayesian dream seems distant indeed.
Discussion
To be updated later... perhaps responding to some of your comments!
[1] Diaconis and Freedman, "On the Consistency of Bayes Estimates"
[2] ET Jaynes, Probability: the Logic of Science
[3] https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/
[4] Shipp et al. "Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning." Nature
Comments on "When Bayesian Inference Shatters"?
I recently ran across this post, which gives a lighter discussion of a recent paper on Bayesian inference ("On the Brittleness of Bayesian Inference"). I don't understand it, but I'd like to, and it seems like the sort of paper other people here might enjoy discussing.
I am not a statistician, and this summary is based on the blog post (I haven't had time to read the paper yet) so please discount my summary accordingly: It looks like the paper focuses on the effects of priors and underlying models on the posterior distribution. Given a continuous distribution (or a discrete approximation of one) to be estimated from finite observations (of sufficiently high precision), and finite priors, the range of posterior estimates is the same as the range of the distribution to be estimated. Given models that are arbitrarily close (I'm not familiar with the total variance metric, but the impression I had was that, for finite accuracy, they produce the same observations with arbitrarily similar probability), you can have posterior estimates that are arbitrarily distant (within the range of the distribution to be estimated) given the same information. My impression is that implicitly relying on arbitrary precision of a prior can give updates that are diametrically opposed to the ones you'd get with different, but arbitrarily similar priors.
First, of course, I want to know if my summary's accurate, misses the point, or wrong.
Second, I'd be interested in hearing discussions of the paper in general and whether it might have any immediate impact on practical applications.
Some other areas of discussion that would be of interest to me: I'm also not entirely sure what 'sufficiently high precision' would be. I also have only a vague idea of the circumstances where you'd be implicitly relying on the arbitrary precision of a prior. I'm also just generally interest in hearing what people more experienced/intelligent than I am might have to say here.
To like, or not to like?
Do you like Shakespeare?
I've been reading the Paris Review interviews with famous authors of the 20th century. Famous authors don't always like other famous authors. Hemingway, Faulkner, Joyce, Fitzgerald — for all of them, you could find some famous author who found them unreadable. (Especially Joyce and Faulkner.)
Except Shakespeare. Everyone loved Shakespeare. In fact, those who mentioned Shakespeare sometimes said he was the best author who has ever lived.
How likely is this?
Supposing you inherited an AI project...
Supposing you have been recruited to be the main developer on an AI project. The previous developer died in a car crash and left behind an unfinished AI. It consists of:
A. A thoroughly documented scripting language specification that appears to be capable of representing any real-life program as a network diagram so long as you can provide the following:
A.1. A node within the network whose value you want to maximize or minimize.
A.2. Conversion modules that transform data about the real-world phenomena your network represents into a form that the program can read.
B. Source code from which a program can be compiled that will read scripts in the above language. The program outputs a set of values for each node that will optimize the output (you can optionally specify which nodes can and cannot be directly altered, and the granularity with which they can be altered).
It gives remarkably accurate answers for well-formulated questions. Where there is a theoretical limit to the accuracy of an answer to a particular type of question, its answer usually comes close to that limit, plus or minus some tiny rounding error.
Given that, what is the minimum set of additional features you believe would absolutely have to be implemented before this program can be enlisted to save the world and make everyone live happily forever? Try to be as specific as possible.
Book: AKA Shakespeare (an extended Bayesian investigation)
Disclaimer: I have not read this book. I'm posting it in the expectation that others may enjoy it as much as I'm sure I would if I had time to read it myself.
This looks interesting as an extended worked example of Bayesian reasoning (the "scientific approach" of the title).
AKA Shakespeare: A Scientific Approach to the Authorship Question
The goal of AKA Shakespeare is to analyze the Shakespeare Authorship Question in such a way that you, Dear Reader, can review the evidence for yourself and come to your own conclusions. You will be presented with three candidates for the great playwright and poet whom we know as “Shakespeare.” He was either the gentleman from Stratford-upon-Avon (referred to as “Stratford”), Edward de Vere, Earl of Oxford (referred to as “Oxford”), or a vague “somebody else” (such as Christopher Marlowe, Henry Neville, etc., referred to as “Ignotus”). The book is built around 25 key questions. Concerning education, for instance, you are asked to infer Shakespeare's education level from his writings, and to compare that with the known (or more-or-less known, or speculated) education levels of Stratford, Oxford, and Ignotus. For each question, you are asked to express your opinions numerically. Rather than say “I strongly believe …,” you say, for instance, “I give 10 to 1 odds that … You then enter your numbers in a chart in the book. Alternatively and preferably, you enter your numbers in the companion website aka-Shakespeare.com which contains a program, “Prospero,” who will process your entries and return your resulting conclusions, expressed as probabilities that Shakespeare was Stratford, or Oxford, or Ignotus. To accommodate a mix of information, debate, and speculation, AKA Shakespeare is written as a dialog involving four fictional characters who meet, drink, and talk in interesting locations—from Napa Valley to Big Sur—in Northern California. Beatrice, a professor of English literature, begins as a committed Stratfordian. Claudia, a detective-story writer, is skeptical. Her husband James (a once-successful engineer, now a less successful vintner) helps to identify the relevant questions. Martin is the scientist who develops and applies the necessary analytical procedures. (To see their portraits and biographies, open up aka-Shakespeare.com.) Beatrice and Claudia end up agreeing that the leading candidate is de Vere, with Ignotus second and Stratford a very distant third. Beatrice’s entries lead to a final probability of 10−13 (one chance in ten million million) that Shakespeare was the gentleman from Stratford-upon-Avon. Claudia’s entries lead to an even smaller probability. James ends with the wry remark: “We—in our smug presumed wisdom—wonder how any men or women could possibly have been so foolish as to believe that the Earth was flat. Maybe, in a hundred years’ time, people will wonder how otherwise sensible men and women could have believed that the works of Shakespeare were written by a butcher’s apprentice from a small town in Warwickshire!” You are encouraged to review the evidence for yourself. You may find that you agree with Beatrice, Claudia, and James. On the other hand, you may not.
http://amzn.com/0984261419
Edited to add:
There are many signs in the above block of text that this book is not up to Lesswrong standards. As gwern suggests, reading it should be done with an adversarial attitude.
I propose some more useful goals than finding someone for whom we can cheer loudly as a properly qualified member of our tribe: find worked examples that let you practice your art; find structured activities that will actually lead you to practice your art; try to critically assess arguments that use the tools we think powerful, then discuss your criticism on a forum like Lesswrong where your errors are likely to be discovered and your insights are likely to be rewarded (with tasty karma).
Unintentional bayesian
Growing up in a very religious country, I was indoctrinated thoroughly both at home and at school. I used to believe that some Christian beliefs made sense. When I was 14 years old or so, I began contemplating death – I said to myself, “Well, after I die I go to Hell or Heaven; the latter is preferable, so I'd better learn as soon as possible how I can make sure I'll go to Heaven.”
So I went on to read frantically about Christianity. With every iota of information processed, I strayed away from this religion. That is, the more I read, the less anything pertaining to it seemed plausible. “Where the hell is Hell? Can I visit before I die? Why doesn't God answer my prayers to tell me? Why do some people get to talk to God but not me?”, I retorted. In retrospective, my greatest strength was genuine curiosity – I wanted to know as much as possible about the truthfulness of my religion.
The irony here is that wanting to become more Christian-like led to my abandoning of Christianity. But I continued to learn more about other religions as well, thinking that one might be truer than the other. Of course, none of them seemed every remotely plausible; I concluded that religions are false. I turned into an atheist without even knowing that that word existed!
Eventually I stumbled on some articles regarding non-religion and discovered that my lack of religious beliefs are called 'atheism'. Since then, I have abandoned more beliefs tied to, say, politics or nutrition, thanks to applying bayesian probability to my hypotheses.
I had been an unintentional bayesian for my whole life!
Have you had any similar experiences?
PS: This is my first article. I am looking forward to hearing feedback on it.
Edit #1: I should have used the term 'rationalist' instead of 'bayesian' because I didn't apply Bayes' theorem explicitly.
What is the best paper explaining the superiority of Bayesianism over frequentism?
Question in title.
This is obviously subjective, but I figure there ought to be some "go-to" paper. Maybe I've even seen it once, but can't find it now and I don't know if there's anything better.
Links to multiple papers with different focus would be welcome. For my current purpose I have a preference for one that aims low and isn't too long.
[LINK] stats.stackexchange.com question about Shalizi's Bayesian Backward Arrow of Time paper
I haven't gotten an answer on this yet and I set up a bounty; I figured I'd link it here too in case any stats/physics people care to take a crack at it.
Request for feedback: paper on fine-tuning and the multiverse hypothesis
A while back, I posted in the "What are you working on?" thread about a paper I was working on. A few people wanted to see it once I have a complete draft, and I'm of course independently interested in obtaining feedback before I move on with it.
The paper doesn't presuppose much philosophical jargon that isn't easily googleable, I think. Math-wise, you need to be somewhat comfortable with basic conditional probabilities. I'm interested in finding out about any math errors, other non sequiturs, and other flaws in my discussion. I'd also like to find out about general impressions, such as what I should have spilled more or less ink on. Some notation is unfinished (subscripts, singular/plural first person, etc.), but it's thoroughly readable.
ABSTRACT: According to a standard form of the fine-tuning argument, the apparent anthropic fine-tuning of the physical constants and boundary conditions of our universe confirms the multiverse hypothesis. According to the inverse gambler’s fallacy objection, this view is mistaken: although the multiverse hypothesis makes the existence of a life-permitting universe more probable than it would be on a single-universe theory, it does not make it any more probable that our universe should be life-permitting, and thus is not confirmed by our total evidence. We examine recent replies to this objection and conclude that they all fall short, usually due to a shared weakness. We then show how a synthetic reply, obtained by combining independent insights from the literature, can overcome the weakness afflicting its predecessors.
If you'd like a slightly more detailed description before deciding whether or not to read the whole thing, see my post.
Here is the actual paper: DOCX PDF (on some computers, italicized Times New Roman looks weird in the PDF)
EDIT 5/9/12: Current draft (edited, shortened to 13.5K words) is here:
DOCX: http://bit.ly/Jc4pXr
PDF: http://bit.ly/Jdc7z3
NOTE: The paper occasionally makes use of the notion of a person as a metaphysical individual. Roughly and likely inaccurately, this is the concept of an individual essence that can only be instantiated once in a possible world and is partly independent of the physical pattern it inhabits (i.e. you can have different possible worlds that are physically identical but contain different individuals -- I think this is what Eliezer refers to as "the philosophical notion of indexical identity apart from pattern identity"). I personally find this concept unmotivated to say the least; it figures in the paper only because some of the arguments discussed rely on it; and it is inessential for my proposed reply. If you're going to weight in on this, I'd rather you make suggestions as to how I could gracefully express that I find the concept unhelpful while still engaging with the arguments.
The Quick Bayes Table
This is an effort to make Bayes' Theorem available to people without heavy math skills. It is possible that this has already been invented, because it is just a direct result of expanding something I read at Yudkowsky’s Intuitive Explanation of Bayes Theorem. In that case, excuse me for reinventing the wheel. Also, English is my second language.
When I read Yudkowsky’s Intuitive Explanation of Bayes Theorem, the notion of using decibels to measure the likelihood ratio of additional evidence struck me as extremely intuitive. But in the article, the notion was just a little footnote, and I wanted to check if this could be used to simplify the theorem.
It is harder to use logarithms than just using the Bayes Theorem the normal way, but I remembered that before modern calculators were made, mathematics carried around small tables of base 10 logarithms that saved them work in laborious multiplications and divisions, and I wondered if we could use the same in order to get quick approximations to Bayes' Theorem.
I calculated some numbers and produced this table in order to test my idea:
|
Decibels |
Probability |
Odds |
|
-30 |
0.1% |
1:1000 |
|
-24 |
0.4% |
1:251 |
|
-20 |
1% |
1:100 |
|
-18 |
1,5% |
1:63 |
|
-15 |
3% |
1:32 |
|
-12 |
6% |
1:16 |
|
-11 |
7% |
1:12.6 |
|
-10 |
9% |
1:10 |
|
-9 |
11% |
1:7.9 |
|
-8 |
14% |
1:6.3 |
|
-7 |
17% |
1:5 |
|
-6 |
20% |
1:4 |
|
-5 |
24% |
1:3.2 |
|
-4 |
28% |
1:2.5 |
|
-3 |
33% |
1:2 |
|
-2 |
38% |
1:1.6 |
|
-1 |
44% |
1:1.3 |
|
0 |
50% |
1:1 |
|
+1 |
56% |
1.3:1 |
|
+2 |
62% |
1.6:1 |
|
+3 |
67% |
2:1 |
|
+4 |
72% |
2.5:1 |
|
+5 |
76% |
3.2:1 |
|
+6 |
80% |
4:1 |
|
+7 |
83% |
5:1 |
|
+8 |
86% |
6.3:1 |
|
+9 |
89% |
7.9:1 |
|
+10 |
91% |
10:1 |
|
+11 |
93% |
12.6:1 |
|
+12 |
94% |
16:1 |
|
+15 |
97% |
32:1 |
|
+18 |
98.5% |
63:1 |
|
+20 |
99% |
100:1 |
|
+24 |
99.6% |
251:1 |
|
+30 |
99.9% |
1000:1 |
This table's values are approximate for easier use. The odds approximately double every 3 dB (The real odds are 1.995:1 in 3 dB) and are multiplied by 10 every 10 dB exactly.
In order to use this table, you must add the decibels results from the prior probability (Using the probability column) and the likelihood ratio (Using the ratio column) in order to get the approximated answer (Probability column of the decibel result). In case of doubt between two rows, choose the closest to 0.
For example, let's try to solve the problem in Yudkowsky’s article:
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
1% prior gets us -20 dB in the table. For the likelihood ratio, 80% true positive versus 9.6% false positive is about a 8:1 ratio, +9 dB in the table. Adding both results, -20 dB + 9 dB = -11dB, and that translates into a 7% as the answer. The true answer is 7.9%, so this method managed to get close to the real answer with just a simple addition.
--
Yudkowsky says that the likelihood ratio doesn't tell the whole story about the possible results of a test, but I think we can use this method to get the rest of the story.
If you can get the positive likelihood ratio as the meaning of a positive result, then you can use the negative likelihood ratio as the meaning of the negative result just reworking the problem.
I'll use Yudkowsky's problem in order to explain myself. If 80% of women with breast cancer get positive mammographies, then 20% of them will get negative mammographies, and they will be false negatives. If 9.6% of women without breast cancer get positive mammographies, then 90.4% of them will get negative mammographies, true negatives.
The ratio between those two values will get us the meaning of a negative result: 20% false negative versus 90.4% true negative is between 1:4 and 1:5 ratio. We get the decibel value closest to 0, -6 dB. -20 dB - 6 dB = -26 dB. This value is between -24 dB and -30 dB, so the answer will be between 0.1% and 0.4%. The true answer is 0.2%, so it also works this way.
--
The positive likelihood ratio and the negative likelihood ratio are a good way of describing how a certain test adds additional data. We could describe the mammography test as a +9dB/-6dB test, and with only this information we know everything we need to know about the test. If the result is positive, it adds 9dB to the evidence, and if it is negative, it subtracts 6dB to it.
Simple and intuitive.
By the way, as decibels are used to measure physical quantities, not probabilities, I believe that renaming the unit would be appropriate in this case. What about DeciBayes?
Teaching Bayesian statistics? Looking for advice.
I am considering trying to get a job teaching statistics from a Bayesian perspective at the university or community college level, and I figured I should get some advice, both on whether or not that's a good idea and how to go about it.
Some background on myself: I just got my Masters in computational biology, to go along with a double Bachelors in Computer Science and Cell/Molecular Biology. I was in a PhD program but between enjoying teaching more than research and grad school making me unhappy, I decided to get the Masters instead. I've accumulated a bunch of experience as a teaching assistant (about six semesters) and I'm currently working as a Teaching Specialist (which is a fancy title for a full time TA). I'm now in my fourth semester of TAing biostatistics, which is pretty much just introductory statistics with biology examples. However, it's taught from a frequentist perspective.
I like learning, optimizing, teaching, and doing a good job of things I see people doing badly. I also seem to do dramatically better in highly structured environments. So, I've been thinking about trying to find a lecturer position teaching statistics from a Bayesian perspective. All of the really smart professors I know personally who have an opinion on the topic are Bayesians, Less Wrong as a community prefers Bayesianism, and I prefer it. This seems like a good way to get paid to do something I would enjoy and raise the rationality waterline while I'm at it.
So, the first question is whether this is the most efficient way to get paid to promote rationality. I did send in an application to the Center for Modern Rationality but I haven't heard back, so I'm guessing that isn't an option. Teaching Bayesian statistics seems like the second best bet, but there are probably other options I haven't thought of. I could teach biology or programming classes, but I think those would be less optimal uses of my skills.
Next, is this even a viable option for me, given my qualifications? I haven't taken any education classes to speak of (the class on how to be a TA might count but it was a joke). My job searches suggest that community colleges do hire people with Masters to teach, but universities mostly do not. I don't know what it takes to actually get hired in the current economic climate.
I'm also trying to figure out if this is the best career option given my skillset (or at least estimate the opportunity cost in terms of ease of finding jobs and compensation). I have a number of other potential options available: I could try to find a research position in bioinformatics or computational biology, or look for programming positions. Bioinformatics really makes "analyzing sequence data" and that's something I've barely touched since undergrad; my thesis used existing gene alignments. I could probably brush up and learn the current tools if I wanted, but I have hardly any experience in that area. Computational biology might be a better bet, but it's a ridiculously varied field and so far I have not much enjoyed doing research.
I could probably look for programming jobs, but they would mostly not leverage my biology skills; while I am a very good programmer for a biologist, and a very good biologist for a programmer, I'm not amazing at either. I can actually program: my thesis project involved lots of Ruby scripts to generate and manipulate data prior to statistical analysis, and I've also written things like a flocking implementation and a simple vector graphics drawing program. Everything I've written has been just enough to do what I needed it to do. I did not teach myself to program in general, but I did teach myself Ruby, if that helps estimate my level of programming talent. Yudkowsky did just point out that programming potentially pays REALLY well, possibly better than any of my other career options, but that may be limited to very high talent and/or very experienced programmers.
Assuming it is a good idea for me to try to teach statistics, and assuming I have a reasonable shot at finding such a job, is it realistic to try to teach statistics from a Bayesian perspective to undergrads? Frequentist approaches are still pretty common, so the class would almost certainly have to cover them as well, which means there's a LOT of material to cover. Bayesian methods generally involve some amount of calculus, although I have found an introductory textbook which uses minimal calculus. That might be a bit much to cram into a single semester, especially depending on the quality of the students (physics majors can probably handle a lot more than community college Communications majors).
Speaking of books, what books would be good to teach from, and what books should I read to have enough background? I attempted Jaynes' Probability Theory: The Logic of Science but it was a bit too high level for me to fully understand. I have been working my way through Bolstad's Introduction to Bayesian Statistics which is what I would probably teach the course from. Are there any topics that Less Wrong thinks would be essential to cover in an introductory Bayesian statistics course?
Thanks in advance for all advice and suggestions!
The Principle of Maximum Entropy
After having read the related chapters of Jaynes' book I was fairly amazed by the Principle of Maximum Entropy, a powerful method for choosing prior distributions. However it immediately raised a large number of questions.
I have recently read two quite intriguing (and very well-written) papers by Jos Uffink on this matter:
Can the maximum entropy principle be explained as a consistency requirement?
The constraint rule of the maximum entropy principle
I was wondering what you think about the principle of maximum entropy and its justifications.
Online Course in Evidence-Based Medicine
The Foundation for Blood Research has created an online course in Evidence-Based Medicine, aimed at "advanced high school science students, college students, nursing students, and 1st or 2nd year medical students." It focuses on evaluating research papers and applying statistics to medical diagnosis. I have taken this course, and it was useful practice in Bayesian reasoning.
The course involves working through a couple case studies of ER patients. Students will observe the patient, review research on relevant diagnostic tests, and calculate posterior probabilities given the available information. For instance: once case involves a woman who may have bacterial meningitis, but her spinal fluid test results are mixed. Students then read parts of a paper describing the success of different components of the spinal fluid test as predictors of meningitis.
The course is self-paced and highly modular, alternating between videos, multiple choice or calculation questions, and short written submissions. There is no in-course interaction between students taking the same course, but it is divided into "class sections" for the convenience of teachers who want to observe their students. It works well with Firefox and Safari, and slightly less well (but still easily usable) with Internet Explorer.
Anyone who is interested or wants more information, look at their website or ask me in the comments. Once a decent number of people have shown some interest, I will contact one of the site administrators and he'll set up an official class section for us.
EDIT: I have contacted the site administrator, we should have a class section available soon. Section name and info on how to log in will be posted shortly.
EDIT2: The course section is up: go to the http://evidenceworksremote.com/courses/ and then find the Less Wrong Community course. When you click on the course listing you will be asked to register. Once you receive the acknowledging email return to the course and enter the "enrollment" key: LW101 . I will be able to see your responses to the questions and possibly able to provide feedback. Once you have completed the course, Dr. Allan, who is one of the developers, would appreciate feedback by email.
Foundations of Inference
I've recently been getting into all of this wonderful Information Theory stuff and have come across a paper (thanks to John Salvatier) that was written by Kevin H. Knuth:
The paper sets up some intuitive minimal axioms for quantifying power sets and then (seems to) use them to derive Bayesian probability theory, information gain, and Shannon Entropy. The paper also claims to use less assumptions than both Cox and Kolmogorov when choosing axioms. This seems like a significant foundation/unification. I'd like to hear whether others agree and what parts of the paper you think are the significant contributions.
If a 14 page paper is too long for you, I recommend skipping to the conclusion (starting at the bottom of page 12) where there is a nice picture representation of the axioms and a quick summary of what they imply.
Naming the Highest Virtue of Epistemic Rationality
Edit: Looking back at this a few years later. It is pretty embarrassing, but I'm going to leave it up.
Why don't we start treating the log2 of the probability — conditional on every available piece of information — you assign to the great conjunction, as the best measure of your epistemic success? Let's call: log_2(P(the great conjunction|your available information)), your "Bayesian competence". It is a deductive fact that no other proper scoring rule could possibly give: Score(P(A|B)) + Score(P(B)) = Score(P(A&B)), and obviously, you should get the same score for assigning P(A|B) to A, after observing B, and assigning P(B) to B a priori, as you would get for assigning P(A&B) to A&B a priori. The great conjunction is the conjunction of all true statements expressible in your idiolect. Your available information may be treated as the ordered set of your retained stimulus.
If this doesn't make sense, or you aren't familiar with these ideas, checkout Technical Explanation after checking out Intuitive Explanation.
It is standard LW doctrine that we should not name the highest value of rationality, and it is often defended quite brilliantly:
You may try to name the highest principle with names such as “the map that reflects the territory” or “experience of success and failure” or “Bayesian decision theory”. But perhaps you describe incorrectly the nameless virtue. How will you discover your mistake? Not by comparing your description to itself, but by comparing it to that which you did not name.
and of course also:
How can you improve your conception of rationality? Not by saying to yourself, “It is my duty to be rational.” By this you only enshrine your mistaken conception. Perhaps your conception of rationality is that it is rational to believe the words of the Great Teacher, and the Great Teacher says, “The sky is green,” and you look up at the sky and see blue. If you think: “It may look like the sky is blue, but rationality is to believe the words of the Great Teacher,” you lose a chance to discover your mistake.
These quotes are from the end of Twelve Virtues
Should we really be wondering if there's a virtue higher than bayesian competence? Is there really a probability worth worrying about that the description of bayesian competence above is misunderstood? Is the description not simple enough to be mathematical? What mistake might I discover in my understanding of bayesian competence by comparing it to that which I did not name, after I've already given a proof that bayesian competence is proper, and that the restrictions: score(P(B)*P(A|B)) = score(P(B)) + score(P(A|B)), and: must be a proper scoring rule, uniquely specify Logb?
I really want answers to these questions. I am still undecided about them; and change my mind about them far too often.
Of course, your bayesian competence is ridiculously difficult to compute. But I am not proposing the measure for practical reasons. I am proposing the measure to demonstrate that degree of rationality is an objective quantity that you could compute given the source code to the universe, even though there are likely no variables in the source that ever take on this value. This may be of little to no value to the most obsessively pragmatic practitioners of rationality. But it would be a very interesting result to philosophers of science and rationality.
Updated to better express view of author, and take feedback into account. Apologies to any commenter who's comment may have been nullified.
The comment below:
The general reason Eliezer advocates not naming the highest virtue (as I understand it) is that there may be some type of problem for which bayesian updating (and the scoring rule referred to) yields the wrong answer. This idea sounds rather improbable to me, but there is a non-negligible probability that bayes will yield a wrong answer on some question. Not naming the virtue is supposed to be a reminder that if bayes ever gives the wrong answer, we go with the right answer, not bayes.
has changed my mind about the openness of the questions I asked.
Review article on Bayesian inference in physics
A nice article just appeared in Reviews of Modern Physics. It offers a brief coverage of the fundamentals of Bayesian probability theory, the practical numerical techniques, a diverse collection of real-world examples of applications of Bayesian methods to data analysis, and even a section on Bayesian experimental design. The PDF is available here.
The abstract:
Rev. Mod. Phys. 83, 943–999 (2011)
Bayesian inference in physics
Received 8 December 2009; published 19 September 2011
Bayesian inference provides a consistent method for the extraction of information from physics experiments even in ill-conditioned circumstances. The approach provides a unified rationale for data analysis, which both justifies many of the commonly used analysis procedures and reveals some of the implicit underlying assumptions. This review summarizes the general ideas of the Bayesian probability theory with emphasis on the application to the evaluation of experimental data. As case studies for Bayesian parameter estimation techniques examples ranging from extra-solar planet detection to the deconvolution of the apparatus functions for improving the energy resolution and change point estimation in time series are discussed. Special attention is paid to the numerical techniques suited for Bayesian analysis, with a focus on recent developments of Markov chain Monte Carlo algorithms for high-dimensional integration problems. Bayesian model comparison, the quantitative ranking of models for the explanation of a given data set, is illustrated with examples collected from cosmology, mass spectroscopy, and surface physics, covering problems such as background subtraction and automated outlier detection. Additionally the Bayesian inference techniques for the design and optimization of future experiments are introduced. Experiments, instead of being merely passive recording devices, can now be designed to adapt to measured data and to change the measurement strategy on the fly to maximize the information of an experiment. The applied key concepts and necessary numerical tools which provide the means of designing such inference chains and the crucial aspects of data fusion are summarized and some of the expected implications are highlighted.
© 2011 American Physical Society
What are good techniques and resources for teaching bayes theorem hands on?
Frequentist vs Bayesian breakdown: interpretation vs inference
Suppose we have two different human beings, Connor and Diane, who agree to interpret their subjective anticipations as probabilities, thereby commonly earning them the title "Bayesian". On a particular project or venture, they might disagree on Trick A or Trick B to decide the next step in the project. It might be that Trick A is commonly labelled a "Frequentist inference method" and B is a "Bayesian inference method". Why might they disagree?
As far as I can see, there are 3 disagreements that get labelled "Bayesian vs Frequentist" debates, and conflating them is a problem:
(1) Whether to interpret all subjective anticipations as probabilities.
(2) Whether to interpret all probabilities as subjective anticipations.
(3) Whether, on a particular project, to use Statistical Trick B instead of Statistical Trick A to infer the best course of action, when B is commonly labelled a "Bayesian method" and A is a "Frequentist method".
(Regarding 3, UC Berkeley professor Michael Jordan offers a good heuristic for how statistical tricks get labelled as Bayesisn or Frequentist, in terms of which terms in a loss function one treats as fixed or variable. I recommend watching the first twenty minutes of his video lecture on this if you're not familiar.)
The question "is Connor a Bayesian or a Frequentist?" is commonly posed as though Connor's position on 1, 2, and 3 must be either "yes, yes, yes" or "no, no, no". I don't believe this is so often the case. For example, my position is:
(1) - Yes. Insofar as we have subjective anticipations, I agree normatively that they should behave and update as probabilities.
(2) - Don't care much. Expressions like P(X|Y) and P(X and Y) are useful for denoting both subjective anticipations and proportions of a whole, and in particular, proportions of real future events. Whether to use the word "probability" is a terminological question. Personally I try to reserve the word "probability" for when they mean subjective anticipations, and say "proportion" when they mean proportions of real future, but this is word choice. Unfortunately this word choice is strongly associated and confused with positions on (1) and (3).
(3) - It depends. In statistical inference, we commonly consider data sets x, world models M, and parameters θ that specify the model M more precisely. I consider the separation of belief into M and θ to be purely formal. When guessing the next data set y, one considers expressions of the form P(x|M,θ) in some way. If I'm already very confident in a specific world model M, and expect θ to actually vary from situation to situation, I'll probably try to estimate the parameters θ from x in a way that has the best expected success rate across all possible data sets M would generate. You might say here that I "trust the model more than the data" (though what I really don't trust are the changing model parameters), and this is a trick commonly referred to as "Frequentist". If I'm not confident in the model M, or expect the parameters θ to the be the same in many future situations, I'll probably try to estimate M,θ from x in a way that has the best expected success rate assuming x. You might say here that I "trust the data more than the model", and label this a "Bayesian" trick.
Throughout (3), since my position in (1) is not changing, a member of the Bayes Tribe will say I'm "really a Bayesian all along", but I don't want to continue with this conflation of position names. It's true that if I use the "Frequentist trick", it will be because I've updated in favor of it, i.e. my subjective confidence levels in the various theory elements are appropriate for it.
... But from now on, when term "Bayesian" or "Frequentist" arises in a debate, my plan is to taboo the terms immediately, and proceed to either dissolve the issue into (1), (2), and (3) above, or change the conversation if people don't have the energy or interest for that length of conversation.
Do people agree with this breakdown? I think I could be persuaded otherwise and would of course appreciate it if I were :)
ETA: I think the wisdom to treat beliefs as anticipation controllers and update our confidences based on evidence might be too precious to alienate people from it with the label "Bayesian", especially if the label is as ambiguous as my breakdown has found it to be.
Michael Jordan dissolves Bayesian vs Frequentist inference debate [video lecture]
UC Berkeley professor Michael Jordan, a leading researcher in machine learning, has a great reduction of the question "Are your inferences Bayesian or Frequentist?". The reduction is basically "Which term are you varying in the loss function?". He calls this the "decision theoretic perspective" on the debate, and uses this terminology well in keeping with LessWrong interests.
I don't have time to write a top-level post about this (maybe someone else does?), but I quite liked the lecture, and thought I should at least post the link!
http://videolectures.net/mlss09uk_jordan_bfway/
The discussion gets much clearer starting at the 10:11 slide, which you can click on and skip to if you like, but I watched the first 10 minutes anyway to get a sense of his general attitude.
Enjoy! I recommend watching while you eat, if it saves you time and the food's not too distracting :)
Bayesian Reasoning Applied to House Selling: Listing Price
Like Yvain's parents, I am planning on moving house. Selling a house and buying a house involve making a lot of decisions based on limited information, which I thought would make a set of good exercises for the application of Bayesian reasoning. I need to decide what price to list my house for, determine how much time and money to put into fixing it up, choose a new home and then there's the two poker games of the final negotiations of the sale.
(I logged onto Less Wrong having just made the decision to consider posting this article, so I was kind of weirded out at first by the title of Yvain's post; but then I was relieved that the topic was somewhat different. I am used to coincidences but on the other hand they push me a little paranoid on my spectrum and I'll feel less stable for a few hours. I already know Google tracks me and who knows what algorithms could be running given a bunch of computer scientists...?)
House Story
tldr; We're listing at the appraised value +10%.
A few years ago, we purchased a beautiful house. 'We' is my husband and I and my parents. We purchased the house because it includes a guest house where my parents can retire. However, my mom continues to postpone retirement and in the meantime my husband and I decided we would a) like more light, b) like a shorter commute and c) could purchase two homes we prefer for the price of this one -- my parents would enjoy a house on the water. (Great post and spot on about the features that matter, Yvain!)
I would be happy to sell the house for +5%, covering real estate fees and new flooring we put in. However, three houses in the cul de sac have sold this year for +10% and so we listed it at that price too. Our house is bigger than theirs but not as nice (they have granite and impressive entrances and we don't). On the other hand, having the guest house makes us special.
Via agent and potential buyer feedback, we're coming to realize that we might be lucky to sell the house for +5%. At this price level, people prefer a house that is impressive and in perfect condition.
Primary Bayesian Question
My primary question is the following: how should we decide to modify our listing price as we get more information?
First, I've read that if a house is priced correctly you'll get an average of one offer every 10 showings. So far we've had 2 showings without an offer. After how many showings should we reduce the price?
Second, the other three houses sold in 6 or 7 months. After how many months should we reduce the price?
Keep in mind, we don't have to move and I estimate that I would be willing to stay in this house for about +3% per year. In other words, I would be willing to wait 2 years for a higher offer if I could sell it for +3% more by doing so.
I anticipate that after posting this I will be embarrassed that it is so pecuniary. On the other hand, this makes it concrete and the problem in general doesn't have too many emotional factors. Any money we make over the first +5% can be used as a down payment for our next house after we pay our parents back. (I did feel embarrassed, so I took out the dollar values and replaced with relative percents.)
Attempt to explain Bayes without much maths, please review
My current favourite waste of time is the concept of Bayesian postmodernism. Just putting those two words together invokes a world of delightful wrangling, as approximately anyone who understands one won't understand or will have contempt for the other. (Though I found at least one person - a programmer who's studied philosophy - who got the idea straight away when I posted it to my blog, which is one more than I was expecting.) It is currently a page of incoherent notes and isn't necessarily actually useful for anything as yet and may never be.
Anyway, that's not my point today. My point today is that as part of this, I have to somehow explain Bayesian thinking in a nutshell to people who are highly intelligent, but have no mathematical knowledge and may actually be dyscalculic - but who can and do get the feel of things. I'm trying to get across that this is how learning works already and I just want to make them aware of it. I've run it past a couple of working artists who seemed to get the idea a bit. So I am posting this here for your technical correction.
If you think it's any good, please do run it past artists or critics of your acquaintance. (I'm glancing in AndrewHickey's direction right now.)
"The meaning of a thing is the way you should be influenced by it." - Vladimir Nesov
To explain what "Bayesian postmodernism" means, I first have to try to explain Bayesian epistemology.
- Probability does not exist outside in the world - it exists in your head. Things happen or not in the world; probability measures your knowledge of them.
- You know certain things, to some degree. New knowledge and experiences come in and affect your knowledge, pulling your degree of certainty of given ideas up or down.
- Bayes' Theorem is the equation describing precisely how much a new piece of information (on the probability of something holding in the world) must affect your knowledge. This is a mathematical theorem, true in the sense that 2+2=4 is true; this is a mathematical question with one right answer. If you know your "prior probability," and you know what the new information is, you know your new probability (the "posterior probability").
- The hard part is, of course knowing what the hell your prior actually is, to more useful specificity than "everything you think you know about everything."
- (Just to make it harder, the prior is not a number, but a probability distribution over a spectrum of possible alternatives.)
Bayesian epistemology is the notion of using this approach to map out the network of your degrees of certainty of your ideas and how they interact, and just how much a new idea should change your existing degrees of certainty.
The application to criticism and understanding of art should be obvious to anyone with even an enthusiast's experience in the field. (And probably not to anyone without.) Postmodernism tells us we can't be certain of anything; Bayesianism tells us precisely how uncertain we should be.
Problems:
- Assigning meaningful numbers is tricky. It's hard enough having some sort of feel for how certain you feel a given notion is, let alone working out how those certainties should interact with mathematical rigour.
- The mathematics to build a Bayesian network properly can get quite hairy. Calculus tends not to be a strong point of art critics.
- The subject matter is subjective internal feelings about art. Two people could build plausible yet utterly incompatible Bayesian networks of subjective feelings, even given that art is intersubjective rather than purely subjective. (There is an interesting result called Aumann's Agreement Theorem which mathematically proves that two Bayesians starting from the same data cannot "agree to disagree", at least one must be wrong - but find two art enthusiasts who start from the same life experiences with the same personal inclinations. Thus, convincing others becomes an argument about bases.)
No human who claims to be a Bayesian actually has a network mapped out in their head. They're just doing their best. But that people (a) do this and (b) get useful results from it - even in number-based fields, rather than subjective feeling-based ones - is promising.
A word on competing approaches: The model that holds that probability exists in the world, which is the version found in common everyday popular usage and which your statistics textbooks probably taught you how to use, is the frequentist approach. This is a grab-bag of tools and statistical methods to apply to the problem. The easy part is you don't have to know your precise prior. The hard part is that different methods can get different answers, of which only one (if that) can be right, so you have to know which one to apply. The entire frequentist toolkit can be mathematically derived from the Bayesian approach. The Bayesian approach is currently increasingly popular in science and economics, because it gives the right answer if you have your prior right.
If the above only requires minor fixes, I may post-edit based on comments so I can just refer people to this link.
Despite the above section being what I've posted this here for discussion of, this is going to devolve into a thread about postmodernism. So I'll answer some of the obvious here.
- "Postmodernism" is not one coherent thing, but six or seven (so far that I've encountered). Per the name, it's a reaction against something called "modernism" in whatever field is being addressed.
- This also means that a lot of it makes no damn sense to the untrained reader unless you also understand what it's a reaction against.
- The name is given to both the methods and the results. Finding terrible postmodernism is about as hard as finding terrible punk rock, for the same reasons. (e.g. that Sokal pranked some idiots has negligible bearing on Derrida.)
- Science, and the Enlightenment in general, is a modernist project by nature (despite the Modernist movement per se claiming to be a reaction to Enlightenment thinking), so science fans and postmodernists have a natural culture clash.
- It was obvious to me, but I appear to be the first person in the world to explicitly note the Bayes structure of the postmodern approach. I noticed this then I noticed that différance, insofar as Derrida actually admits there's a definition, looks very like how a Bayesian update would feel from the inside. I think. Maybe.
- Cav points out that I'm basically rebuilding PM in the shape of Bayes. I need to learn enough to attempt to do it the other way around too.
- I've either struck gold or I've struck crack.
Post-script: No-one's coughed up their own skull in horror yet, so I assume I haven't made any glaring technical errors and, modulo a few post-edits, this'll do for now. It's still too mathematical, but diagrams may help - maybe the next version will have some.
Nor has anyone started talking about postmodernism, to my surprise.
PPS: And I'm surprised no-one's disputed "No human who claims to be a Bayesian actually has a network mapped out in their head."
Against improper priors
An improper prior is essentially a prior probability distribution that's infinitesimal over an infinite range, in order to add to one. For example, the uniform prior over all real numbers is an improper prior, as there would be an infinitesimal probability of getting a result in any finite range. It's common to use improper priors for when you have no prior information.
The mark of a good prior is that it gives a high probability to the correct answer. If I bet 1,000,000 to one that a coin will land on heads, and it lands on tails, it could be a coincidence, but I probably had a bad prior. A good prior is one that results in me not being very surprised.
With a proper prior, probability is conserved, and more probability mass in one place means less in another. If I'm less surprised when a coin lands on tails, I'm more surprised when it lands on heads. This isn't true with an improper prior. If I wanted to predict the value of a random real number, and used a normal distribution with a mean of zero and a standard deviation of one, I'd be pretty darn surprised if it doesn't end up being pretty close to zero, but I'd be infinitely surprised if I used a uniform distribution. No matter what the number is, it will be more surprising with the improper prior. Essentially, a proper prior is better in every way. (You could find exceptions for this, such as averaging a proper and improper prior to get an improper prior that still has finite probabilities and they just add up to 1/2, or by using a proper prior that has zero in some places, but you can always make a proper prior that's better in every way to a given improper prior).
Dutch books also seems to be a popular way of showing what works and what doesn't, so here's a simple Dutch argument against improper priors: I have two real numbers: x and y. Suppose they have a uniform distribution. I offer you a bet at 1:2 odds that x has a higher magnitude. They're equally likely to be higher, so you take it. I then show you the value of x. I offer you a new bet at 100:1 odds that y has a higher magnitude. You know y almost definitely has a higher magnitude than that, so you take it again. No matter what happens, I win.
You could try to get out of it by using a different prior, but I can just perform a transformation on it to get what I want. For example, if you choose a logarithmic prior for the magnitude, I can just take the magnitude of the log of the magnitude, and have a uniform distribution.
There are certainly uses for an improper prior. You can use it if the evidence is so great compared to the difference between it and the correct value that it isn't worth worrying about. You can also use it if you're not sure what another person's prior is, and you want to give a result that is at least as high as they'd get no matter how much there prior is spread out. That said, an improper prior is never actually correct, even in things that you have literally no evidence for.
Bayesian justice
"The mathematical mistakes that could be undermining justice"
They failed, though, to convince the jury of the value of the Bayesian approach, and Adams was convicted. He appealed twice unsuccessfully, with an appeal judge eventually ruling that the jury's job was "to evaluate evidence not by means of a formula... but by the joint application of their individual common sense."
But what if common sense runs counter to justice? For David Lucy, a mathematician at Lancaster University in the UK, the Adams judgment indicates a cultural tradition that needs changing. "In some cases, statistical analysis is the only way to evaluate evidence, because intuition can lead to outcomes based upon fallacies," he says.
Norman Fenton, a computer scientist at Queen Mary, University of London, who has worked for defence teams in criminal trials, has just come up with a possible solution. With his colleague Martin Neil, he has developed a system of step-by-step pictures and decision trees to help jurors grasp Bayesian reasoning (bit.ly/1c3tgj). Once a jury has been convinced that the method works, the duo argue, experts should be allowed to apply Bayes's theorem to the facts of the case as a kind of "black box" that calculates how the probability of innocence or guilt changes as each piece of evidence is presented. "You wouldn't question the steps of an electronic calculator, so why here?" Fenton asks.
It is a controversial suggestion. Taken to its logical conclusion, it might see the outcome of a trial balance on a single calculation. Working out Bayesian probabilities with DNA and blood matches is all very well, but quantifying incriminating factors such as appearance and behaviour is more difficult. "Different jurors will interpret different bits of evidence differently. It's not the job of a mathematician to do it for them," says Donnelly.
The linked paper is "Avoiding Probabilistic Reasoning Fallacies in Legal Practice using Bayesian Networks" by Norman Fenton and Martin Neil. The interesting parts, IMO, begin on page 9 where they argue for using the likelihood ratio as the key piece of information for evidence, and not simply raw probabilities; page 17, where a DNA example is worked out; and page 21-25 on the key piece of evidence in the Bellfield trial, no one claiming a lost possession (nearly worthless evidence)
Related reading: Inherited Improbabilities: Transferring the Burden of Proof, on Amanda Knox.
Psychologist making pseudo-claim that recent works "compromise the Bayesian point of view"
I have recently been corresponding with a friend who studies psychology regarding human cognition and the best underlying models for understanding it. His argument, summarized very briefly, is given by this quote:
Lastly, there has been a huge amount of research over the last two decades that shows human reasoning is 1) entirely constituted by emotion, and that it is 2) mostly unconscious and therefore out of our control. A lot of this research has seriously compromised the Bayesian point of view. I am referring to work done by Antonio Damasio, who demonstrated the essential role emotion plays in decision making (Descartes' Error), Timothy Wilson, who demonstrated the vital role of the unconscious (Strangers to Ourselves), and Jonathan Haidt, who demonstrated how moral reasoning is dictated by intuition and emotion (The Emotional Dog and its Rational Tail). I could go on and on here. I assume that you are familiar with this stuff. I'd just like to know how you who respond to this work from the point of view of your studies (in particular, those two points). I don't mean to get in a tit for tat debate here, just want the other side of the story.
I am having trouble synthesizing a response that captures the Bayesian point of view (and is sufficiently backed up by sources so that it will be useful for my friend rather than just gainsaying of the argument) because I am mostly a decision theory / probability person. Are these works of psychology and neuroscience really illustrating that human emotion governs decision making? What are some good neuroscience papers to read that deal with this, and how do Bayesians respond? It may be that everything he mentions above is a correct assessment (I don't know and don't have enough time to read the books right now), but that it is irrelevant if you want to make good decisions rather than just accept the types of decisions we already make.
Experiment: Knox case debate with Rolf Nelson
Recently, on the main section of the site, Raw_Power posted an article suggesting that we find "worthy opponents" to help us avoid mistakes.
As you may recall, Rolf Nelson disagrees with me about Amanda Knox -- rather sharply. Of course, the same can be said of lots of other people (if not so much here on Less Wrong). But Rolf isn't your average "guilter". Indeed, considering that he speaks fluent Bayesian, is one of the Singularity Institute's largest donors, and is also (as I understand it) signed up for cryonics, it's hard to imagine an "opponent" more "worthy". The Amanda Knox case may not be in the same category of importance as many other issues where Rolf and I probably agree; but my opinion on it is very confident, and it's the opposite of his. If we're both aspiring rationalists, at least one of us is doing something wrong.
As it turns out, Rolf is interested in having a debate with me on the subject, to see if one of us can help to change the other's mind. I'm setting this post up as an experiment, to see if LW can serve as a suitable venue for such an exercise. I hope it can: Less Wrong is almost unique in the extent to which the social norms governing discussion reflect and coincide with the requirements of personal epistemic rationality. (For example: "Do not believe you do others a favor if you accept their arguments; the favor is to you.") But I don't think we've yet tried an organized one-on-one debate -- so we'll see how it goes. If it proves too unwieldy or inappropriate for some other reason, we can always move to another venue.
Although the primary purpose of this post is a one-on-one debate between Rolf Nelson and myself, this is a LW Discussion post like any other, and it goes without saying that others are welcome and encouraged to comment. Just be aware that we, the main protagonists, will try to keep our discussion focused on each other's arguments. (Also, since our subject is an issue where there is already a strong LW consensus, one would prefer to avoid a sort of "gangup effect" where lots of people "pounce" on the person taking the contrarian position.)
With that, here we go...
Unconditionally Convergent Expected Utility
Expected utility can be expressed as the sum ΣP(Xn)U(Xn). Suppose P(Xn) = 2-n, and U(Xn) = (-2)n/n. Then expected utility = Σ2-n(-2)n/n = Σ(-1)n/n = -1+1/2-1/3+1/4-... = -ln(2). Except there's no obvious order to add it. You could just as well say it's -1+1/2+1/4+1/6+1/8-1/3+1/10+1/12+1/14+1/16-1/5+... = 0. The sum depends on the order you add it. This is known as conditional convergence.
This is clearly something we want to avoid. Suppose my priors have an unconditionally convergent expected utility. This would mean that ΣP(Xn)|U(Xn)| converges. Now suppose I observe evidence Y. ΣP(Xn|Y)|U(Xn)| = Σ|U(Xn)|P(Xn∩Y)/P(Y) ≤ Σ|U(Xn)|P(Xn)/P(Y) = 1/P(Y)·ΣP(Xn)|U(Xn)|. As long as P(Y) is nonzero, this must also converge.
If my prior expected utility is unconditionally convergent, then given any finite amount of evidence, so is my posterior.
This means I only have to come up with a nice prior, and I'll never have to worry about evidence braking expected utility.
I suspect that this can be made even more powerful, and given any amount of evidence, finite or otherwise, I will almost surely have an unconditionally convergent posterior. Anyone want to prove it?
Now let's look at Pascal's Mugging. The problem here seems to be that someone could very easily give you an arbitrarily powerful threat. However, in order for expected utility to converge unconditionally, either carrying out the threat must get unlikely faster than the disutility increases, or the probability of the threat itself must get unlikely that fast. In other words, either someone threatening 3^^^3 people is so unlikely to carry it out to make it non-threatening, or the threat itself must be so difficult to make that you don't have to worry about it.
Free Stats Textbook: Principles of Uncertainty
Joseph Kadane, emeritus at Carnegie Mellon, released his new statistics textbook Principles of Uncertainty as a free pdf. The book is written from a Bayesian perspective, covering basic probability, decision theory, conjugate distribution analysis, hierarchical modeling, MCMC simulation, and game theory. The focus is mathematical, but computation with R is touched on. A solid understanding of calculus seems sufficient to use the book. Curiously, the author devotes a fair number of pages to developing the McShane integral, which is equivalent to Lebesgue integration on the real line. There are lots of other unusual topics you don't normally see in an intermediate statistics textbook.
Having came across this today, I can't say whether it is actually very good or not, but the range of topics seems perfectly suited to Less Wrong readers.
The Joys of Conjugate Priors
(Warning: this post is a bit technical.)
Suppose you are a Bayesian reasoning agent. While going about your daily activities, you observe an event of type . Because you're a good Bayesian, you have some internal parameter
which represents your belief that
will occur.
Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive. Your observation of is a datapoint, and thus you'll want to modify
. But how much should this datapoint influence
? Well, that will depend on how sure you are of
in the first place. If you calculated
based on a careful experiment involving hundreds of thousands of observations, then you're probably pretty confident in its value, and this single observation of
shouldn't have much impact. But if your estimate of
is just a wild guess based on something your unreliable friend told you, then this datapoint is important and should be weighted much more heavily in your reestimation of
.
Of course, when you reestimate , you'll also have to reestimate how confident you are in its value. Or, to put it a different way, you'll want to compute a new probability distribution over possible values of
. This new distribution will be
, and it can be computed using Bayes' rule:
Here, since is a parameter used to specify the distribution from which
is drawn, it can be assumed that computing
is straightforward.
is your old distribution over
, which you already have; it says how accurate you think different settings of the parameters are, and allows you to compute your confidence in any given value of
. So the numerator should be straightforward to compute; it's the denominator which might give you trouble, since for an arbitrary distribution, computing the integral is likely to be intractable.
But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions. If this is your goal, then once you've computed the distribution , you can pick the value of
that maximizes it. This will be your new parameter, and because you have the formula
, you'll know exactly how confident you are in this parameter.
In practice, picking the value of which maximizes
is usually pretty difficult, thanks to the presence of local optima, as well as the general difficulty of optimization problems. For simple enough distributions, you can use the EM algorithm, which is guarranteed to converge to a local optimum. But for more complicated distributions, even this method is intractable, and approximate algorithms must be used. Because of this concern, it's important to keep the distributions
and
simple. Choosing the distribution
is a matter of model selection; more complicated models can capture deeper patterns in data, but will take more time and space to compute with.
It is assumed that the type of model is chosen before deciding on the form of the distribution . So how do you choose a good distribution for
? Notice that every time you see a new datapoint, you'll have to do the computation in the equation above. Thus, in the course of observing data, you'll be multiplying lots of different probability distributions together. If these distributions are chosen poorly,
could get quite messy very quickly.
If you're a smart Bayesian agent, then, you'll pick to be a conjugate prior to the distribution
. The distribution
is conjugate to
if multiplying these two distributions together and normalizing results in another distribution of the same form as
.
Let's consider a concrete example: flipping a biased coin. Suppose you use the bernoulli distribution to model your coin. Then it has a parameter which represents the probability of gettings heads. Assume that the value 1 corresponds to heads, and the value 0 corresponds to tails. Then the distribution of the outcome
of the coin flip looks like this:
It turns out that the conjugate prior for the bernoulli distribution is something called the beta distribution. It has two parameters, and
, which we call hyperparameters because they are parameters for a distribution over our parameters. (Eek!)
The beta distribution looks like this:
Since represents the probability of getting heads, it can take on any value between 0 and 1, and thus this function is normalized properly.
Suppose you observe a single coin flip and want to update your beliefs regarding
. Since the denominator of the beta function in the equation above is just a normalizing constant, you can ignore it for the moment while computing
, as long as you promise to normalize after completing the computation:
Normalizing this equation will, of course, give another beta distribution, confirming that this is indeed a conjugate prior for the bernoulli distribution. Super cool, right?
If you are familiar with the binomial distribution, you should see that the numerator of the beta distribution in the equation for looks remarkably similar to the non-factorial part of the binomial distribution. This suggests a form for the normalization constant:
The beta and binomial distributions are almost identical. The biggest difference between them is that the beta distribution is a function of , with
and
as prespecified parameters, while the binomial distribution is a function of
, with
and
as prespecified parameters. It should be clear that the beta distribution is also conjugate to the binomial distribution, making it just that much awesomer.
Another difference between the two distributions is that the beta distribution uses gammas where the binomial distribution uses factorials. Recall that the gamma function is just a generalization of the factorial to the reals; thus, the beta distribution allows and
to be any positive real number, while the binomial distribution is only defined for integers. As a final note on the beta distribution, the -1 in the exponents is not philosophically significant; I think it is mostly there so that the gamma functions will not contain +1s. For more information about the mathematics behind the gamma function and the beta distribution, I recommend checking out this pdf: http://www.mhtl.uwaterloo.ca/courses/me755/web_chap1.pdf. It gives an actual derivation which shows that the first equation for
is equivalent to the second equation for
, which is nice if you don't find the argument by analogy to the binomial distribution convincing.
So, what is the philosophical significance of the conjugate prior? Is it just a pretty piece of mathematics that makes the computation work out the way we'd like it to? No; there is deep philosophical significance to the form of the beta distribution.
Recall the intuition from above: if you've seen a lot of data already, then one more datapoint shouldn't change your understanding of the world too drastically. If, on the other hand, you've seen relatively little data, then a single datapoint could influence your beliefs significantly. This intuition is captured by the form of the conjugate prior. and
can be viewed as keeping track of how many heads and tails you've seen, respectively. So if you've already done some experiments with this coin, you can store that data in a beta distribution and use that as your conjugate prior. The beta distribution captures the difference between claiming that the coin has 30% chance of coming up heads after seeing 3 heads and 7 tails, and claiming that the coin has a 30% chance of coming up heads after seeing 3000 heads and 7000 tails.
Suppose you haven't observed any coin flips yet, but you have some intuition about what the distribution should be. Then you can choose values for and
that represent your prior understanding of the coin. Higher values of
indicate more confidence in your intuition; thus, choosing the appropriate hyperparameters is a method of quantifying your prior understanding so that it can be used in computation.
and
will act like "imaginary data"; when you update your distribution over
after observing a coin flip
, it will be like you already saw
heads and
tails before that coin flip.
If you want to express that you have no prior knowledge about the system, you can do so by setting and
to 1. This will turn the beta distribution into a uniform distribution. You can also use the beta distribution to do add-N smoothing, by setting
and
to both be N+1. Setting the hyperparameters to a value lower than 1 causes them to act like "negative data", which helps avoid overfitting
to noise in the actual data.
In conclusion, the beta distribution, which is a conjugate prior to the bernoulli and binomial distributions, is super awesome. It makes it possible to do Bayesian reasoning in a computationally efficient manner, as well as having the philosophically satisfying interpretation of representing real or imaginary prior data. Other conjugate priors, such as the dirichlet prior for the multinomial distribution, are similarly cool.
Future Filters [draft]
See Katja Grace's article: http://hplusmagazine.com/2011/05/13/anthropic-principles-and-existential-risks/
There are two comments I want to make about the above article.
First: the resolution to God's Coin Toss seems fairly straightforward. I argue that the following scenario is formally equivalent to 'God's Coin Toss'
"Dr. Evil's Machine"
Dr. Evil has a factory for making clones. The factory has 1000 separate identical rooms. Every day, a clone is produced in each room at 9:00 AM. However, there is a 50% chance of malfunction, in which case 900 of the clones suddenly die by 9:30 AM, the remaining 100 are healthy and notice nothing. At the end of the day Dr. Evil ships off all the clones which were produced and restores the rooms to their original state.
You wake up at 10:00 AM and learn that you are one of the clones produced in Dr. Evil's factory, and your learn all of the information above. What is the probability that that the machine malfunctioned today?
In the second reformulation, the answer is clear from Bayes' rule. Let P(M) be the probability of malfunction, and P(S) be the probability that you are alive at 10:00 AM. From the information given, we have
P(M) = 1/2
P(~M) = 1/2
P(S|M) = 1/10
P(S|~M) = 1
Therefore,
P(S) = P(S|M) P(M) + P(S|~M)P(~M) = (1/2)(1/10) + (1/2)(1) = 11/20
P(M|S) = P(S|M) P(M)/P(S) = (1/20)/(11/20) = 1/11
That is, given the information you have, you should conclude that the probability that the machine malfunctioned is 1/11.
The second comment concerns Grace's reasoning about future filters.
I will assume that the following model is a fair representation of Grace's argument about relative probabilities for the first and second filters.
Future Filter Model I
Given: universe with N planets, T time steps. Intelligent life can arise on a planet at most once.
At each time step:
- each surviving intelligent species becomes permanently visible to all other species with probability c (the third filter probability)
- each surviving intelligent species self-destructs with probability b (the second filter probability)
- each virgin planet produces an intelligent species with probability a (the first filter probability)
Suppose N=one billion, T=one million. Put uniform priors on a, b, c, and the current time t (an integer between 1 and T).
Your species appeared on your planet at unknown time step t_0. The current time t is also unknown. At the current time, no species has become permanently visible in the universe. Conditioned on this information, what is the posterior density for first filter parameter a?
Probability updating question - 99.9999% chance of tails, heads on first flip
This isn't intended as a full discussion, I'm just a little fuzzy on how a Bayesian update or any other kind of probability update would work in this situation.
You have a coin with a 99.9999% chance of coming up tails, and a 100% chance of coming up either tails or heads.
You've deduced these odds by studying the weight of the coin. You are 99% confident of your results. You have not yet flipped it.
You have no other information before flipping the coin.
You flip the coin once. It comes up heads.
How would you update your probability estimates?
(this isn't a homework assignment; rather I was discussing with someone how strong the anthropic principle is. Unfortunately my mathematic abilities can't quite comprehend how to assemble this into any form I can work with.)
A Problem with Human Intuition about Conventional Statistics:
As an aspiring scientist, I hold the Truth above all. As Hodgell once said, "That which can be destroyed by the truth should be." But what if the thing that is holding our pursuit of the Truth back is our own system? I will share an example of an argument I overheard between a theist and an atheist once - showing an instance where human intuition might fail us.
*General Transcript*
Atheist: Prove to me that God exists!
Theist: He obviously exists – can’t you see that plants growing, humans thinking, [insert laundry list here], is all His work?
Atheist: Those can easily be explained by evolutionary mechanisms!
Theist: Well prove to me that God doesn’t exist!
Atheist: I don’t have to! There may be an invisible pink unicorn baby flying around my head, there is probably not. I can’t prove that there is no unicorn, that doesn’t mean it exists!
Theist: That’s just complete reductio ad ridiculo, you could do infrared, polaroid, uv, vacuum scans, and if nothing appears it is statistically unlikely that the unicorn exists! But God is something metaphysical, you can’t do that with Him!
Atheist: Well Nietzsche killed metaphysics when he killed God. God is dead!
Theist: That is just words without argument. Can you actually…..
As one can see, the biggest problem is determining burden of proof. Statistically speaking, this is much like the problem of defining the null hypothesis.
A theist would define: H0 : God exists. Ha: God does not exist.
An atheist would define: H0: God does not exist. Ha God does exist.
Both conclude that there is no significant evidence hinting at Ha over H0. Furthermore, and this is key, they both accept the null hypothesis. The correct statistical term for the proper conclusion if insignificant evidence exists for the acceptance of the alternate hypothesis is that one fails to reject the null hypothesis. However, human intuition fails to grasp this concept, and think in black and white, and instead we tend to accept the null hypothesis.
This is not so much a problem with statistics as it is with human intuition. Statistics usually take this form because simultaneous 100+ hypothesis considerations are taxing on the human brain. Therefore, we think of hypotheses to be defended or attacked, but not considered neutrally.
Considered a Bayesian outlook on this problem.
There are two possible outcomes: At least one deity exists(D). No deities exist(N).
Let us consider the natural evidence (Let’s call this E) before us.
P(D+N) = 1. P[(D+N)|E] = 1. P(D|E) + P(N|E) = 1. P(D|E) = 1- P(N|E).
Although the calculation of the prior probability of the probability of god existing is rather strange, and seems to reek of bias, I still argue that this is a better system of analysis than just the classical H0 and Ha, because it effectively compares the probability of D and N with no bias inherent in the brain’s perception of the system.
Example such as these, I believe, show the flaws that result from faulty interpretations of the classical system. If instead we introduced a Bayesian perspective – the faulty interpretation would vanish.
This is a case for the expanded introduction of Bayesian probability theory. Even if cannot be applied correctly to every problem, even if it is apparently more complicated than the standard method they teach in statistics class ( I disagree here), it teaches people to analyze situations from a more objective perspective.
And if we can avoid Truth-seekers going awry due to simple biases such as those mentioned above, won’t we be that much closer to finding Truth?
Modeling sleep patterns
My sleep is unpredictable. Not in a technical sense, but a colloquial one. To be literal, I have no idea how to predict my sleep. I just as often sleep through the day as I do through the night. My sleep itself, as far as a sleep study can tell, is normal. I can vaguely say, 60% confidence, if I'm likely to fall asleep in a given 3-4 hour period, and occasionally I will be fairly sure, 80% confidence, 6-10 hours beforehand, of a 1-2 hour period. I can similarly predict the length of my sleep (which is relatively normal--generally distributed 7, 8, 9.5, 13 hours at .1, .4, .6, .9).
My sleep is seriously disturbed. Without understanding the process behind my sleep, without being able to predict it days beforehand and understand the variables behind it, I find it impossible to wake up at a consistent time every day (+/- 8 hours), despite years of trying, which makes it extremely hard to hold down a job, or do dozens of other normal things. There could be a profession that I could make my sleep work with, but I'm still searching for it.
So I ask you readers: Is there some sort of pattern detecting thing, whose name perhaps includes something like "markov" or "kolmogorov" or "bayesian", that could automatically take a time series data and predict the next values based on an unknown, complex model?
So, I could like enter the times I go to sleep and wake up, and when I have caffeine or I exercise, and maybe other things, and it would puzzle out how my sleep works and forecast my next few sleep cycles?
To have an accurate tool like that would transform my life.
"Hidden Markov models" comes to mind, but at first glance I don't see how a sleep model would count as a Markov process, given that you have to factor in sleep debt, time of day (because of sunlight), and perhaps other variables. But then I know nothing about HMMs.
Also, this is my first post. Is this the sort of thing that goes better in LessWrong or Less Wrong Discussion?
Visualizing Bayesian Inference [link]
Galton Visualizing Bayesian Inference (article @ CHANCE)
Excerpt:
What does Bayes Theorem look like? I do not mean what does the formula—
—look like; these days, every statistician knows that. I mean, how can we visualize the cognitive content of the theorem? What picture can we appeal to with the hope that any person curious about the theorem may look at it, and, after a bit of study say, “Why, that is clear—I can indeed see what is happening!”
Francis Galton could produce just such a picture; in fact, he built and operated a machine in 1877 that performs that calculation. But, despite having published the picture in Nature and the Proceedings of the Royal Institution of Great Britain, he never referred to it again—and no reader seems to have appreciated what it could accomplish until recently.
Schematics for the machine and its algorithm can be found at the link. This is a really cool design, and maybe it can aid Eliezer's and others' efforts to help people understand Bayes' Theorem.
[Draft] Holy Bayesian Multiverse, Batman!
I couldn't find the math for the quantum suicide and immortality thought experiment, so I'm placing it here for posterity. If one actually ran the experiment, Bayes' theorem would tell us how to update our belief in the multi-world interpretation (MWI) of quantum mechanics. I conclude by arguing that we don't need to run the experiment.
Prereqs: Understand the purpose of Bayes Theorem, possess at least rudimentary knowledge of the competing quantum worldviews, and have a nostalgic appreciation for Adam West.
The Fiendish Setup:
Suppose that, after catching Batman snooping in the shadows of his evil lair, Joker ties the caped crusader into a quantum, negative binomial death machine that, every ten seconds, measures the spin value of a fresh proton. Fifty percent of the time, the result will trigger a Bat-killing Rube Goldberg machine. The other 50 percent of the time, the quantum death machine will play a suspenseful stock sound effect and search for a new proton.
Starting point for calculating inferential distance?
One of the shiniest ideas I picked up from LW is inferential distance. I say "shiny" because the term, so far as I'm aware, has no clear mathematical or pragmatic definition, no substantive use in peer reviewed science, but was novel to me and appeared to make a lot of stuff about the world suddenly make sense. In my head it is marked as "super neat... but possibly a convenient falsehood". I ran across something yesterday that struck me a beautifully succinct and helpful towards resolving the epistemic status of the concept of "inferential distance".
Bayesian Doomsday Argument
First, if you don't already know it, Frequentist Doomsday Argument:
There's some number of total humans. There's a 95% chance that you come after the last 5%. There's been about 60 to 120 billion people so far, so there's a 95% chance that the total will be less than 1.2 to 2.4 trillion.
I've modified it to be Bayesian.
First, find the priors:
Do you think it's possible that the total number of sentients that have ever lived or will ever live is less than a googolplex? I'm not asking if you're certain, or even if you think it's likely. Is it more likely than one in infinity? I think it is too. This means that the prior must be normalizable.
If we take P(T=n) ∝ 1/n, where T is the total number of people, it can't be normalized, as 1/1 + 1/2 + 1/3 + ... is an infinite sum. If it decreases faster, it can at least be normalized. As such, we can use 1/n as an upper limit.
Of course, that's just the limit of the upper tail, so maybe that's not a very good argument. Here's another one:
We're not so much dealing with lives as life-years. Year is a pretty arbitrary measurement, so we'd expect the distribution to be pretty close for the majority of it if we used, say, days instead. This would require the 1/n distribution.
After that,
T = total number of people
U = number you are
P(T=n) ∝ 1/n
U = m
P(U=m|T=n) ∝ 1/n
P(T=n|U=m) = P(U=m|T=n) * P(T=n) / P(U=m)
= (1/n^2) / P(U=m)
P(T>n|U=m) = ∫P(T=n|U=m)dn
= (1/n) / P(U=m)
And to normalize:
P(T>m|U=m) = 1
= (1/m) / P(U=m)
m = 1/P(U=m)
P(T>n|U=m) = (1/n)*m
P(T>n|U=m) = m/n
So, the probability of there being a total of 1 trillion people total if there's been 100 billion so far is 1/10.
There's still a few issues with this. It assumes P(U=m|T=n) ∝ 1/n. This seems like it makes sense. If there's a million people, there's a one-in-a-million chance of being the 268,547th. But if there's also a trillion sentient animals, the chance of being the nth person won't change that much between a million and a billion people. There's a few ways I can amend this.
First: a = number of sentient animals. P(U=m|T=n) ∝ 1/(a+n). This would make the end result P(T>n|U=m) = (m+a)/(n+a).
Second: Just replace every mention of people with sentients.
Third: Take this as a prediction of the number of sentients who aren't humans who have lived so far.
The first would work well if we can find the number of sentient animals without knowing how many humans there will be. Assuming we don't take the time to terreform every planet we come across, this should work okay.
The second would work well if we did tereform every planet we came across.
The third seems a bit wierd. It gives a smaller answer than the other two. It gives a smaller answer than what you'd expect for animals alone. It does this because it combines it for a Doomsday Argument against animals being sentient. You can work that out separately. Just say T is the total number of humans, and U is the total number of animals. Unfortunately, you have to know the total number of humans to work out how many animals are sentient, and vice versa. As such, the combined argument may be more useful. It won't tell you how many of the denizens of planets we colonise will be animals, but I don't think it's actually possible to tell that.
One more thing, you have more information. You have a lifetime of evidence, some of which can be used in these predictions. The lifetime of humanity isn't obvious. We might make it to the heat death of the universe, or we might just kill each other off in a nuclear or biological war in a few decades. We also might be annihilated by a paperclipper somewhere in between. As such, I don't think the evidence that way is very strong.
The evidence for animals is stronger. Emotions aren't exclusively intelligent. It doesn't seem animals would have to be that intelligent to be sentient. Even so, how sure can you really be. This is much more subjective than the doomsday part, and the evidence against their sentience is staggering. I think so anyway, how many animals are there at different levels of intelligence?
Also, there's the priors for total human population so far. I've read estimates vary between 60 and 120 billion. I don't think a factor of two really matters too much for this discussion.
So, what can we use for these priors?
Another issue is that this is for all of space and time, not just Earth.
Consider that you're the mth person (or sentient) from the lineage of a given planet. l(m) is the number of planets with a lineage of at least m people. N is the total number of people ever, n is the number on the average planet, and p is the number of planets.
l(m)/N
=l(m)/(n*p)
=(l(m)/p)/n
l(m)/p is the portion of planets that made it this far. This increases with n, so this weakens my argument, but only to a limited extent. I'm not sure what that is, though. Instinct is that l(m)/p is 50% when m=n, but the mean is not the median. I'd expect a left-skew, which would make l(m)/p much lower than that. Even so, if you placed it at 0.01%, this would mean that it's a thousand times less likely at that value. This argument still takes it down orders of magnitude than what you'd think, so that's not really that significant.
Also, a back-of-the-envolope calculation:
Assume, against all odds, there are a trillion times as many sentient animals as humans, and we happen to be the humans. Also, assume humans only increase their own numbers, and they're at the top percentile for the populations you'd expect. Also, assume 100 billion humans so far.
n = 1,000,000,000,000 * 100,000,000,000 * 100
n = 10^12 * 10^11 * 10^2
n = 10^25
Here's more what I'd expect:
Humanity eventually puts up a satilite to collect solar energy. Once they do one, they might as well do another, until they have a dyson swarm. Assume 1% efficiency. Also, assume humans still use their whole bodies instead of being a brain in a vat. Finally, assume they get fed with 0.1% efficiency. And assume an 80-year lifetime.
n = solar luminosity * 1% / power of a human * 0.1% * lifetime of Sun / lifetime of human
n = 4 * 10^26 Watts * 0.01 / 100 Watts * 0.001 * 5,000,000,000 years / 80 years
n = 2.5 * 10^27
By the way, the value I used for power of a human is after the inefficiencies of digesting.
Even with assumptions that extreme, we couldn't use this planet to it's full potential. Granted, that requires mining pretty much the whole planet, but with a dyson sphere you can do that in a week, or two years with the efficiency I gave.
It actually works out to about 150 tons of Earth per person. How much do you need to get the elements to make a person?
Incidentally, I rewrote the article, so don't be surprised if some of the comments don't make sense.
View more: Next
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)