Priors as Mathematical Objects

Eliezer Yudkowsky

What exactly is a "prior", as a mathematical object? Suppose you're looking at an urn filled with red and white balls. When you draw the very first ball, you haven't yet had a chance to gather much evidence, so you start out with a rather vague and fuzzy expectation of what might happen - you might say "fifty/fifty, even odds" for the chance of getting a red or white ball. But you're ready to revise that estimate for future balls as soon as you've drawn a few samples. So then this initial probability estimate, 0.5, is not repeat not a "prior".

An introduction to Bayes's Rule for confused students might refer to the population frequency of breast cancer as the "prior probability of breast cancer", and the revised probability after a mammography as the "posterior probability". But in the scriptures of Deep Bayesianism, such as Probability Theory: The Logic of Science, one finds a quite different concept - that of prior information, which includes e.g. our beliefs about the sensitivity and specificity of mammography exams. Our belief about the population frequency of breast cancer is only one small element of our prior information.

In my earlier post on inductive bias, I discussed three possible beliefs we might have about an urn of red and white balls, which will be sampled without replacement:

Case 1: The urn contains 5 red balls and 5 white balls;
Case 2: A random number was generated between 0 and 1, and each ball was selected to be red (or white) at this probability;
Case 3: A monkey threw balls into the urn, each with a 50% chance of being red or white.

In each case, if you ask me - before I draw any balls - to estimate my marginal probability that the fourth ball drawn will be red, I will respond "50%". And yet, once I begin observing balls drawn from the urn, I reason from the evidence in three different ways:

Case 1: Each red ball drawn makes it less likely that future balls will be red, because I believe there are fewer red balls left in the urn.
Case 2: Each red ball drawn makes it more plausible that future balls will be red, because I will reason that the random number was probably higher, and that the urn is hence more likely to contain mostly red balls.
Case 3: Observing a red or white ball has no effect on my future estimates, because each ball was independently selected to be red or white at a fixed, known probability.

Suppose I write a Python program to reproduce my reasoning in each of these scenarios. The program will take in a record of balls observed so far, and output an estimate of the probability that the next ball drawn will be red. It turns out that the only necessary information is the count of red balls seen and white balls seen, which we will respectively call R and W. So each program accepts inputs R and W, and outputs the probability that the next ball drawn is red:

Case 1: return (5 - R)/(10 - R - W) # Number of red balls remaining / total balls remaining
Case 2: return (R + 1)/(R + W + 2) # Laplace's Law of Succession
Case 3: return 0.5

These programs are correct so far as they go. But unfortunately, probability theory does not operate on Python programs. Probability theory is an algebra of uncertainty, a calculus of credibility, and Python programs are not allowed in the formulas. It is like trying to add 3 to a toaster oven.

To use these programs in the probability calculus, we must figure out how to convert a Python program into a more convenient mathematical object - say, a probability distribution.

Suppose I want to know the combined probability that the sequence observed will be RWWRR, according to program 2 above. Program 2 does not have a direct faculty for returning the joint or combined probability of a sequence, but it is easy to extract anyway. First, I ask what probability program 2 assigns to observing R, given that no balls have been observed. Program 2 replies "1/2". Then I ask the probability that the next ball is R, given that one red ball has been observed; program 2 replies "2/3". The second ball is actually white, so the joint probability so far is 1/2 * 1/3 = 1/6. Next I ask for the probability that the third ball is red, given that the previous observation is RW; this is summarized as "one red and one white ball", and the answer is 1/2. The third ball is white, so the joint probability for RWW is 1/12. For the fourth ball, given the previous observation RWW, the probability of redness is 2/5, and the joint probability goes to 1/30. We can write this as p(RWWR|RWW) = 2/5, which means that if the sequence so far is RWW, the probability assigned by program 2 to the sequence continuing with R and forming RWWR equals 2/5. And then p(RWWRR|RWWR) = 1/2, and the combined probability is 1/60.

We can do this with every possible sequence of ten balls, and end up with a table of 1024 entries. This table of 1024 entries constitutes a probability distribution over sequences of observations of length 10, and it says everything the Python program had to say (about 10 or fewer observations, anyway). Suppose I have only this probability table, and I want to know the probability that the third ball is red, given that the first two balls drawn were white. I need only sum over the probability of all entries beginning with WWR, and divide by the probability of all entries beginning with WW.

We have thus transformed a program that computes the probability of future events given past experiences, into a probability distribution over sequences of observations.

You wouldn't want to do this in real life, because the Python program is ever so much more compact than a table with 1024 entries. The point is not that we can turn an efficient and compact computer program into a bigger and less efficient giant lookup table; the point is that we can view an inductive learner as a mathematical object, a distribution over sequences, which readily fits into standard probability calculus. We can take a computer program that reasons from experience and think about it using probability theory.

Why might this be convenient? Say that I'm not sure which of these three scenarios best describes the urn - I think it's about equally likely that each of the three cases holds true. How should I reason from my actual observations of the urn? If you think about the problem from the perspective of constructing a computer program that imitates my inferences, it looks complicated - we have to juggle the relative probabilities of each hypothesis, and also the probabilities within each hypothesis. If you think about it from the perspective of probability theory, the obvious thing to do is to add up all three distributions with weightings of 1/3 apiece, yielding a new distribution (which is in fact correct). Then the task is just to turn this new distribution into a computer program, which turns out not to be difficult.

So that is what a prior really is - a mathematical object that represents all of your starting information plus the way you learn from experience.

I'm confused when you say that the prior represents all your starting information plus the way you learn from experience. Isn't the way you learn from experience fixed, in this framework? Given that you are using Bayesian methods, so that the idea of a prior is well defined, then doesn't that already tell how you will learn from experience?

Hal, with a poor prior, "Bayesian updating" can lead to learning in the wrong direction or to no learning at all. Bayesian updating guarantees a certain kind of consistency, but not correctness. (If you have five city maps that agree with each other, they might still disagree with the city.) You might think of Bayesian updating as a kind of lower level of organization - like a computer chip that runs programs, or the laws of physics that run the computer chip - underneath the activity of learning. If you start with a maxentropy prior that assigns equal probability to every sequence of observations, and carry out strict Bayesian updating, you'll still never learn anything; your marginal probabilities will never change as a result of the Bayesian updates. Conversely, if you somehow had a good prior but no Bayesian engine to update it, you would stay frozen in time and no learning would take place. To learn you need a good prior and an updating engine. Taking a picture requires a camera, light - and also time.

This probably deserves its own post.

Another thing I don't fully understand is the process of "updating" a prior. I've seen different flavors of Bayesian reasoning described. In some, we start with a prior, get some information and update the probabilities. This new probability distribution now serves as our prior for interpreting the next incoming piece of information, which then causes us to further update the prior. In other interpretations, the priors never change; they are always considered the initial probability distribution. We then use those prior probabilities plus our sequence of observations since then to make new interpretations and predictions. I gather that these can be considered mathematically identical, but do you think one or the other is a more useful or helpful way to think of it?

In this example, you start off with uncertainty about which process put in the balls, so we give 1/3 probability to each. But then as we observe balls coming out, we can update this prior. Once we see 6 red balls for example, we can completely eliminate Case 1 which put in 5 red and 5 white. We can think of our prior as our information about the ball-filling process plus the current state of the urn, and this can be updated after each ball is drawn.

Hal,

You are being a bad boy. In his earlier discussion Eliezer made it clear that he did not approve of this terminology of "updating priors." One has posterior probability distributions. The prior is what one starts with. However, Eliezer has also been a bit confusing with his occasional use of such language as a "prior learning." I repeat, agents learn, not priors, although in his view of the post-human computerized future, maybe it will be computerized priors that do the learning.

The only way one is going to get "wrong learning" at least somewhat asymptotically is if the dimensionality is high and the support is disconnected. Eliezer is right that if one starts off with a prior that is far enough off, one might well have "wrong learning," at least for awhile. But, unless the conditions I just listed hold, eventually the learning will move in the right direction and head towards the correct answer, or probability distribution, at least that is what Bayes' Theorem asserts.

OTOH, the reference to "deep Bayesianism" raises another issue, that of fundamental subjectivism. There is this deep divide among Bayesians between the ones that are ultimately classical frequentists but who argue that Bayesian methods are a superior way of getting to the true objective distribution, and the deep subjectivist Bayesians. For the latter, there are no ultimately "true" probability distributions. We are always estimating something derived out of our subjective priors as updated by more recent information, wherever those priors came from.

Also, saying a prior should the known probability distribution, say of cancer victims, assumes that this probability is somehow known. The prior is always subject to how much information the assumer of a prior has when they being their process of estimation.

Eliezer may not approve of it, but almost all of the literature uses the phrase "updating a prior" to mean exactly the type of sequential learning from evidence that Eliezer discusses. I prefer to think of it as 'updating a prior'. Bayes' theorem tells you that data is an operator on the space of probability distributions, converting prior information into posterior information. I think it's helpful to think of that process as 'updating' so that my prior actually changes to something new before the next piece of information comes my way.

Eliezer ,

Just to be clear . . . going back to your first paragraph, that 0.5 is a prior probability for the outcome of one draw from the urn (that is, for the random variable that equals 1 if the ball is red and 0 if the ball is white). But, as you point out, 0.5 is not a prior probability for the series of ten draws. What you're calling a "prior" would typically be called a "model" by statisticians. Bayesians traditionally divide a model into likelihood, prior, and hyperprior, but as you implicitly point out, the dividing line between these is not clear: ultimately, they're all part of the big model.

Barkley, I think you may be regarding likelihood distributions as fixed properties held in common by all agents, whereas I am regarding them as variables folded into the prior - if you have a probability distribution over sequences of observables, it implicitly includes beliefs about parameters and likelihoods. Where agents disagree about prior likelihood functions, not just prior parameter probabilities, their beliefs may trivially fail to converge.

Andrew's point may be particularly relevant here - it may indeed be that statisticians call what I am talking about a "model". (Although in some cases, like the Laplace's Law of Succession inductor, I think they might call it a "model class"?) Jaynes, however, would have called it our "prior information" and he would have written "the probability of A, given that we observe B" as p(A|B,I) where I stands for all our prior beliefs including parameter distributions and likelihood distributions. While we may often want to discriminate between different models and model classes, it makes no sense to talk about discriminating between "prior informations" - your prior information is everything you start out with.

Eliezer, I am very interested in the Bayesian approach to reasoning you've outlined on this site, it's one of the more elegant ideas I've ever run into.

I am a bit confused, though, about to what extent you are using math directly when assessing truth claims. If I asked you for example "what probability do you assign to the proposition 'global warming is anthropogenic' ?" (say), would you tell me a number?

Or is this mostly about conceptually understanding that P(effect|~cause) needs to be taken into account?

If it's a number, what's your heuristic for getting there (i.e., deciding on a prior probability & all the other probabilities)?

If there's a post that goes into that much detail, I haven't seen it yet, though your explanations of Bayes theorem generally are brilliant.

My reason for writing this is not to correct Eliezer. Rather, I want to expand on his distinction between prior information and prior probability. Pages 87-89 of Probability Theory: the Logic of Science by E. T. Jaynes (2004 reprint with corrections, ISBN 0 521 59271 2) is dense with important definitions and principles. The quotes below are from there, unless otherwise indicated.

Jaynes writes the fundamental law of inference as

  P(H|DX) = P(H|X) P(D|HX) / P(D|X)         (4.3)

Which the reader may be more used to seeing as

 P(H|D) = P(H) P(D|H) / P(D)

Where

 H = some hypothesis to be tested
 D = the data under immediate consideration
 X = all other information known

X is the misleadingly-named ‘prior information’, which represents all the information available other than the specific data D that we are considering at the moment. “This includes, at the very least, all it’s past experiences, from the time it left the factory to the time it received its current problem.” --Jaynes p.87, referring to a hypothetical problem-solving robot. It seems to me that in practice, X ends up being a representation of a subset of all prior experience, attempting to discard only what is irrelevant to the problem. In real human practice, that representation may be wrong and may need to be corrected.

“ ... to our robot, there is no such thing as an ‘absolute’ probability; all probabilities are necessarily conditional on X at the least.” “Any probability P(A|X) which is conditional on X alone is called a prior probability. But we caution that ‘prior’ ... does not necessarily mean ‘earlier in time’ ... the distinction is purely a logical one; any information beyond the immediate data D of the current problem is by definition ‘prior information’.”

“Indeed, the separation of the totality of the evidence into two components called ‘data’ and ‘prior information’ is an arbitrary choice made by us, only for our convenience in organizing a chain of inferences.” Please note his use of the word ‘evidence’.

Sampling theory, which is the basis of many treatments of probability, “ ... did not need to take any particular note of the prior information X, because all probabilities were conditional on H, and so we could suppose implicitly that the general verbal prior information defining the problem was included in H. This is the habit of notation that we have slipped into, which has obscured the unified nature of all inference.”

“From the start, it has seemed clear how one how one determines numerical values of of sampling probabilities¹ [e.g. P(D|H) ], but not what determines prior probabilities [AKA ‘priors’ e.g. P(H|X)]. In the present work we shall see that this s only an artifact of the unsymmetrical way of formulating problems, which left them ill-posed. One could see clearly how to assign sampling probabilities because the hypothesis H was stated very specifically; had the prior information X been specified equally well, it would have been equally clear how to assign prior probabilities.”

Jaynes never gives up on that X notation (though the letter may differ), he never drops it for convenience.

“When we look at these problems on a sufficiently fundamental level and realize how careful one must be to specify prior information before we have a well-posed problem, it becomes clear that ... exactly the same principles are needed to assign either sampling probabilities or prior probabilities ...” That is, P(H|X) should be calculated. Keep your copy of Kendall and Stuart handy.

I think priors should not be cheaply set from an opinion, whim, or wish. “ ... it would be a big mistake to think of X as standing for some hidden major premise, or some universally valid proposition about Nature.”

The prior information has impact beyond setting prior probabilities (priors). It informs the formulation of the hypotheses, of the model, and of “alternative hypotheses” that come to mind when the data seem to be showing something really strange. For example, data that seems to strongly support psychokinesis may cause a skeptic to bring up a hypothesis of fraud, whereas a career psychic researcher may not do so. (see Jaynes pp.122-125)

I say, be alert for misinformation, biases, and wishful thinking in your X. Discard everything that is not evidence.

I’m pretty sure the free version Probability Theory: The Logic of Science is off line. You can preview the book here: http://books.google.com/books?id=tTN4HuUNXjgC&printsec=frontcover&dq=Probability+Theory:+The+Logic+of+Science&cd=1#v=onepage&q&f=false .

Also see the Unofficial Errata and Commentary for E. T. Jaynes’s Probability Theory: The Logic of Science

SEE ALSO

FOOTNOTES

There are massive compendiums of methods for sampling distributions, such as
- Feller (An Introduction to Probability Theory and its Applications, Vol1, J. Wiley & Sons, New York, 3rd edn 1968 and Vol 2. J. Wiley & Sons, New York, 2nd edn 1971) and Kendall and
- Stuart (The Advanced Theory of Statistics: Volume 1, Distribution Theory, McMillan, New York 1977).
  ** Be familiar with what is in them.

Edited 05/05/2010 to put in the actual references.

Edited 05/19/2010 to put in SEE ALSO

Then the task is just to turn this new distribution into a computer program, which turns out not to be difficult.

Can someone please provide a hint how?

Here's some Python code to calculate a prior distribution from a rule for assigning probability to the next observation.

A "rule" is represented as a function that takes as a first argument the next observation (like "R") and as a second argument all previous observations (a string like "RRWR"). I included some example rules at the end.

EDIT: oh man, what happened to my line spacing? my indents? jeez.

EDIT2: here's a dropbox link: https://www.dropbox.com/s/16n01acrauf8h7g/prior_producer.py

from functools import reduce

def prod(sequence):
    '''Product equivalent of python's "sum"'''
    return reduce(lambda a, b: a*b, sequence)

def sequence_prob(rule, sequence):
    '''Probability of a sequence like "RRWR" using the given rule for
    computing the probability of the next observation.

    To put it another way: computes the joint probability mass function.'''
    return prod([rule(sequence[i], sequence[:i]) \
                 for i in range(len(sequence))])

def number2sequence(number, length):
    '''Convert a number like 5 into a sequence like WWRWR.

    The sequence corresponds to the binary digit representation of the 
    number: 5 --> 00101 --> WWRWR

    This is convenient for listing all sequences of a given length.'''
    binary_representation = bin(number)[2:]
    seq_end = binary_representation.replace('1', 'R').replace('0', 'W')

    if len(seq_end) > length:
        raise ValueError('no sequence of length {} with number {}'\
                          .format(length, number))

    # Now add W's to the beginning to make it the right length - 
    # like adding 0's to the beginning of a binary number
    return ''.join('W' for i in range(length - len(seq_end))) + seq_end

def prior(rule, n):
    '''Generate a joint probability distribution from the given rule over
    all sequences of length n. Doesn't feed the rule any background
    knowledge, so it's a prior distribution.'''
    sequences = [number2sequence(i, n) for i in range(2**n)]
    return [(seq, sequence_prob(rule, seq)) for seq in sequences]

And here's some examples of functions that can be used as the "rule" arguments.

def laplaces_rule(next, past):
    R = past.count('R')
    W = past.count('W')
    if R + W != len(past):
        raise ValueError('knowledge is not just of red and white balls')
    red_prob = (R + 1)/(R + W + 2)
    if next == 'R':
        return red_prob
    elif next == 'W':
        return 1 - red_prob
    else:
        raise ValueError('can only predict whether next will be red or white')


def antilaplaces_rule(next, past):
    return 1 - laplaces_rule(next, past)

So just to be clear. There are two things, the prior probability, which is the value P(H|I), and the back ground information which is 'I'. So P(H|D,I_1) is different from P(H|D,I_2) because they are updates using the same data and the same hypothesis, but with different partial background information, they are both however posterior probabilities. And the priors P(H_I_1) may be equal to P(H|I_2) even if I_1 and I_2 are radically different and produce updates in opposite directions given the same data. P(H|I) is still called the prior probability, but it is smething very differnet from the background information which is essentially just I.

Is this right? Let me be more specific.

Let's say my prior information is case1, then P( second ball is R| first ball is R & case1) = 4/9

If my prior information was case2, then P( second ball is R| first ball is R & case2) = 2/3 [by the rule of succession]

and P( first ball is R| case1) = 50% = P( first ball is R|case2)

This is why different prior information can make you learn in different directions, even if two prior informations produce the same prior probability?

Please let me know if i am making any sort of mistake. Or if I got it right, either way.

No really, i really want help. Please help me understand if I am confused, and settle my anxiety if I am not confused.

You got it right. The three different cases correspond to different joint distributions over sequences of outcomes. Prior information that one of the cases obtains amounts to picking one of these distributions (of course, one can also have weighted combinations of these distributions if there is uncertainty about which case obtains). It turns out that in this example, if you add together the probabilities of all the sequences that have a red ball in the second position, you will get 0.5 for each of the three distributions. So equal prior probabilities. But even though the terms sum to 0.5 in all three cases, the individual terms will not be the same. For instance, prior information of case 1 would assign a different probability to RRRRR (0.004) than prior information of case 2 (0.031).

So the prior information is a joint distribution over sequences of outcomes, while the prior probability of the hypothesis is (in this example at least) a marginal distribution calculated from this joint distribution. Since multiple joint distributions can give you the same marginal distribution for some random variable, different prior information can correspond to the same prior probability.

When you restrict attention to those sequences that have a red ball in the first position, and now add together the (appropriately renormalized) joint probabilities of sequences with a red ball in the second position, you don't get the same number with all three distributions. This corresponds to the fact that the three distributions are associated with different learning rules.

One can update one's beliefs about one's existing beliefs and the ways in which one learns from experience too – click.

Under standard assumptions about the drawing process, you only need 10 numbers, not 1024: P(the urn initially contained ten white balls), P(the urn initially contained nine white balls and one red one), P(the urn initially contained eight white balls and two red ones), and so on through P(one white ball and nine red ones). (P(ten red balls) equals 1 minus everything else.) P(RWRWWRWRWW) is then P(4R, 6W) divided by the appropriate binomial coefficient.

So then this initial probability estimate, 0.5, is not repeat not a "prior".

This really confuses me. Considering the Universe in your example, which consists only of the urn with the balls, wouldn't one of the prior hypotheses(e.g. case 2) be a prior and have all the necessary information to compute the lookup table?

In other words aren't the three following equivalent in the urn-with-balls universe?

Hypothesis 2 + bayesian updating
Python program 2
The lookup table generated from program 2 + Procedure for calculating conditional probability(e.g. if you want to know the probability that the third ball is red, given that the first two balls drawn were white.)

Unless I am misunderstanding you, yes, that's precisely the point.

I don't understand why you are confused, though. None of these are, after all, numbers in (0,1), which would not contain any information as to how you would go about doing your updates given more evidence.

So then this initial probability estimate, 0.5, is not repeat not a "prior".

1:1 odds seems like it would be a default null prior, especially because one round of Bayes' Rule updates it immediately to whatever your first likelihood ratio is, kind of like the other mathematical identities. If your priors represent "all the information you already know", then it seems like you (or someone) must have gotten there through a series of Bayesian inferences, but that series would have to start somewhere, right? If (in the real universe, not the ball & urn universe) priors aren't determined by some chain of Bayesian inference, but instead by some degree of educated guesses / intuition / dead reckoning, wouldn't that make the whole process subject to a "garbage in, garbage out" fallacy(?).

For a use case: A, low internal resolution rounded my posterior probability to 0 or 1, and now new evidence is not updating my estimations anymore, or B, I think some garbage crawled into my priors, but I'm not sure where. In either case, I want to take my observations, and rebuild my chain of inferences from the ground up, to figure out where I should be. So... where is the ground? If 1:1 odds is not the null prior, not the Bayesian Identity, then what is?

This probably deserves its own post.

53

Priors as Mathematical Objects

53

53