You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

The Sleeping Beauty problem and transformation invariances

1 aspera 23 August 2015 08:57PM

I recently read this blog post by Allen Downey in response to a reddit post in response to Julia Galef's video about the Sleeping Beauty problem. Downey's resolution boils down to a conjecture that optimal bets on lotteries should be based on one's expected state of prior information just before the bet's resolution, as opposed to one's state of prior information at the time the bet is made.

I suspect that these two distributions are always identical. In fact, I think I remember reading in one of Jaynes' papers about a requirement that any prior be invariant under the acquisition of new information. That is to say, the prior should be the weighted average of possible posteriors, where the weights are the likelihood that each posterior would be acheived after some measurement. But now I can't find this reference anywhere, and I'm starting to doubt that I understood it correctly when I read it.

So I have two questions:

1) Is there such a thing as this invariance requirement? Does anyone have a reference? It seems intuitive that the prior should be equivalent to the weighted average of posteriors, since it must contain all of our prior knowledge about a system. What is this property actually called?

2) If it exists, is it a corollary that our prior distribution must remain unchanged unless we acquire new information?

Error margins

5 [deleted] 21 March 2015 11:05AM

Why don't probabilities come with error margins, or other means of describing uncertainty in their assessments?

If I evaluate a prior probability P(new glacial period starting within the next 100 years) to, say, 0.1, shouldn't I then also communicate how certain I feel about that judgement?
A scientist might make the same estimate but be more sure about it's accuracy than I.

In our everyday judgements we often use such package deals:
A: where's Jamie?
B: I think he went to the club house, but you know Jamie - he could be anywhere.

High P, high uncertainty

A: Where's Susie? Do you think she ran astray after that hefty argument?
B: no I'm certain she would *never* do that. She must have gone to a friends place.

High P, low uncertainty.

To like, or not to like?

2 PhilGoetz 14 November 2013 02:26AM

Do you like Shakespeare?

I've been reading the Paris Review interviews with famous authors of the 20th century. Famous authors don't always like other famous authors. Hemingway, Faulkner, Joyce, Fitzgerald — for all of them, you could find some famous author who found them unreadable. (Especially Joyce and Faulkner.)

Except Shakespeare. Everyone loved Shakespeare. In fact, those who mentioned Shakespeare sometimes said he was the best author who has ever lived.

How likely is this?

continue reading »

Freakonomics Study Investigates Decision-Making and Estimated Prior Probabilities

3 telms 30 July 2013 04:08AM

The Freakonomics web site is currently conducting online research that appears, to this properly hypothesis-blinded participant, to be investigating decision-making and estimated prior probability of success. You can participate yourself at  http://www.freakonomics.com/experiments/.

The study asks the participant to choose a yes/no decision that they would be willing to commit to making on the basis of a random coin toss. (Well, actually, the random decay of an atomic nucleus, but they use coin flip graphics.) In my case, the only decision I was willing to make on such a random basis is something with very low risks: namely, the decision whether or not to quit twisting my hair. I accepted the obligation to change my behavior based on a coin toss, and the coin toss says I gotta change.

Breaking a habit of such long standing will be difficult. Past behavior is the best predictor of future behavior, and all that, so when they asked how LIKELY I thought it would be that my hair-twisting habit would stick despite my best efforts to get rid of it, I estimated 90%. Yet I also claimed that I WILL PROBABLY (not certainly, but probably) conquer the habit.

Yes, I recognize the dissonance between these two statements. It intrigues me. Is it perhaps the intent of the experiment to create explicit, conscious, cognitive dissonance like this in some participants, and see what difference it makes to outcomes?

They could easily have phrased the odds question in the inverse form. They COULD have asked how likely I thought it was that I would SUCCEED in achieving my goal. That would align neatly with my statement of commitment and yield no dissonance. I could make the usual biased assumptions that strength of willpower is the same as odds of success, and over-estimate those success odds accordingly.

I don't actually know that the study cares about this, but this is what I would care about if I were the researchers.

The Freakonomics people will be following up over time by email. They're also checking on me through a friend, so there is every possibility that they expect to see an interaction between social involvement in the decision's outcome and the presence of cognitive dissonance, which is believed to drive SOCIAL behavior more strongly than it drives personal decisions kept to oneself.

I'm posting this to increase my social commitment, of course. I also posted on Facebook. It's terrible to have a psychologically trained participant make assumptions about your research project and leverage those assumptions to the max for imaginary ends. But that's life in social science. :)

The Joys of Conjugate Priors

41 TCB 21 May 2011 02:41AM

(Warning: this post is a bit technical.)

Suppose you are a Bayesian reasoning agent.  While going about your daily activities, you observe an event of type .  Because you're a good Bayesian, you have some internal parameter  which represents your belief that  will occur.

Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive.  Your observation of  is a datapoint, and thus you'll want to modify .  But how much should this datapoint influence ?  Well, that will depend on how sure you are of  in the first place.  If you calculated  based on a careful experiment involving hundreds of thousands of observations, then you're probably pretty confident in its value, and this single observation of  shouldn't have much impact.  But if your estimate of  is just a wild guess based on something your unreliable friend told you, then this datapoint is important and should be weighted much more heavily in your reestimation of .

Of course, when you reestimate , you'll also have to reestimate how confident you are in its value.  Or, to put it a different way, you'll want to compute a new probability distribution over possible values of .  This new distribution will be , and it can be computed using Bayes' rule:



Here, since  is a parameter used to specify the distribution from which  is drawn, it can be assumed that computing  is straightforward.   is your old distribution over , which you already have; it says how accurate you think different settings of the parameters are, and allows you to compute your confidence in any given value of .  So the numerator should be straightforward to compute; it's the denominator which might give you trouble, since for an arbitrary distribution, computing the integral is likely to be intractable.

But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions.  If this is your goal, then once you've computed the distribution , you can pick the value of  that maximizes it.  This will be your new parameter, and because you have the formula , you'll know exactly how confident you are in this parameter.

In practice, picking the value of  which maximizes  is usually pretty difficult, thanks to the presence of local optima, as well as the general difficulty of optimization problems.  For simple enough distributions, you can use the EM algorithm, which is guarranteed to converge to a local optimum.  But for more complicated distributions, even this method is intractable, and approximate algorithms must be used.  Because of this concern, it's important to keep the distributions  and  simple.  Choosing the distribution  is a matter of model selection; more complicated models can capture deeper patterns in data, but will take more time and space to compute with.

It is assumed that the type of model is chosen before deciding on the form of the distribution .  So how do you choose a good distribution for ?  Notice that every time you see a new datapoint, you'll have to do the computation in the equation above.  Thus, in the course of observing data, you'll be multiplying lots of different probability distributions together.  If these distributions are chosen poorly,  could get quite messy very quickly.

If you're a smart Bayesian agent, then, you'll pick  to be a conjugate prior to the distribution .  The distribution  is conjugate to  if multiplying these two distributions together and normalizing results in another distribution of the same form as .

Let's consider a concrete example: flipping a biased coin.  Suppose you use the bernoulli distribution to model your coin.  Then it has a parameter  which represents the probability of gettings heads.  Assume that the value 1 corresponds to heads, and the value 0 corresponds to tails.  Then the distribution of the outcome  of the coin flip looks like this:



It turns out that the conjugate prior for the bernoulli distribution is something called the beta distribution.  It has two parameters,  and , which we call hyperparameters because they are parameters for a distribution over our parameters.  (Eek!)

The beta distribution looks like this:



Since  represents the probability of getting heads, it can take on any value between 0 and 1, and thus this function is normalized properly.

Suppose you observe a single coin flip  and want to update your beliefs regarding .  Since the denominator of the beta function in the equation above is just a normalizing constant, you can ignore it for the moment while computing , as long as you promise to normalize after completing the computation:



Normalizing this equation will, of course, give another beta distribution, confirming that this is indeed a conjugate prior for the bernoulli distribution.  Super cool, right?

If you are familiar with the binomial distribution, you should see that the numerator of the beta distribution in the equation for  looks remarkably similar to the non-factorial part of the binomial distribution.  This suggests a form for the normalization constant:



The beta and binomial distributions are almost identical.  The biggest difference between them is that the beta distribution is a function of , with  and  as prespecified parameters, while the binomial distribution is a function of , with  and  as prespecified parameters.  It should be clear that the beta distribution is also conjugate to the binomial distribution, making it just that much awesomer.

Another difference between the two distributions is that the beta distribution uses gammas where the binomial distribution uses factorials.  Recall that the gamma function is just a generalization of the factorial to the reals; thus, the beta distribution allows  and  to be any positive real number, while the binomial distribution is only defined for integers.  As a final note on the beta distribution, the -1 in the exponents is not philosophically significant; I think it is mostly there so that the gamma functions will not contain +1s.  For more information about the mathematics behind the gamma function and the beta distribution, I recommend checking out this pdf: http://www.mhtl.uwaterloo.ca/courses/me755/web_chap1.pdf.  It gives an actual derivation which shows that the first equation for  is equivalent to the second equation for , which is nice if you don't find the argument by analogy to the binomial distribution convincing.

So, what is the philosophical significance of the conjugate prior?  Is it just a pretty piece of mathematics that makes the computation work out the way we'd like it to?  No; there is deep philosophical significance to the form of the beta distribution.

Recall the intuition from above: if you've seen a lot of data already, then one more datapoint shouldn't change your understanding of the world too drastically.  If, on the other hand, you've seen relatively little data, then a single datapoint could influence your beliefs significantly.  This intuition is captured by the form of the conjugate prior.   and  can be viewed as keeping track of how many heads and tails you've seen, respectively.  So if you've already done some experiments with this coin, you can store that data in a beta distribution and use that as your conjugate prior.  The beta distribution captures the difference between claiming that the coin has 30% chance of coming up heads after seeing 3 heads and 7 tails, and claiming that the coin has a 30% chance of coming up heads after seeing 3000 heads and 7000 tails.

Suppose you haven't observed any coin flips yet, but you have some intuition about what the distribution should be.  Then you can choose values for  and  that represent your prior understanding of the coin.  Higher values of  indicate more confidence in your intuition; thus, choosing the appropriate hyperparameters is a method of quantifying your prior understanding so that it can be used in computation.   and  will act like "imaginary data"; when you update your distribution over  after observing a coin flip , it will be like you already saw  heads and  tails before that coin flip.
 
If you want to express that you have no prior knowledge about the system, you can do so by setting  and to 1.  This will turn the beta distribution into a uniform distribution.  You can also use the beta distribution to do add-N smoothing, by setting  and  to both be N+1.  Setting the hyperparameters to a value lower than 1 causes them to act like "negative data", which helps avoid overfitting  to noise in the actual data.

In conclusion, the beta distribution, which is a conjugate prior to the bernoulli and binomial distributions, is super awesome.  It makes it possible to do Bayesian reasoning in a computationally efficient manner, as well as having the philosophically satisfying interpretation of representing real or imaginary prior data.  Other conjugate priors, such as the dirichlet prior for the multinomial distribution, are similarly cool.