You Be the Jury: Survey on a Current Event
As many of you probably know, in an Italian court early last weekend, two young students, Amanda Knox and Raffaele Sollecito, were convicted of killing another young student, Meredith Kercher, in a horrific way in November of 2007. (A third person, Rudy Guede, was convicted earlier.)
If you aren't familiar with the case, don't go reading about it just yet. Hang on for just a moment.
If you are familiar, that's fine too. This post is addressed to readers of all levels of acquaintance with the story.
What everyone should know right away is that the verdict has been extremely controversial. Strong feelings have emerged, even involving national tensions (Knox is American, Sollecito Italian, and Kercher British, and the crime and trial took place in Italy). The circumstances of the crime involve sex. In short, the potential for serious rationality failures in coming to an opinion on a case like this is enormous.
Now, as it happens, I myself have an opinion. A rather strong one, in fact. Strong enough that I caught myself thinking that this case -- given all the controversy surrounding it -- might serve as a decent litmus test in judging the rationality skills of other people. Like religion, or evolution -- except less clichéd (and cached) and more down-and-dirty.
Of course, thoughts like that can be dangerous, as I quickly recognized. The danger of in-group affective spirals looms large. So before writing up that Less Wrong post adding my-opinion-on-the-guilt-or-innocence-of-Amanda-Knox-and-Raffaele-Sollecito to the List of Things Every Rational Person Must Believe, I decided it might be useful to find out what conclusion(s) other aspiring rationalists would (or have) come to (without knowing my opinion).
So that's what this post is: a survey/experiment, with fairly specific yet flexible instructions (which differ slightly depending on how much you know about the case already).
Calibration for continuous quantities
Related to: Calibration fail, Test Your Calibration!
Around here, calibration is mostly approached on a discrete basis: for example, the Technical Explanation of Technical Explanations talks only about discrete distributions, and the commonly linked tests and surveys are either explicitly discrete or offer only coarsely binned probability assessments. For continuous distributions (or "smooth" distributions over discrete quantities like dates of historical events, dollar amounts on the order of hundreds of thousands, populations of countries, or any actual measurement of a continuous quantity), we can apply a finer-grained assessment of calibration.
The problem of assessing calibration for continuous quantities is that our distributions can have very dissimilar shapes, so there doesn't seem to be a common basis for comparing one to another. As an example, I'll give some subjective (i.e., withdrawn from my nether regions) distributions for the populations of two countries, Canada and Botswana. I live in Canada, so I have years of dimly remembered geography classes in elementary school and high school to inform my guess. In the case of Botswana, I have only my impressions of the nation from Alexander McCall Smith's excellent No. 1 Ladies' Detective Agency series and my general knowledge of Africa.
For Canada's population, I'll set my distribution to be a normal distribution centered at 32 million with a standard deviation of 2 million. For Botswana's population, my initial gut feeling is that it is a nation of about 2 million people. I'll put 50% of my probability mass between 1 and 2 million, and the other 50% of my probability mass between 2 million and 10 million. Because I think that values closer to 2 million are more plausible than values at the extremes, I'll make each chunk of 50% mass a right-angle triangular distribution. Here are plots of the probability densities:
Test Your Calibration!
In my journeys across the land, I have, to date, encountered four sets of probability calibration tests. (If you just want to make bets on your predictions, you can use Intrade or another prediction market, but these generally don't record calibration data, only which of your bets paid out.) If anyone knows of other tests, please do mention them in the comments, and I'll add them to this post. To avoid spoilers, please do not post what you guessed for the calibration questions, or what the answers are.
The first, to boast shamelessly, is my own, at http://www.acceleratingfuture.com/tom/?p=129. My tests use fairly standard trivia questions (samples: "George Washington actually fathered how many children?", "Who was Woody Allen's first wife?", "What was Paul Revere's occupation?"), with an emphasis towards history and pop culture. The quizzes are scored automatically (by computer) and you choose whether to assign a probability of 96%, 90%, 75%, 50%, or 25% to your answer. There are five quizzes with fifty questions each: Quiz #1, Quiz #2, Quiz #3, Quiz #4 and Quiz #5.
PredictionBook.com - Track your calibration
Our hosts at Tricycle Developments have created PredictionBook.com, which lets you make predictions and then track your calibration - see whether things you assigned a 70% probability happen 7 times out of 10.
The major challenge with a tool like this is (a) coming up with good short-term predictions to track (b) maintaining your will to keep on tracking yourself even if the results are discouraging, as they probably will be.
I think the main motivation to actually use it, would be rationalists challenging each other to put a prediction on the record and track the results - I'm going to try to remember to do this the next time Michael Vassar says "X%" and I assign a different probability. (Vassar would have won quite a few points for his superior predictions of Singularity Summit 2009 attendance - I was pessimistic, Vassar was accurate.)
Wits and Wagers
Wits and Wagers is apparently a board game, where players compete to be well-calibrated with respect to their trivia knowledge. I haven't played it.
Has someone else here played it? If so, what was your experience? Would it be good rationalist/bayesian training?
Are calibration and rational decisions mutually exclusive? (Part two)
In my previous post, I alluded to a result that could potentially convince a frequentist to favor Bayesian posterior distributions over confidence intervals. It’s called the complete class theorem, due to a statistician named Abraham Wald. Wald developed the structure of frequentist decision theory and characterized the class of decision rules that have a certain optimality property.
Frequentist decision theory reduces the decision process to its basic constituents, i.e., data, actions, true states, and incurred losses. It connects them using mathematical functions that characterize their dependencies, i.e., the true state determines the probability distribution of the data, the decision rule maps data to a particular action, and the chosen action and true states together determine the incurred loss. To evaluate potential decision rules, frequentist decision theory uses the risk function, which is defined as the expected loss of a decision rule with respect to the data distribution. The risk function therefore maps (decision rule, true state)-pairs to the average loss under a hypothetical infinite replication of the decision problem.
Since the true state is not known, decision rules must be evaluated over all possible true states. A decision rule is said to be “dominated” if there is another decision rule whose risk is never worse for any possible true state and is better for at least one true state. A decision rule which is not dominated is deemed “admissible”. (This is the optimality property alluded to above.) The punch line is that under some weak conditions, the complete class of admissible decision rules is precisely the class of rules which minimize a Bayesian posterior expected loss.
(This result sparked interest in the Bayesian approach among statisticians in the 1950s. This interest eventually led to the axiomatic decision theory that characterizes rational agents as obeying certain fundamental constraints and proves that they act as if they had a prior distribution and a loss function.)
Taken together, the calibration results of the previous post and the complete class theorem suggest (to me, anyway) that irrespective of one's philosophical views on frequentism versus Bayesianism, perfect calibration is not possible in full generality for a rational decision-making agent.
Are calibration and rational decisions mutually exclusive? (Part one)
I'm planning a two-part sequence with the aim of throwing open the question in the title to the LW commentariat. In this part I’ll briefly go over the concept of calibration of probability distributions and point out a discrepancy between calibration and Bayesian updating.
It's a tenet of rationality that we should seek to be well-calibrated. That is, suppose that we are called on to give interval estimates for a large number of quantities; we give each interval an associated epistemic probability. We declare ourselves well-calibrated if the relative frequency with which the quantities fall within our specified intervals matches our claimed probability. (The Technical Explanation of Technical Explanations discusses calibration in more detail, although it mostly discusses discrete estimands, while here I'm thinking about continuous estimands.)
Frequentists also produce interval estimates, at least when "random" data is available. A frequentist "confidence interval" is really a function from the data and a user-specified confidence level (a number from 0 to 1) to an interval. The confidence interval procedure is "valid" if in a hypothetical infinite sequence of replications of the experiment, the relative frequency with which the realized intervals contain the estimand is equal to the confidence level. (Less strictly, we may require "greater than or equal" rather than "equal".) The similarity between valid confidence coverage and well-calibrated epistemic probability intervals is evident.
This similarity suggests an approach for specifying non-informative prior distributions, i.e., we require that such priors yield posterior intervals that are also valid confidence intervals in a frequentist sense. This "matching prior" program does not succeed in full generality. There are a few special cases of data distributions where a matching prior exists, but by and large, posterior intervals can at best produce only asymptotically valid confidence coverage. Furthurmore, according to my understanding of the material, if your model of the data-generating process contains more than one scalar parameter, you have to pick one "interest parameter" and be satisfied with good confidence coverage for the marginal posterior intervals for that parameter alone. For approximate matching priors with the highest order of accuracy, a different choice of interest parameter usually implies a different prior.
The upshot is that we have good reason to think that Bayesian posterior intervals will not be perfectly calibrated in general. I have good justifications, I think, for using the Bayesian updating procedure, even if it means the resulting posterior intervals are not as well-calibrated as frequentist confidence intervals. (And I mean good confidence intervals, not the obviously pathological ones.) But my justifications are grounded in an epistemic view of probability, and no committed frequentist would find them as compelling as I do. However, there is an argument for Bayesian posteriors over confidence intervals than even a frequentist would have to credit. That will be the focus of the second part.
Replaying History
One of my favorite fiction genres is alternative history. The basic idea of alternative history is to write a story set in an alternate universe where history played out differently. Popular alternate histories include those where the Nazis win World War II, the USSR wins the Cold War, and the Confederate States of America win the American Civil War. But most of the writing in this genre has a serious flaw: the author starts out by saying "wouldn't it be cool to write a story where X had happened instead of Y" and then works backwards to concoct historical events that will lead to the desired outcome. No matter how good the story is, the history is often bad because at every stage the author went looking for a reason for things to go his way.
Being unsatisfied with most alternate histories, I like to play a historical "what if" game. Rather than asking the question at the conclusion, though (like "what if the Nazis had won the war"), I ask it at an earlier moment, ideally one where chance played an important role. What if Napoleon had been convinced not to invade Russia? What if the Continental Army had not successfully retreated from New York? What if Viking settlements in Newfoundland had not collapsed? These are as opposed to "What if Napoleon had never been defeated?", "What if the Colonies had lost the American Revolutionary War?", and "What if Vikings had developed a thriving civilization in the Americas?". I find that replaying history in this way a fun use of my analytical skills, but more importantly a good test of my rationality.
One of the most difficult things in thinking of an alternative history is to stay focused on the facts and likely outcomes. It's easy to say "I'd really like to see a world where X happened" and then silently or overtly bias your thinking until you find a way to achieve the desired outcome. Learning to avoid this takes discipline, especially in a domain like alternate history where there's no way to check if your reasoning turned out to be correct. But unlike imagining the future, making an alternate history does have the real history to measure up against, so it provides a good training ground for futurist who don't want to wait 20 or 30 years to get feedback on their thinking.
Given all this, I have two suggestions. One, this indicates that a good way to teach history and rational thinking at the same time would be to present historical data up to a set point, ask students to reason out what they think will happen next in history, and then reveal what actually happened and use the feedback to calibrate and improve our historical reasoning (which will hopefully provide some benefit in other domains). Second, a good way to build experience applying the skills of rationality is publicly present and critique alternate histories.
In that vein, if there appears to be sufficient interest, I'll start doing a periodic article here dedicated to the discussion of some particular alternative history. The discussion will be in the comments: people can propose outcomes, then others can revise and critique and propose other outcomes, continuing the cycle until we hit a brick wall (not enough information, question asks something that would not have changed history, etc.) or come to a consensus.
What do you all think of this idea?
View more: Prev
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)