Hrm, I'd have to say go with whichever is simpler (choose your favorite reasonable method of measuring the complexity of a hypothesis) for the usual reasons. (less bits to describe it means less stuff that has to be "just so", etc etc... Of course, modify this a bit if one of the hypothesies has a significantly different prior than the other due to previously learned info, but...) But yeah, the less complex one that works is more likely to be closer to the underlying dynamic.
If you're handed the two hypothesies as black boxes, so that you can't a...
Is it cheating to say that it depends hugely on the content of the theories, and their prior probabilities?
The theories screen off the theorists, so if we knew the theories then we could (given enough cleverness) decide based on the theories themselves what our belief should be.
But before we even look at the theories, you ask me which theory I expect to be correct. I expect the one which was written earlier to be correct. This is not because it matters which theory came first, irrespective of their content; it is because I have different beliefs about what each of the two theories might look like.
The first theorist had less data to work with, and so had less da...
I rather like the 3rd answer on his blog (Doug D's). A slight elaboration on that -- one virtue of a scientific theory is its generality, and prediction is a better way of determining generality than explanation -- demanding predictive power from a theory excludes ad hoc theories of the sort Doug D mentioned, that do nothing more than re-state the data. This reasoning, note, does not require any math. :-)
The first guy has demonstrated prediction, the second only hindsight. We assume the first theory is right - but of course, we do the next experiment, and then we'll know.
Assuming both of them can produce values (are formulated in such a way that they can produce a new value with just the past values + the environment)
The second theory has the risk of being more descriptive than predictive. It has more potential of being fit to the input data, including all its noise, and to be a (maybe complex) enumeration of its values.
The first one has at least proven it could be used to predict, while the second one can only produce a new value.
I would thus give more credit to the first theory. At least it has won against ten coin flips without omniscience.
What exactly is meant here by 'believe'? I can imagine various interpretations.
a. Which do we believe to be 'a true capturing of an underlying reality'? b. Which do we believe to be 'useful'? c. Which do we prefer, which seems more plausible?
a. Neither. Real scientists don't believe in theories, they just test them. Engineers believe in theories :-)
b. Utility depends on what you're trying to do. If you're an economist, then a beautifully complicated post-hoc explanation of 20 experiments may get your next grant more easily than a si...
Both theories fit 20 data points. That some of those are predictions is irrelevant, except for the inferences about theory simplicity that result. Since likelihoods are the same, those priors are also the posteriors.
My state of belief is then represented by a certain probability that each theory is true. If forced to pick one out of the two, I would examine the penalties and payoffs of being correct and wrong, ala Pascal's wager.
We do ten experiments. A scientist observes the results, constructs a theory consistent with them
Huh? How did the scientist know what to observe without already having a theory? Theories arise as explanations for problems, explanations which yield predictions. When the first ten experiments were conducted, our scientist would therefore be testing predictions arising from an explanation to a problem. He wouldn't just be conducting any old set of experiments.
Similarly the second scientist's theory would be a different explanation of the problem situation, on...
We do ten experiments. A scientist observes the results, constructs a theory consistent with them
Huh? How did the scientist know what to observe without already having a theory? Theories arise as explanations for problems, explanations which yield predictions. When the first ten experiments were conducted, our scientist would therefore be testing predictions arising from an explanation to a problem. He wouldn't just be conducting any old set of experiments.
Similarly the second scientist's theory would be a different explanation of the problem situation, on...
One theory has a track record of prediction, and what is being asked for is a prediction, so at first glance I would choose that one. But the explanation based-one is built on more data.
But it is neither prediction nor explanation that makes things happen in the real world, but causality. So I would look in to the two theories and pick the one that looks to have identified a real cause instead of simply identifying a statistical pattern in the data.
Whichever is simpler - assuming we don't know anything about the scientists' abilities or track record.
Having two different scientists seems to pointlessly confound the example with extraneous variables.
I don't think the second theory is any less "predictive" than the first. It could have been proposed at the same time or before the first, but it wasn't. Why should the predictive ability of a theory vary depending on the point in time in which it was created? David Friedman seems to prefer the first because it demonstrates more ability on the part of the scientist who created it (i.e., he got it after only 10 tries).
Unless we are given any more information on the problem, I think I agree with David.
These theories are evidence about true distribution of data, so I construct a new theory based on them. I then could predict the next data point using my new theory, and if I have to play this game go back and choose one of the original theories that gives the same prediction, based only on prediction about this particular next data point, independently on whether selected theory as a whole is deemed better.
Having more data is strictly better. But I could expect that there is a good chance that a particular scientist will make an error (worse than me now, ...
Here's my answer, prior to reading any of the comments here, or on Friedman's blog, or Friedman's own commentary immediately following his statement of the puzzle. So, it may have already been given and/or shot down.
We should believe the first theory. My argument is this. I'll call the first theory T1 and the second theory T2. I'll also assume that both theories made their predictions with certainty. That is, T1 and T2 gave 100% probability to all the predictions that the story attributed to them.
First, it should be noted that the two theories should ...
(And, of course, first theory could be improved using the next 10 data points by Bayes' rule, which will give a candidate for being the second theory. This new theory can even disagree with the first on which value of particular data point is most likely.)
Knowing how theories and experiements were chosen would make this more sensible problem. Having that information would affect our expectations about theories - as others have noted there are a lot of theories one could form in ad hoc manner, but question is which of them was selected.
First theory has been selected with first ten experiements and it seems to have survived second set of experiements. If experiements were independent from first set of experiements and from each other this is quite unlikely so this is strong evidence that first theory is the c...
I would go with the first one in general. The first one has proved itself on some test data, while all the second one has done is to fit a model on given data. There is always the risk that the second theory has overfitted a model with no worthwhile generalization accuracy. Even if the second theory is simpler than the first the fact that the first theory has been proved right on unseen data makes it a slam dunk winner. Of course further experiments may cause us to update our beliefs, particularly if theory 2 is proving just as accurate.
There are an infinite number of models that can predict 10 variables, or 20 for that matter. The only probable way for scientist A to predict a model out of the infinite possible ones is to bring prior knowledge to the table about the nature of that model and the data. This is also true for the second scientist, but only slightly less so.
Therefore, scientist A has demonstrated a higher probability of having valuable prior knowledge.
I don't think there is much more to this than that. If the two scientists have equal knowledge there is no reason the second m...
Tyrrell's argument seems to me to hit the nail on the head. (Although I would have liked to see that formalization -- it seems to me that while T1 will be preferred, the preference may be extremely slight, depending. No, I'm too lazy to do it myself :-))
Formalizing Vijay's answer here:
The short answer is that you should put more of your probability mass on T1's prediction because experts vary, and an expert's past performance is at least somewhat predictive of his future performance.
We need to assume that all else is symmetrical: you had equal priors over the results of the next experiment before you heard the scientists' theories; the scientists were of equal apparent caliber; P( the first twenty experimental results | T1 ) = P( the first twenty experimental results | T2); neither theorist influenced the...
Scientist 2's theory is more susceptible to over-fitting of the data; we have no reason to believe it's particularly generalizable. His theory could, in essence, simply be restating the known results and then giving a more or less random prediction for the next one. Let's make it 100,000 trials rather than 20 (and say that Scientist A has based his yet-to-be-falsified theory off the first 50,000 trials), and stipulate that Scientist 2 is a neural network -- then the answer seems clear.
I wrote in my last comment that "T2 is more likely to be flawed than is T1, because T2 only had to post-dict the second batch. This is trivial to formalize using Bayes's theorem. Roughly speaking, it would have been harder for T1 to been constructed in a flawed way and still have gotten its predictions for the second batch right."
Benja Fallenstein asked for a formalization of this claim. So here goes :).
Define a method to be a map that takes in a batch of evidence and returns a theory. We have two assumptions
ASSUMPTION 1: The theory produced b...
Throughout these replies there is a belief that theory 1 is 'correct through skill'. With that in mind it is hard to come to any other conclusion than 'scientist 1 is better'.
Without knowing more about the experiments, we can't determine if theory 1's 10 good predictions were simply 'good luck' or accident.
If your theory is that the next 10 humans you meet will have the same number of arms as they have legs, for example...
There's also potential for survivorship bias here. If the first scientist's results had been 5 correct, 5 wrong, we wouldn't be having t...
I'd use the only tool we have to sort theories: Occam's razor.
This is what many do by assuming the second is “over-fitted”; I believe a good scientist would search the literature before stating a theory, and know about the first one; as he would also appreciate elegance, I'd expect him to come up with a simpler theory — but, as you pointed out, some time in a economics lab could easily prove me wrong, although I'm assuming the daunting ...
Peter, your point that we have different beliefs about the theories prior to looking at them is helpful. AFAICT theories don't screen off theorists, though. My belief that the college baseball team will score at least one point in every game ("theory A"), including the next one ("experiment 21"), may reasonably be increased by a local baseball expert telling me so and by evidence about his expertise. This holds even if I independently know something about baseball.
As to the effect of "number of parameters" on the theories' ...
Tyrrell, right, thanks. :) Your formalization makes clear that P1/P2 = p(M(B1) predicts B2 | M flawed) / p(M(B1) predicts B2), which is a stronger result than I thought. Argh, I wish I were able to see this sort of thing immediately.
One small nitpick: It could be more explicit that in Assumption 2, B1 and B2 range over actual observation, whereas in Assumption 1, B ranges over all possible observations. :)
Anna, right, I think we need some sort of "other things being equal" proviso to Tyrrell's solution. If experiments 11..20 were chosen by scient...
"One small nitpick: It could be more explicit that in Assumption 2, B1 and B2 range over actual observation, whereas in Assumption 1, B ranges over all possible observations. :)"
Actually, I implicitly was thinking of the "B" variables as ranging over actual observations (past, present, and future) in both assumptions. But you're right: I definitely should have made that explicit.
We know that the first researcher is able to successfully predict the results of experiment. We don't know that about the second researcher. Therefore I would bet on the first researcher prediction (but only assuming other things being equal).
Then we'll do the experiment and know for sure.
Benja --
I disagree with Tyrrell (see below), but I can give a version of Tyrrell's "trivial" formalization:
We want to show that:
Averaging over all theories T, P(T makes correct predictions | T passes 10 tests) > P(T makes correct predictions)
By Bayes' rule,
P(T makes correct predictions | T passes 10 tests) = P(T makes correct predictions)
So our conclusion is equivalent to:
Averaging over all theories T, P(T passes 10 tests | T makes correct predictions) / P(T passes 10...
Hi, Anna. I definitely agree with you that two equally-good theories could agree on the results of experiments 1--20 and then disagree about the results of experiment 21. But I don't think that they could both be best-possible theories, at least not if you fix a "good" criterion for evaluating theories with respect to given data.
What I was thinking when I claimed that in my original comment was the following:
Suppose that T1 says "result 21 will be X" and theory T2 says "result 21 will be Y".
Then I claim that there is another...
Among the many excellent, and some inspiring, contributions to OvercomingBias, this simple post, together with its comments, is by far the most impactful for me. It's scary in almost the same way as the way the general public approaches selection of their elected representatives and leaders.
Tyrrell, um. If "the ball will be visible" is a better theory, then "we will observe some experimental result" would be an even better theory?
Solomonoff induction, the induction method based on Kolmogorov complexity, requires the theory (program) to output the precise experimental results of all experiments so far, and in the future. So your T3 would not be a single program; rather, it would be a set of programs, each encoding specifically one experimental outcome consistent with "the ball is visible." (Which gets rid of the problem that "we will observe some experimental result" is the best possible theory :))
Here is my answer without looking at the comments or indeed even at the post linked to. I'm working solely from Eliezer's post.
Both theories are supported equally well by the results of the experiments, so the experiments have no bearing on which theory we should prefer. (We can see this by switching theory A with theory B: the experimental results will not change.) Applying bayescraft, then, we should prefer whichever theory was a priori more plausible. If we could actually look at the contents of the theory we could make a judgement straight from that...
I've seen too many cases of overfitting data to trust the second theory. Trust the validated one more.
The question would be more interesting if we said that the original theory accounted for only some of the new data.
If you know a lot about the space of possible theories and "possible" experimental outcomes, you could try to compute which theory to trust, using (surprise) Bayes' law. If it were the case that the first theory applied to only 9 of the 10 new cases, you might find parameters such that you should trust the new theory more.
In the gi...
Benja, I have never studied Solomonoff induction formally. God help me, but I've only read about it on the Internet. It definitely was what I was thinking of as a candidate for evaluating theories given evidence. But since I don't really know it in a rigorous way, it might not be suitable for what I wanted in that hand-wavy part of my argument.
However, I don't think I made quite so bad a mistake as highly-ranking the "we will observe some experimental result" theory. At least I didn't make that mistake in my own mind ;). What I actually wrot...
Upon first reading, I honestly thought this post was either a joke or a semantic trick (e.g., assuming the scientists were themselves perfect Bayesians which would require some "There are blue-eyed people" reasoning).
Because theories that can make accurate forecasts are a small fraction of theories that can make accurate hindcasts, the Bayesian weight has to be on the first guy.
In my mind, I see this visually as the first guy projecting a surface that contains the first 10 observations into the future and it intersecting with the actual future. ...
Both theories are equally good. Both are correct. There is no way to choose one, except to make another experiment and see which theory - if any (still might be both well or both broken) - will prevail.
That the first theory is right seems obvious and not the least bit counterintuitive. Therefore, based on what I know about the psychology of this blog, I predict that it is false and the second one is true.
We have two theories that explain the all the available data - and this is Overcoming Bias - so how come only a tiny number of people have mentioned the possibility of using Occam's razor? Surely that must be part of any sensible response.
I don't think you've given enough information to make a reasonable choice. If the results of all 20 experiments are consistent with both theories but the second theory would not have been made without the data from the second set of experiments, then it stands to reason that the second theory makes more precise predictions.
If the theories are equally complex and the second makes more precise predictions, then it appears to be a better theory. If the second theory contains a bunch of ad hoc parameters to improve the fit, then it's likely a worse theory.
But ...
Hi Tyrrell,
Let T1_21 and T2_21 be the two theories' predictions for the twenty-first experiment.
As you note, if all else is equal, our prior beliefs about P(T1_21) and P(T2_21) -- the odds we would've accepted on bets before hearing T1s and T2's predictions -- are relevant to the probability we should assign after hearing T1's and T2's predictions. It takes more evidence to justify a high-precision or otherwise low-prior-probability prediction. (Of course, by the same token, high precision and otherwise low-prior predictions are often more useful.)
The pr...
We believe the first(T1).
Why: Correctly predicted outcomes updates it's probability of being correct(Bayes).
The additional information available to the second theory is redundant since it was correctly predicted by T1.
A few thoughts.
I would like the one that:
0) Doesn't violate any useful rules of thumb, e.g. conservation of energy, allowing transmitting information faster than the speed of light in a vacuum. 1) Gives more precise predictions. To be consistent with a theory isn't hard if the theory gives a large range of uncertainty. E.g. if one theory is 2) Doesn't have any infinities in its range
If all these are equal, I would prefer them equally. Otherwise I would have to think that something was special about the time they were suggested, and be money pumped.
For exam...
As a machine-learning problem, it would be straightforward: The second learning algorithm (scientist) did it wrong. He's supposed to train on half the data and test on the other half. Instead he trained on all of it and skipped validation. We'd also be able to measure how relatively complex the theories were, but the problem statement doesn't give us that information.
As a human learning problem, it's foggier. The second guy could still have honestly validated his theory against the data, or not. And it's not straightforward to show that one human-rea...
We should take into account the costs to a scientist of being wrong. Assume that the first scientist would pay a high price if the second ten data points didn't support his theory. In this case he would only propose the theory if he was confident it was correct. This confidence might come from his intuitive understanding of the theory and so wouldn't be captured by us if we just observed the 20 data points.
In contrast, if there will be no more data the second scientist knows his theory will never be proved wrong.
So reviewing the other comments now I see that I am essentially in agreement with M@ (on David's blog) who posted prior to Eli. Therefore, Eli disagrees with that. Count me curious.
Peter de Blanc got it right, IMHO. I don't agree with any of the answers that involve inference about the theorists themselves; they each did only one thing, so it is not the case that you can take one thing they did as evidence for the nature of some other thing they did.
Peter de Blanc is right: Theories screen off the theorists. It doesn't matter what data they had, or what process they used to come up with the theory. At the end of the data, you've got twenty data points, and two theories, and you can use your priors in the domain (along with things like Occam's Razor) to compute the likelihoods of the two theories.
But that's not the puzzle. The puzzle doesn't give us the two theories. Hence, strictly speaking, there is no correct answer.
That said, we can start guessing likelihoods for what answer we would come up wi...
The short answer is, "it depends." For all we can tell from the statement of the problem, the second "theory" could be "I prayed for divine revelation of the answers and got these 20." Or it could be special relativity in 1905. So I don't think this "puzzle" poses a real question.
Actually I'd like to take back my last comment. To the extent that predictions 11-20 and 21-30 are generated by different independent "parts" of the theory, then the quality of the former part is evidence about the quality of the latter part via the theorist's competence.
Of course you can make an inference about the evidenced skill of the scientists. Scientist 1 was capable of picking out of a large set of models that covered the first 10 variables, the considerably smaller set of models that also covered the second 10. He did that by reference to principles and knowledge he brought to the table about the nature of inference and the problem domain. The second scientist has not shown any of this capability. I think our prior expectation for the skill of the scientists would be irrelevant, assuming that the prior was at leas...
Ceteris paribus, I'd choose the second theory since the process that generated it had strictly more information. Assume that the scientists would've generated the same theory given the same data, and the data in question are coin flips. The first scientist sees a random looking series of 10 coin flips with 5 heads and 5 tails and hypothesizes that they are generated by the random flips of a fair coin. We collect 10 more data points, and again we get 5 heads and 5 tails, the maximum likelihood result given the first theory. Now the second scientist sees the...
Experience alone leads me to pick Theory #2. In what I do I'm constantly battling academic experts peddling Theory #1. Typically they have looked at say 10 epidemiological studies and concluded that the theory "A causes B" is consistent with the data and thus true. A thousand lawsuits against the maker of "A" are then launched on behalf of those who suffer from "B".
Eventually, and almost invariably with admittedly a few notable exceptions, the molecular people then come along and more convincingly theorize that "C causes...
I wrote:
To the extent that predictions 11-20 and 21 are generated by different independent "parts" of the theory, the quality of the former part is evidence about the quality of the latter part via the theorist's competence.
...however, this is much less true of cases like Newton or GR where you can't change a small part of the theory without changing all the predictions, than it is of cases like "evolution theory is true and by the way general relativity is also true", which is really two theories, or cases like "Newton is true on ...
relative to the early theory that was put forward should have read relative to a random early theory given that it was consistent with the evidence.
Let's suppose, purely for the sake of argument of course, that the scientists are superrational.
The first scientist chose the most probable theory given the 10 experiments. If the predictions are 100% certain then it will still be the most probable after 10 more successful experiments. So, since the second scientist chose a different theory, there is uncertainty and the other theory assigned an even higher probability to these outcomes.
In reality people are bad at assessing priors (hindsight bias), leading to overfitting. But these scientists are assumed t...
The first theorist had multiple theories to choose from that would have been consistent with the first 10 data points - some of them better than others. Later evidence indicates that he chose well, that he apparently has some kind of skill in choosing good theories. No such evidence is available regarding the skill of the second theorist.
My approach: (using Bayes' Theorem explicitly)
A: first theory
B: second theory
D: data accumulated between the 10th and 20th trials
We're interested in the ratio P(A|D)/P(B|D).
By Bayes' Theorem:
P(A|D) = P(D|A)P(A)/P(D)
P(B|D) = P(D|B)P(B)/P(D)
Then
P(A|D)/P(B|D) = P(D|A)P(A)/(P(D|B)P(B)).
If each theory predicts the data observed with equal likelihood (that is, under neither theory is the data more likely to be observed), then P(D|A) = P(D|B) so we can simplify,
P(A|D)/P(B|D) = P(A)/P(B) >> 1
given that presumably theory A was a much more plausible pri...
If you're handed the two hypothesies as black boxes, so that you can't actually see inside them and work out which is more complex, then go with the first one.
...unless you are attending a magic show. Fortunately, it is not common for scientists to be asked to choose between hypotheses without even knowing what they are.
Suppose the scientists S_10 and S_20 are fitting curves f(i) to noisy observations y(i) at points i = 0...20. Suppose there are two families of models, a polynomial g(i;a) and a trigonometric h(i;Ï,Ï):
g(i) <- sum(a[k]x^k, k=0..infinity)
h(i) <- cos(Ïi+Ï)
The angular frequency Ï is predetermined. The phase Ï is random:
Ï ~ Flat(), equivalently Ï ~ Uniform(0, 2*Ï)
The coefficients a[k] are independently normally distributed with moments matched to the marginal moments of the coefficients in the Taylor expansion of h(i):
a[k] ~ Normal(mean=0, stdde...
Two points I'd like to comment on.
Re: The second scientist had more information
I don't think this is relevant if-- as I understood from the description-- the first scientist's theory predicted experiments 11..20 with high accuracy. In this scenario, I don't think the first scientist should have learned anything that would make them reject their previous view. This seems like an important point. (I think I understood this from Tyrrell's comment.)
Re: Theories screen of theorists
I agree-- we should pick the simpler theory-- if we're able to judge them for sim...
The way science is currently done, experimental data that the formulator of the hypothesis did not know about is much stronger evidence for a hypothesis than experimental data he did know about.
A hypothesis formulated by a perfect Bayesian reasoner would not have that property, but hypotheses from human scientists do, and I know of no cost-effective way to stop human scientists from generating the effect. Part of the reason human scientists do it is because the originator of a hypothesis is too optimistic about the hypothesis (and this optimism stems in ...
Whoever (E or Friedman) chose the title, "Prediction vs. Explanation", was probably thinking along the same lines.
If I were to travel to the North Pole and live there through the months of January and February with no prior knowledge of the area, then I would almost certainly believe (one could even say Theorize) that it is constantly night time at the North Pole. I could move back to The United States, and may never know that my theory is wrong. If I had, however, stayed through March and maybe into April, I would then know that the Sun does eventually rise. From this extra information, I could postulate a new theory that would likely be more correct.
"The Sun ri...
Remember the first comment way back in the thread? Psy-Kosh? I'm pretty much with him.
We assume that both hypotheses are equally precise - that they have equally pointed likelihood functions in the vicinity of the data so far.
If you know what's inside the boxes, and it's directly comparable via Occam's Razor, then Occam's Razor should probably take over.
The main caveat on this point is that counting symbols in an equation doesn't always get you the true prior probability of something, and the scientist's ability to predict the next ten symbols from the f...
The question is whether the likelihood that the 21st experiment will validate the best theory constructed from 20 data points and invalidate the best theory constructed from 10 data points, when that theory also fits the other ten, is greater than the likelihood scientist B is just being dumb.
The likelihood of the former is very hard to calculate, but it's definitely less than 1/11, in other words, over 91% of the time the first theory will still be, if not the best possible theory, good enough to predict the results of one more experiment. The likelihood ...
Part of the problem here is that the situation presented is an extremely unusual one. Unless scientist B's theory is deliberately idiotic, experiment 21 has to strike at a point of contention between two theories which otherwise agree, and it has to be the only experiment out of 21 which does so. On top of that, both scientists have to pick one of these theories, and they have to pick different ones. Even if those theories are the only ones which make any sense, and they're equally likely from the available data, your chance of ending up in the situation t...
If instead of ten experiments per set, there were only 3, who here would pick theory B instead?
Since both theories satisfy all 20 experiments, for all intents and purposes of experimentation the theories are both equally valid or equally invalid.
Imagine the ten experiments produced the following numbers as results: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
The first scientists hypothesis is this function: if n < 20 then n else 5, (where n is the number of the variable being tested in the experiment)
10 more experiments are done and of course it predicts the answers perfectly. Scientist two comes up with his hypothesis: n. That's it, just the value of n is the value that will be measured by the experiment.
Now, would you really trust the first hypothesis because it happened to have been made before the next te...
David D. Friedman asks:
One of the commenters links to Overcoming Bias, but as of 11PM on Sep 28th, David's blog's time, no one has given the exact answer that I would have given. It's interesting that a question so basic has received so many answers.