As Unknown pointed out, using Kolmogorov complexity is a better bet if winning means finding the truth quickly, instead of just eventually.
The rules of this game, plus the assumption that Nature is maximally adversarial, seems specifically designed to so that this particular version of Ockham would be optimal. It doesn't really seem to provide much insight into what one should do in more general/typical settings.
The question that Kelly is trying to answer is "Why, or in what sense, does Occam's razor work?". Yes, the answer is "It works in that it is worst-case-efficient." He doesn't assume a simplicity-biased prior (which some might characterize as a circular justification of Occam's razor).
I worry that the particular example that I'm presenting here (where the Occam strategy steps in a vaguely linear fashion) is coloring your thinking about Kelly's work generally, which has little or nothing to do with stepping in a vaguely linear fashion, and everything to do with believing the simplest theory compatible with the evidence. (Which, based on your other writings, I think you would endorse.)
Please investigate further, maybe read one of Kelly's papers, I would hate for my poor writing skills to mislead you.
Nice post! Couple quibbles/questions.
Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.
It's not a challenge unless you think that each of the weak classifiers should be powerful enough to do the classification correctly. Occam = "Do not multiply entities beyond need", not "Do not multiply entities".
...If the scientist, confronted with a long r
I second the OP's recommendation of Zendo. It could've been named "Induction: The Game", it's a great medium for practicing (and teaching!) some basic science and rationality skills, and it's also a heck of a lot of fun. Plus, you can use the neat little Zendo pyramids (also called Icehouse pieces) for various other games.
Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.
I'm not sure this is true. Overfitting means that a classifier achieves good empirical performance (on the known data), but bad generalization performance (on new, previously unseen data). The weak classifiers used in Adaboost et al. don't really suffer from overfitting; they just have bad empirical performance (thi...
Kolmogorov complexity also depends on the description language used to create strings. If the language has a recursive feature, for example, this would assign lower complexity to members of the Fibonacci sequence.
So, part of the task of Science is to induce what Nature's description language is.
Kolmogorov complexity also depends on the description language used to create strings.
The whole point of the KC is that it doesn't depend on the specific programming language (or Turing machine) selected except up to an additive constant. If your language contains a recursive feature, it will assign lower complexity to Fibonacci numbers than my non-recursive language does, but the difference is at most the codelength required for me to simulate your machine on my machine.
The game of Science vs. Nature is more complicated than that, and it's the interesting structure that allows scientists to make predictions that are better than "what we've seen so far is everything there is." In particular, the interesting things in both Chemistry and particle Physics is that scientists were able to find regularities in the data (the Periodic Table is one example) that led them to predict missing particles. Once they knew what properties to look for, they usually were able to find them. When a theory predicts particles that a...
What I love about this post is that, at heart, it challenges the very ability to be rational, it hints at the possibility that what is most effective (and how we ultimately function) is as statistical learning machines whose understanding and predictive ability lie in structures very different from the ones we use to communicate. Our language and formal argument structures define a space of understanding that is more a reflection of our communication technology than the true nature of things. In this way, Occams razor and Kolmogorov complexity reflect a mo...
Successful analysis of uncertainty without using a prior implies some kind of hidden Bayes-structure.
Consider hypothesis 1 and hypothesis 2. Hypothesis 2 can pretend to be hypothesis 1 but not vice versa. What does this mean?
p(D1|H2)>0
p(D2|H1)=0
Observations about D2 are clearly irrelevant to our present inference problem, instead, what matters is this:
p(D1|H1)=1
p(D1|H2)<1
p(D1|H1)>p(D1|H2)
that is, D1 causes a Bayesian agent to update in favor of hypothesis 1, because hypothesis 1 fits the evidence better.
(this is only exactly true if D1 and D2 are...
Is your additive example supposed to be the minimum number of 1s needed to represent n as a product or sum of 1s? It looks like that but if so, 6 should be ((1+1)(1+1+1)) and 11 should be (1+(1+1)(1+1+1+1+1)). 12 is still an "inversion."
ETA A few other numbers seem to be off as a result of the mistake with 6. 30 and 42 also have more compact representations.
If there are a very large number of fundamental particles, it is likely that there is some reason for the number, rather than being a merely random number, and if so, then you would likely reach the truth quicker by using some version of Kolmogorov complexity, rather than counting the particles one by one as you discover them.
This is circular. You assume that the universe is likely to have low K-complexity and conclude that a K-complexity razor works well. Kelly's work requires no such assumption, this is why I think it's valuable.
There is a game studied in Philosophy of Science and Probably Approximately Correct (machine) learning. It's a cousin to the Looney Labs game "Zendo", but less fun to play with your friends. http://en.wikipedia.org/wiki/Zendo_(game) (By the way, playing this kind of game is excellent practice at avoiding confirmation bias.) The game has two players, who are asymmetric. One player plays Nature, and the other player plays Science. First Nature makes up a law, a specific Grand Unified Theory, and then Science tries to guess it. Nature provides some information about the law, and then Science can change their guess, if they want to. Science wins if it converges to the rule that Nature made up.
Kevin T. Kelly is a philosopher of science, and studies (among other things) the justification within the philosophy of science for William of Occam's razor: "entities should not be multiplied beyond necessity". The way that he does this is by proving theorems about the Nature/Science game with all of the details elaborated.
Why should you care? Firstly, his justification is different from the overfitting justification that sometimes shows up in Bayesian literature. Roughly speaking, the overfitting justification characterizes our use of Occam's razor as pragmatic - we use Occam's razor to do science because we get good generalization performance from it. If we found something else (e.g. boosting and bagging, or some future technique) that proposed oddly complex hypotheses, but achieved good generalization performance, we would switch away from Occam's razor.
An aside regarding boosting and bagging: These are ensemble machine learning techniques. Suppose you had a technique that created decision tree classifiers, such as C4.5 (http://en.wikipedia.org/wiki/C4.5_algorithm), or even decision stumps (http://en.wikipedia.org/wiki/Decision_stump). Adaboost (http://en.wikipedia.org/wiki/Adaboost) would start by weighting all of the examples identically and invoking your technique to find an initial classifier. Then it would reweight the examples, prioritizing the ones that the first iteration got wrong, and invoke your technique on the reweighted input. Eventually, Adaboost outputs an ensemble of decision trees (or decision stumps), and taking the majority opinion of the ensemble might well be more effective (generalize better beyond training) than the original classifier. Bagging is similar (http://en.wikipedia.org/wiki/Bootstrap_aggregating).
Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.
Secondly, his alternative justification bears on ordering hypotheses "by simplicity", providing an alternative to (approximations of) Kolmogorov complexity as a foundation of science.
Let's take a philosopher's view of physics - from a distance, a philosopher listens to the particle physicists and hears: "There are three fundamental particles that make up all matter!", "We've discovered another, higher-energy particle!", "Another, even higher-energy particle!". There is no reason the philosopher can see why this should stop at any particular number of particles. What should the philosopher believe at any given moment?
This is a form of the Nature vs. Science game where Nature's "Grand Unified Theory" is known to be a nonnegative number (of particles), and at each round Nature can reveal a new fundamental particle (by name) or remain silent. What is the philosopher of science's strategy? Occam's razor suggests that we prefer simpler hypotheses, but if there were 254 known particles, how would we decide whether to claim that there are 254, 255, or 256 particles? Note that in many encodings, 255 and 256 are quite round numbers and therefore have short descriptions; low Kolmogorov complexity.
An aside regarding "round numbers": Here are some "shortest expressions" of some numbers, according to one possible grammar of expressions. (Kolmogorov complexity is unique up to an additive constant, but since we never actually use Kolmogorov complexity, that isn't particularly helpful.)
1 == (1)
2 == (1+1)
3 == (1+1+1)
4 == (1+1+1+1)
5 == (1+1+1+1+1)
6 == (1+1+1+1+1+1)
7 == (1+1+1+1+1+1+1)
8 == ((1+1)*(1+1+1+1))
9 == ((1+1+1)*(1+1+1))
10 == ((1+1)*(1+1+1+1+1))
11 == (1+1+1+1+1+1+1+1+1+1+1)
12 == ((1+1+1)*(1+1+1+1))
13 == ((1)+((1+1+1)*(1+1+1+1)))
14 == ((1+1)*(1+1+1+1+1+1+1))
15 == ((1+1+1)*(1+1+1+1+1))
16 == ((1+1+1+1)*(1+1+1+1))
17 == ((1)+((1+1+1+1)*(1+1+1+1)))
18 == ((1+1+1)*(1+1+1+1+1+1))
19 == ((1)+((1+1+1)*(1+1+1+1+1+1)))
20 == ((1+1+1+1)*(1+1+1+1+1))
21 == ((1+1+1)*(1+1+1+1+1+1+1))
22 == ((1)+((1+1+1)*(1+1+1+1+1+1+1)))
24 == ((1+1+1+1)*(1+1+1+1+1+1))
25 == ((1+1+1+1+1)*(1+1+1+1+1))
26 == ((1)+((1+1+1+1+1)*(1+1+1+1+1)))
27 == ((1+1+1)*((1+1+1)*(1+1+1)))
28 == ((1+1+1+1)*(1+1+1+1+1+1+1))
30 == ((1+1+1+1+1)*(1+1+1+1+1+1))
32 == ((1+1)*((1+1+1+1)*(1+1+1+1)))
35 == ((1+1+1+1+1)*(1+1+1+1+1+1+1))
36 == ((1+1+1)*((1+1+1)*(1+1+1+1)))
40 == ((1+1)*((1+1+1+1)*(1+1+1+1+1)))
42 == ((1+1+1+1+1+1)*(1+1+1+1+1+1+1))
45 == ((1+1+1)*((1+1+1)*(1+1+1+1+1)))
As you can see, there are several inversions, where a larger number (in magnitude) is "simpler" in that it has a shorter shortest expression (in this grammar). The first inversion occurs at 12, which is simpler than 11. In this grammar, squares, powers of two, and smooth numbers generally (http://en.wikipedia.org/wiki/Smooth_number) will be considered simple. Even though "the 256th prime" sounds like a short description of a number, this grammar isn't flexible enough to capture that concept, and so this grammar does not consider primes to be simple. I believe this illustrates that picking a particular approximate Kolmogorov-complexity-like concept is a variety of justification of Occam's razor by aesthetics. Humans can argue about aesthetics, and be convinced by arguments somewhat like logic, but ultimately it is a matter of taste.
In contrast, Kelly's idea of "simplicity" is related to Popper's falsifiability. In this sense of simplicity, a complex theory is one that can (for a while) camouflage itself as another (simple) theory, but a simple theory cannot pretend to be complex. So if Nature really had 256 particles, it could refuse to reveal them for a while (maybe they're hidden in "very high energy regimes"), and the 256-particle universe would exactly match the givens for a 254-particle universe. However, the 254-particle universe cannot reveal new particles; it's already shown everything that it has.
Remember, we're not talking about elaborate data analysis, where there could be "mirages" or "aliasing", patterns in the data that look initially like a new particle, but later reveal themselves to be explicable using known particles. We're talking about a particular form of the Nature vs Science game where each round Nature either reveals (by name) a new particle, or remains silent. This illustrates that Kelly's simplicity is relative to the possible observables. In this scenario, where Nature identifies new particles by name, then the hypothesis that we have seen all of the particles and there will be no new particles is always the simplest, the "most falsifiable". With more realistic observables, the question of what is the simplest consistent hypothesis becomes trickier.
A more realistic example that Kelly uses (following Kuhn) is the Copernican hypothesis that the earth and other planets circle the sun. In what sense is it simpler than the geocentric hypothesis? From a casual modern perspective, both hypotheses might seem symmetric and of similar complexity. The crucial effect is that Ptolemy's model parameters (velocities, diameters) have to be carefully adjusted to create a "coincidence" - apparent retrograde motion that always coincides with solar conjunction (for mercury and venus) and solar opposition (for the other planets). (Note: Retrograde means moving in the unusual direction, west to east. Conjunction means two entities near each other in the sky. Opposition means the entities are at opposite points of the celestial sphere.) The Copernican model "predicts" that coincidence; not in the sense that the creation of the model precedes knowledge of the effect, but that any tiny deviation from exact coincidence to be discovered in the future would be evidence against the Copernican model. In this sense, the Copernican model is more falsifiable; simpler.
The Ockham Efficiency Theorems explain in what sense this version of Occam's razor is strictly better than other strategies for the Science vs. Nature game. If what we care about is the number of public mind changes (saying "I was mistaken, actually there are X particles." would count as a mind change for any X), and the timing of the mind changes, then Occam's razor is the best strategy for the Science vs. Nature game. The Occam strategy for the number-of-particles game will achieve exactly as many mind changes as there are particles. A scientist who deviates from Occam's razor allows Nature to extract a mind change from the (hasty) scientist "for free".
The way this works in the particles game is simple. To extract a mind change from a hasty scientist who jumps to predicting 12 particles when they've only seen 11, or 256 particles when they've only seen 254, Nature can simply continuously refuse to reveal new particles. If the scientist doesn't ever switch back down to the known number of particles, then they're a nonconvergent scientist - they lose the game. If the scientist, confronted with a long run of "no new particles found" does switch to the known number of particles, then Nature has extracted a mind change from the scientist without a corresponding particle. The Occam strategy achieves the fewest possible number of mind changes (that is, equal to the number of particles), given such an adversarial Nature.
The "Ockham Efficiency Theorems" refer to the worked-out details of more elaborate Science vs. Nature games - where Nature chooses a polynomial GUT, for example.
This entire scenario does generalize to noisy observations as well (learning Perlean causal graphs) though I don't understand this aspect fully. If I understand correctly, the scientist guesses a probability distribution over the possible worlds and you count "mind changes" as changes of that probability distribution, so adjusting a 0.9 probability to 0.8 would be counted as a fraction of a mind change. Anyway, read the actual papers, they're well-written and convincing.
www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm
This post benefited greatly from encouragement and critique by cousin_it.