I've been trying to understand the uses and limitations of Solomonoff induction. Following the principle that in order to fully understand something you should explain it others, here's a try. I prefer to write such things in a form for dialogue, as that better reflects thought processes.
This is not a very-in-depth technical article - for example, cousin_it and Wen Dai has obviously spent more time pondering about SI. (I'm not a long-time LW reader, but have skimmed through existing LW and wiki articles on related topics before posting this.)
Alice. Hi, I'm interested in the question of why and when should I prefer simpler hypotheses. I've heard about Occam's razor and I've read about Solomonoff induction and the Universal Prior. Now I'm looking for a philosophical justification of the math. I'd like to have something akin to de Finetti's justification for probability theory - "if you don't believe in the axioms, you're going to be Dutch booked!".
Bob. You're welcome. Do you have any problems with the formulations?
A. I'm trying to understand how to connect the informal concept of Occam's razor with the mathematical formula of the Universal Prior in a meaningful way. Informally, a hypothesis is something that explains the data. Occam's razor tells us to prefer simpler hypotheses.
B. Well, yes.
A. But the Universal Prior seems to tell something that seems to be even stronger: that all shorter hypothesis are more likely! Clearly, that's not the case: if F and G are hypotheses, then "F OR G" is a hypothesis as well, and not less likely than either F or G individually! What's more, there exists a whole set of language and domain pairs where longer hypotheses have the same average probability. For one particular example consider propositional logic with AND, OR, and NOT operators. Because of symmetry between conjunction and disjunction, a well-formed 10-element formula is as likely to be a tautology (i.e. always correct) as a 100-element formula!
B. You're confused. In this formalism, you should interpret all hypotheses as programs for Universal Turing Machine. The machine reads a hypothesis from one of its tapes. It then executes the hypothesis, and outputs the result of the computation to another tape. The result is the data - the actual string, whose prefix we're observing, and whose remaining part we're trying to predict. Hypothesis is the input string - the string we're not observing. In order to predict future output, our best option is to guess which input program the machine is using. Some programs can be ruled out because they don't correspond to the observed output; but an infinite number of possible programs always remains. The Universal Prior says that shorter programs are more likely. So, a hypothesis is just a bunch of Turing Machine instructions. It makes sense to speak about the "joint hypothesis of F and G" - it's a program that does both what F and G do. On the other hand, it makes no sense to speak about "F OR G" in the way you were doing it before.
A. I see, right! Actually I'm not a mathematician or AI designer, I just want to reason about the physical universe. Let's fix it as the domain of our conversation. It seems to me that hypotheses in some sense correspond to the laws of physics. The output corresponds to the physical universe itself; everything that we can observe by using our senses. But what does the Turing Machine correspond to?
B. You're looking at it from the wrong angle. The justification of Occam's razor is an epistemic, not an ontological issue. It's not essential whether there is or isn't a specific Turing Machine "out there" computing our universe; it's essential that our universe is behaving as if it was computable! Let's quote Scholarpedia: "It is clear that, in a world with computable processes, patterns which result from simple processes are relatively likely, while patterns that can only be produced by very complex processes are relatively unlikely."
A. But when we're calculating the complexity of hypotheses, we're doing it in respect to a particular model of computing!
B. The Church-Turing thesis tells that the choice of a particular machine is not very important - any Universal Turing Machine can simulate any other model of computation.
A. But who says it's a universal machine! It may be anything, really! For example, who says that the universe doesn't work as a quasi-Turing machine that outputs a huge number of "1" with 99% probability every time after it outputs a single "0"? The Universal Prior relies on the relation between inputs and outputs of the machine; if this relation is changed, the probabilities are going to be wrong.
On the other hand, the physical model may be strictly more powerful than a Universal Turing machine - it may be a hypercomputer! If it is, the universe is uncomputable.
B. Well, Occam's razor says that the model should be the simplest possible... Ah, I see, appealing to the razor here is a kind of circular reasoning.
At the moment I have another line of thought: consider what happens when we throw in the Simulation Hypothesis. The probability of being in a simulation is dependent on the computational power the simulated universes got. If the power is significantly smaller than in the parent universe, the recursive chain of simulations is shorter; if the power is closer to parent's power, the chain is longer. Therefore, the probability of being in a simulation is proportional to the length of the chain. This implies that if we're living in a simulation, then our universe is almost certainly not significantly less powerful than our "parent" universe. Either both our and the alien universe are computable, or they are both hypercomputable (in identical meanings of this word). Since it seems that we cannot create hypercomputers in our universe, its reasonable that the aliens cannot do that either. So it's evidence that our universe is computable.
A. Still there is a small probability that uncomputable oracles are present in the physical world, and we have just failed to recognize them. Perhaps we could learn about them in the future and harness their power somehow.
B. Sure - but the point is that we have yet to see any evidence for them. And there's evidence against them - the fact that our computer simulations match reality very well - as long as we have the computing power required for them! We've been looking for strange messages in the sky, we haven't found them. We've been looking for messages in our cell, we haven't found them.
A. In the worst case we can still justify Occam's razor for the physical universe by induction on empirical experience, right?
B. Hmm... I thought for a while and now I've got a better justification! See, even if the universe itself is uncomputable, there's still a MYRIAD of processes in it that ARE COMPUTABLE. We know that gravity and electromagnetism do not behave in random ways; they are at least approximately computable. Molecular dynamics is approximately computable. The cell is approximately computable. The nervous system is approximately computable. Evolution should be approximately computable; we can compute some of its basic elements astonishingly well. When Mendel was determining how heredity works by doing his pea-hybridization experiments, he was investigating a discrete, beautifully computable process.
Nature fights against randomness. Pure uncontrolled randomness does not allow complex systems to emerge. Even if the space-time of physics itself is completely continuous and uncomputable (which far from given), Nature at higher levels favors computable and discrete processes. In a sense, Nature has an infinite number of monkeys - but these monkeys are not sitting a a typewriter and writing the whole damn thing. Instead, they are sitting at a computer and writing input programs. As Seth Lloyd says (source):
Quantum mechanics supplies the universe with “monkeys” in the form of random quantum fluctuations, such as those that seeded the locations of galaxies. The computer into which they type is the universe itself. From a simple initial state, obeying simple physical laws, the universe has systematically processed and amplified the bits of information embodied in those quantum fluctuations. The result of this information processing is the diverse, information-packed universe we see around us: programmed by quanta, physics gave rise first to chemistry and then to life; programmed by mutation and recombination, life gave rise to Shakespeare; programmed by experience and imagination, Shakespeare gave rise to Hamlet. You might say that the difference between a monkey at a typewriter and a monkey at a computer is all the difference in the world.
A. Fine. This really has convinced me that Occam's razor is in some sense justified. I agree that the majority of the processes in the universe are computable or computationally approximable. I agree that for these processes, the Universal Prior follows. On the other hand, Occam's razor isn't justified for the ultimate laws of physics, because our universe might be uncomputable.
B. Well, there might be no such thing as "laws" in the ontological sense. Look, it's probable that there were an infinite number of inflationary bubbles expanding in the primordial Multiverse. Random regularities generated in a particular their subset lead to the appearance of laws. In this case the Universal Prior is exactly what is needed to sort out which sets of physical laws are probable, and which are improbable. I think that the existence of this simple physical model is evidence for a simple computational model, therefore evidence against hypercomputation and the kind of probabilistic, quasi-Turing machines you mentioned before.
A. May be. However, I still doubt that Occam's razor is always the best option to use in practice, even if it is theoretically justified.
Solomonoff Induction guarantees good results when are able to observe the prefix of the output data completely and precisely. Instead, real-world reasoning requires working with data data that has measurement errors, random noise, missing samples... and so on. Furthermore, we often need not so much a single precise hypothesis, but rather something broader - a type or a class of hypotheses - what machine learning people call "a model". Indeed, it turns out that machine learning researchers are working precisely on these problems. They are not terribly excited about Occam's razor. Let me quote from [Domingos 1999]:
"First [version of Occam's] razor: Given two models with the same generalization error, the simpler one should be
preferred because simplicity is desirable in itself.
[..]
Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.
We believe that it is important to distinguish clearly between these two versions of Occam's razor. The first one is largely uncontroversial, while the second one, taken literally, is false.
Several theoretical arguments and pieces of empirical evidence have been advanced to support it, but each of these is reviewed below and found wanting."
B. So what do you suggest?
A. It seems to me that Occam's razor surely is a good option when we need to explain data from controlled scientific experiments. It's the best option if we know how to computationally simulate the process and are able to isolate the process from the environment. If we don't know the mechanics of this process, we cannot even approximately calculate the computational complexity of the data. If we cannot isolate the system from the environment (and the system doesn't have good enough built-in error-correction facilities), then the results of the simulations won't repeat real world data. (For example, we can simulate the heredity of the peas. We can get probability distribution of recessive and dominant traits across pea hybrids in silico, and then get nearly the same results in vivo. On the other hand, there's small hope that we'll ever be able to simulate the unfolding of the long-scale evolution and replicate its past results in our computers. Even though we do understand the mathematical basis of evolution, simulating the environmental impact is beyond our reach.)
B. We also should not forget that there are other motivations behind the razor. When evaluating a theory, its universal a priori probability is not the only thing that counts: theories with lower cognitive complexity are preferable because they are easier to reason about; theories that lead to algorithms with lower time complexity are preferable because they save our processing power. The universal prior is uncomputable after all; it can only be approximated. So it's OK to trade off marginal gains in probability - the apparent gain may be an error anyway. It's perfectly rational to examine and trade-off Kolmogorov complexity for different kinds of complexities.
A. So the bottom line is that situations in which Occam's razor (in the informal sense) is justified is neither a superset nor a subset of the situations in which Solomonoff induction is justified. Besides, the question of the nature of the universe remains open - as always...
References:
[Domingos 1999] The role of Occam's razor in knowledge discovery. P Domingos - 1999 - Springer. [PDF]
[Scholarperdia 2007] Marcus Hutter et al. (2007), Scholarpedia, 2(8):2572. [link]
[EDIT note: This is completely different from what I originally wrote in response to lavalamp's question, because originally I completely misunderstood it. Sorry.]
You can't "count every possible program equally". (What probability will you give each possible program? If it's positive then your total probability will be infinite. If it's zero then your total probability will be zero. You can do a lot of probability-like things on a space with infinite total measure, in which case you could give every program equal weight, but that's not generally what one does.)
Leaving that aside: Your argument suggests that whatever probabilities we give to programs, the resulting probabilities on outcomes will end up favouring outcomes that can be produced by simpler programs. That's true. But a universal prior is doing something stronger than that: it gives (in a particular way) higher probabilities to simpler programs as well. So outcomes generated by simpler programs will (so to speak) be doubly favoured: once because those simple programs have higher probabilities, and once because there are "more" programs that generate those outcomes.
In fact, any probability assignment with finite total probability (in particular, any with total probability 1, which is of course what we usually require) must "in the limit" give small probabilities to long programs. But a universal prior is much more specific, and says how program length corresponds to probability.
According to the SchoIarpedia article, it's not required that the probabilities add up to 1:
∑x m(x) < 1
It's simpler to defined the Universal Prior in the way that not-halting programs are "not counted". So the sum is not over all program, just the halting ones.