As I understand it, Solomonoff induction works by the a procedure loosely stated as saying we start with a Kolmogorov complexity universal prior (formalized Occam's razor), then update using Bayesian inference any time we see new data.
More precisely, suppose the universe is a computable sequence X of 0s and 1s. We want to predict the nth term before it happens, using the first n-1 terms.
To do this, we use the following 'algorithm':
- List all computable sequences in increasing order of the complexity of the algorithm used to generate them to get a list A1, A2, A3, ...
- We start with the prior P(Ak)=1/2k.
- With this prior, we predict to see a string b of observations with probability P(b)=∑k P(Ak)[b<Ak], where [b<Ak]=1 if Ak starts with the string b, and [b<Ak]=0 otherwise.
- After we see a string b of observations, we update our beliefs using Bayes theorem: P(Ak|b) = P(b|Ak)P(Ak)/P(b) = [b<Ak]P(Ak)/P(b), and similarly P(bc|b) = P(b|bc)P(bc)/P(b) = P(bc)/P(b).
I'm putting the word 'algorithm' in quotes, because there is no list of all computable sequences where [b<Ak] is a computable function of k and b, let alone one where the sequences are listed in increasing Kolmogorov complexity. You can think of it as an algorithm relative to an oracle that computes such a function.
If we follow this 'algorithm', we get a nice theorem:
For all ε>0 there is some N such that if b is an initial segment of X of length at least N, then P(bc|b)>1-ε if bc is an initial segment of X, and P(bc|b)<ε, otherwise.
So in the limit, we can correctly extrapolate from what we've seen with high confidence.
Proof: Let K = {k : X=Ak}. Let p0 = ∑k in K 1/2k. Let M ≥ -log(εp0). Let b be a sufficiently long initial segment of X such that b is not an initial segment of any Am, where m≤M and m is not in K. Then if bc is an initial segment of X,
P(bc) = ∑k 1/2k[bc<Ak] ≥ p0,
but on the other hand
P(b) = ∑k 1/2k[b<Ak] ≤ p0 + ∑k>M 1/2k = p0 + 2-M ≤ (1+ε)p0.
Therefore,
P(bc|b)=P(bc)/P(b) ≥ p0/(1+ε)p0 = 1/(1+ε) > 1-ε.
What if the universe isn't actually computable? Maybe there is some true randomness in the universe, as quantum-mechanics suggests? If you assume the universe is plucked at random from some computable probability distribution on the set of all binary sequences, you get almost the same result. By repeating the same procedure as above, but instead of listing all computable sequences, you list all computable probability distributions on the set of binary sequences and do a similar sort of Bayesian inference, you again get a theorem telling you that as you make observations, your distribution (conditioned on what you've seen so far) converges to the true distribution (conditioned on what you've seen so far).
This is all well and good. But in the end, what does it have to do with Kolmogorov complexity? The exact prior distribution we started with and the order of the list of all computable sequences was not used in any essential way in the proof. What's the advantage of Occam's razor? After all, we could skip steps 1 and 2 completely, start with an arbitrary list of all computable sequences, and an arbitrary prior supported on the set of computable sequences, and we would end up with the exact same theorem. (For the random universe version, we need to start with a prior that is a convex combination of every computable probability distribution, but again, we don't need Kolmogorov complexity at all.)
Is there a stronger theorem that we get when we use the Kolmogorov complexity prior that we don't get otherwise? Why are people making a big deal about the relationship between Solomonoff induction and Occam's razor, if the part of Solomonoff induction that corresponds to Occam's razor is not necessary for Solomonoff induction to work? Occam's razor seems to work well in practice, but I don't see how it's in any way necessary (or even useful!) for the math.
Er... isn't this like saying "To my knowledge, there is no theoretical justification of biblical inerrancy. For a Christian though, biblical inerrancy has the nice property of cutting off the probability of any algorithm that incorporates atheism"?
Occam's razor better have an independent reason to be sensible, other than reaffirming a previous (held on which basis?) belief in atheism.
Not exactly. There is no way to discount biblical evidence a priori. We can, however massively discount biblical evidence by deriving empirical predicitons and find that they do not match reality. For example, earth is massively older than the computed 6000 years.
The point of a prior distribution is to incorporate any principle or knowledge we have at hand. If we have no prior knowledge, that is, no evidence, no observation, a literal tabula rasa, the only guide we have in designing a prior are principles. That is the point which I tried to make: Before we... (read more)