As I understand it, Solomonoff induction works by the a procedure loosely stated as saying we start with a Kolmogorov complexity universal prior (formalized Occam's razor), then update using Bayesian inference any time we see new data.
More precisely, suppose the universe is a computable sequence X of 0s and 1s. We want to predict the nth term before it happens, using the first n-1 terms.
To do this, we use the following 'algorithm':
- List all computable sequences in increasing order of the complexity of the algorithm used to generate them to get a list A1, A2, A3, ...
- We start with the prior P(Ak)=1/2k.
- With this prior, we predict to see a string b of observations with probability P(b)=∑k P(Ak)[b<Ak], where [b<Ak]=1 if Ak starts with the string b, and [b<Ak]=0 otherwise.
- After we see a string b of observations, we update our beliefs using Bayes theorem: P(Ak|b) = P(b|Ak)P(Ak)/P(b) = [b<Ak]P(Ak)/P(b), and similarly P(bc|b) = P(b|bc)P(bc)/P(b) = P(bc)/P(b).
I'm putting the word 'algorithm' in quotes, because there is no list of all computable sequences where [b<Ak] is a computable function of k and b, let alone one where the sequences are listed in increasing Kolmogorov complexity. You can think of it as an algorithm relative to an oracle that computes such a function.
If we follow this 'algorithm', we get a nice theorem:
For all ε>0 there is some N such that if b is an initial segment of X of length at least N, then P(bc|b)>1-ε if bc is an initial segment of X, and P(bc|b)<ε, otherwise.
So in the limit, we can correctly extrapolate from what we've seen with high confidence.
Proof: Let K = {k : X=Ak}. Let p0 = ∑k in K 1/2k. Let M ≥ -log(εp0). Let b be a sufficiently long initial segment of X such that b is not an initial segment of any Am, where m≤M and m is not in K. Then if bc is an initial segment of X,
P(bc) = ∑k 1/2k[bc<Ak] ≥ p0,
but on the other hand
P(b) = ∑k 1/2k[b<Ak] ≤ p0 + ∑k>M 1/2k = p0 + 2-M ≤ (1+ε)p0.
Therefore,
P(bc|b)=P(bc)/P(b) ≥ p0/(1+ε)p0 = 1/(1+ε) > 1-ε.
What if the universe isn't actually computable? Maybe there is some true randomness in the universe, as quantum-mechanics suggests? If you assume the universe is plucked at random from some computable probability distribution on the set of all binary sequences, you get almost the same result. By repeating the same procedure as above, but instead of listing all computable sequences, you list all computable probability distributions on the set of binary sequences and do a similar sort of Bayesian inference, you again get a theorem telling you that as you make observations, your distribution (conditioned on what you've seen so far) converges to the true distribution (conditioned on what you've seen so far).
This is all well and good. But in the end, what does it have to do with Kolmogorov complexity? The exact prior distribution we started with and the order of the list of all computable sequences was not used in any essential way in the proof. What's the advantage of Occam's razor? After all, we could skip steps 1 and 2 completely, start with an arbitrary list of all computable sequences, and an arbitrary prior supported on the set of computable sequences, and we would end up with the exact same theorem. (For the random universe version, we need to start with a prior that is a convex combination of every computable probability distribution, but again, we don't need Kolmogorov complexity at all.)
Is there a stronger theorem that we get when we use the Kolmogorov complexity prior that we don't get otherwise? Why are people making a big deal about the relationship between Solomonoff induction and Occam's razor, if the part of Solomonoff induction that corresponds to Occam's razor is not necessary for Solomonoff induction to work? Occam's razor seems to work well in practice, but I don't see how it's in any way necessary (or even useful!) for the math.
Not a problem. Suppose you flip a quantum coin ten times. If you record the output, the K-complexity is ten bits. As such, there's a 1/1024 prior probability of getting that exact output. This is exactly what you'd get if you assumed it was random.
Basically, K-complexity is treating the very laws of physics as random. Any randomness on top of that works the same way as it would as part of that.
The problem is that it seems that things that are definable but not computable should be above random chance. For example, the K-complexity of a halting oracle is infinite, but it can be defined in finite space. Would the probability of the fine structure constant being a halting oracle be infinitesimal?
One reason to use K-complexity is that so far, it's worked far better than anything else. As far as we know, we can fit the laws of physics on a note card, yet the universe contains well over 10^80 particles, and don't get me started on the amount of computing power necessary to run it.
But you can't fit a description of the current state of those particles on a note card, which you would need in order to actually make predictions.