As I understand it, Solomonoff induction works by the a procedure loosely stated as saying we start with a Kolmogorov complexity universal prior (formalized Occam's razor), then update using Bayesian inference any time we see new data.
More precisely, suppose the universe is a computable sequence X of 0s and 1s. We want to predict the nth term before it happens, using the first n-1 terms.
To do this, we use the following 'algorithm':
- List all computable sequences in increasing order of the complexity of the algorithm used to generate them to get a list A1, A2, A3, ...
- We start with the prior P(Ak)=1/2k.
- With this prior, we predict to see a string b of observations with probability P(b)=∑k P(Ak)[b<Ak], where [b<Ak]=1 if Ak starts with the string b, and [b<Ak]=0 otherwise.
- After we see a string b of observations, we update our beliefs using Bayes theorem: P(Ak|b) = P(b|Ak)P(Ak)/P(b) = [b<Ak]P(Ak)/P(b), and similarly P(bc|b) = P(b|bc)P(bc)/P(b) = P(bc)/P(b).
I'm putting the word 'algorithm' in quotes, because there is no list of all computable sequences where [b<Ak] is a computable function of k and b, let alone one where the sequences are listed in increasing Kolmogorov complexity. You can think of it as an algorithm relative to an oracle that computes such a function.
If we follow this 'algorithm', we get a nice theorem:
For all ε>0 there is some N such that if b is an initial segment of X of length at least N, then P(bc|b)>1-ε if bc is an initial segment of X, and P(bc|b)<ε, otherwise.
So in the limit, we can correctly extrapolate from what we've seen with high confidence.
Proof: Let K = {k : X=Ak}. Let p0 = ∑k in K 1/2k. Let M ≥ -log(εp0). Let b be a sufficiently long initial segment of X such that b is not an initial segment of any Am, where m≤M and m is not in K. Then if bc is an initial segment of X,
P(bc) = ∑k 1/2k[bc<Ak] ≥ p0,
but on the other hand
P(b) = ∑k 1/2k[b<Ak] ≤ p0 + ∑k>M 1/2k = p0 + 2-M ≤ (1+ε)p0.
Therefore,
P(bc|b)=P(bc)/P(b) ≥ p0/(1+ε)p0 = 1/(1+ε) > 1-ε.
What if the universe isn't actually computable? Maybe there is some true randomness in the universe, as quantum-mechanics suggests? If you assume the universe is plucked at random from some computable probability distribution on the set of all binary sequences, you get almost the same result. By repeating the same procedure as above, but instead of listing all computable sequences, you list all computable probability distributions on the set of binary sequences and do a similar sort of Bayesian inference, you again get a theorem telling you that as you make observations, your distribution (conditioned on what you've seen so far) converges to the true distribution (conditioned on what you've seen so far).
This is all well and good. But in the end, what does it have to do with Kolmogorov complexity? The exact prior distribution we started with and the order of the list of all computable sequences was not used in any essential way in the proof. What's the advantage of Occam's razor? After all, we could skip steps 1 and 2 completely, start with an arbitrary list of all computable sequences, and an arbitrary prior supported on the set of computable sequences, and we would end up with the exact same theorem. (For the random universe version, we need to start with a prior that is a convex combination of every computable probability distribution, but again, we don't need Kolmogorov complexity at all.)
Is there a stronger theorem that we get when we use the Kolmogorov complexity prior that we don't get otherwise? Why are people making a big deal about the relationship between Solomonoff induction and Occam's razor, if the part of Solomonoff induction that corresponds to Occam's razor is not necessary for Solomonoff induction to work? Occam's razor seems to work well in practice, but I don't see how it's in any way necessary (or even useful!) for the math.
I know it's not a problem. I explained exactly how to modify Solomonoff induction to handle universes that are generated randomly according to some computable law, as opposed to being generated deterministically according to an algorithm.
Maybe it is, maybe it isn't. Maybe your definition of Kolmogorov complexity is such that the Kolmogorov complexity of every string is at least 3^^^3, because you defined Kolmogorov complexity using a really dumb universal machine. Maybe it's 2, because you programmed a universal machine that was really good at compressing the string 0101110010, because you like it, and it so happens that was what you flipped. If you flip a coin a very large number of times, it's very likely that the Kolmogorov complexity of the output is at least the number of flips, but it could be much smaller due to chance.
False, because you don't actually know the complexity was 10.
No, it's not doing that at all. Using a complexity-weighted prior is assuming that the universe follows some random computable law, which is more likely to be simple than complex. I suppose this is true to the extent that any probability distribution on a countable set vanishes in the limit (for any enumeration of the countable set!), but I see no reason to bring Kolmogorov complexity into it.
I have no idea how to make sense of this question. Are infinitely many bits of fine structure constant even physically meaningful? If beyond a certain point of precision, the fine structure constant can't be measured even in principle because it has no detectable physical effects on accessible time/space scales, it makes no sense to even have an answer to the question "is the fine structure constant a halting oracle?" let alone the probability of such an event.
I'll grant you that Occam's razor works well in science. That wasn't the question. The question is what is the advantage, if any, of starting with a Kolmogorov prior for Solomonoff induction, as opposed to any other arbitrary prior. You haven't answered my question.