As I understand it, Solomonoff induction works by the a procedure loosely stated as saying we start with a Kolmogorov complexity universal prior (formalized Occam's razor), then update using Bayesian inference any time we see new data.
More precisely, suppose the universe is a computable sequence X of 0s and 1s. We want to predict the nth term before it happens, using the first n-1 terms.
To do this, we use the following 'algorithm':
- List all computable sequences in increasing order of the complexity of the algorithm used to generate them to get a list A1, A2, A3, ...
- We start with the prior P(Ak)=1/2k.
- With this prior, we predict to see a string b of observations with probability P(b)=∑k P(Ak)[b<Ak], where [b<Ak]=1 if Ak starts with the string b, and [b<Ak]=0 otherwise.
- After we see a string b of observations, we update our beliefs using Bayes theorem: P(Ak|b) = P(b|Ak)P(Ak)/P(b) = [b<Ak]P(Ak)/P(b), and similarly P(bc|b) = P(b|bc)P(bc)/P(b) = P(bc)/P(b).
I'm putting the word 'algorithm' in quotes, because there is no list of all computable sequences where [b<Ak] is a computable function of k and b, let alone one where the sequences are listed in increasing Kolmogorov complexity. You can think of it as an algorithm relative to an oracle that computes such a function.
If we follow this 'algorithm', we get a nice theorem:
For all ε>0 there is some N such that if b is an initial segment of X of length at least N, then P(bc|b)>1-ε if bc is an initial segment of X, and P(bc|b)<ε, otherwise.
So in the limit, we can correctly extrapolate from what we've seen with high confidence.
Proof: Let K = {k : X=Ak}. Let p0 = ∑k in K 1/2k. Let M ≥ -log(εp0). Let b be a sufficiently long initial segment of X such that b is not an initial segment of any Am, where m≤M and m is not in K. Then if bc is an initial segment of X,
P(bc) = ∑k 1/2k[bc<Ak] ≥ p0,
but on the other hand
P(b) = ∑k 1/2k[b<Ak] ≤ p0 + ∑k>M 1/2k = p0 + 2-M ≤ (1+ε)p0.
Therefore,
P(bc|b)=P(bc)/P(b) ≥ p0/(1+ε)p0 = 1/(1+ε) > 1-ε.
What if the universe isn't actually computable? Maybe there is some true randomness in the universe, as quantum-mechanics suggests? If you assume the universe is plucked at random from some computable probability distribution on the set of all binary sequences, you get almost the same result. By repeating the same procedure as above, but instead of listing all computable sequences, you list all computable probability distributions on the set of binary sequences and do a similar sort of Bayesian inference, you again get a theorem telling you that as you make observations, your distribution (conditioned on what you've seen so far) converges to the true distribution (conditioned on what you've seen so far).
This is all well and good. But in the end, what does it have to do with Kolmogorov complexity? The exact prior distribution we started with and the order of the list of all computable sequences was not used in any essential way in the proof. What's the advantage of Occam's razor? After all, we could skip steps 1 and 2 completely, start with an arbitrary list of all computable sequences, and an arbitrary prior supported on the set of computable sequences, and we would end up with the exact same theorem. (For the random universe version, we need to start with a prior that is a convex combination of every computable probability distribution, but again, we don't need Kolmogorov complexity at all.)
Is there a stronger theorem that we get when we use the Kolmogorov complexity prior that we don't get otherwise? Why are people making a big deal about the relationship between Solomonoff induction and Occam's razor, if the part of Solomonoff induction that corresponds to Occam's razor is not necessary for Solomonoff induction to work? Occam's razor seems to work well in practice, but I don't see how it's in any way necessary (or even useful!) for the math.
This kind of depends on what you mean by “big difference” and “complicated enough”.
For example, I can see humanity not yet detecting that spontaneous fission isn’t random but decided by a really good random number generator.
One might also imagine a ridiculously complicated program that makes elementary particles appear to behave randomly, except that when a huge number of such particles are “assembled” in a strange loop of sufficient complexity to be considered “a human-level self-aware entity” the pseudo-random micro-behavior leads to powerful effects at the macro-scale. Pretty much anything about what brains do that we don’t yet understand in detail could be hypothesised to arise from this. (I.e., the gods jiggle the dice to make humans, e.g., risk-adverse. Or send hurricanes towards groups of people who aren’t capitalist enough, whatever.)
Of course, such hypotheses are very complex and we’d usually say that it’s unreasonable to contemplate them. But using Occam’s razor would be circular in this particular thread.
There are infinitely many possible realities that look simple to program but aren't, but most complicated realities don't look nearly this simple. The Copenhagen interpretation looks the same as Many Worlds, but that's because it was discovered while trying to figure out this reality, not because most possibilities do.
But no alternative has bee... (read more)