jockocampbell comments on An Intuitive Explanation of Solomonoff Induction - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (210)
What a wonderful post!
I find it intellectually exhilarating as I have not been introduced to Solomonoff before and his work may be very informative for my studies. I have come at inference from quite a different direction and I am hopeful that an appreciation of Solomonoff will broaden my scope.
One thing that puzzles me is the assertion that:
First my understanding of the principle of maximum entropy suggests that prior probabilities are constrained only by evidence and not by the length of the hypothesis test algorithm. In fact Jaynes argues that 'Ockham's' razor is already built into Bayesian inference.
Second given that the probability is reduced by half with every bit of added algorithm length wouldn't that imply that algorithms' having 1 bit were the most likely and have a probability of 1/2. In fact I doubt if any algorithm at all is describable with 1 bit. Some comments as well as the body of the article suggest that the real accomplishment of Solomonoff's approach is to provide the set of all possible algorithms/hypothesis and that the probabilities assigned to each are not part of a probability distribution but rather are for the purposes of ranking. Why do they need to be ranked? Why not assign them all probability 1/N where N = 2^(n+1) - 2, the number of algorithms having length up to and including length n.
Clearly I am missing something important.
Could it be that ranking them by length is for the purpose of determining the sequence in which the possible hypothesis should be evaluated? When ranking hypothesis by length and then evaluating them against the evidence in sequence from shorter to longer our search will stop at the shortest possible algorithm, which by Occam's razor is the preferred algorithm.
Keep in mind that the Solomonoff induction is defined up to the choice of a prefix-free universal Turing machine. There exist some of these UTMs that can represent a valid program with 1 bit, but this isn't particularly relevant:
Each UTM leads to a different Solomonoff prior. Solomonoff induction is universal only in an asymptotic sense: as you accumulate more and more bits of evidence, the posterior distributions converge with high probability.
Probability estimates dominated by short programs are biased by the choice of the UTM. As more evidence accumulates, the likelihood of a short program being consistent with all of it decreases (intuitively, by the pigeonhole principle), and the probability estimate will be dominated by longer and longer programs, "washing out" the initial bias.
In order to normalize properly, you have to divide the raw Solomonoff measure by the Chaitin's omega of that particular UTM, that is, the probability that a randomly generated program halts.