I am confused about Somolonoff induction and some other related concepts. I appreciate some help from anyone who understands this.
What I understand so far is:
- I understand Bayesian statistics and the problem of choosing the priors
- I understand that Kolmogorov complexity can be used to measure the complexity of an hypothesis
- I think to understand that Somolonoff induction consists of evaluating all the possible hypotheses generating some observed data, and weighting them according to their complextiy
For example, the sequence HTHTT could be generated by different programs:
- A program that flips a fair coin
- A program that flips a coin with p=0.4
- A program that generates "exactly" the sequence HTHTT
- ... etc. Infinite rules could be written here
I do understand that a program that flips a fair coin is simpler than a program that generates the exact output
What I don't understand is:
How do you weight exactly complexity against likelihood?
In the previous example, we could say that the first program has complexity X bits while the the third program has complexity Y, bein Y > X (but only a small difference, because we don't have that many observations). Here is where things get a bit murky for me. The "predictive" power of the first program to explain that specific output, being probabilistic, is 1/2^5, while for the second program the likelihood is 1. To prefer the first hypothesis we have to assign a prior probability at least 2^5 times higher to the first program than to the second one. Where does this 2^5 factor comes from?
I know that it is a theoretical thing but if you can use coin flips in the explanation, that helps a lot. If you identify other places where I am confused but mistakenly thought I wasn't, please point that out.
No, the thing about prefixes is about what strings encode a program, not about their outputs.
The purpose of this is mostly just to define a prior over possible programs, in a way that conveniently ensures that the total probability assigned over all programs is at most 1. Seeing as it still works for different choices of language, it probably doesn't need to exactly use this kind of defining the probabilities, and I think any reasonable distribution over programs will do (at least, after enough observations)
But, while I think another distribution over programs should work, this thing with the prefix-free language is the standard way of doing it, and there are reasons it is nice.
The analogy for a normal programming language would be if no python script was a prefix of any other python script (which isn't true of python scripts, but could be if they were required to end with some "end of program" string)
There will be many different programs which produce the exact same output when run, and will all be considered when doing Solomonoff induction.
This may be pedantic of me, but I wouldn't call the lengths of the programs, the Kolmogorov complexity of the program. The lengths of the programs are (upper bounds on) the Kolmogorov complexity of the outputs of the programs. The Kolmogorov complexity of a program g, would be the length of the shortest program which outputs the program g, not the length of g.
When you say that program C has 4 bits, is that just a value you picked, or are you obtaining that from somewhere?
Also, for a prefix-free programming language, you can't have 2^5 valid programs of length 5, and 2^6 programs of length 6, because if all possible binary strings of length 5 were valid programs, then no string of length 6 would be a valid program.
This is probably getting away from the core points though
(You could have the programming language be such that, e.g. 00XXXXX outputs the bits XXXXX, and 01XXXXXX outputs the bits XXXXXX, and other programs start with a 1, and any other program might want to encode is somehow encoded using some scheme)
yeah, the (non-normalized) prior for each will be 2^(-(length of a program which directly encodes a 5 bit string to output)) for the programs which directly encode some 5 bit string and output it, 2^(-(length of a program which directly encodes a 6 bit string to output)) for the programs which directly encode some 6 bit string and output it, and (say) 2^(-4) for program C.
And those likelihoods you gave are all correct for those models.
So, then, the posteriors (prior to normalization)
would be 2^(-(length of a program which directly encodes a 5 bit string to output)) (let's say this is 2^(-7) for the program that essentially is print("HTHHT"),
2^(-(length of a program which directly encodes a 6 bit string to output)) (let's say this is 2^(-8) ) for the programs that essentially are print("HTHHTH") and print("HTHHTT") respectively
2^(-4) * 2^(-5) = 2^(-9) for model C.
If we want to restrict to these 4 programs, then, adding these up, we get 2^(-7) + 2^(-8) + 2^(-8) + 2^(-9) = 2^(-6) + 2^(-9) = 9 * 2^(-9), and dividing that, we get
(4/9) chance for the program that hardcodes HTHHT (say, 0010110)
(2/9) chance for the program that hardcodes HTHHTH (say, 01101101)
(2/9) chance for the program that hardcodes HTHHTT (say, 01101100)
(1/9) chance for the program that produces a random 5 bit string. (say, 1000)
So, in this situation, where we've restricted to these programs, the posterior probability distribution for "what is the next bit" would be
(4/9)+(1/9)=(5/9) chance that "there is no next bit" (this case might usually be disregarded/discared, idk.)
(2/9) chance that the next bit is H
(2/9) chance that the next bit is T