Here is my understanding:
we assume a programming language where a program is a finite sequence of bits, and such that no program is a prefix of another program. So, for example, if 01010010 is a program, then 0101 is not a program.
Then, the (not-normalized) prior probability for a program is
Why that probability?
If you take any infinite sequence of bits, then, because no program is a prefix of any other program, at most one program will be a prefix of that sequence of bits.
If you randomly (with uniform distribution) select an infinite sequence of bits, the probability that the sequence of bits has a particular program as a prefix, is then (because there's a factor of (1/2) for each of the bits of the program, and if the first (length of the program) bits match the program, then it doesn't matter what the rest of the bits that come after are.
(Ah, I suppose you don't strictly need to talk about infinite sequences of bits, and you could just talk about randomly picking the value of the next bit, stopping if it ever results in a valid program in the programming language..., not sure which makes it easier to think about.)
If you want this to be an actual prior, you can normalize this by dividing by (the sum over all programs, of ).
The usual way of defining Solomonoff induction, I believe has the programs being deterministic, but I've read that allowing them to use randomness has equivalent results, and may make some things conceptually easier.
So, I'll make some educated guesses about how to incorporate the programs having random behavior.
Let G be the random variable for "which program gets selected", and g be used to refer to any potential particular value for that variable. (I avoided using p for program because I want to use it for probability. The choice of the letter g was arbitrary.)
Let O be the random variable for the output observed, and let o be used for any particular value for that variable.
And, given a program g, the idea of P(O=o|G=g) makes sense, and P(G=g) is proportional to (missing a normalization constant, but this will be the same across all g),
And, P(O=o) will also be a normalization constant that is the same across all g.
And so, if you can compute values P(O=o|G=g) (the program g may take too long to run in practice) we can compute values proportional to P(G=g|O=o) .
Does that help any?
(apologies if this should be a "comment" rather than an "answer". Hoping it suffices.)
The part I understood is, that you weigh the programs based on the length in bits, the longer the program the less weight it has. This makes total sense.
I am not sure that I understand the prefix thing and I think that's relevant. For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).
I also find still fuzzy (and know at least I can put my finger on it) is the part where Solomonoff induction is extended to deal with randomness.
Let me see if ...
The concept of Kolmogorov Sufficient Statistic might be the missing piece. (cf Elements of information theory section 14.12)
We want the shortest program that describes a sequence of bits. A particularly interpretable type of such programs is "the sequence is in the set X generated by program p, and among those it is the n'th element"
Example "the sequence is in the set of sequences of length 1000 with 104 ones, generated by (insert program here), of which it is the n~10^144'th element".
We therefore define f(String, n) to be the size of the smallest set containing String which is generated by a program of length n. (Or alternatively where a program of length n can test for membership of the set)
If you plot the logarithm of f(String,n) you will often see bands where the line has slope -1, corresponding to using the extra bit to hardcode one more bit of the index. In this case the longer programs aren't describing any more structure than the program where the slope started being -1. We call such a program a Kolmogorov minimal statistic.
The relevance is that for a biased coin with each flip independent the Kolmogorov minimal statistic is the bias. And it is often more natural to think about the Kolmogorov minimal statistics.
I do understand that a program that flips a fair coin is simpler than a program that generates the exact output
So this is not quite right. Hypotheses in "vanilla" Solomonoff induction are deterministic. When there's noise, a better intuition might be that there are lots of hypotheses that corresponds to different settings of some "noise variables," and there's no clear pattern telling you which of these many different hypotheses is better. The outcome of the noise just looks, to Solomonoff induction, like a surprising fact it didn't know before.
Heavily biased coins are simple to predict - i.e. it doesn't take many bits (on average) to compress the sequences generated by them. It's a fair coin that takes lots of bits to write the sequence for, because each flip is a maximally surprising new fact.
To prefer the first hypothesis we have to assign a prior probability at least 2^5 times higher to the first program than to the second one. Where does this 2^5 factor comes from?
Yup, exactly right. That's just the structure of the Solomonoff prior - if a pattern takes 5 more bits to specify, then it's a factor of 2^5 less likely (before evidence).
I think this is pointing to what I don't understand: how do you account for hypotheses that explain data generated randomly? How do you compare a hypothesis which is a random number generator with some parameters against a hypothesis which has some deterministic component?
Is there a way to understand this without reading the original paper (which will probably take me quite long)?
When you understood this, how was your personal process that took you from knowing about probabilities and likelihood to understanding Solomonoff induction? Did you have to read the original sources or you found some good explanations somewhere?
I also don't get if this is a calculation that you can do in a single step or if this is a continuous thing. In other words, Solomonoff induction would work only if we assume that we keep observing new data?
Sorry for the stupid questions, as you can see, I am confused.
how do you account for hypotheses that explain data generated randomly?
"Randomness," in this way of thinking, isn't a property of hypotheses. There is no one hypothesis that is "the random hypothesis." Randomness is just what happens when the outcome depends sensitively on things you don't know.
When you understood this, how was your personal process that took you from knowing about probabilities and likelihood to understanding Solomonoff induction? Did you have to read the original sources or you found some good explanations somewhere?
I mean, there are explanations of solomonoff induction on this site that are fine, but for actually getting a deep understanding you've probably gotta do stuff like read An Introduction To Kolmogorov Complexity by Li and Vitanyi.
I think that re-reading again your answer made something click. So thanks for that
The observed data is not **random**, because random is not a property of the data itself.
The hypotheses that we want to evaluate are not random either, because we are analysing Turing machines that generate those data deterministically.
If the data is HTHTHT, we do not test a python script that is doing:
random.choices(["H","T"], k=6)
What we test instead is something more like
["H"] +["T"]+["H"]+["T"]+["H"]+["T"]
And
["HT"]*3
In this case, this last script will be simpler and for that reason, will receive a higher prior.
If we apply this is a Bayesian setting, the likelihood of all these hyptohesis is necessarily 1, so the posterior probabilty just becomes the prior (divided by some factor), which is proportional to the length of the program. This makes sense because it is in agreement with Occam's razor.
The thing I still struggle to see is how I connect this framework with probabilistic hypothesis that I want to test, such as the data was generated by a fair coin. One possibility that I see (but I am not sure this is the correct thing) is testing all the possible strings generated by an algorithm like this:
i=0
while True:
random.seed(i)
random.choices(["H","T"], k=6)
The likelihood of the strings like HHTHTH is 0 so we remove them and then we are left only with the algorithms that are consistent with the data.
Not totally sure of the last part
Not sure if this help but:
A program that generates "exactly" the sequence HTHTT
Alongside this program there is program that generates HHHHH and HHHHT and HHHTT etc - 2^5 of such programs, and before seeing evidence HTHTT is just one of those not standing out in any way. (but I don't know how specifically Solomonoff induction accounts for it, if that is your question)
Yes, this is something I can see easily, but I am not sure how Solomonoff induction accounts for that
I am confused about Somolonoff induction and some other related concepts. I appreciate some help from anyone who understands this.
What I understand so far is:
For example, the sequence HTHTT could be generated by different programs:
I do understand that a program that flips a fair coin is simpler than a program that generates the exact output
What I don't understand is:
How do you weight exactly complexity against likelihood?
In the previous example, we could say that the first program has complexity X bits while the the third program has complexity Y, bein Y > X (but only a small difference, because we don't have that many observations). Here is where things get a bit murky for me. The "predictive" power of the first program to explain that specific output, being probabilistic, is 1/2^5, while for the second program the likelihood is 1. To prefer the first hypothesis we have to assign a prior probability at least 2^5 times higher to the first program than to the second one. Where does this 2^5 factor comes from?
I know that it is a theoretical thing but if you can use coin flips in the explanation, that helps a lot. If you identify other places where I am confused but mistakenly thought I wasn't, please point that out.