What counts as a hypothesis for Solomonoff induction? The general impression I’ve gotten in various places is “a hypothesis can be anything (that you could write down)”. But I don’t think that’s quite it. E.g. evidence can be written down but is treated separately. I think a hypothesis is more like a computer program that outputs predictions about what evidence will or will not be observed.
If X and Y are hypotheses, then is “X and Y” a hypothesis? “not X”? “X or Y?” If not, why not, and where can I read a clear explanation of the rules and exclusions for Solomonoff hypotheses?
If using logic operators with hypotheses does yield other hypotheses, then I’m curious about a potential problem. When hypotheses are related, we can consider what their probabilities should be in more than one way. The results should always be consistent.
For example, suppose you have no evidence yet. And suppose X and Y are independent. Then you can calculate the probability of P(X or Y) in terms of the probability of P(X) and P(Y). You can also calculate the probability of all three based on their length (that’s the Solomonoff prior). These should always match but I don’t think they do.
The non-normalized probability of X is 1/2^len(X).
So you get:
P(X or Y) = 1/2^len(X) + 1/2^len(Y) - 1/2^(len(X)+len(Y))
and we also know:
P(X or Y) = 1/2^len(X or Y)
since the left hand sides are the same, that means the right hand sides should be equal, by simple substitution:
1/2^len(X or Y) = 1/2^len(X) + 1/2^len(Y) - 1/2^(len(X)+len(Y))
Which has to hold for any X and Y.
We can select X and Y to be the same length and to minimize compression gains when they’re both present, so len(X or Y) should be approximately 2len(X). I’m assuming a basis, or choice of X and Y, such that “or” is very cheap relative to X and Y, hence I approximated it to zero. Then we have:
1/2^2len(X) = 1/2^len(X) + 1/2^len(X) - 1/2^2len(X)
which simplifies to:
1/2^2len(X) = 1/2^len(X)
Which is false (since len(X) isn’t 0). And using a different approximation of len(X or Y) like 1.5len(X), 2.5len(X) or even len(X) wouldn’t make the math work.
So Solomonoff induction is inconsistent. So I assume there’s something I don’t know. What? (My best guess so far, mentioned above, is limits on what is a hypothesis.)
Also here’s a quick intuitive explanation to help explain what’s going on with the math: P(X) is both shorter and less probable than P(X or Y). Think about what you’re doing when you craft a hypotheses. You can add bits (length) to a hypothesis to exclude stuff. In that case, more bits (more length) means lower prior probability, and that makes sense, because the hypothesis is compatible with fewer things from the set of all logically possible things. But you can also add bits (length) to a hypothesis to add alternatives. It could be this or that or a third thing. That makes hypotheses longer but more likely rather than less likely. Also, speaking more generally, the Solomonoff prior probabilities are assigned according to length with no regard for consistency amongst themselves, so its unsurprising that they’re inconsistent unless the hypotheses are limited in such a way that they have no significant relationships with each other that would have to be consistent, which sounds hard to achieve and I haven’t seen any rules specified for achieving that (note that there are other ways to find relationships between hypotheses besides the one I used above, e.g. looking for subsets).
"That is, you could say something like "It's the list of all primes OR the list of all squares. Compressed data: first number is zero""
Just to clarify here (because it took me a couple of seconds): you only need the first number of the compressed data because that is sufficient to distinguish whether you have a list of primes or a list of squares. But as Pongo said, you could describe that same list in a much more compressed way by skipping the irrelevant half of the OR statement.