So, I've been hearing a lot about the awesomeness of Solomonoff induction, at least as a theoretical framework. However, my admittedly limited understanding of Solomonoff induction suggests that it would form an epicly bad hypothesis if given a random string. So my question is, if I misunderstood, how does it deal with randomness? And if I understood correctly, isn't this a rather large gap?
Edit: Thanks for all the comments! My new understanding is that Solomonoff induction is able to understand that it is dealing with a random string (because it finds itself giving equal weight to programs that output a 1 or a 0 for the next bit), but despite that is designed to keep looking for a pattern forever. While this is a horrible use of resources, the SI is a theoretical framework that has infinite resources, so that's a meaningless criticism. Overall this seems acceptable, though if you want to actually implement a SI you'll need to instruct it on giving up. Furthermore, the SI will not include randomness as a distinct function in its hypothesis, which could lead to improper weighting of priors, but will still have good predictive power -- and considering that Solomonoff induction was only meant for computable functions, this is a pretty good result.
A truly perfectly random string isn't really interesting because nothing predicts it. So I'm going to talk about biased randomness.
I'm also going to assume we're working with the "outputs a single prediction" variant of Solomonoff Inductors, instead of the ones that output probability distributions (where the answer is kind of clearly "Yes they deal with it"). Oh, and I should note that Solomonoff Inductors are hilariously intractable, and unfortunately Cartesian in their workings.
With that aside...
Suppose you generate a random string by rolling a die and recording "Off" when it comes up 1, and "On" when it comes up 2,3,4,5, or 6. How well will a Solomonoff Inductor predict this sequence?
The thing to notice is that although the input is not algorithmic, it is compressible. Because "On" is so much more likely than "Off", we can save space by using shorter encodings for strings like "On On On" than for "Off Off Off". In fact, by using Arithmetic Coding, we can re-encode our sequences to use only 65% as many On/Off characters (on average).
When you do Solomonoff Induction, the next prediction is dominated by the shortest programs that haven't been eliminated. What will these programs look like in the case of our biased sequence? They will be a compressed encoding of the sequence, alongside a decoder. The encoding schemes that best match the input will dominate the predictions, and these will be the ones that model it as a biased random sequence and assume the right probability (like a well tuned arithmetic coder).
What do you get when you look at an arithmetic coder's next output, giving it a perfectly random input if it needs more data before producing another output? You get a biased random value. In our case it will be "On" about 5/6 of the time, and "Off" 1/6 of the time.
So, when the Solomonoff Inductor looks at the next output when making a prediction... a whole bunch of the shortest programs are outputting "On" 5/6 of the time and "Off" 1/6 of the time. Which will push the predictions towards the right probabilities! Instead of a single program determining the prediction, we get a huge group of programs working together to push the prediction in the right direction.
There will still be skew in the results. It's not hard to come up with program encodings that strongly favor "Off", for example. Nevertheless, there is at least some built-in functionality for dealing with randomness.
What do you mean? If you feed SI a stream of i.i.d. random bits which are 1 with probability 5/6 and 0 with probability 1/6, then SI's beliefs about the next bit will converge to the true distribution pretty fast, no matter what encoding of programs is used. The easiest way to see that is by noticing that the true distribution is computable (which just means the probability of any bit string is computable from that bit string), therefore it can't get a higher log score than SI in the long run (there's a general result saying that).