More and Less than Solomonoff Induction

Manfred

I've been thinking about how to put induction with limited resources on a firmer foundation. This may just be retracing the steps of others, but that's okay with me. Mostly I just want to talk about these thoughts.

After a few points of introduction.

What's Solomonoff induction?

Suppose we're given some starting data, and asked to predict the future. Solomonoff induction predicts the future by combining the predictions of all programs that (1) output the starting data up until now, and (2) aren't the continuation of another such program. The predictions are combined according to a weighting that decreases exponentially as the length of the program increases.

Why is it a good idea?

The simplest answer is that we have a frequentist guarantee that if the "true program" generating our input has some length N (that is, if the observable universe is a big but finite-sized computer), then our predictions will only be wrong a limited number of times, and after that we'll predict the correctly every time.

A more bayesian answer would start with the information that our observations can be generated by some finite-sized program, and then derive that something like Solomonoff induction has to represent our true prior over generating programs - as the length gets bigger, our probability is required to go to zero at infinity, and an exponential is the maximum-entropy such curve. This is not a complete answer, but it at least makes the missing pieces more apparent.

Why won't it work with limited resources

The trouble with using Solomonoff induction in real life is that to pick out which programs output our data so far, we need to run every program - and if the program doesn't ever halt, we need to use a halting oracle to stop it or else we'll take infinite time.

Limited resources require us to only pick from a class of programs that is guaranteed to not run over the limit.

If we have limited time and no halting oracle, we can't check every program. Instead, we are only allowed to check programs drawn from a class of programs that we can check the output of in limited time. The simplest example would be to just check programs but not report the result if we go over some time limit, in which case the class we're picking from is "programs that don't go over the time limit."

This is an application of a general principle - when we impose resource constraints on a real agent, we want to be able to cash them out as properties of an abstract description of what the agent is doing. In this case, we can cash out limited time for our real agent as an inability for our abstract description to have any contributions from long programs.

This isn't really a Bayesian restriction - we don't actually know that the true program is within our search space. This also weakens our frequentist guarantee.

The problem of overfitting - real models allow for incorrect retrodictions.

If we just take Solomonoff induction and put restrictions on it, our predictions will still only come from hypotheses that exactly reproduce our starting data. This is a problem.

For example, if we have a ton of data, like we do, then finding even one program that exactly replicates our data is too hard, and our induction is useless.

We as humans have two main ways of dealing with this. The first is that we ignore most of our data and only try to predict the important parts. The second is that we don't require our models to perfectly retrodict our observations so far (retrodiction- it's like prediction, for the past!).

In fact. having imperfect retrodictions is such a good idea that we can have a problem called "overfitting." Isn't that kinda odd? That it's better to use a simple model that is actually 100% known to be wrong, than to use a very complicated model that has gotten everything right so far.

This behavior makes more sense if we imagine that our data contains contributions both from the thing we're trying to model, and from stuff we want to ignore, in a way that's hard to separate.

What makes real models better than chance? The example of coarse-graining.

Is it even possible to know ahead of time that an imperfect model will make good predictions? It turns out the answer is yes.

The very simplest example is of just predicting the next bit based on which bit has been more common so far. If every pattern were equally likely this wouldn't work, but if we require the probability of a pattern to decrease with its complexity, we're more likely to see repetitions.

A more general example: an imperfect model is always better than chance if it is a likely coarse-graining of the true program. Coarse-graining is when you can use a simple program to predict a complicated one by only worrying about some of the properties of the complicated program. In the picture at right, a simple cellular automaton can predict whether a more complicated one will have the same or different densities of white and black in a region.

When a coarse-graining is exact, the coarse-grained properties follow exact rules all the time. Like in the picture at right, where the coarse-grained pattern "same different same" always evolves into "different different different," even though there are multiple states of the more complicated program that count as "different" and "same."

When a coarse-graining is inexact, only most of the states of the long program follow the coarse-grained rules of the short program. but it turns out that, given some ignorance about the exact rules or situation, this is also sufficient to predict the future better than chance.

Of course, when doing induction, we don't actually know the true program. Instead what happens is that we just find some simple program that fits our data reasonably well (according to some measurement of fit), and we go "well, what happened before is more likely to happen again, so this rule will help us predict the future."

Presumably we combine predictions according to some weighting that include both the length of the program and its goodness of fit.

Machine learning - probably approximately correct

Since this is a machine learning problem, there are already solutions to similar problems, one of which is probably approximately correct learning. The basic idea is that if you have some uniformly-sampled training data, and a hypothesis space you can completely search, then you can give some probabilistic guarantees about how good the hypothesis is that best fits the training data. A "hypothesis," here, is a classification of members of data-space into different categories.

The more training data you have, the closer (where "close" can be measured as a chi-squared-like error) the best-fitting hypothesis is to the actually-best hypothesis. If your hypothesis space doesn't contain the true hypothesis, then that's okay - you can still guarantee that the best-fitting hypothesis gets close to the best hypothesis in your hypothesis space. The probability that a far-away hypothesis would masquerade as a hypothesis within the small "successful" region gets smaller and smaller as the training data increases.

There is a straightforward extension to cases where your data contains contributions from both "things we want to model" and "things we want to ignore." Rather than finding the hypothesis that fits the training data best, we want to find the hypothesis for the "stuff we want to model" that has the highest probability of producing our observed data, after we've added some noise, drawn from some "noise distribution" that encapsulates the basic information about the stuff we want to ignore.

There are certainly some issues comparing this to Solomonoff induction, like the fact that our training data is randomly sampled rather than a time series, but I do like this paradigm a bit better than looking for a guarantee that we'll find the one true answer in finite time.

Bayes' theorem

If there's an easy way to combine machine learning with Solomonoff induction, it's Bayes' theorem. The machine learning was focused on driving down P(training data | chance), and picking a hypothesis with a high P(training data | hypothesis, noise model). We might want to Use Bayes' rule to say something like P(hypothesis | training data, noise model) = P(hypothesis | complexity prior) * P(training data | hypothesis, noise model) / P(training data | noise model).

Anyhow - what are the obvious things I've missed? :)