Approximating Solomonoff Induction

Houshalter

13 Approximating Solomonoff Induction

29th May 2015

3 min read

13

Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability.

Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test.

Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction.

But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program.

We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine.

Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary.

Instead of flipping between discrete states, we can slightly increase connections between circuits, and it will only slightly change the behavior. This is very nice, because we have algorithms like MCMC that can efficiently approximate true bayesian inference on continuous parameters.

And we are no longer restricted to boolean gates, we can use any function that takes real numbers. Like a function that takes a sum of all of it's inputs, or one that squishes a real number between 0 and 1.

We can also look at how much changing the input of a circuit slightly, changes the output. Then we can go to all the circuits that connect to it in the previous time step. And we can see how much changing each of their input changes their output, and therefore the output of the first logic gate.

And we can go to those gates' inputs, and so on, chaining it all the way through the whole circuit. Finding out how much a slight change to each connection will change the final output. This is called the gradient, and we can then do gradient descent. Basically change each parameter slightly in the direction that increases the output the way we want.

This is a very efficient optimization algorithm. With it we can rapidly find circuits that fit functions we want. Like predicting the price of a stock given the past history, or recognizing a number in an image, or something like that.

But this isn't quite Solomonoff Induction. Since we are finding the best single model, instead of testing the space of all possible models. This is important because essentially each model is like a hypothesis. There can be multiple hypotheses which also fit the data yet predict different things.

There are many tricks we can do to approximate this. For example, if you randomly turn off each gate with 50% probability and then optimize the whole circuit to deal with this. For some reason this somewhat approximates the results of true bayesian inference. You can also fit a distribution over each parameter, instead of a single value, and approximate bayesian inference that way.

Although I never said it, everything I've mentioned about continuous circuits is equivalent to Artificial Neural Networks. I've shown how they can be derived from first principles. My goal was to show that ANNs do approximate true Solomonoff Induction. I've found the Bayes-Structure.

It's worth mentioning that Solomonoff Induction has some problems. It's still an ideal way to do inference on data, it just has problems with self-reference. An AI based on SI might do bad things like believe in an afterlife, or replace it's reward signal with an artificial one (e.g. drugs.) It might not fully comprehend that it's just a computer, and exists inside the world that it is observing.

Interestingly, humans also have these problem to some degree.

Reposted from my blog here.

Solomonoff induction

Personal Blog

13

New Comment

Rendering 0/45 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 7:53 PM

Moderation Log

13 Approximating Solomonoff Induction

by Houshalter

29th May 2015

3 min read

13

Interestingly, humans also have these problem to some degree.

Reposted from my blog here.

Solomonoff induction

Personal Blog

13

Mentioned in

150Deep learning as program synthesis

New Comment

Rendering 0/45 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 7:53 PM

Moderation Log

More from Houshalter

Curated and popular this week

45Comments

Comment Permalink

anon8511y20

People here seem to really like Solomonoff induction, but I don't think it's all that relevant to learning in practice due to computational complexity.

Solomonoff induction is not computable. Trying to "approximate" it, by coming up with hypotheses similar to it, is probably also not computable.

If you replace Solomonoff induction with induction over programs that halt quickly or induction over circuits, it becomes computable, but it is still NP-hard. Again, approximating this is probably also NP-hard, depending on your definition of approximation.

Next, if you replace boolean circuits with neural nets, it is still hard to find the best neural net to fit the data. MCMC and gradient descent only find local optima. I mean, the fact that neural nets didn't give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.

It's not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don't do that at all.

jacob_cannell11y10

MCMC and gradient descent only find local optima.

Not strictly true. Simply by overparameterizing the ANN, you can emulate parallel gradient descent (global optimization). Stochastic techniques - such as dropout - and tricks such as momentum can also help tunnel through local optima. In addition, the typical optimization hurdles tend to be saddlepoints rather than local optima.

Modern SGD mechanisms are powerful global optimizers.

I mean, the fact that neural nets didn't give us strong AI back in the 70s demonstrates that they are not doing anything

... (read more)

0Richard_Kennaway11y

Can you expand on the relationship between the two? As far as I can see, Bayesian learning starts from a prior distribution, and PAC learning (for PAC-learnable concepts) works independently of the distribution. That is, they're solving average-case vs. worst-case or adversarial problems, which is a subject Eliezer has discussed here. Further discussion, mostly not by Eliezer, here. Those discussions don't mention PAC though. PAC-Bayesian hybrids appear to also be a thing, but all I know about that is the Google search I just did.

0Houshalter11y

It's true that it's only possible to find local optima, but that's true with any algorithm. Unless P=NP, perfect mathematically optimal AI is probably impossible. But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn't a huge problem. As to why we can have decent machine learning and not AGI, I don't know. Approximating SI isn't sufficient for one, you need to act on the models you find. But even so, I'm more trying to say it's is a promising research direction, and better than other stuff. I'm not saying it's a solved problem. Or that there are no further advances to be made. [...] Everything approximates Bayesian inference, it's just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference. And that's how NNs are typically trained. But it's nice to have other hypotheses and not just assume that the most likely hypothesis is 100% correct.

See in context