[This post borders on some well-trodden ground in information theory and machine learning, so ideas in this post have an above-average chance of having already been stated elsewhere, by professionals, better. EDIT: As it turns out, this is largely the case, under the subjects of the justifications for MML prediction and Kolmogorov-simple PAC-learning.]

I: Introduction

I am fascinated by methods of thinking that work for well-understood reasons - that follow the steps of a mathematically elegant dance. If one has infinite computing power the method of choice is something like Solomonoff induction, which is provably ideal in a certain way at predicting the world. But if you have limited computing power, the choreography is harder to find.

To do Solomonoff induction, you search through all Turing machine hypotheses to find the ones that exactly output your data so far, then use the weighted average of those perfect retrodictors to predict the next time step. So the naivest way to build an ideal limited agent is to merely search through lots of hypotheses (chosen from some simple set) rather than all of them, and only run each Turing machine for time less than some limit. At least it's guaranteed to work in the limit of large computing power, which ain't nothing.

Suppose then that we take this nice elegant algorithm for a general predictor, and we implement it on today's largest supercomputer, and we show it the stock market prices from the last 50 years to try to predict stocks and get very rich. What happens?

Bupkis happens, that's what. Our Solomonoff predictor tries a whole lot of Turing machines and then runs out of time before finding any useful hypotheses that can perfectly replicate 50 years of stock prices. This is because such useful hypotheses are very, very, very rare.

We might then turn to the burgeoning field of logical uncertainty, which has a major goal of handling intractable math problems in an elegant and timely manner. We are logically uncertain about what distribution Solomonoff induction will output, so can we just average over that logical uncertainty to get some expected stock prices?

The trouble with this is that current logical uncertainty methods rely on proofs that certain outputs are impossible or contradictory. For simple questions this can narrow down the answers, but for complicated problems it becomes intractable, replacing the hard problem of evaluating lots of Turing machines with the hard problem of searching through lots and lots of proofs about lots of Turing machines - and so again our predictor runs out of time before becoming useful.

In practice, the methods we've found to work don't look very much like Solomonoff induction. Successful methods don't take the data as-is, but instead throw some of it away: curve fitting and smoothing data, filtering out hard-to-understand signals as noise, and using predictive models that approximate reality imperfectly. The sorts of things that people trying to predict stocks are already doing. These methods are vital to improve computational tractability, but are difficult (to my knowledge) to fit into a framework as general as Solomonoff induction.

II: Rambling

Suppose that our AI builds a lot of models of the world, including lossy models. How should it decide which models are best to use for predicting the world? Ideally we'd like to make a tradeoff between the accuracy of the model, measured in the expected utility of how accurate you expect the model's predictions to be, and the cost of the time and energy used to make the prediction.

Once we know how to tell good models, the last piece would be for our agent to make the explore/exploit tradeoff between searching for better models and using its current best.

There are various techniques to estimate resource usage, but how does one estimate accuracy?

Here was my first thought: If you know how much information you're losing (e.g. by binning data), for discrete distributions this sets the Shannon information of the ideal value (given by Solomonoff prediction) given the predicted value. This uses the relationship between information in bits of data and Shannon information that determines how sharp your probability distribution is allowed to be.

But with no guarantees about the normality (or similar niceness properties) of the ideal value given the prediction, this isn't very helpful. The problem is highlighted by hurricane prediction. If hurricanes behaved nicely as we threw away information, weather models would just be small, high-entropy deviations from reality. Instead, hurricanes can change route greatly even with small differences in initial conditions.

The failure of the above approach can be explained in a very general way: it uses too little information about the model and the data, only the amount of information thrown away. To do better, our agent has to learn a lot from its training data - a subject that workers in AI have already been hard at work on. On the one hand, it's a great sign if we can eventually connect ideal agents to current successful algorithms. On the other, doing so elegantly seems like a hard problem.

To sum up in the blandest possible way: If we want to build successful predictors of the future with limited resources, they should use their experience to learn approximate models of the world.

The real trick, though, is going to be to set this on a solid foundation. What makes a successful method of picking models? As we lack access to the future (yet! Growth mindset!), we can't grade models based on their future predictions unless we descend to solipsism and grade models against models. Thus we're left with grading models based on how well they retrodict the data so far. Sound familiar? The foundation we want seems like an analogue to Solomonoff induction, one that works for known reasons but doesn't require perfection.

III:  An Example

Here's a paradigm that might or might not be a step in the right direction, but at least gestures at what I mean.

The first piece of the puzzle is that a model that gets proportion P of training bits wrong can be converted to a Solomonoff-accepted perfectly-precise model just by specifying the bits it gets wrong. Suppose we break the model output (with total length N) into chunks of size L, and prefix each chunk with the locations of the wrong bits in that chunk. Then the extra data required to rectify an approximate model is at most N/L·log(P·L)+N·P·log(L). Then the hypothesis where the model is right about the next bit is simpler than the hypothesis when it's wrong, because when the model is right you don't have to spend ~log(L) bits correcting it.

In this way, Solomonoff induction natively cares about some approximate models' predictions. There are some interesting details here that are outside the focus of this particular post. Does using the optimal chunk length lead to Solomonoff induction reflecting model accuracy correctly? What are some better schemes for rectifying models that handle things like models that output probabilities? The point is just that even if your model is wrong on fraction P of the training data, Solomonoff induction will still promote it as long as it's simpler than N-N/L·log(P·L)-N·P·log(L).

The second piece of the puzzle is that induction can be done over processed functions of observations, like smoothing the data or filtering difficult-to-predict parts (noise) out. If this processing increases the accuracy of models, we can use this to make high-accuracy models of functions the training data, and then use those models to predict the the processed future observations as above.

These two pieces allow an agent to use approximate models, and to throw away some of its information, and still have its predictions work for the same reason as Solomonoff induction. We can use this paradigm to interpret what an algorithm like curve fitting is doing - the fitted curve is a high-accuracy retrodiction of some smoothed function of the data, which therefore does a good job of predicting what that smoothed function will be in the future.

There are some issues here. If a model that you are using is not the simplest, it might have overfitting problems (though perhaps you can fix this just by throwing away more information than naively appears necessary) or systematic bias. More generally, we haven't explored how models get chosen; we've made the problem easier to brute force but we need to understand non-brute force search methods and what their foundations are. It's a useful habit to keep in mind what actually works for humans - as someone put it to me recently, "humans can make models they understand that work for reasons they understand."

Furthermore, this doesn't seem to capture reductionism well. If our agent learns some laws of physics and then is faced with a big complicated situation it needs to use a simplified model to make a prediction about, it should still in some sense "believe in the laws of physics," and not believe that this complicated situation violates the laws physics even if its current best model is independent of physics.

IV: Logical Uncertainty

It may be possible to relate this back to logical uncertainty - where by "this" I mean the general thesis of predicting the future by building models that are allowed to be imperfect, not the specific example in part III. Soares and Fallenstein use the example of a complex Rube Goldberg machine that deposits a ball into one of several chutes. Given the design of the machine and the laws of physics, suppose that one can in principle predict the output of this machine, but that the problem is much too hard for our computer to do. So rather than having a deterministic method that outputs the right answer, a "logical uncertainty method" in this problem is one that, with a reasonable amount of resources spent, takes in the description of the machine and the laws of physics, and gives a probability distribution over the machine's outputs.

Meanwhile, suppose that we take an approximately inductive predictor and somehow teach it the the laws of physics, then ask it to predict the machine. We'd like it to make predictions via some appropriately simplified folk model of physics. If this model gives a probability distribution over outcomes - like in the simple case of "if you flip this coin in this exact way, it has a 50% shot at landing heads" - doesn't that make it a logical uncertainty method? But note that the probability distribution returned by a single model is not actually the uncertainty introduced by replacing an ideal predictor with a resource-limited predictor. So any measurement of logical uncertainty has to factor in the uncertainty between models, not just the uncertainty within models.

Again, we're back to looking for some prediction method that weights models with some goodness metric more forgiving than just using perfectly-retrodicting Turing machines, and which outputs a probability distribution that includes model uncertainty. But can we apply this to mathematical questions, and not just Rube Goldberg machines? Is there some way to subtract away the machine and leave the math?

Suppose that our approximate predictor was fed math problems and solutions, and built simple, tractable programs to explain its observations. For easy math problems a successful model can just be a Turing machine that finds the right answer. As the math problems get more intractable, successful models will start to become logical uncertainty methods, like how we can't predict a large prime number exactly, but we can predict it's last digit is 1, 3, 7, or 9. Within this realm we have something like low-level reductionism, where even though we can't find a proof of the right answer, we still want to act as if mathematical proofs work and all else is ignorance, and this will help us make successful predictions.

Then we have complicated problems that seem to be beyond this realm, like P=NP. Humans certainly seem to have generated some strong opinions about P=NP without dependence on mathematical proofs narrowing down the options. It seems to such humans that the genuinely right procedure to follow is that, since we've searched long and hard for a fast algorithm for NP-complete problems without success, we should update in the direction that no such algorithm exists. In approximate-Solomonoff-speak, it's that P!=NP is consistent with a simple, tractable explanation for (a recognizable subset of) our observations, while P=NP is only consistent with more complicated tractable explanations. We could absolutely make a predictor that reasons this way - it just sets a few degrees of freedom. But is it the right way to reason?

For one thing, this seems like it's following Gaifman's proposed property of logical uncertainty, that seeing enough examples of something should convince you of it with probability 1 - which has been shown to be "too strong" in some sense (it assigns probability 0 to some true statements - though even this could be okay if those statements are infinitely dilute). Does the most straightforward implementation actually have the Gaifman condition, or not? (I'm sorry, ma'am. Your daughter has... the Gaifman condition.)

This inductive view of logical uncertainty lacks the consistent nature of many other approaches - if it works, it does so by changing approaches to suit the problem at hand. This is bad if you want your logical uncertainty methods to be based on a simple prior followed by some kind of updating procedure. But logical uncertainty is supposed to be practical, after all, and at least this is a simple meta-procedure.

V: Questions

Thanks for reading this post. In conclusion, here are some of my questions:

What's the role of Solomonoff induction in approximate induction? Is Solomonoff induction doing all of the work, or is it possible to make useful predictions using tractable hypotheses Solomonoff induction would exclude, or excluding intractable hypotheses Solomonoff induction would have to include?

Somehow we have to pick out models to promote to attention in the first place. What properties make a process for this good or bad? What methods for picking models can be shown to still lead to making useful predictions - and not merely in the limit of lots of computing time?

Are humans doing the right thing by making models they understand that work for reasons they understand? What's up with that reductionism problem anyhow?

Is it possible to formalize the predictor discussed in the context of logical uncertainty? Does it have to fulfill Gaifman's condition if it finds patterns in things like P!=NP?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 9:13 AM

What you are looking for already exists - look up unsupervised generative models, and in particular temporal autoencoders that attempt to predict their future input sequences. Generative models handle model uncertainty and approximation uncertainty in the way you are looking for. The universal prior concept maps directly to the concept of regularization in machine learning, and the general principle is that your model complexity (in bits) must be less than your data complexity (ie, your model should compress your data).

In practice however there is a problem with the whole "predict all of my observations" idea: doing this tends to waste lots of computation on modelling features that have little specific task utility.

Supervised learning (where applicable) is more efficient and generally preferred because it can focus the limited model capacity & computation on learning task relevant features.

Also, you should check out Jurgen Schmiduber's AMA on reddit, he explains the connections between ideal/impractical inference algorithms and their more practical/scalable approximations, such as RNNs.

Thanks for the recommendations! Do you know of any good resources for work exploring optimality and generality proofs for practical unsupervised learners?

In practice however there is a problem with the whole "predict all of my observations" idea: doing this tends to waste lots of computation on modelling features that have little specific task utility.

Right, this is why I think it's relevant that Solomonoff induction will still do a good job of predicting any computable function of your data - like a noise-filtering function. This gives you an easy proof that certain algorithms will succeed modulo noise.

Do you know of any good resources for work exploring optimality and generality proofs for practical unsupervised learners?

Yes. There were a couple of DL theory papers in the last year or so that generated some excitement on r/machinelearning:

"Provable bounds for learning some deep representations."

"On the computational efficiency of training neural networks."

I'm not a huge fan of hard theory and optimality proofs, but even I found the first paper in particular somewhat insightful.

Machine learning was somewhat obsessed with formal theory (provable bounds, convex learning, etc) in the last decade, and I believe it held back progress.

The problem is that to prove any hard results, you typically have to simplify the model down to the point at which it loses all relevance to solving real world problems.

The recent advances in deep learning come in part from the scruffies/experimental researchers saying "screw hard theory" and just forging ahead, and also in part from advances in informal theoretical understanding (what one may call wisdom rather than math). The work of Bengio and Hinton in particular is rich with that kind of wisdom.

In particular see this comment from hinton, and a related comment here from Bengio.

The recent advances in deep learning come in part from the scruffies/experimental researchers saying "screw hard theory" and just forging ahead

Yeah I get that vibe. But ultimately I'm looking into this because I want to understand the top-down picture, not because of specific applications (Well, integrating logical uncertainty with induction would be nice, but if that works it's more of a theoretical milestone than an application). Research on neural networks is actually more specific than what I want to read right now, but it looks like that's what's available :P (barring a literature search I should do before reading more than a few NN papers)

One of the issues is that if you try to prove generalization bounds for anything that has an universal representation property (e.g. neural networks) you always get terrible bounds that don't reflect empirical performance on real-world learning tasks.

I think this is because there are probably some learning tasks which are intrinsically intractable (this is certainly the case if one-way functions exist), therefore any bound that quantifies over all learning tasks will be at least exponential.

But ultimately I'm looking into this because I want to understand the top-down picture, not because of specific applications

For that, I really really recommend "Representation Learning". Quote from the abstract:

"This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning."

It really covers the big picture view of the entirety of machine learning, and from multiple viewpoints (statistical, geometric, optimization, etc).

Well, integrating logical uncertainty with induction would be nice, but if that works it's more of a theoretical milestone than an application)

That's done or being done. Look into "probabilistic programming".

That's done or being done. Look into "probabilistic programming".

This doesn't look like what I want - have you read my take on logical uncertainty? I still stand by the problem statement (though I'm more uncertain now about minimizing Shannon vs. Kolmogorov information), even if I no longer quite approve of the proposed solution.

EDIT: If I were to point to what I no longer approve of, it would be that while in the linked posts I focus on what's mandatory for limited reasoners, I now think it should be possible to look at what's actually a good idea for limited reasoners.

Probabilistic programming seems to assign distributions to programs that involve random variables, but logical uncertainty assigns distributions to programs without random variables :P

This doesn't look like what I want - have you read my take on logical uncertainty?

A little - yes approximation is key for practical performance. This is why neural nets and related models work so well - they allow one to efficiently search over model/function/program space for the best approximate models that fit memory/compute constraints.

Probabilistic programming seems to assign distributions to programs that involve random variables, but logical uncertainty assigns distributions to programs without random variables :P

Code and variables are interchangeable, so of course prob programming can model distributions over programs. For example, I could create a VM or interpreter with a big array of 'variables' that define opcodes. Graphical models and networks also encode programs as variables. There is no hard distinction between data/code/variables.

Hm, I think we're talking past each other.

To give an example of what I mean, if a probabilistic programming language really implemented logical uncertainty, you could write a deterministic program that computes the last digit of the googolth prime number (possible outputs 0 through 9), and then use some getDistribution method on this deterministic program, and it would return the distribution {0.25 chance of 1, 0.25 chance of 3, 0.25 chance of 7, 0.25 chance of 9} (unless your computer actually has enough power to calculate the googolth prime).

I think your example is - at least theoretically - still within the reach of a probabilistic logic program, although the focus is entirely on approximation rather than conditioning (there are no observations). Of course, I doubt any existing inference engine would be able to help much with problems of that type, but the idea is still general enough to cover estimating intractable functions from the logic defining the solution.