You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Computable Universal Prior

0 potato 11 December 2015 09:54AM

Suppose instead of using 2^-K(H) we just use 2^-length(H), does this do something obviously stupid? 

Here's what I'm proposing:

Take a programing language with two characters. Assign each program a prior of 2^-length(program). If the program outputs some string, then P(string | program) = 1, else it equals 0. I figure there must be some reason people don't do this already, or else there's a bunch of people doing it. I'd be real happy to find out about either. 

Clearly, it isn't a probability distribution, but we can still use it, no? 

 

 

Approximating Solomonoff Induction

6 Houshalter 29 May 2015 12:23PM

Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability.

Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test.

Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction.

But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program.

We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine.

Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary.

Instead of flipping between discrete states, we can slightly increase connections between circuits, and it will only slightly change the behavior. This is very nice, because we have algorithms like MCMC that can efficiently approximate true bayesian inference on continuous parameters.

And we are no longer restricted to boolean gates, we can use any function that takes real numbers. Like a function that takes a sum of all of it's inputs, or one that squishes a real number between 0 and 1.

We can also look at how much changing the input of a circuit slightly, changes the output. Then we can go to all the circuits that connect to it in the previous time step. And we can see how much changing each of their input changes their output, and therefore the output of the first logic gate.

And we can go to those gates' inputs, and so on, chaining it all the way through the whole circuit. Finding out how much a slight change to each connection will change the final output. This is called the gradient, and we can then do gradient descent. Basically change each parameter slightly in the direction that increases the output the way we want.

This is a very efficient optimization algorithm. With it we can rapidly find circuits that fit functions we want. Like predicting the price of a stock given the past history, or recognizing a number in an image, or something like that.

But this isn't quite Solomonoff Induction. Since we are finding the best single model, instead of testing the space of all possible models. This is important because essentially each model is like a hypothesis. There can be multiple hypotheses which also fit the data yet predict different things.

There are many tricks we can do to approximate this. For example, if you randomly turn off each gate with 50% probability and then optimize the whole circuit to deal with this. For some reason this somewhat approximates the results of true bayesian inference. You can also fit a distribution over each parameter, instead of a single value, and approximate bayesian inference that way.

Although I never said it, everything I've mentioned about continuous circuits is equivalent to Artificial Neural Networks. I've shown how they can be derived from first principles. My goal was to show that ANNs do approximate true Solomonoff Induction. I've found the Bayes-Structure.

It's worth mentioning that Solomonoff Induction has some problems. It's still an ideal way to do inference on data, it just has problems with self-reference. An AI based on SI might do bad things like believe in an afterlife, or replace it's reward signal with an artificial one (e.g. drugs.) It might not fully comprehend that it's just a computer, and exists inside the world that it is observing.

Interestingly, humans also have these problem to some degree.

Reposted from my blog here.

How do you notice when you are ignorant of necessary alternative hypotheses?

16 [deleted] 24 June 2014 06:12PM

So I just wound up in a debate with someone over on Reddit about the value of conventional academic philosophy.  He linked me to a book review, in which both the review and the book are absolutely godawful.  That is, the author (and the reviewer following him) start with ontological monism (the universe only contains a single kind of Stuff: mass-energy), adds in the experience of consciousness, reasons deftly that emergence is a load of crap... and then arrives to the conclusion of panpsychism.

WAIT HOLD ON, DON'T FLAME YET!

Of course panpsychism is bunk.  I would be embarrassed to be caught upholding it, given the evidence I currently have, but what I want to talk about is the logic being followed.

1) The universe is a unified, consistent whole.  Good!

2) The universe contains the experience/existence of consciousness.  Easily observable.

3) If consciousness exists, something in the universe must cause or give rise to consciousness.  Good reasoning!

4) "Emergence" is a non-explanation, so that can't be it.  Good!

5) Therefore, whatever stuff the unified universe is made of must be giving rise to consciousness in a nonemergent way.

6) Therefore, the stuff must be innately "mindy".

What went wrong in steps (5) and (6)?  The man was actually reasoning more-or-less correctly!  Given the universe he lived in, and the impossibility of emergence, he reallocated his probability mass to the remaining answer.  When he had eliminated the impossible, whatever remained, however low its prior, must be true.

The problem was, he eliminated the impossible, but left open a huge vast space of possible hypotheses that he didn't know about (but which we do): the most common of these is the computational theory of mind and consciousness, which says that we are made of cognitive algorithms.  A Solomonoff Inducer can just go on to the next length of bit-strings describing Turing machines, but we can't.

Now, I can spot the flaw in the reasoning here.  What frightens me is: what if I'm presented with some similar argument, and I can't spot the flaw?  What if, instead, I just neatly and stupidly reallocate my belief to what seems to me to be the only available alternative, while failing to go out and look for alternatives I don't already know about?  Notably, it seems like expected evidence is conserved, but expecting to locate new hypotheses means I should be reducing my certainty about all currently-available hypotheses now to have some for dividing between the new possibilities.

If you can notice when you're confused, how do you notice when you're ignorant?

More and Less than Solomonoff Induction

4 Manfred 21 May 2014 04:55AM

I've been thinking about how to put induction with limited resources on a firmer foundation. This may just be retracing the steps of others, but that's okay with me. Mostly I just want to talk about these thoughts.

After a few points of introduction.

What's Solomonoff induction?

Suppose we're given some starting data, and asked to predict the future. Solomonoff induction predicts the future by combining the predictions of all programs that (1) output the starting data up until now, and (2) aren't the continuation of another such program. The predictions are combined according to a weighting that decreases exponentially as the length of the program increases.

Why is it a good idea?

The simplest answer is that we have a frequentist guarantee that if the "true program" generating our input has some length N (that is, if the observable universe is a big but finite-sized computer), then our predictions will only be wrong a limited number of times, and after that we'll predict the correctly every time.

A more bayesian answer would start with the information that our observations can be generated by some finite-sized program, and then derive that something like Solomonoff induction has to represent our true prior over generating programs - as the length gets bigger, our probability is required to go to zero at infinity, and an exponential is the maximum-entropy such curve. This is not a complete answer, but it at least makes the missing pieces more apparent.

Why won't it work with limited resources

The trouble with using Solomonoff induction in real life is that to pick out which programs output our data so far, we need to run every program - and if the program doesn't ever halt, we need to use a halting oracle to stop it or else we'll take infinite time.

Limited resources require us to only pick from a class of programs that is guaranteed to not run over the limit.

If we have limited time and no halting oracle, we can't check every program. Instead, we are only allowed to check programs drawn from a class of programs that we can check the output of in limited time. The simplest example would be to just check programs but not report the result if we go over some time limit, in which case the class we're picking from is "programs that don't go over the time limit."

This is an application of a general principle - when we impose resource constraints on a real agent, we want to be able to cash them out as properties of an abstract description of what the agent is doing. In this case, we can cash out limited time for our real agent as an inability for our abstract description to have any contributions from long programs.

This isn't really a Bayesian restriction - we don't actually know that the true program is within our search space. This also weakens our frequentist guarantee.

The problem of overfitting - real models allow for incorrect retrodictions.

If we just take Solomonoff induction and put restrictions on it, our predictions will still only come from hypotheses that exactly reproduce our starting data. This is a problem.

For example, if we have a ton of data, like we do, then finding even one program that exactly replicates our data is too hard, and our induction is useless.

We as humans have two main ways of dealing with this. The first is that we ignore most of our data and only try to predict the important parts. The second is that we don't require our models to perfectly retrodict our observations so far (retrodiction- it's like prediction, for the past!).

In fact. having imperfect retrodictions is such a good idea that we can have a problem called "overfitting."  Isn't that kinda odd? That it's better to use a simple model that is actually 100% known to be wrong, than to use a very complicated model that has gotten everything right so far.

This behavior makes more sense if we imagine that our data contains contributions both from the thing we're trying to model, and from stuff we want to ignore, in a way that's hard to separate.

What makes real models better than chance? The example of coarse-graining.

Is it even possible to know ahead of time that an imperfect model will make good predictions? It turns out the answer is yes.

The very simplest example is of just predicting the next bit based on which bit has been more common so far. If every pattern were equally likely this wouldn't work, but if we require the probability of a pattern to decrease with its complexity, we're more likely to see repetitions.

A more general example: an imperfect model is always better than chance if it is a likely coarse-graining of the true program. Coarse-graining is when you can use a simple program to predict a complicated one by only worrying about some of the properties of the complicated program. In the picture at right, a simple cellular automaton can predict whether a more complicated one will have the same or different densities of white and black in a region.

When a coarse-graining is exact, the coarse-grained properties follow exact rules all the time. Like in the picture at right, where the coarse-grained pattern "same different same" always evolves into "different different different," even though there are multiple states of the more complicated program that count as "different" and "same."

When a coarse-graining is inexact, only most of the states of the long program follow the coarse-grained rules of the short program. but it turns out that, given some ignorance about the exact rules or situation, this is also sufficient to predict the future better than chance.

Of course, when doing induction, we don't actually know the true program. Instead what happens is that we just find some simple program that fits our data reasonably well (according to some measurement of fit), and we go "well, what happened before is more likely to happen again, so this rule will help us predict the future."

Presumably we combine predictions according to some weighting that include both the length of the program and its goodness of fit.

Machine learning - probably approximately correct

Since this is a machine learning problem, there are already solutions to similar problems, one of which is probably approximately correct learning. The basic idea is that if you have some uniformly-sampled training data, and a hypothesis space you can completely search, then you can give some probabilistic guarantees about how good the hypothesis is that best fits the training data. A "hypothesis," here, is a classification of members of data-space into different categories.

The more training data you have, the closer (where "close" can be measured as a chi-squared-like error) the best-fitting hypothesis is to the actually-best hypothesis. If your hypothesis space doesn't contain the true hypothesis, then that's okay - you can still guarantee that the best-fitting hypothesis gets close to the best hypothesis in your hypothesis space. The probability that a far-away hypothesis would masquerade as a hypothesis within the small "successful" region gets smaller and smaller as the training data increases.

There is a straightforward extension to cases where your data contains contributions from both "things we want to model" and "things we want to ignore." Rather than finding the hypothesis that fits the training data best, we want to find the hypothesis for the "stuff we want to model" that has the highest probability of producing our observed data, after we've added some noise, drawn from some "noise distribution" that encapsulates the basic information about the stuff we want to ignore.

There are certainly some issues comparing this to Solomonoff induction, like the fact that our training data is randomly sampled rather than a time series, but I do like this paradigm a bit better than looking for a guarantee that we'll find the one true answer in finite time.

Bayes' theorem

If there's an easy way to combine machine learning with Solomonoff induction, it's Bayes' theorem. The machine learning was focused on driving down P(training data | chance), and picking a hypothesis with a high P(training data | hypothesis, noise model). We might want to Use Bayes' rule to say something like P(hypothesis | training data, noise model) = P(hypothesis | complexity prior) * P(training data | hypothesis, noise model) / P(training data | noise model).

 

Anyhow - what are the obvious things I've missed? :)

Solomonoff induction on a random string

0 christopherj 08 April 2014 05:30AM

So, I've been hearing a lot about the awesomeness of Solomonoff induction, at least as a theoretical framework. However, my admittedly limited understanding of Solomonoff induction suggests that it would form an epicly bad hypothesis if given a random string. So my question is, if I misunderstood, how does it deal with randomness? And if I understood correctly, isn't this a rather large gap?

 

Edit: Thanks for all the comments! My new understanding is that Solomonoff induction is able to understand that it is dealing with a random string (because it finds itself giving equal weight to programs that output a 1 or a 0 for the next bit), but despite that is designed to keep looking for a pattern forever. While this is a horrible use of resources, the SI is a theoretical framework that has infinite resources, so that's a meaningless criticism. Overall this seems acceptable, though if you want to actually implement a SI you'll need to instruct it on giving up. Furthermore, the SI will not include randomness as a distinct function in its hypothesis, which could lead to improper weighting of priors, but will still have good predictive power  -- and considering that Solomonoff induction was only meant for computable functions, this is a pretty good result.

Modifying Universal Intelligence Measure

2 Alex_Altair 18 September 2012 11:44PM

In 2007, Legg and Hutter wrote a paper using the AIXI model to define a measure of intelligence. It's pretty great, but I can think of some directions of improvement.

  • Reinforcement learning. I think this term and formalism are historically from much simpler agent models which actually depended on being reinforced to learn. In its present form (Hutter 2005 section 4.1) it seems arbitrarily general, but it still feels kinda gross to me. Can we formalize AIXI and the intelligence measure in terms of utility functions, instead? And perhaps prove them equivalent?
  • Choice of Horizon. AIXI discounts the future by requiring that total future reward is bounded, and therefore so does the intelligence measure. This seems to me like a constraint that does not reflect reality, and possibly an infinitely important one. How could we remove this requirement? (Much discussion on the "Choice of the Horizon" in Hutter 2005 section 5.7).
  • Unknown utility function. When we reformulate it in terms of utility functions, let's make sure we can measure its intelligence/optimization power without having to know its utility function. Perhaps by using an average of utility functions weighted by their K-complexity.
  • AI orientation. Finally, and least importantly, it tests agents across all possible programs, even those which are known to be inconsistent with our universe. This might okay if your agent is a playing arbitrary games on a computer, but if you are trying to determine how powerful an agent will be in this universe, you probably want to replace the Solomonoff prior with the posterior resulting from updating the Solomonoff prior with data from our universe.

Any thought or research on this by others? I imagine lots of discussion has occurred over these topics; any referencing would be appreciated.

Combining causality with algorithmic information theory

15 [deleted] 09 June 2012 01:31AM

Warning: maths.

Causal inference using the algorithmic Markov condition (Janzing and Schölkopf, 2008) replaces conditional independences between random variables, which define the structure of causal graphs, with algorithmic conditional independences between bit strings.

Conditional probabilities between variables become conditional complexities between strings, i.e. K(x|y) is the length of the shortest program that can generate the string x from y. Similarly, algorithmic mutual information I(x:y) is the amount of information that can be omitted in defining a string y given a shortest compressor for string x, I(x:y) = K(y) - K(y|x*). K(x,y) is the complexity of the concatenation of two strings x and y. These lead naturally to a definition of algorithmic conditional independence as I(x:y|z) = K(x|z) + K(y|z) - K(x,y|z) = 0 , where equality is defined up to the standard additive constant.

Then a lot of sexy, confusing proofs happen. When the dust settles, it looks like if you take some strings describing observations, interpret them as nodes in a graph, and "factor" so that a certain algorithmic Markov condition holds (every node string should be algorithmically independent of its non-descendant node strings given the optimal compressor of its parents' node strings), then every node can be computed by an O(1) program run on a Turing machine, with the node's parents and a noise term as input (with each node's noise string being jointly independent of the others). 

Notably, this means that if we make two observations which were "generated from their parents by the same complex rule", then we can "postulate another causal link between the nodes that explains the similarity of mechanisms". They say "complex rule" because the mutual algorithmic information between simple information strings, like some digits of pi, will be swallowed up by additive constants. Which all seems very close to rediscovering TDT.

There's more to the paper, but that's the tasty bit, so the summary ends here.

A Philosophical Treatise of Universal Induction (Link)

11 XiXiDu 10 June 2011 06:53PM

Abstract: Understanding inductive reasoning is a problem that has engaged mankind for thousands of years. This problem is relevant to a wide range of fields and is integral to the philosophy of science. It has been tackled by many great minds ranging from philosophers to scientists to mathematicians, and more recently computer scientists. In this article we argue the case for Solomonoff Induction, a formal inductive framework which combines algorithmic information theory with the Bayesian framework. Although it achieves excellent theoretical results and is based on solid philosophical foundations, the requisite technical knowledge necessary for understanding this framework has caused it to remain largely unknown and unappreciated in the wider scientific community. The main contribution of this article is to convey Solomonoff induction and its related concepts in a generally accessible form with the aim of bridging this current technical gap. In the process we examine the major historical contributions that have led to the formulation of Solomonoff Induction as well as criticisms of Solomonoff and induction in general. In particular we examine how Solomonoff induction addresses many issues that have plagued other inductive systems, such as the black ravens paradox and the confirmation problem, and compare this approach with other recent approaches.

Link: mdpi.com/1099-4300/13/6/1076/

Download PDF Full-Text: mdpi.com/1099-4300/13/6/1076/pdf

Authors: Samuel Rathmanner and Marcus Hutter

Published: 3 June 2011

Via: vetta.org/2011/06/treatise-on-universal-induction/