private_messaging comments on More and Less than Solomonoff Induction - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (32)
Speaking of
as far as we know, there's no finite length "program" that would generate the input you get from, say, looking at a quantum random number generator, or for that matter, at any chaotic process where the butterfly effect amplifies quantum noise up to macroscale (e.g. weather).
Yep, Solomonoff induction makes the assumption that your observations are the output of some turing machine. The question is, does a Solomonoff inductor converge to the correct strategy anyhow? One can design bets that would punish it, given a random number generator, so no, not really.
However, allowing us to model "things we don't understand" as a random distribution, and then look for the highest-probability hypothesis over the part of the world we're trying to model, does solve the problem.
No. You need an hypercomputable environment to consistently beat Solomonoff induction.
Suppose I was generating bits with a fair coin, and for each flip I offered a Solomonoff inductor the following bet: +1 points if heads, -1.000001 points if tails.
My suspicion is that the Solomonoff inductor would take the bet an infinite number of times if I kept flipping the coin.
Rather than predicting the Solomonoff inductor and then tricking it, it may suffice to have an infinitely complex string of bits with a known ratio of 1s and 0s. Of course, maybe we've been too hasty in thinking that nature can provide fair coins, and the Solomonoff inductor will eventually start predicting quantum mechanics. That's certainly what it thinks will happen.
My suspicion is that you don't know what you are talking about.
Sorry about the frankness, but your mistakes denote some serious confusion about the subject.
A Solomonoff inductor has no concept of betting and certainly can't decide whether to take a bet.
There are a few fair ways to model it as a decision agent: you can take the per-bit posterior probability as the output and reward it according to a strictly proper scoring rule, or take the predicted bits as the output and reward it by the number of correct predictions.
In these cases, if the data string is generated by a computable probability distribution (e.g. a fair coin), then the Solomonoff inductor payoff rate will asymptotically (and quickly) converge to or above the expected payoff rate of the best computable stochastic agent.
Before delving any further into this topic, I suggest you study the literature. The book An Introduction to Kolmogorov Complexity and Its Applications by Ming Li and Paul Vitanyi provides an excellent presentation of Kolmogorov complexity and Solomonoff induction, which also covers several resource-bounded variants with practical applications.
Assume that we do this the obvious way.
If you like, we can also talk about the number of times that the probability estimate differs from 0.5 by some epsilon.
(The obvious way is "if EV(bet)>0, consider the bet taken.")
Thanks for the recommendation!
What shows up in the most obvious place - pg 328 of the book - is that the mean squared error decreases faster than 1/n. And in fact the proof method isn't strong enough to show that the absolute error decreases faster than 1/n (because the derivative of the absolute error jumps discontinuously).
There's still the possibility that we can "cheat" by allowing our agent to take bets with positive expected value. Taking bets only cares about the absolute error.
But still, maybe there are stronger bounds out there (maybe even later in the same book) that would apply to the absolute error. And I'm definitely not disputing that I'm ignorant :)
I don't think its really obvious. The (retroactively) obvious way to extend SI into a decision agent, is AIXI, which has its own optimality theorems.
The absolute error is just the square root of the squared error. If the squared error decreases faster than 1/n, then the absolute error decreases faster than 1/sqrt(n).
The point being that if your expected absolute error decreases like 1/n or slower, you make an infinite number of wrong bets.
You keep framing this in terms of bets. I don't think there is any point in continuing this discussion.
How are you converting the inductor into a betting decision?
What you should get from the inductor - after sufficient input - is very close to equal probability for head and tails (probability arising from the weighted sum of surviving programs).
edit: namely, you'll have (among the craziness of that program soup) a family of programs that relay subsequent bits from the program tape to the output. So you have the family of programs that begin with "print 11010111001..." , and within those, there's equal number of those that print zero after 11010111001 and those that print 1 .
There's also various craziness which offsets the probabilities - e.g. a family of programs that do that but start outputting zeroes (or ones) after some very large number of bits. Those aren't eliminated until they fail to predict the future.
edit2: though, yeah, if you keep decreasing your epsilon here, it'll accept the bet now and then. If you're utilizing the prior history of bets it made (e.g. to adjust that small punishment down until it's willing to bet) , you effectively have an environment with hypercomputation.
The obvious way - if EV(bet)>0, take bet.
We could also talk about "the number of times the probability assignment is off by epsilon."
The point is, this mental picture you're painting - universe has a "true program", Solomonoff inductor will look at the world, figure that program out, and be correct every time ever after - is, as far as we know, almost certainly not true. (Except perhaps trivially, in form of universe eventually expanding so far that the input is a never-ending stream of zeroes). edit: also, no, it doesn't make that assumption.
So how would you propose we deal with modelling something which seemingly contains such random number generators?
Solomonoff induction can handle that.
Yeah, albeit the shortest hypothesis that works keeps growing in length, which is really really bad for being able to compute something similar in practice.
That depends on what you mean by "shortest hypothesis". You can view Solomonoff induction as a mixture of all deterministic programs or, equivalently, a mixture of all computable distributions. If a "hypothesis" is a deterministic program, it will keep growing in length when given random bits. On the other hand, if a "hypothesis" is a computable distribution, its length might well stay bounded and not very large.
Well, it'd be an utter mess where you can't separate the "data" bits (ones that are transformed into final probability distribution if we assume they were random) from the "code" bits (that do the transformation). Indeed given that the universe itself can run computations, all "data" bits are also code.
If a "hypothesis" is a computable distribution, it doesn't need to include the random bits. (A computable distribution is a program that accepts a bit string and computes successive approximations to the probability of that bit string.)
Got to be a little more than that, for the probabilities of different bit strings to add up to 1. The obvious way to do that without having to sum all "probabilities" and renormalize, is to map an infinitely long uniformly distributed bit string into the strings for which you're defining probabilities.
In any case you're straying well away from what S.I. is defined to be.
I guess I left out too many details. Let's say a "hypothesis" is a program that accepts a bit string S and prints a sequence of non-negative rational numbers. For each S, that sequence has to be non-decreasing and converge to a limit P(S). The sum of P(S) over all S has to be between 0 (exclusive) and 1 (inclusive). Such an object is called a "lower-semicomputable semimeasure". SI is a universal lower-semicomputable semimeasure, which means it dominates any other up to a multiplicative constant. See page 8 of this paper for more on which classes of (semi-)measures have or don't have a universal element.
Yes, SI is such a measure, but the individual programs within SI as usually defined are not. You aren't feeding the input data into programs in SI, you match program outputs to the input data.
Did you deliberately ignore the qualification "if the observable universe is a big but finite-sized computer"?
I'm pointing out that as far as we know, it's not so.
Also, "then our predictions will only be wrong a limited number of times" is trivially true if the observation string is finite, but " and after that we'll predict the correctly every time." is as far as we know, false (we run out of input string before we gather enough data, even if everything is finite).
We know that the observable universe has a finite size. We can predict arbitrarily (but not infinitely) into the future and with an arbitrary (but not infinite) resolution with a finite model.
A true random number generator is equivalent to a lookup table built into the laws of physics, so long as you only pick a random number finitely many times.
In this case, that frequentist prediction you just quoted is worthless. Our predictions will be wrong a limited number of times only because we're dealing with a model that extends a limited distance into the future, and random numbers will only occur a limited number of times.
However, it works fine from a bayesian perspective. Once you figure out the main part of the program and just start building up the random number table, it will act the same as if you just told it the results would be random and to give probability distributions rather than just sets of possibilities.
The observable universe being a finite-sized deterministic classical computer is a stronger assumption than the universe being something which only produces observable strings according to computable probability distributions (which is a way to formalize the Church–Turing thesis).
Solomonoff induction only requires the latter assumption.