You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

cousin_it comments on More and Less than Solomonoff Induction - Less Wrong Discussion

4 Post author: Manfred 21 May 2014 04:55AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (32)

You are viewing a single comment's thread. Show more comments above.

Comment author: cousin_it 21 May 2014 02:19:58PM *  1 point [-]

Solomonoff induction can handle that.

Comment author: private_messaging 21 May 2014 05:46:02PM 0 points [-]

Yeah, albeit the shortest hypothesis that works keeps growing in length, which is really really bad for being able to compute something similar in practice.

Comment author: cousin_it 21 May 2014 08:27:25PM *  1 point [-]

That depends on what you mean by "shortest hypothesis". You can view Solomonoff induction as a mixture of all deterministic programs or, equivalently, a mixture of all computable distributions. If a "hypothesis" is a deterministic program, it will keep growing in length when given random bits. On the other hand, if a "hypothesis" is a computable distribution, its length might well stay bounded and not very large.

Comment author: private_messaging 22 May 2014 04:01:00AM *  -1 points [-]

Well, it'd be an utter mess where you can't separate the "data" bits (ones that are transformed into final probability distribution if we assume they were random) from the "code" bits (that do the transformation). Indeed given that the universe itself can run computations, all "data" bits are also code.

Comment author: cousin_it 22 May 2014 02:40:09PM *  1 point [-]

If a "hypothesis" is a computable distribution, it doesn't need to include the random bits. (A computable distribution is a program that accepts a bit string and computes successive approximations to the probability of that bit string.)

Comment author: private_messaging 23 May 2014 05:53:06AM *  1 point [-]

A computable distribution is a program that accepts a bit string and computes successive approximations to the probability of that bit string.

Got to be a little more than that, for the probabilities of different bit strings to add up to 1. The obvious way to do that without having to sum all "probabilities" and renormalize, is to map an infinitely long uniformly distributed bit string into the strings for which you're defining probabilities.

In any case you're straying well away from what S.I. is defined to be.

Comment author: cousin_it 23 May 2014 10:29:10AM *  0 points [-]

I guess I left out too many details. Let's say a "hypothesis" is a program that accepts a bit string S and prints a sequence of non-negative rational numbers. For each S, that sequence has to be non-decreasing and converge to a limit P(S). The sum of P(S) over all S has to be between 0 (exclusive) and 1 (inclusive). Such an object is called a "lower-semicomputable semimeasure". SI is a universal lower-semicomputable semimeasure, which means it dominates any other up to a multiplicative constant. See page 8 of this paper for more on which classes of (semi-)measures have or don't have a universal element.

Comment author: private_messaging 23 May 2014 12:18:40PM *  1 point [-]

Yes, SI is such a measure, but the individual programs within SI as usually defined are not. You aren't feeding the input data into programs in SI, you match program outputs to the input data.

Comment author: cousin_it 25 May 2014 02:16:17PM *  1 point [-]

I think you can have an equivalent definition of SI in terms of programs that consume input data and print rational numbers. Or at least, I don't see any serious obstacles to doing so. It will also be expensive to approximate, and you'll need to renormalize as you go (and do other things like converting all sequences of rationals to non-decreasing sequences etc.), but at least you won't have the problem that "hypotheses" grow without bound.

Comment author: private_messaging 26 May 2014 12:47:40PM *  1 point [-]

That might be useful in some context, yes. But you'll still have both the program which is longer but gives probability of 1 to observations so far, and a program that is shorter, but gives a much much smaller probability (on the order of 2^-L where L is length difference).

I think the hypotheses in physics correspond not to individual programs in SI, but to common prefixes. For example, QM would describe a program that would convert subsequent random bits into predicted observations with correct probability distribution. You could treat a prefix as a probability distribution over outputs - the probability distribution that results from pre-pending that prefix to random bits. The bonus is that very simple operations on the random bits correspond to high level mathematical functions that are otherwise difficult to write in any low level TM-like representation.