Stephen_Jordan — LessWrong

LESSWRONG
LW

Replying toPascal's Mugging: Tiny Probabilities of Vast Utilities

Pascal's Mugging: Tiny Probabilities of Vast Utilities

It seems to me that the cancellation is an artifact of the particular example, and that it would be easy to come up with an example in which the cancellation does not occur. For example, maybe you have previous experience with the mugger. He has mugged you before about minor things and sometimes you have paid him and sometimes not. In all cases he has been true to his word. This would seem to tip the probabilities at least slightly in favor of him being truthful about his current much larger threat.

Replying toA Priori

Stephen_Jordan18y

A Priori

William,

By considering models in the first place, one is already using Occam's razor. With no preference for simplicity in the priors at all, one would start with uniform priors for all possible data sequences, not finite-parameter models of data sequences. If you formalize models as being programs for Turing machines which have a separate tape for inputting the program, and your prior is a uniform distribution over possible inputs on that tape, you exactly recover the 2^-k Occam's razor law, where k is the number of program bits that the Turing machine reads during its execution (i.e. the Kolmogorov complexity).

Interestingly, the degree to which one can outperform this distribution is provably bounded.... (read more)

Replying toOccam's Razor

Stephen_Jordan18y

Occam's Razor

Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.

Replying toOccam's Razor

Stephen_Jordan18y

Occam's Razor

In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff's original scheme the prior probabilities did not add to one.) This also allows the beautiful interpretation that the program tape is assigned by independent coin flips for each bit, and the 2^-l weighting arises naturally rather than as an artificial assumption. I believe this is discussed in the information theory book by Cover and Thomas.