# jacob_cannell comments on The Generalized Anti-Pascal Principle: Utility Convergence of Infinitesimal Probabilities - Less Wrong

-3 18 December 2011 11:47PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Sort By: Best

Comment author: 20 December 2011 09:49:18PM *  1 point [-]

it's computed utilities will then be random samples from the utility function over the space of all programs, and should then converge to the mean of the utility function by the central limit theorem.

Well the mean of the utility function is just the expected utility.

There are number of utility terms in the AIXI equation. The utility function is evaluated for every hypothesis/program/universe forward evaluated for all future action paths, giving one best utility for just that universe, and the total expected utility is then the sum over all valid universes weighted by their complexity penalty.

By 'mean of the utility function', I meant the mean of the utility function over all possible universes rather than just valid universes. The validity constraint forces the expected utility to diverge from the mean of the utility function - it must for the agent to make any useful decisions!

So the total expected utility is not normally the mean utility, but it reduces to it in the case where the observation filter is removed.

The way I'm approaching this is to ask whether most of the expected utility comes from high probability events or low probability ones

My entire post concerns the subset of universes with probabilities approaching 1/infinity, corresponding to programs with length going to infinity. The high probability scenarios (shorter program universes) don't matter in mugger scenarios, we categorically assume they all have boring extremely low utilities (the mugger is jokin/lying/crazy).

Your observations have some probability P(T|N) to retain a hypothesis of length N. I don't see why this would depend that strongly on the value of N.

In AIXI-models, hypothesis acceptance is not probabilistic, it is completely binary: a universe program either perfectly fits the observation history or it does not. If even 1 bit is off, the program is ignored.

It's unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.

How did you get the number 2^-(L - length(O)) as a limit on the amount of the hypothesis space that is filtered (and do you mean retained by the filter or removed by the filter when you say 'filtered').

This simply comes from the fact that an observation history O can at most filter out only a fraction of the space of programs that are longer than it.

For example, start with an empty observation history O: {}. Clearly, this filters nothing. The space of valid programs of length L, for any L, is simply all possible programs of length L, which is expected to be a set of around 2^L in size. The sum over all programs for L going to infinity is thus the space of everything, the full Tegmark. In this case, the expected utility is simply the mean of the utility function over the full Tegmark.

Now consider O:{1}. We have cut out exactly half of the program space. O:{11}, cuts out 3/4th of the tegmark, and in general an observation history with length(O) filters the universe space down to 2^-length(O) of it's previous size, removing 1 - 2^-length(O) possible universes - but there are an infinite number of total universes.

Now, let's say we are ONLY interested in the contribution of universes of a certain prior likelihood (corresponding to a certain program length). These are the subsets of the tegmark with programs P where length(P) = L for some L. This is a FINITE, enumerable set.

Then for JUST the subset of universes with length(P)=L, there are 2^L universes in this set. For an observation history O with length(O) > L, it is not guaranteed that there are any valid programs that match the observation history. It could be 1, could be 0.

However, for length(P) > length(O) + C, for some small C, valid programs are absolutely guaranteed. Specifically for some constant C there are programs which simply directly encode random strings which happen to align with O. This set of programs correspond to 'chaos'.

Now consider the limit behavior as complexity goes to infinity. For any fixed observation history with length(O), as length(P) goes to infinity, the chaos set grows at the maximum possible rate, with 2^length(P), and dominates (because the chaos programs just fill extra length with any random bits).

In particular, for observation set O and the subset of universes with length(P)=L, there are expected to be roughly 2^-(length(O)+C) * 2^L observationally valid chaos universes. This simplifies to 2^(L-length(O)-C) valid chaos universes.

So when length(O)+C > L, there are unlikely to be any valid chaos universes. So the expected utility over this subset, EU[L], will be averaged over a small number of universes, possibly even 1 (if there are any at all that match O), or none. But as L grows larger than length(O)+C, the chaos universes suddenly appear (guaranteed) and their number grow exponentially with L, and the expected utility over that exponentially growing set quickly converges to the mean of the utility function (because the chaos universes are random).

Assuming a utility function with positive/negative bounds normalized around zero, the convergence should be to zero.

Comment author: 20 December 2011 11:51:28PM 1 point [-]

By 'mean of the utility function', I meant the mean of the utility function over all possible universes rather than just valid universes. The validity constraint forces the expected utility to diverge from the mean of the utility function - it must for the agent to make any useful decisions!

Okay. In that case there are two reasons that mugger hypotheses are still important: the unupdated expected utility is not necessarily anywhere near the naive tail-less expected utility and that while the central limit theorem shows that updating based on observations is unlikely to produce a shift in the utility of the tails that is large relative to the bounds on the utility function, it will still be large relative to the actual utility.

The way I'm approaching this is to ask whether most of the expected utility comes from high probability events or low probability ones

My entire post concerns the subset of universes with probabilities approaching 1/infinity, corresponding to programs with length going to infinity. The high probability scenarios (shorter program universes) don't matter in mugger scenarios, we categorically assume they all have boring extremely low utilities (the mugger is jokin/lying/crazy).

The utility of the likely scenarios is essential here. If we don't take into account the utility of \$5, we have no obvious reason not to pay the mugger. The ratio of the utility differences of various action due to the likely hypotheses and due to the high-utility hypotheses is what is important.

Your observations have some probability P(T|N) to retain a hypothesis of length N. I don't see why this would depend that strongly on the value of N.

In AIXI-models, hypothesis acceptance is not probabilistic, it is completely binary: a universe program either perfectly fits the observation history or it does not. If even 1 bit is off, the program is ignored.

That is a probability (well really a frequency) taken over all hypotheses of length N (or L if you prefer).

It's unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.

The space of valid programs of length L, for any L, is simply all possible programs of length L, which is expected to be a set of around 2^L in size.

Well, an O(1) factor less, since otherwise our prior measure would diverge, but you don't have to write it explicitly; when working with Kolmogorov complexity, you expect everything to be within a constant factor.

Now consider O:{1}. We have cut out exactly half of the program space. O:{11}, cuts out 3/4th of the tegmark, and in general an observation history with length(O) filters the universe space down to 2^-length(O) of it's previous size, removing 1 - 2^-length(O) possible universes - but there are an infinite number of total universes.

No, not quite. Observations are not perfectly informative. If someone wanted to optimally communicate their observations, they would use such a system, but a real observation will not be perfectly optimized to rule out half the hypothesis space. We are reading bits from the output of the program, not its source code!

However, for length(P) > length(O) + C, for some small C, valid programs are absolutely guaranteed. Specifically for some constant C there are programs which simply directly encode random strings which happen to align with O. This set of programs correspond to 'chaos'.

I don't think this set behaves how you think it behaves. 1 - 2^-length(O) of this set will be ruled out, but there are more programs that have with more structure than "print this string" that don't get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility.

Looking at you conclusions, you can actually replace l(O) with K(O) and everything qualitatively survives.

Comment author: 21 December 2011 02:55:55AM *  0 points [-]

The utility of the likely scenarios is essential here. If we don't take into account the utility of \$5, we have no obvious reason not to pay the mugger.

No, not necessarily. It could be an arbitrarily small cost: the mugger could say just look at me for a nanosecond, and this tiny action of almost no cost could still not be worthwhile.

If AIXI can not find a full observation history O matching program P which generates a future we would describe as (mugger really does have matrix powers and causes massive negative reward) under the constraints that length(P) < length(O), then AIXI's expected utility decision for the mugger futures goes to zero . The length(P) < length(O) is a likelihood bound.

AIXI essentially stops considering theories beyond some upper improbability (much longer than it's observation history).

but a real observation will not be perfectly optimized to rule out half the hypothesis space.

For AIXI, each observation rules out exactly half of the hypothesis space, because it's hypothesis space is the entirety of everything.

there are more programs that have with more structure than "print this string" that don't get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility

No - this is a contradiction. The programs of K(O) bits are the first valid universes, and by the definition/mapping of the mugger problem to AIXI-logic, those correspond to the mundane worlds where the mugger is [joking,lying,crazy]. If the program is valid and it is K(O) bits, then the leftover bits can't matter - as you said yourself they are unobservable! And any unobservable bits are thus unavailable to the utility function.

Moreover, they are necessarily just repeats, if the program is K(O) bits, then it has appeared far earlier than length(O) in the ensemble, and is some mundane low utility universe.