User Comment Replies

Bits of Optimization Can Only Be Lost Over A Distance

I'm intrigued by these examples but I'm not sure it translates. It sounds like you are interpreting "difference of size of file in bits between reference and optimized versions" as the thing the KL divergence is measuring, but I don't think that's true. I'm assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1). Let's explicitly assume that the input file is a randomly chosen English word.

Suppose a fourth case where our "optimizer" outputs the file "0" regardless of input. ... (read more)

1ACrackedPot3y

It isn't the thing that the KL divergence is measuring, it is an analogy for it. The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms. However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however. So, having an intuition for what the ZIP file does, the equivalent "examples": Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0. Not appropriate as a way of thinking about this, but helpful to frame the next two examples. Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0. Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases. Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient. Critically for this, the future optimization can vary in effectiveness depending on the input parameters. Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example: For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization] For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization] Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of

Bits of Optimization Can Only Be Lost Over A Distance

tgbrooks3y40

My problem is that A is defined as the output of the optimizer, M0 is defined as A, so P(A|ref) is central to the entire inequality. However, what is the output of an optimizer if we are without the optimizer? The given examples (Daniel's and John's) both gloss over the question of P(A|ref) and implicitly treat it as uniform over the possible choices the optimizer could have made. In the box-with-slots examples, what happens if there is no optimizer? I don't know.

In the MMO example, what is the output without a player-optimizer? I don't think it's a random... (read more)

1kave3y

Good point! It seems like it would be nice in Daniel's example for P(A|ref) to be the action distribution of an "instinctual" or "non-optimising" player. I don't know how to recover that. You could imagine something like an n-gram model of player inputs across the MMO.

Bits of Optimization Can Only Be Lost Over A Distance

tgbrooks3yΩ150

I think this assumes implicitly that P(A|ref) is uniformly distributed over all the 10,000 options. In a video game I‘d think more that the ”reference” is always to output 0s since the player isn’t interacting. Then The KL divergence could be arbitrarily large. But it’s not really clear in general how to interpret the reference distribution, perhaps someone can clarify?

4kave3y

I agree on the "reference" distribution in Daniel's example. I think it generally means "the distribution over the random variables that would obtain without the optimiser". What exactly that distribution is / where it comes from I think is out-of-scope for John's (current) work, and I think is kind of the same question as where the probabilities come from in statistical mechanics.

LESSWRONG
LW

All of tgbrooks's Comments + Replies