My problem is that A is defined as the output of the optimizer, M0 is defined as A, so P(A|ref) is central to the entire inequality. However, what is the output of an optimizer if we are without the optimizer? The given examples (Daniel's and John's) both gloss over the question of P(A|ref) and implicitly treat it as uniform over the possible choices the optimizer could have made. In the box-with-slots examples, what happens if there is no optimizer? I don't know.
In the MMO example, what is the output without a player-optimizer? I don't think it's a randomly chosen string of 10,000 bit inputs. No MMO I've ever played chooses random actions if you walk away from it. Yet Daniel's interpretation assumes that that's the distribution. Anything else, the player choosing the least likely reference outcome can beat the bounds in Daniel's answer. I.e. his example makes it clear that bits-of-optimization applied by the player does not correspond to bits-of-input, unless the reference is a randomly chosen string of inputs. And in that case, the bound feels trivial and uninsightful. If every possible action I can choose has a p chance of happening without me, then the output that I choose will have had a chance of p by definition. And the distribution of outcomes I selected will then always have had at least a p chance of having been selected without me (plus some chance that it happened through other possible output choices). No math needed to make me believe that!
None of this applies to the equation itself. It works for any choice of P(A|ref). But I think that changes the interpretations given (such as Daniels) and without it I'm not sure I that this builds intuition for anything in the way that I think it's trying to do. Is "uniformly choose an output" really a useful reference? I don't think it is useful for intuition. And in the useful references I choose (constant output), the bound becomes trivial (infinite KL divergence). So what is a useful choice?
I think this assumes implicitly that P(A|ref) is uniformly distributed over all the 10,000 options. In a video game I‘d think more that the ”reference” is always to output 0s since the player isn’t interacting. Then The KL divergence could be arbitrarily large. But it’s not really clear in general how to interpret the reference distribution, perhaps someone can clarify?
I'm intrigued by these examples but I'm not sure it translates. It sounds like you are interpreting "difference of size of file in bits between reference and optimized versions" as the thing the KL divergence is measuring, but I don't think that's true. I'm assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1). Let's explicitly assume that the input file is a randomly chosen English word.
Suppose a fourth case where our "optimizer" outputs the file "0" regardless of input. The end result is a tiny zip file. Under the "reference" condition, the original file is zipped and is still a few bytes, so we have reduced the file size by a few bytes at most. However, the KL divergence is infinite! After all "0" is not an English word and so it's zip never appears in the output distribution of the reference but occurs with probability 1 under our optimizer. So the KL divergence is not at all equal to the number of bits of filesize reduced. Obviously this example is rather contrived, but it suffices to show that we can't directly translate intuition about filesizes to intuition about bits-of-optimization as measured by KL divergence.
Were you going for a different intuition with these examples?