My problem is that A is defined as the output of the optimizer, M0 is defined as A, so P(A|ref) is central to the entire inequality. However, what is the output of an optimizer if we are without the optimizer? The given examples (Daniel's and John's) both gloss over the question of P(A|ref) and implicitly treat it as uniform over the possible choices the optimizer could have made. In the box-with-slots examples, what happens if there is no optimizer? I don't know.
In the MMO example, what is the output without a player-optimizer? I don't think it's a random...
I think this assumes implicitly that P(A|ref) is uniformly distributed over all the 10,000 options. In a video game I‘d think more that the ”reference” is always to output 0s since the player isn’t interacting. Then The KL divergence could be arbitrarily large. But it’s not really clear in general how to interpret the reference distribution, perhaps someone can clarify?
I'm intrigued by these examples but I'm not sure it translates. It sounds like you are interpreting "difference of size of file in bits between reference and optimized versions" as the thing the KL divergence is measuring, but I don't think that's true. I'm assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1). Let's explicitly assume that the input file is a randomly chosen English word.
Suppose a fourth case where our "optimizer" outputs the file "0" regardless of input. ... (read more)