This is a tangent, but might be important.
or a non-EUM with legible goals (like a risk-averse money-maximizer)
Our prototypical examples of risk-averse money-maximizers are EUMs. In particular, the Kelly bettor is probably the most central example: it maximizes expected log wealth (i.e. log future money). The concavity of the logarithm makes it risk averse: a Kelly bettor will always take a sure payoff over an uncertain outcome with the same expected payoff.
I bring this up mainly because the wording makes it sound like you're under the impression that being an EUM is inconsistent with being a risk-averse money-maximizer, in which case you probably have an incorrect understanding of the level of generality of (nontrivial) expected utility maximization, and should probably update toward EUMs being a better model of real agents more often than you previously thought.
Sounds like you've correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of won't work, though.
Hand-wavily: the problem is that, if we take the latent to be a deterministic function , then has lots of zeros in it - not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.
I'd recommend looking for a function . Unfortunately that does mean that low entropy of given has to be proven.
Some details mildly off, but I think you've got the big picture basically right.
Alternatively, if I didn’t tell you the label, you could estimate it from either X_1 or X_2 equally well. This is the other two diagrams.
Minor clarification here: the other two diagrams say not only that I can estimate the label equally well from either or , but that I can estimate the label (approximately) equally well from , , or the pair .
I think that the full label will be an approximate stochastic natural latent.
I'd have to run the numbers to check that 200 flips is enough to give a high-confidence estimate of (in which case 400 flips from the pair of variables will also put high confidence on the same value with high probability), but I think yes.
But if we consider only the first bit[1] of the label (which roughly tells us whether the bias is above or below 50% heads) then this bit will be a deterministic natural latent because with reasonably high certainty, you can guess the first bit of from or .
Not quite; I added some emphasis. The first bit will (approximately) satisfy the two redundancy conditions, i.e. and , and indeed will be an approximately deterministic function of . But it won't (approximately) satisfy the mediation condition ; the two sets of flips will not be (approximately) independent given only the first bit. (At least not to nearly as good an approximation as the original label.)
That said, the rest of your qualitative reasoning is correct. As we throw out more low-order bits, the mediation condition becomes less well approximated, the redundancy conditions become better approximated, and the entropy of the coarse-grained latent given falls.
So to build a proof along these lines, one would need to show that a bit-cutoff can be chosen such that bit_cutoff() still mediates (to an approximation roughly -ish), while making the entropy of bit_cutoff() low given .
I do think this is a good angle of attack on the problem, and it's one of the main angles I'd try.
If a latent satisfies the the three natural latents conditions within , we can always find a (potentially much bigger) such that this latent also satisfies the deterministic latent condition, right? This is why you need to specify that the problem is showing that a deterministic natural latent exists with 'almost the same' . Does this sound right?
Yes. Indeed, if we allow large enough (possibly scaling with system size/entropy) then there's always a deterministic natural latent regardless; the whole thing becomes trivial.
Oh, you're right. Man, I was really not paying attention before bed last night! Apologies, you deserve somewhat less tired-brain responses than that.
A natural latent is, by definition, a latent which satisfies two properties. The first is that the observables are IID conditional on the latent, i.e. the common construction you're talking about. That property by itself doesn't buy us much of interest, for our purposes, but in combination with the other property required for a natural latent, it buys quite a lot.
Probably not going to have a discussion on the topic right now, but out of honest curiosity: did you read the bill?
Time spent running.
Note that the "thing which is compressed" is the optimization target, which is not the whole system. So in your example, it probably wouldn't be the code itself which is compressed, but rather the runtime of the program (holding the output constant), or maybe the runtimes of a bunch of programs on the machine, or something along those lines.
Fixed. Good catch, thanks!
Note that the same mistake, but with convexity in the other direction, also shows up in the OP:
An EUM can totally prefer a probabilistic mixture of two options to either option individually; this happens whenever utility is convex with respect to resources (e.g. money). For instance, suppose an agent's utility is u(money) = money^2. I offer this agent a $1 bet on a fair coinflip at even odds, i.e. it gets $0 if it loses and $2 if it wins. The agent takes the bet: the bet offers u = 0.5*0^2 + 0.5*2^2 = 2, while the baseline offers a certainty of $1 which has u = 1.0*1^2 = 1.