Nothing is "mere." I, too, can see the stars on a desert night, and feel them. But do I see less or more? The vastness of the heavens stretches my imagination - stuck on this carousel, my little eye can catch one-million-year-old light. A vast pattern - of which I am a part - perhaps my stuff was belched from some forgotten star, as one is belching there. Or see them with the greater eye of Palomar, rushing all apart from some common starting point when they were perhaps all together. What is the pattern, or the meaning, or the why? It does not do harm to the mystery to know a little about it.

- Richard P. Feynman on The Relation of Physics to Other Sciences

Conventionally is a random variable, just like how $E [Y | X]$ is a random variable. To be fair the conventions are somewhat inconsistent, given that (as you said) $H (Y | X)$ is a number.

Previous discussion, comment by johnswentworth:

Relevant slogan: Goodheart is about generalization, not approximation.
[...]
In all the standard real-world examples of Goodheart, the real problem is that the proxy is not even approximately correct once we move out of a certain regime.

Speaking from the perspective of someone still developing basic mathematical maturity and often lacking prerequisites, it's very useful as a learning aid. For example, it significantly expanded the range of papers or technical results accessible to me. If I'm reading a paper containing unfamiliar math, I no longer have to go down the rabbit hole of tracing prerequisite dependencies, which often expand exponentially (partly because I don't know which results or sections in the prerequisite texts are essential, making it difficult to scope my focus). Now I can simply ask the LLM for a self-contained exposition. Using traditional means of self-studying like [search engines / Wikipedia / StackExchange] is very often no match for this task, mostly in terms of time spent or wasted effort; simply having someone I can directly ask my highly specific (and often dumb) questions or confusions and receive equally specific responses is just really useful.

Non-Shannon-type Inequalities

The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information .

This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.

A fundamental result in Information Theory is that $I (X; Y ∣ Z) \geq 0$ always holds.

Given $n$ random variables $X_{1}, \dots, X_{n}$ and $α, β, γ \subseteq [n]$ , from now on we write $I (α; β ∣ γ)$ with the obvious interpretation of the variables standing for the joint variables they correspond to as indices.

Since $I (α; β | γ) \geq 0$ always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.

Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of $n$ variable. It turns out, yes for $n = 2$ , (approximately) yes for $n = 3$ , and no for $n \geq 4$ .

Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for $n = 4$ :

I (A; B) \leq 2 I (A; B ∣ C) + I (A; C ∣ B) + I (B; C ∣ A) + I (A; B ∣ D) + I (C; D)

Explanation of the math, for anyone curious.

Given $n$ random variables and $α, β, γ \subseteq [n]$ , it turns out that $I (α; β ∣ γ) \geq 0$ is equivalent to $H (α \cup β) + H (α \cap β) \leq H (α) + H (β)$ (submodularity), $H (α) \leq H (β)$ if $α \subseteq β$ , and $H (\emptyset) = 0$ .
This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.
Let $Γ_{n}^{*}$ then be a subset of $R^{2^{n}}$ , each element corresponding to the values of the joint entropy assigned to each subset of some random variables $X_{1}, \dots, X_{n}$ . For example, an element of $Γ_{2}^{*}$ would be $(H (\emptyset), H (X_{1}), H (X_{2}), H (X_{1}, X_{2})) \in R^{2^{n}}$ for some random variables $X_{1}$ and $X_{2}$ , with a different element being a different tuple induced by a different random variable $(X_{1}^{'}, X_{2}^{'})$ .
Now let $Γ_{n}$ represent elements of $R^{2^{n}}$ satisfying the three aforementioned conditions on joint entropy. For example, $Γ_{2}^{*}$ 's element would be $(h_{\emptyset}, h_{1}, h_{2}, h_{12}) \in R^{2^{n}}$ satisfying e.g., $h_{1} \leq h_{12}$ (monotonicity). This is also a convex cone, so its elements really do correspond to "nonnegative linear combinations" of Shannon-type inequalities.
Then, the claim that "nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures" would correspond to the claim that $Γ_{n} = Γ_{n}^{*}$ for all $n$ .
The content of the papers linked above is to show that:
- $Γ_{2} = Γ_{2}^{*}$
- $Γ_{3} \neq Γ_{3}^{*}$ but $Γ_{3} = ¯ ¯¯¯¯ ¯ Γ_{3}^{*}$ (closure^[1])
- $Γ_{4} \neq Γ_{4}^{*}$ and $Γ_{4} \neq ¯ ¯¯¯¯ ¯ Γ_{4}^{*}$ , and also for all $n \geq 4$ .

^{^}
This implies that, while there exists a $2^{3}$ -tuple satisfying Shannon-type inequalities that can't be constructed or realized by any random variables $X_{1}, X_{2}, X_{3}$ , there does exist a sequence of random variables $(X_{1}^{(k)}, X_{2}^{(k)}, X_{3}^{(k)})_{k = 1}^{\infty}$ whose induced $2^{3}$ -tuple of joint entropies converge to that tuple in the limit.

Relevant: Alignment as a Bottleneck to Usefulness of GPT-3

between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?

By the way, Gemini 2.5 Pro and o3-mini-high is good at tic-tac-toe. I was surprised because the last time I tested this on o1-preview, it did quite terribly.

Where in the literature can I find the proof of the lower bound?

Previous discussion, comment by A.H. :

Sorry to be a party pooper, but I find the story of Jason Padgett (the guy who 'banged his head and become a math genius') completely unconvincing. From the video that you cite, here is the 'evidence' that he is 'math genius':
He tells us, with no context, 'the inner boundary of pi is f(x)=x sin(pi/x)'. Ok!
He makes 'math inspired' drawings (some of which admittedly are pretty cool but they're not exactly original) and sells them on his website
He claims that a physicist (who is not named or interviewed) saw him drawing in the mall, and, on the basis of this, suggested that he study physics.
He went to 'school' and studied math and physics. He says started with basic algebra and calculus and apparently 'aced all the classes', but doesn't tell us what level he reached. Graduate? Post-graduate?
He was 'doing integrals with triangles instead of integrals with rectangles'
He tells us 'every shape in the universe is a fractal'
Some fMRI scans were done on his brain which found 'he had conscious access to parts of the brain we don't normally have access to'.

I wrote "your brain can wind up settling on either of [the two generative models]", not both at once.

Ah that makes sense. So the picture I should have is: whatever local algorithm oscillates between multiple local MAP solutions over time that correspond to qualitatively different high-level information (e.g., clockwise vs counterclockwise). Concretely, something like the metastable states of a Hopfield network, or the update steps of predictive coding (literally gradient update to find MAP solution for perception!!) oscillating between multiple local minima?

Curious about the claim regarding bistable perception as the brain "settling" differently on two distinct but roughly equally plausible generative model parameters behind an observation. In standard statistical terms, should I think of it as: two parameters having similarly high Bayesian posterior probability, but the brain not explicitly representing this posterior, instead using something like local hill climbing to find a local MAP solution—bistable perception corresponding to the two different solutions this process converges to?

If correct, to what extent should I interpret the brain as finding a single solution (MLE/MAP) versus representing a superposition or distribution over multiple solutions (fully Bayesian)? Specifically, in which context should I interpret the phrase "the brain settling on two different generative models"?

X explains Z% of the variance in Y