I'm not an expert on this topic, but my impression is that linear regression is useful for when you are trying to a fit a function from input to output (e.g imagine you have the alleles at various loci as your inputs and you want to predict some phenotype as your output. That's the type of problem well-suited for high-dimensional linear regression.) Whereas, for principle component analysis, it's mainly used as a dimensionality reduction technique (so using PCA for the case of two dimensions as I did in this post is a bit overkill.)

The Geometry of Linear Regression versus PCA

criticalpoints2mo20

Thank you! That's very kind.

I got curious and asked Claude to explain the difference between regressing X-onto-Y and Y-onto-X and it did a really good job---which I found somewhat distressing. Is my blog post even providing any value when an LLM can reproduce 80-90% of the insight in literally a 1000th of the time?

But maybe there's still value in writing up the blog post because it's non-trivial to know what the right questions are to ask. I wrote this blog post because I knew that (a) understanding the difference between the two regression lines was important and (b) it was actually straightforward to explain the difference if you used the right framing. So perhaps there's still utility in having good taste in what questions are worth answering. At the very least, I personally benefited from writing up the post since it forced me to shore up my understanding.

Markov's Inequality Explained

criticalpoints4mo10

Yes, that's important clarification. Markov's inequality is tight on the space of all non-negative random variables (the inequality becomes an equality with the two-point distribution shown in the final state of the proof). But it's not constructed to be tight with respect to a generic distribution.

I'm pretty new to these sorts of tail-bound proofs that you see a lot in e.g high-dimensional probability theory. But in general, understanding under what circumstances a bound is tight has been one of the best ways to intuitively understand how a given bound works.

The Ap Distribution

criticalpoints4mo10

For the first part about " being a formal maneuver"--I don't disagree with the comment as stated nor with what Jaynes did in a technical sense. But I'm trying to imbue the proposition with a "physical interpretation" when I identify it with an infinite collection of evidences. There is a subtlety with my original statement that I didn't expand on, but I've been thinking about ever since I read the post: "infinitude" is probably best understood as a relative term. Maybe the simplest way to think about this is that, as I understand it, if you condition on two $A_{p}$ distributions at the same time, you get a "do not compute"--not zero, but "do not compute". So the $A_{p}$ proposition only seems to make sense with respect to some subset ${E}$ of all possible propositions. I interpret this subset as being those of "finite" evidence while the $A_{p}$ 's (and other propositions) somehow stand outside of this finite evidence class. There is also the matter that, in day-to-day life, it doesn't really seem possible to encounter what to me seems like a "single" piece of evidence that has the dramatic effect of rendering our beliefs "deterministically indeterministic". Can we really learn something that tells us that there is no more to learn?
Yes, I suspect that there is a typo there, though I'm a bit too lazy to reference the original text to check. It should be that the probability density over $A_{p}$ is normalized, and their expectation is the probability of $A$ .
This idea of compressing all relevant information of $E$ relevant to $A$ in the object $p (A_{p} | E)$ is interesting and indeed, it's perhaps a better articulation of what I find interesting about the $A_{p}$ distribution than what is conveyed in the main body of the original post. One thread that I want(ed) to tug at a little further is that the $A_{p}$ distribution seems to lend itself well to the first steps towards something of a dynamical model of probability theory: when you encounter a piece of evidence $E$ , its first-order effect is to change your probability of $A$ , but its second and n-th order effects are to affect your distribution of what future evidence you expect to encounter and how to "interpret" those pieces of evidence--where by "interpret" I mean in what way encountering that piece of evidence shifts your probability of $A$ . This dynamical theory of probability would have folk theorems like "the variance in your A_p distribution must monotonically decrease over time". These are shower thoughts.
And it's also interesting perhaps on a more applied/agentic sense in that we often casually talk about "updating" our beliefs, but what does that actually look like in practice? Empirically, we see that we can have evidence in our head that we fail to process (lack of logical omniscience). Maybe something like the $A_{p}$ distribution could be helpful for understanding this even better.

The Ap Distribution

criticalpoints5mo10

Thanks for the reference. You and other commentator both seem to be saying the same thing: that the there isn't much use case for the Ap distribution as Bayesian statisticians have other frameworks for thinking about these sorts of problems. It seems important that I acquaint myself with the basic tools of Bayesian statistics to better contextualize Jaynes' contribution.

Six (and a half) intuitions for KL divergence

criticalpoints5mo10

This intuition--that the KL is a metric-squared--is indeed important for understanding the KL divergence. It's a property that all divergences have in common. Divergences can be thought of as generalizations of the Euclidean metric where you replace the quadratic--which is in some sense the Platonic convex function--with a convex function of your choice.

This intuition is also important for understanding Talagrand's T2 inequality which says that, under certain conditions like strong log-concavity of the reference measure q, the Wasserstein-2 distance (which is analogous to the Euclidean metric-squared only lifted as a metric on the space of probability measures) between the two probability measures p and q can be upperbounded by their KL divergence.

The Ap Distribution

criticalpoints9mo20

Thanks for the feedback.

What you are showing with the coin is a hierarchical model over multiple coin flips, and doesn't need new probability concepts. Let be the flips. All you need in life is the distribution $P (F 1, F 2, \dots)$ . You can decide to restrict yourself to distributions of the form ∫10dpcoinP(F,G|pcoin)p(pcoin). In practice, you start out thinking about $p_{c o i n}$ as a variable atop all the $F_{i}$ in a graph, and then think in terms of $P (F, G | p_{c o i n})$ and $p (p_{c o i n})$ separately, because that's more intuitive. This is the standard way of doing things. All you do with $A_{p}$ is the same, there's no point at which you do something different in practice, even if you ascribed additional properties to $A_{p}$ in words.

This isn't emphasized by Jaynes (though I believe it's mentioned at the very end of the chapter), but the $A_{p}$ distribution isn't new as a formal idea in probability theory. It's based on De Finetti's representation theorem. The theorem concerns exchangeable sequences of random variables.

A sequence of random variables ${X_{i}}$ is exchangeable if the joint distribution of any finite subsequence is invariant under permutations. A sequence of coin flips is the canonical example. Note that exchangeability does not imply independence! If I have a perfectly biased coin where I don't know the bias, then all the random variables are perfectly dependent on each other (they all must obtain the same value).

De Finetti's representation theorem says that any exchangeable sequence of random variables can be represented as an integral over identical and independent distributions (i.e binomial distributions). Or in other words, the extent to which random variables in the sequence are dependent on each other is solely due to their mutual relationship to the latent variable (the hidden bias of the coin).

$P (X_{1} = x_{1}, \dots, X_{n} = x_{n}) = \int_{0}^{1} (\frac{n}{k}) θ^{k} (1 - θ)^{n - k} d F (θ)$

You are correct that all relevant information is contained in the joint distribution $P (F_{1}, F_{2}, . . .)$ . And while I have no deep familiarity with Bayesian hierarchical modeling, I believe your claim that the decomposition $\int_{0}^{1} d p_{c o i n} P (F, G | p_{c o i n}) p (p_{c o i n})$ is standard in Bayesian modeling.

But I think the point is that the $A_{p}$ distribution is a useful conceptual tool when considering distributions governed by a time-invariant generating process. A lot of real-world processes don't fit that description, but many do fit that description.

A concept like "the probability of me assigning a certain probability" makes sense but I don't think Jaynes actually did anything like that for real. Here on lesswrong I guess @abramdemski knows about stuff like that.

Yes, this is correct. The part about "the probability of assigning a probability" and the part about interpreting the proposition $A_{p}$ as a shorthand for an infinite collection evidences are my own interpretations of what the $A_{p}$ distribution "really" means. Specifically, the part about the "probability that you will assign the probability in the infinite future" is loosely inspired by the idea of Cauchy surfaces from e.g general relativity (or any physical theory that has a causal structure built in). In general relativity, the idea is that if you have boundary conditions specified on a Cauchy surface, then you can time-evolve to solve for the distribution of matter and energy for all time. In something like quantum field theory, a principled choice for the Cauchy surface would be the infinite past (this conceptual idea shows up when understanding the vacuum in QFT). But I think in probability theory, it's more useful conceptually to take your Cauchy surface of probabilities to be what you expect them to be in the "infinite future". This is how I make sense of the $A_{p}$ distribution.

And now that you mention it, this blog post was totally inspired by reading the first couple chapters of "Logical Inductors" (though the inspiration wasn't conscious on my part).

--PS: I think Jaynes was great in his way of approaching the meaning and intuition of statistics, but the book is bad as a statistics textbook. It's literally the half-complete posthumous publication of a rambling contrarian physicist, and it shows. So I would not trust any specific statistical thing he does. Taking the general vibe and ideas is good, but when you ask about a specific thing "why is nobody doing this?" it's most likely because it's outdated or wrong.

Not a statistician, so I will defer to your expertise that the book is bad as a statistics book (never thought of it as a statistics book to be honest). I think the strongest parts of this book are when he derives statistical mechanics from the maximum entropy principle and when he generalizes the principle of indifference to consider more general group invariances/symmetries. As far as I'm aware, my opinion on which of Jaynes' ideas are his best ideas matches the consensus.

I suspect the reason why I like the $A_{p}$ distribution is that I come from a physics background, so his reformulation of standard ideas in Bayesian modeling makes some amount of sense to me even if comes across as weird and crankish to statisticians.