criticalpoints

The Geometry of Linear Regression versus PCA

Thank you! That's very kind.

I got curious and asked Claude to explain the difference between regressing X-onto-Y and Y-onto-X and it did a really good job---which I found somewhat distressing. Is my blog post even providing any value when an LLM can reproduce 80-90% of the insight in literally a 1000th of the time?

But maybe there's still value in writing up the blog post because it's non-trivial to know what the right questions are to ask. I wrote this blog post because I knew that (a) understanding the difference between the two regression lines was important and (b) it was actually straightforward to explain the difference if you used the right framing. So perhaps there's still utility in having good taste in what questions are worth answering. At the very least, I personally benefited from writing up the post since it forced me to shore up my understanding.

The Geometry of Linear Regression versus PCA

In statistics, there are two common ways to "find the best linear approximation to data": linear regression and principal component analysis. However, they are quite different---having distinct assumptions, use cases, and geometric properties. I remained subtly confused about the difference between them until last year. Although what I'm about to explain is standard knowledge in statistics, and I've even found well-written blog posts on this exact subject, it still seems worthwhile to examine, in detail, how linear regression and principal component analysis differ.

The brief summary of this post is that the different lines result from the different directions in which we minimize error:

When we regress $Y$ onto $X$ , we minimize vertical errors relative to the

... (read 1577 more words →)

Replying toMarkov's Inequality Explained

Markov's Inequality Explained

Yes, that's important clarification. Markov's inequality is tight on the space of all non-negative random variables (the inequality becomes an equality with the two-point distribution shown in the final state of the proof). But it's not constructed to be tight with respect to a generic distribution.

I'm pretty new to these sorts of tail-bound proofs that you see a lot in e.g high-dimensional probability theory. But in general, understanding under what circumstances a bound is tight has been one of the best ways to intuitively understand how a given bound works.

Markov's Inequality Explained

In my experience, the proofs that you see in probability theory are much shorter than the longer, more involved proofs that you might see in other areas of math (like e.g. analytical number theory). But that doesn't mean that good technique isn't important. In probability theory, there are a set of tools that are useful across a broad variety of situations and you need to be able to recognize when it's the appropriate time to use each tool in your toolkit.

One of the most useful of tools to have is Markov's inequality. What Markov's inequality says is that, given a non-negative random variable $X$ and a positive real number $a$ , the probability that $X$ is greater than $a$ can... (read 806 more words →)

Replying toThe Ap Distribution

For the first part about " $A_{p}$ being a formal maneuver"--I don't disagree with the comment as stated nor with what Jaynes did in a technical sense. But I'm trying to imbue the proposition with a "physical interpretation" when I identify it with an infinite collection of evidences. There is a subtlety with my original statement that I didn't expand on, but I've been thinking about ever since I read the post: "infinitude" is probably best understood as a relative term. Maybe the simplest way to think about this is that, as I understand it, if you condition on two $A_{p}$ distributions at the same time, you get a "do not compute"--not zero,

... (read more)

Replying toThe Ap Distribution

Thanks for the reference. You and other commentator both seem to be saying the same thing: that the there isn't much use case for the Ap distribution as Bayesian statisticians have other frameworks for thinking about these sorts of problems. It seems important that I acquaint myself with the basic tools of Bayesian statistics to better contextualize Jaynes' contribution.

Replying toSix (and a half) intuitions for KL divergence

Six (and a half) intuitions for KL divergence

This intuition--that the KL is a metric-squared--is indeed important for understanding the KL divergence. It's a property that all divergences have in common. Divergences can be thought of as generalizations of the Euclidean metric where you replace the quadratic--which is in some sense the Platonic convex function--with a convex function of your choice.

This intuition is also important for understanding Talagrand's T2 inequality which says that, under certain conditions like strong log-concavity of the reference measure q, the Wasserstein-2 distance (which is analogous to the Euclidean metric-squared only lifted as a metric on the space of probability measures) between the two probability measures p and q can be upperbounded by their KL divergence.

Chess As The Model Game

Other than Pokemon, most of my YouTube consumption is comprised of chess analysis videos. This might surprise people because I don't play chess very often these days. And when I do play chess, I'm not that good.

(My peak chess.com Elo rating was 1050 back in 2018---and surely, I've only gotten worse since then now that I am out of practice.)

I haven't been a chess enthusiast my whole life though. Besides a brief childhood dalliance, my love of chess was something that I discovered as a young adult. Like a lot of tiger parents, my mom made sure that I was exposed to chess at an early age---just in case I was the... (read 2296 more words →)

Replying toThe Ap Distribution

Thanks for the feedback.

What you are showing with the coin is a hierarchical model over multiple coin flips, and doesn't need new probability concepts. Let $F_{i}$ be the flips. All you need in life is the distribution $P (F 1, F 2, \dots)$ . You can decide to restrict yourself to distributions of the form ∫10dpcoinP(F,G|pcoin)p(pcoin). In practice, you start out thinking about $p_{c o i n}$ as a variable atop all the $F_{i}$ in a graph, and then think in terms of $P (F, G | p_{c o i n})$ and $p (p_{c o i n})$ separately, because that's more intuitive. This is the standard way of doing things. All you do with $A_{p}$ is the same, there's no point at which you do something different in practice, even if you

... (read 696 more words →)