All of criticalpoints's Comments + Replies

Thank you!

I'm not an expert on this topic, but my impression is that linear regression is useful for when you are trying to a fit a function from input to output (e.g imagine you have the alleles at various loci as your inputs and you want to predict some phenotype as your output. That's the type of problem well-suited for high-dimensional linear regression.) Whereas, for principle component analysis, it's mainly used as a dimensionality reduction technique (so using PCA for the case of two dimensions as I did in this post is a bit overkill.)

Thank you! That's very kind.

I got curious and asked Claude to explain the difference between regressing X-onto-Y and Y-onto-X and it did a really good job---which I found somewhat distressing. Is my blog post even providing any value when an LLM can reproduce 80-90% of the insight in literally a 1000th of the time?

But maybe there's still value in writing up the blog post because it's non-trivial to know what the right questions are to ask. I wrote this blog post because I knew that (a) understanding the difference between the two regression lines was impor... (read more)

1XelaP
Certainly, you have pictures! Pictures are great!

Yes, that's important clarification. Markov's inequality is tight on the space of all non-negative random variables (the inequality becomes an equality with the two-point distribution shown in the final state of the proof). But it's not constructed to be tight with respect to a generic distribution.

I'm pretty new to these sorts of tail-bound proofs that you see a lot in e.g high-dimensional probability theory. But in general, understanding under what circumstances a bound is tight has been one of the best ways to intuitively understand how a given bound works.

  1. For the first part about " being a formal maneuver"--I don't disagree with the comment as stated nor with what Jaynes did in a technical sense. But I'm trying to imbue the proposition with a "physical interpretation" when I identify it with an infinite collection of evidences. There is a subtlety with my original statement that I didn't expand on, but I've been thinking about ever since I read the post: "infinitude" is probably best understood as a relative term. Maybe the simplest way to think about this is that, as I understand it, if you condition o

... (read more)

Thanks for the reference. You and other commentator both seem to be saying the same thing: that the there isn't much use case for the Ap distribution as Bayesian statisticians have other frameworks for thinking about these sorts of problems. It seems important that I acquaint myself with the basic tools of Bayesian statistics to better contextualize Jaynes' contribution.

2transhumanist_atom_understander
Sort of. I think the distribution of Θ is the Ap distribution, since it satisfies that formula; Θ=p is Ap. It's just that Jaynes prefers an exposition modeled on propositional logic, whereas a standard probability textbook begins with the definition of "random variables" like Θ, but this seems to me just a notational difference, since an equation like Θ=p is after all a proposition from the perspective of propositional logic. So I would rather say that Bayesian statisticians are in fact using it, and I was just explaining why you don't find any exposition of it under that name. I don't think there's a real conceptual difference. Jaynes of course would object to the word "random" in "random variable" but it's just a word, in my post I call it an "unknown quantity" and mathematically define it the usual way.

This intuition--that the KL is a metric-squared--is indeed important for understanding the KL divergence. It's a property that all divergences have in common. Divergences can be thought of as generalizations of the Euclidean metric where you replace the quadratic--which is in some sense the Platonic convex function--with a convex function of your choice.

This intuition is also important for understanding Talagrand's T2 inequality which says that, under certain conditions like strong log-concavity of the reference measure q, the Wasserstein-2 distance (which... (read more)

Thanks for the feedback.

What you are showing with the coin is a hierarchical model over multiple coin flips, and doesn't need new probability concepts. Let be the flips. All you need in life is the distribution . You can decide to restrict yourself to distributions of the form ∫10dpcoinP(F,G|pcoin)p(pcoin). In practice, you start out thinking about as a variable atop all the in a graph, and then think in terms of and separately, because that's more intuitive. This is the standard way of doing things. All you do

... (read more)
1rotatingpaguro
I still don't understand your "infinite limit" idea. If in your post I drop the following paragraph: the rest is standard hierarchical modeling. So even if your words here are suggestive, I don't understand how to actually connect the idea to calculations/concrete things, even at a vague indicative level. So I guess I'm not actually understanding it. For example, you could show me a conceptual example where you do something with this which is not standard probabilistic modeling. Or maybe it's all standard but you get to a solution faster. Or anything where applying the idea produces something different, then I would see how it works. ---------------------------------------- (Note: I don't know if you noticed, but De Finetti applies to proper infinite sequences only, not finite ones, people forget this. It is not relevant to the discussion though)