Confounded No Longer: Insights from 'All of Statistics'

TurnTrout

Using fancy tools like neural nets, boosting and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a bandaid.

Larry Wasserman

Foreword

For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.

This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.

I then went even further, challenging myself during the bits of downtime throughout my day to do things like explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.

All of Statistics

1: Introduction

2: Probability

In which sample spaces are formalized.

3: Random Variables

In which random variables are detailed and a multitude of distributions are introduced.

Conjugate Variables

Consider that a random variable $X$ is a function $X : Ω \to R$ . For random variables $X, Y$ , we can then produce conjugate random variables $X Y, X + Y$ , with

(X Y) (ω) = X (ω) Y (ω) (X + Y) (ω) = X (ω) + Y (ω) .

4: Expectation

Evidence Preservation

E (E (Y | X)) = E (Y)

is conservation of expected evidence (thanks to Alex Mennen for making this connection explicit).

Marginal Variance

V (Y) = E V (Y | X) + V E (Y | X)

Why does marginal variance have two terms? Shouldn't the expected conditional variance be sufficient?

This literally plagued my dreams.

Proof (of the variance; I cannot prove it plagued my dreams):

\begin{matrix} V (Y) & = E (Y - E (Y))^{2} = E ((Y - E (Y | X)) + (E (Y | X) - E (Y)))^{2} = E (Y - E (Y | X))^{2} + E (2 (Y - E (Y | X)) (E (Y | X) - E (Y))) + E (E (Y | X) - E (Y))^{2} = E V (Y | X) + 2 E ((Y - E (Y | X)) (E (Y | X) - E (Y))) + V E (Y | X) = E V (Y | X) + 2 E (Y E (Y | X) - Y E (Y) - E (Y | X)^{2} + E (Y | X) E (Y)) + V E (Y | X) = E V (Y | X) + 2 (E (Y E (Y | X)) - E (Y E (Y)) - E (E (Y | X)^{2}) + E (E (Y | X) E (Y))) + V E (Y | X) = E V (Y | X)      sample variance + V E (Y | X)      model variance . \end{matrix}

The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.

Bessel's Correction

When calculating variance from observations $X_{1}, \dots, X_{n}$ , you might think to write

S_{n}^{2} = \frac{1}{n} n \sum i = 1 (X_{i} - {¯ X}_{n})^{2},

where ${¯ X}_{n}$ is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus

S_{n}^{2} = \frac{1}{n - 1} n \sum i = 1 (X_{i} - {¯ X}_{n})^{2} .

See Wikipedia.

5: Inequalities

6: Convergence

In which the author provides instrumentally-useful convergence results; namely, the law of large numbers and the central limit theorem.

Equality of Continuous Variables

For continuous random variables $X, Y$ , we have $P (X = Y) = 0$ , which is surprising. In fact, for $x_{i} \sim X, y_{i} \sim Y$ , $P (x_{i} = y_{i}) = 0$ as well!

The continuity is the culprit. Since the cumulative density functions $F_{X}, F_{Y}$ are continuous, the limit of the density allotted to any given point is 0. Read more here.

Types of Convergence

Let $X_{1}, X_{2}, \dots$ be a sequence of random variables, and let $X$ be another random variable. Let $F_{n}$ denote the CDF of $X_{n}$ , and let $F$ denote the CDF of $X$ .

In Probability

$X_{n}$ converges to $X$ in probability, written $X_{n} p \to X$ , if, for every $ϵ > 0$ , $P (| X_{n} - X | > ϵ) \to 0$ as $n \to \infty$ .

Random variables are functions $Y : Ω \to R$ , assigning a number to each possible outcome in the sample space $Ω$ . Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than $ϵ$ ) with probability 0 in the limit.

See here.

In Distribution

$X_{n}$ converges to $X$ in distribution, written $X_{n} ⇝ X$ , if ${lim}_{n \to \infty} F_{n} (t) = F (t)$ at all $t$ for which $F$ is continuous.

Fairly straightforward.

A similar $^{1}$ geometric intuition:

Note: the continuity requirement is important. Imagine we distribute points uniformly on $(0, \frac{1}{n})$ ; we see that $X_{n} ⇝ 0$ . However, $F_{n}$ is 0 when $x \leq 0$ , but $F (0) = 1$ . Thus CDF convergence does not occur at $x = 0$ .

In Quadratic Mean

$X_{n}$ converges to $X$ in quadratic mean, written $X_{n} qm \to X$ , if $E ((X_{n} - X)^{2}) \to 0$ as $n \to \infty$ .

The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of $X_{n}$ highly deviant with respect to $X$ come into play. For example, if $X_{n} p \to X$ but the extremal values of $X_{n}$ increase in squared distance more quickly than they decrease in probability, $X_{n}$ will not converge to $X$ in quadratic mean.

7: Models, Statistical Inference and Learning

In which the attentive reader notices the chapter's tautological title - "statistical inference" and "learning" are taken to mean the same thing. Estimators are introduced, along with the definition of bias, consistency, and mean squared error.

8: Estimating the CDF and Statistical Functionals

In which the empirical distribution function and plug-in estimators set the stage for...

9: The Bootstrap

In which we learn to better approximate statistics via simulation.

10: Parametric Inference

In which we explore those models residing in finite-dimensional parameter space.

Fisher Information

The score function captures how the log-likelihood $ℓ$ changes with respect to $θ$ :

s (X; θ) = \frac{\partial log f (X; θ)}{\partial θ} .

Informally, this is the sensitivity of $ℓ$ to the parameter $θ$ . The derivative of the score captures the curvature of $ℓ$ with respect to $θ$ ; essentially, this represents how much information $X$ provides about $θ$ . The Fisher information is then the expected knowledge gain:

I (θ) = - E [\frac{\partial^{2}}{\partial θ^{2}} log f (X; θ) ∣ ∣ ∣ θ] .

Factorization Theorem

A statistic $T$ is sufficient $\leftrightarrow$ there are functions $g (t, θ)$ and $h (x)$ such that $f (x^{n}; θ) = g (t (x^{n}), θ) h (x^{n})$ .

A statistic is sufficient if and only if we can reexpress the probability density function using just that statistic.

11: Hypothesis Testing and p-values

In which we make testable predictions and step towards traditional rationality. Trigger warning: frequentism.

Frequently Confused

Brian Fantana: They've done studies, you know. 60% of the time it works, every time.
Ron Burgundy: That doesn't make sense.

Anchorman

Confidence intervals ("in 60% of experiments just like this, we will see results within this interval") and credible intervals ("we believe that this experiment has a result within this interval with 60% probability") are different things.

Frequentists define "confidence interval" to mean "theoretically, if we ran this experiment Lots of times, we'd get values in the interval 60% of the time". Without understanding this nuance, some results seem counterintuitive:

In the example [Jaynes] gives, there is enough information in the sample to be certain that the true value of the parameter lies nowhere in a properly constructed 90% confidence interval!

[Size Joke Here]

In hypothesis testing, we're trying to discriminate between two sets of possible worlds - formally, we're partitioning our hypothesis space $Θ$ into $Θ_{0}$ (the null hypothesis) and $Θ_{1}$ (the alternative hypothesis). Let's consider all of the things which can happen, all of the outcomes we can observe - this is the sample space $Ω$ .

A test $φ : Ω \to {0, 1}$ might take a sample and say "you're in $Θ_{0}$ " (for example). We can divvy up $Ω$ into the acceptance region $A$ (in which we accept the null hypothesis) and rejection region $R$ .

The power of a test $φ$ is the function $β : Θ \to [0, 1]$ that tells us the probability of rejecting the null hypothesis given some parameter: $β_{φ} (θ) = Pr (X \in R | θ)$ . Basically, we have $β_{φ} (θ)$ probability of rejecting the null hypothesis given that reality is actually parametrized by $θ$ .

We want to avoid rejecting the null hypothesis when $θ \in Θ_{0}$ ; therefore, we define some level of significance $α$ for which $β_{φ} (θ) \leq α; θ \in Θ_{0}$ . This means we're avoiding Type I errors $100 \times (1 - α) %$ of the time. The maximum probability that we commit a Type I error is the size of the test $φ$ : $α_{φ} = {sup}_{θ \in Θ_{0}} β_{φ} (θ)$ .

The p-value Alignment Problem

Getting your understanding of p-values to align with how p-values actually work (whatever that means) can require an impressive amount of mental gymnastics. Let's see if we can do better.

You're running an experiment in which you hypothesize that all dogs spontaneously combust when you whistle just so. You divide the hypothesis space into $Θ_{dogs don't spontaneously combust}$ and $Θ_{dogs do spontaneously combust}$ ( $Θ_{0}$ and $Θ_{1}$ for short); that is, sets of worlds in which your conjecture is false (null) and true (alternative). Each $θ$ is a way-the-world-could-be. By the definition of p-values, you may only reject the null hypothesis if all worlds $θ \in Θ_{0}$ agree that the observation is unlikely.

The p-value is the probability (under the null hypothesis) of observing a value of the test statistic as or more extreme than what was actually observed.

Imagine if you could only Bayes update towards a set of worlds when all the other world models agree that the observation is unlikely under their models.

12: Bayesian Inference

In which we return to the familiar.

Jeffreys' Prior

We often desire that our priors be noninformative, since finding a reasonable subjective prior isn't always feasible. One might think to use a uniform prior $f (θ) = c$ ; however, this doesn't quite hold up.

Say I have a uniform prior $f (θ) = 1$ for the money in your bank account (each $θ$ being a dollar amount). What if I want to know my prior for square of the amount of money in your bank account ( $ϕ = θ^{2}$ )? Then by the change of variable equation for PDFs, we have $f_{Φ} (ϕ) = \frac{1}{2 \sqrt{ϕ}}$ . We then desire that our prior be transformation invariant - under a noninformative prior, I should be ignorant about both the value of your balance and the squared value of your balance.

Jeffrey's prior satisfies this desideratum - define

f (θ) \propto \sqrt{I (θ)},

where $I (θ)$ is the Fisher information (discussed in the Ch. 10 summary):

I (θ) = - E [\frac{\partial^{2}}{\partial θ^{2}} log f (X; θ) ∣ ∣ ∣ θ] .      expected information X carries about θ

Jeffrey's prior isn't totally noninformative - it encodes the information that we expect the prior to be transformation invariant, but that is rather weak information.

13: Statistical Decision Theory

In which decision theory is defined as the theory of comparing statistical procedures.

14: Linear Regression

In which the pieces start to line up.

The Bias-Variance Tradeoff

Image credit: Scott Fortmann-Roe

As more covariates are added to a model, the bias decreases while the variance increases. Let's say you call 30 friends and ask them whether they agree with the Copenhagen interpretation of quantum mechanics, or with many-worlds. Say that you build a model with 5 covariates (such as age, sex, race, political leaning, and education level). This has decreased bias compared to a model which uses only education level, since descriptive power increases with the number of covariates. However, you increase variance in the sense that any given friend is more likely to be differently classified every time you run the experiment with slightly different data sets.

If you're familiar with brain surgery (machine learning), we can use it to learn how to apply bandaids. Think of adding more covariates as sliding towards overfitting.

Degrees of Confusion

There are numerous explanations for what degrees of freedom actually are. Some say it's the number of independent parameters required by a model, and others explain it as the number of parameters which are free to vary. Is there a better framing?

Consider $X_{1}, \dots, X_{n} i i d \sim N (0, 1)$ , and let ${¯ X}_{n}$ be the sample mean. Then the residuals vector $(X_{1} - {¯ X}_{n}, \dots, X_{n} - {¯ X}_{n})$ has $n - 1$ degrees of freedom. Why is this the case, and what does this mean?

Say we learn the values of $X_{1}, \dots, X_{n - 1}$ . Then conditional on our already knowing the sample mean, there is only one value that $X_{n}$ can take:

X_{n} = n {¯ X}_{n} - n - 1 \sum i = 1 X_{i} .

$X_{n}$ is totally determined by the first $n - 1$ values (this is related to Bessel's correction).

Let's ask a similar question - how many bits of information do we need to specify our model? Statistics isn't acclimated to thinking in terms of bits, so "independent real-valued parameters" is the unit used instead. If you have more parameters, you need to gather more bits to have the same confidence that your explanation (model) fits the data you have observed. This is an implicit Occamian prior: amongst models which fit the data equally well, the one with the fewest degrees of freedom is preferred.

I'd like to thank TheMajor for letting me steal their wonderful explanation.

15: Multivariate Models

16: Inference about Independence

17: Undirected Graphs and Conditional Independence

In which (very) elementary graph theory and the pairwise and global Markov conditions are introduced.

18: Log-Linear Models

19: Causal Inference

Simpson's Paradox

Sometimes you have two groups which individually exhibit a positive trend, but have a negative trend when combined.

Imagine it is 2019, and Shrek 5 has just come out. $^{2}$ Being an internet phenomenon, the movie is initially extremely popular with younger demographics, but has middling performance with middle-aged people. Consider concessions sales at a single theater: the younger group buys, on average, 1.8 large popcorns per person, while the older group only averages .7 larges. If $\frac{2}{3}$ of the initial viewership at the theater is younger, then we have a weighted average of $\frac{2}{3} \cdot \frac{18}{10} + \frac{1}{3} \cdot \frac{7}{10} = 1.4 ¯ 3$ larges.

The older group actually likes the movie, and recommends it to their friends. The demographic decomposition is now fifty-fifty. During the second week, everyone is a bit hungrier and buys .1 more large popcorns per viewing on average. Then both groups are buying more popcorn, but the weighted average decreased: $\frac{1}{2} \cdot \frac{19}{10} + \frac{1}{2} \cdot \frac{8}{10} = 1.35$ larges.

Obviously, the demographic split shifted the average. However, pretend you're the manager for the concessions stand. You monitor average per-person purchases and erroneously conclude that something you did made people less likely to buy, even though both groups are buying more popcorn.

If you don't control for confounders (in this case, demographics), the statistic of per-person purchases is not reliable for drawing conclusions.

20: Directed Graphs

In which passive and active conditioning are built up to by exploring the capacities of directed acyclic graphs for representing independence relations.

21: Nonparametric Curve Estimation

22: Smoothing Using Orthogonal Functions

The top plot is the true density for the Bart Simpson distribution.

23: Classification

24: Stochastic Processes

In which we learn processes for dealing with sequences of dependent random variables.

25: Simulation Methods

Final Verdict

This text is very cleanly written and has reasonable exercises. Ideally, I would have gone through my calculus books first, but it wasn't a big deal. The main downside is that I couldn't find an answer key, but thanks to the generous help of my friends on Facebook and in the MIRIx Discord, it worked out.

I skimmed Ch. 21, as it seemed to be more about implementation than deep conceptual material. I intend to revisit Ch. 22 after reading Tao's Analysis I, which is next on my list.

This book took me less than two weeks at a few hours of studying per day.

Forwards

Tips

I quickly realized that learning the basics of the R programming language is essential for getting a large portion of the value this text can offer.

Depth

Although I have fewer things to say on a meta level, I definitely got a lot out of this book. The most rewarding parts were when I noticed my confusion and really dove in to figure out what was going on - in particular, my forays into random variables, confidence intervals, p-values, and convergence types.

Red

I definitely haven't arrived at full-fledged statistical sophistication, but I progressed so rapidly that I regularly thought "what caveman asked that lol" when encountering questions I had asked just days earlier.

This is another data point for a realization I've had over the last month: I'm so red, but I've been living like a white-blue. What does that even mean, and how is it relevant?

From Duncan's excellent fake framework, How the "Magic: The Gathering" Color Wheel Explains Humanity:

The most salient dichotomy present here, in my opinion, is that of red and white:

Red and white disagree on questions of structure and commitment. Red is episodic, suspicious of rules and order because they constrain one’s ability to grow and change and freely choose. White is more diachronic, interested in finding the small compromises and sacrifices that will allow people to build trust and cooperate reliably.

White personalities often regard themselves as a continuous person, evolving in a somewhat orderly fashion. Red, on the other hand, feels disconnected from their past selves. After a certain amount of time, past-you feels like a different person who made choices that now seem ridiculous, if not alien. How old is your current iteration? Mine is three months, but what shocked me about this book was that I felt an intellectual disconnect with the me who existed four days prior.

Zooming out from All of Statistics, I think it's telling that I achieved fairly tectonic change $^{3}$ by learning to align my emotions with my reflectively-coherent desires, to clear away emotional debris, and to channel my passion into discrete tasks. I was living as if I were a white, but it's now clear I'm a blue-red who exhibits white traits mostly in pursuit of peace of mind.

I no longer ask "how can I study most effectively?", but rather, "what does it feel like to be me right now, and how can I bring that into alignment with what I want to do?".

Red seeks freedom, and it tries to achieve that freedom through action... For a red agent, victory feels fiery, beautiful, magnificent, and fierce — it’s the climax of a dance or a brawl or a love affair, the feeling of cresting a summit or having successfully ridden a wave. It’s feeling alive.

If you are interested in working with me or others on the task of learning MIRI-relevant math, if you have a burning desire to knock the alignment problem down a peg - I would be more than happy to work with you. Messaging me may also have the pleasant side effect of your receiving an invitation to the MIRIx Discord server.

$^{1}$ Although any shape in the sequence implied by the image does indeed have strictly different area than the circle it approximates (in contrast to $F_{n}$ and $F$ ), the analogy may still be helpful.

$^{2}$ Please don't wirehead thinking about this.

$^{3}$ I'm aware that this section isn't very implementable. I may write more on my post-CFAR experience in the near future.