Meta-note: This is listed as a 254 minute read, which seems like either a bug or a flaw in the algorithm. This isn't quick, but it isn't that long.
Unless this is all completely new to you and you actually try to stop and understand it, in which case it's a really long (but very worthwhile) read. But I don't see that as what that number's designed to mean?
(Great explanations/post, by the way!)
I tried diving into All of Statistics but I found it to be way too concise. I didn't get past Chapter 3. In particular Chapter 3 felt like a list of distributions and some arbitrary properties. It felt like I wasn't really getting an intuition for what these distributions represent or why these properties are interesting. In the end I dropped the book because I felt like a wasn't really learning anything.
My negative experience with this book is likely a result of me having no previous experience with statitistics.
I think that that's reasonable - there really aren't many intuitions offered for the distributions. I got more of that from the Bayesian Statistics course I took concurrently and from reading Wikipedia pages. A lot of the rest seemed well-motivated, though!
Do you happen to remember which Bayesian Statistics course you took? I'm interested in following the same steps.
Foreword
For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.
This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.
I then went even further, challenging myself during the bits of downtime throughout my day to do things like explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.
All of Statistics
1: Introduction
2: Probability
In which sample spaces are formalized.
3: Random Variables
In which random variables are detailed and a multitude of distributions are introduced.
Conjugate Variables
Consider that a random variable X is a function X:Ω→R. For random variables X,Y, we can then produce conjugate random variables XY,X+Y, with
4: Expectation
Evidence Preservation
is conservation of expected evidence (thanks to Alex Mennen for making this connection explicit).
Marginal Variance
This literally plagued my dreams.
Proof (of the variance; I cannot prove it plagued my dreams):
The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.
Bessel's Correction
When calculating variance from observations X1,…,Xn, you might think to write
where ¯Xn is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus
See Wikipedia.
5: Inequalities
6: Convergence
In which the author provides instrumentally-useful convergence results; namely, the law of large numbers and the central limit theorem.
Equality of Continuous Variables
For continuous random variables X,Y, we have P(X=Y)=0, which is surprising. In fact, for xi∼X,yi∼Y, P(xi=yi)=0 as well!
The continuity is the culprit. Since the cumulative density functions FX,FY are continuous, the limit of the density allotted to any given point is 0. Read more here.
Types of Convergence
In Probability
Random variables are functions Y:Ω→R, assigning a number to each possible outcome in the sample space Ω. Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than ϵ) with probability 0 in the limit.
See here.
In Distribution
Fairly straightforward.
A similar1 geometric intuition:
Note: the continuity requirement is important. Imagine we distribute points uniformly on (0,1n); we see that Xn⇝0. However, Fn is 0 when x≤0, but F(0)=1. Thus CDF convergence does not occur at x=0.
In Quadratic Mean
The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of Xn highly deviant with respect to X come into play. For example, if Xnp→X but the extremal values of Xn increase in squared distance more quickly than they decrease in probability, Xn will not converge to X in quadratic mean.
7: Models, Statistical Inference and Learning
In which the attentive reader notices the chapter's tautological title - "statistical inference" and "learning" are taken to mean the same thing. Estimators are introduced, along with the definition of bias, consistency, and mean squared error.
8: Estimating the CDF and Statistical Functionals
In which the empirical distribution function and plug-in estimators set the stage for...
9: The Bootstrap
In which we learn to better approximate statistics via simulation.
10: Parametric Inference
In which we explore those models residing in finite-dimensional parameter space.
Fisher Information
The score function captures how the log-likelihood ℓ changes with respect to θ:
Informally, this is the sensitivity of ℓ to the parameter θ. The derivative of the score captures the curvature of ℓ with respect to θ; essentially, this represents how much information X provides about θ. The Fisher information is then the expected knowledge gain:
Further reading.
Factorization Theorem
A statistic is sufficient if and only if we can reexpress the probability density function using just that statistic.
11: Hypothesis Testing and p-values
In which we make testable predictions and step towards traditional rationality. Trigger warning: frequentism.
Frequently Confused
Confidence intervals ("in 60% of experiments just like this, we will see results within this interval") and credible intervals ("we believe that this experiment has a result within this interval with 60% probability") are different things.
Frequentists define "confidence interval" to mean "theoretically, if we ran this experiment Lots of times, we'd get values in the interval 60% of the time". Without understanding this nuance, some results seem counterintuitive:
[Size Joke Here]
In hypothesis testing, we're trying to discriminate between two sets of possible worlds - formally, we're partitioning our hypothesis space Θ into Θ0 (the null hypothesis) and Θ1 (the alternative hypothesis). Let's consider all of the things which can happen, all of the outcomes we can observe - this is the sample space Ω.
A test φ:Ω→{0,1} might take a sample and say "you're in Θ0" (for example). We can divvy up Ω into the acceptance region A (in which we accept the null hypothesis) and rejection region R.
The power of a test φ is the function β:Θ→[0,1] that tells us the probability of rejecting the null hypothesis given some parameter: βφ(θ)=Pr(X∈R|θ). Basically, we have βφ(θ) probability of rejecting the null hypothesis given that reality is actually parametrized by θ.
We want to avoid rejecting the null hypothesis when θ∈Θ0; therefore, we define some level of significance α for which βφ(θ)≤α;θ∈Θ0. This means we're avoiding Type I errors 100×(1−α)% of the time. The maximum probability that we commit a Type I error is the size of the test φ: αφ=supθ∈Θ0βφ(θ).
The p-value Alignment Problem
Getting your understanding of p-values to align with how p-values actually work (whatever that means) can require an impressive amount of mental gymnastics. Let's see if we can do better.
You're running an experiment in which you hypothesize that all dogs spontaneously combust when you whistle just so. You divide the hypothesis space into Θdogs don't spontaneously combust and Θdogs do spontaneously combust (Θ0 and Θ1 for short); that is, sets of worlds in which your conjecture is false (null) and true (alternative). Each θ is a way-the-world-could-be. By the definition of p-values, you may only reject the null hypothesis if all worlds θ∈Θ0 agree that the observation is unlikely.
Imagine if you could only Bayes update towards a set of worlds when all the other world models agree that the observation is unlikely under their models.
12: Bayesian Inference
In which we return to the familiar.
Jeffreys' Prior
We often desire that our priors be noninformative, since finding a reasonable subjective prior isn't always feasible. One might think to use a uniform prior f(θ)=c; however, this doesn't quite hold up.
Say I have a uniform prior f(θ)=1 for the money in your bank account (each θ being a dollar amount). What if I want to know my prior for square of the amount of money in your bank account (ϕ=θ2)? Then by the change of variable equation for PDFs, we have fΦ(ϕ)=12√ϕ. We then desire that our prior be transformation invariant - under a noninformative prior, I should be ignorant about both the value of your balance and the squared value of your balance.
Jeffrey's prior satisfies this desideratum - define
where I(θ) is the Fisher information (discussed in the Ch. 10 summary):
Jeffrey's prior isn't totally noninformative - it encodes the information that we expect the prior to be transformation invariant, but that is rather weak information.
13: Statistical Decision Theory
In which decision theory is defined as the theory of comparing statistical procedures.
14: Linear Regression
In which the pieces start to line up.
The Bias-Variance Tradeoff
As more covariates are added to a model, the bias decreases while the variance increases. Let's say you call 30 friends and ask them whether they agree with the Copenhagen interpretation of quantum mechanics, or with many-worlds. Say that you build a model with 5 covariates (such as age, sex, race, political leaning, and education level). This has decreased bias compared to a model which uses only education level, since descriptive power increases with the number of covariates. However, you increase variance in the sense that any given friend is more likely to be differently classified every time you run the experiment with slightly different data sets.
If you're familiar with brain surgery (machine learning), we can use it to learn how to apply bandaids. Think of adding more covariates as sliding towards overfitting.
Read more.
Degrees of Confusion
There are numerous explanations for what degrees of freedom actually are. Some say it's the number of independent parameters required by a model, and others explain it as the number of parameters which are free to vary. Is there a better framing?
Consider X1,…,Xniid∼N(0,1), and let ¯Xn be the sample mean. Then the residuals vector (X1−¯Xn,…,Xn−¯Xn) has n−1 degrees of freedom. Why is this the case, and what does this mean?
Say we learn the values of X1,…,Xn−1. Then conditional on our already knowing the sample mean, there is only one value that Xn can take:
Xn is totally determined by the first n−1 values (this is related to Bessel's correction).
Let's ask a similar question - how many bits of information do we need to specify our model? Statistics isn't acclimated to thinking in terms of bits, so "independent real-valued parameters" is the unit used instead. If you have more parameters, you need to gather more bits to have the same confidence that your explanation (model) fits the data you have observed. This is an implicit Occamian prior: amongst models which fit the data equally well, the one with the fewest degrees of freedom is preferred.
I'd like to thank TheMajor for letting me steal their wonderful explanation.
15: Multivariate Models
16: Inference about Independence
17: Undirected Graphs and Conditional Independence
In which (very) elementary graph theory and the pairwise and global Markov conditions are introduced.
18: Log-Linear Models
19: Causal Inference
Simpson's Paradox
Sometimes you have two groups which individually exhibit a positive trend, but have a negative trend when combined.
Imagine it is 2019, and Shrek 5 has just come out.2 Being an internet phenomenon, the movie is initially extremely popular with younger demographics, but has middling performance with middle-aged people. Consider concessions sales at a single theater: the younger group buys, on average, 1.8 large popcorns per person, while the older group only averages .7 larges. If 23 of the initial viewership at the theater is younger, then we have a weighted average of 23⋅1810+13⋅710=1.4¯3 larges.
The older group actually likes the movie, and recommends it to their friends. The demographic decomposition is now fifty-fifty. During the second week, everyone is a bit hungrier and buys .1 more large popcorns per viewing on average. Then both groups are buying more popcorn, but the weighted average decreased: 12⋅1910+12⋅810=1.35 larges.
Obviously, the demographic split shifted the average. However, pretend you're the manager for the concessions stand. You monitor average per-person purchases and erroneously conclude that something you did made people less likely to buy, even though both groups are buying more popcorn.
If you don't control for confounders (in this case, demographics), the statistic of per-person purchases is not reliable for drawing conclusions.
20: Directed Graphs
In which passive and active conditioning are built up to by exploring the capacities of directed acyclic graphs for representing independence relations.
21: Nonparametric Curve Estimation
22: Smoothing Using Orthogonal Functions
23: Classification
24: Stochastic Processes
In which we learn processes for dealing with sequences of dependent random variables.
25: Simulation Methods
Final Verdict
This text is very cleanly written and has reasonable exercises. Ideally, I would have gone through my calculus books first, but it wasn't a big deal. The main downside is that I couldn't find an answer key, but thanks to the generous help of my friends on Facebook and in the MIRIx Discord, it worked out.
I skimmed Ch. 21, as it seemed to be more about implementation than deep conceptual material. I intend to revisit Ch. 22 after reading Tao's Analysis I, which is next on my list.
This book took me less than two weeks at a few hours of studying per day.
Forwards
Tips
I quickly realized that learning the basics of the R programming language is essential for getting a large portion of the value this text can offer.
Depth
Although I have fewer things to say on a meta level, I definitely got a lot out of this book. The most rewarding parts were when I noticed my confusion and really dove in to figure out what was going on - in particular, my forays into random variables, confidence intervals, p-values, and convergence types.
Red
I definitely haven't arrived at full-fledged statistical sophistication, but I progressed so rapidly that I regularly thought "what caveman asked that lol" when encountering questions I had asked just days earlier.
This is another data point for a realization I've had over the last month: I'm so red, but I've been living like a white-blue. What does that even mean, and how is it relevant?
From Duncan's excellent fake framework, How the "Magic: The Gathering" Color Wheel Explains Humanity:
The most salient dichotomy present here, in my opinion, is that of red and white:
White personalities often regard themselves as a continuous person, evolving in a somewhat orderly fashion. Red, on the other hand, feels disconnected from their past selves. After a certain amount of time, past-you feels like a different person who made choices that now seem ridiculous, if not alien. How old is your current iteration? Mine is three months, but what shocked me about this book was that I felt an intellectual disconnect with the me who existed four days prior.
Zooming out from All of Statistics, I think it's telling that I achieved fairly tectonic change3 by learning to align my emotions with my reflectively-coherent desires, to clear away emotional debris, and to channel my passion into discrete tasks. I was living as if I were a white, but it's now clear I'm a blue-red who exhibits white traits mostly in pursuit of peace of mind.
I no longer ask "how can I study most effectively?", but rather, "what does it feel like to be me right now, and how can I bring that into alignment with what I want to do?".
If you are interested in working with me or others on the task of learning MIRI-relevant math, if you have a burning desire to knock the alignment problem down a peg - I would be more than happy to work with you. Messaging me may also have the pleasant side effect of your receiving an invitation to the MIRIx Discord server.
1 Although any shape in the sequence implied by the image does indeed have strictly different area than the circle it approximates (in contrast to Fn and F), the analogy may still be helpful.
2 Please don't wirehead thinking about this.
3 I'm aware that this section isn't very implementable. I may write more on my post-CFAR experience in the near future.