(This is the second of two posts on the textbook All of Statistics. Click here for post I.)
4. Fundamentals of Statistical Inference
Probability theory is about using distributions to derive information about outcomes. Given the assumption that X∼Poisson(3), we can compute probabilities of outcomes the form Pr(X=k). Statistics is about the opposite: using outcomes to derive information about distributions.
The book exclusively considers the setting where we observe independently sampled data points x1,...,xn∼X where the distribution of X is unknown. It is often convenient to talk about the RV that generated the k-th sample point, the logical notation for which is Xk. The book uses upper-case letters for both (e.g., 'we observe data X1,...,Xn"), not differentiating between the RV that generates the k-th point (i.e., Xk) and the k-th point itself (xk).
On the highest level, one can divide all inference problems into two categories: parametric inference and non-parametric inference. In parametric Inference, we assume that we know the family of distributions that X belongs to (Binomial, Poission, Normal, etc.). In this case, the problem reduces to inferring the parameters that characterize distributions in that family. For example, if we observe data on traffic accidents, we may assume that X∼Poisson(λ), and all that's left to estimate is λ. Conversely, in the context of non-parametric inference, one does not assume that X belongs to any one family and thus has to estimate the distribution directly.
5. Bayesian Parametric Inference
5.1. The concept
Let Θ be the parameter we wish to learn about. Since we don't know its value, we treat it as a RV that ranges over several possible values. Furthermore, let x=(x1,...,xn) be our observed data, and let X=(X1,...,Xn) be the RVs that generate this data.
Bayes' theorem is an equation relating PrB(A) to PrA(A). (With PY(X):=P(X|Y).) In our case, we are interested in terms of the form P[X=x](Θ=θ), and we know how to compute terms of the form P[Θ=θ](X=x) because once we fix a value of Θ, the distribution is known and we can compute the required probabilities. Thus, Bayes' theorem gives us a way to reduce a problem of statistical inference to a problem of probability theory.
Alas, we have
P[X=x](Θ=θ)=P[Θ=θ](X=x)P(Θ=θ)P(X=x)
Our prior on θ will virtually always be continuous. Let h be the pdf of this prior. The Xi may be continuous or discrete. Let f be their pdf or pmf. Our formula becomes
h[X=x](θ)=f[Θ=θ](x)h(θ)f(x)
Note that the denominator does not depend on θ: once we observe the data, how likely it was to appear has no impact on how we weigh different values of Θ against each other. Thus, we have
h[X=x](θ)∝h(θ)f[Θ=θ](x)
Suppose we begin by computing the right side. If f is a pmf, the result is an incorrectly scaled pdf (since we multiply a density with a product of masses, which is just a scalar). If f is a pdf, the result will be the product of a density with a product of other densities. If you think of a density as a mass multiplied by an infinitesimal, then it's a term with an infinitesimal to the power of n+1. However, one can ignore this fact entirely, compute the integral of the [non-density-function obtained by evaluating the right side above], then scale that non-density-function by the inverse of that integral, and the result will be the correct density function, h[X=x] (I really need to study nonstandard analysis at some point). Since computing f(x) is usually harder than computing this integral, this is the standard approach.
Using the fact that the xi are independently sampled, we can further rewrite the formula as
h[X=x](θ)∝h(θ)n∏i=1f[Θ=θ](xi)
The only open question is how to choose the prior distribution on θ. The philosophically correct thing to do is to use ones prior beliefs about θ, but this has the issue of making methods depend on the person carrying them out. There are various methods suggested in the literature that construct analytical priors, but there is no consensus as to which one is best.
5.2. A discrete Example
Let x1,...,xn∼X with X∼Bernoulli(Θ). Note that here, Θ is a probability, so any reasonable prior should put zero density to values outside of [0,1]. For now, assume our prior on Θ is uniform, i.e., h(θ)=1 for θ∈[0,1].
Define k:=∑ni=1xi. We have
h[X=x](θ)∝h(θ)n∏i=1f[Θ=p](xi)=θk(1−θ)n−k
Usuually, we would now have to compute the integral ∫10θk(1−θ)n−kdp. However, in this case, our posterior has a Beta distribution, h[X=x]∼Beta(k,n−k+1). This means we already know that the integral sums up to Γ(n+1)Γ(k)Γ(n−k+1), which means that our posterior pdf is
h[X=x](θ)=Γ(k)Γ(n−k+1)Γ(n+1)θk(1−θ)n−k
Here is a plot of h (black) and h[X=x] (red) for n=10 and k=7:
5.3. A continuous Example
Let X∼Uniform(0,θ), and let x1,x2,x3 be samples of X. Assume we have prior h(θ)=11+θ2 for θ∈R+. Let y:=max{x1,x2,x3} and suppose that y=1. Then,
which means we need to scale by 21−ln(2), and our posterior pdf is
h[X=x](θ)=2(1−ln(2))(θ3+θ5)
Here is a plot of h (black) and h[X=x] (red):
5.4. Applications
The book mentions three distinct use cases for statistical inference:
Point-estimation: provide a best guess for a specific object, like the parameter of interest.
Confidence intervals: give an interval such that we can say with a given certainty that the parameter is in this interval. This is often reported as 'the margin of error'. If a pollster reports that 27% of people like something and the margin of error is 3 points, this probably means that θ∈[0.24,0.3] with 0.95% confidence. There is nothing special about 0.95, but it's the convention.
Hypothesis testing, which we'll cover in chapter 7.
To obtain a point estimate, one may take the mean, median, or mode of the posterior distribution. To obtain a confidence interval, suppose we want certainty 90. Then, we need only find numbers a,c such that
∫cah[X=x](θ)dθ=0.9
If we want, we can always choose a,c such that our point estimate ^θ of Θ is exactly in the middle of the interval [a,c].
6. Frequentist Methods
6.1. The concept
The prevalent school of thought in statistics dictates that the unknown parameter θ is not a RV but a constant. Therefore, they reject Bayesian methods as they make probability statements about θ. Instead, Frequentist methods deal in terms of what are called estimators. An estimator is an arbitrary function from the data that is meant to estimate a certain quantity (such as a parameter of interest). Then, there are rigorous analytical methods to evaluate estimators.
There are a bunch of commonly used estimators that are applicable to many problems. However, for any specific problem, one may always construct a nonstandard estimator.
In this chapter, we use Θ to denote a space of possible values for the unknown parameter, and we denote the parameter itself as θ. (We don't need a symbol for the variable in the pdf of θ because it doesn't have a pdf.) Also, we can no longer condition on θ having a certain value, so we write fθ instead of f[Θ=θ], and think of f as denoting a family of probability densities with fθ being one of them, rather than as one density that we can update on the event Θ=θ.
6.2. Evaluations
Suppose we observe samples x1,...,xn from some distribution parametrized by θ. Let f be an estimator of θ. Since f is a function of the data and we can view each data point xk as generated by a random variable Xk, we can view f itself as a random variable that is some function of the Xk. For example, if we wish to estimate the mean of the distribution, then f=1n∑ni=1Xi is the obvious choice. This means we can compute
E((f−θ)2)
which is called the Mean Squared Error. Note that computing the mean Squared Error will yield a term involving θ, something like θn+n2.
If E(f)=θ, the Mean Squared Error equals V(f). In this case,
bias(f):=E(f)−θ
is 0, and we call the estimator unbiased.
On first glance, being unbiased sounds like a good thing. It took me a while to understand exactly why, form a Bayesian perspective, bias is desirable:
The obvious Bayesian point estimate for θ is the mean of the posterior distribution.
For this estimate to be unbiased, it would have to be the case that, if we condition on Θ=5, in expectation (that's expectation over the generated data given that Θ=5) the posterior distribution of Θ centers around 5.
However, a proper Bayesian update (even on data that is typical for the case that Θ=5) only moves the distribution some portion of the way toward the evidence, not the entire way.
Thus, the posterior distribution's mean will be closer to 5 than that of the prior distribution, but it won't be all the way there.
In other words, Bayes estimators are systematically biased toward the prior distribution, whereas Frequentist methods do not include a prior and hence can be unbiased. A recurring theme is that Frequentist methods tend to look like (or even be identical to) Bayesian methods with flat priors.
6.3. An Example
Consider again the case of a continuous RV X∼Uniform(0,θ) with θ unknown, where we have samples x1,...,xn of X. In this case, any function f:Rn→R+ that maps the data to an estimated value of θ is a valid estimator.
Here are two estimators that make intuitive sense:
f1:(x1,...,xn)↦max{xi:1≤i≤n} (if n is large, then the maximum should be close to the largest sample)
f2:(x1,...,xn)↦2n∑ni=1xi (the maximum should be about twice the mean)
In this case, the first estimator is biased (f1 is, in fact, guaranteed to be smaller than θ), whereas the second estimator is unbiased. Which estimator is better?
At first, this seems completely unclear. You could now do two things. One is to compute the Mean Squared Error (this can be done exactly). The other is to take a look at what the Bayesian estimator was doing. In particular, we can see that the posterior distribution is solely determined by the maximum of the xi. This suggests that f1 is better, which is indeed true. (It can be scaled by n+1n to make it unbiased, but even the biased version is superior if n is large.)
6.4. Commonly used Estimators
6.4.1. Maximum Likelihood Estimator
The Maximum Likelihood Estimator is the function outputting the value for θ such that the probability mass or density for the observed data is maximized, i.e.,
argmaxθ∈Θn∏i=1fθ(xi)
Thus, the Maximum Likelihood estimator outputs the mode of the posterior distribution obtained by doing a Bayesian Update with a flat prior. In the example from 6.3, the Maximum Likelihood estimator is just f1.
In the case where θ denotes a probability, we have Θ=[0,1] and a flat prior is possible. In the case where Θ=R+, a flat prior is mathematically impossible in standard analysis because ∫∞0c dθ does not converge. It also "implies" that Pr[θ<101000]≈0 since that interval is infinitely smaller than R+ and our prior is uniform.
6.4.2. Empirical Distribution Function
In non-parametric inference, the empirical distribution function is an estimate for the distribution that generated the data. It simply puts probability mass 1n on every data point xi. Let fEDF be this function. It's pretty easy to see that, for any probability mass function f, we have
n∏i=1f(xi)≤n∏i=1fEDF(xi)
Thus, the empirical distribution function is the maximum likelihood estimator applied to the probability distribution rather than to a parameter.
6.4.3. Method of Moments
The method of moments is a method for parametric inference that is not a special case of Bayesian Inference. Recall that a statistical functional provides a single number that encodes some property of a RV. Suppose that we can estimate this number for an observed data set. Then, we can estimate the value of a parameter θ in two steps:
Estimate the statistical functional for the data set
Choose θ such that the statistical functional of fθ equals that number
If we have several parameters to estimate, just do the same with several statistical functionals. This leads to a system with k equations and k unknowns.
The statistical functionals we use are the moments, hence the name. Recall that the k-th moment is defined as
μk:=∫∞−∞xkf(x)dx
We estimate the k-th moment of a data set (x1,...,xn) in the obvious way:
^μk:=1nn∑i=1xk
Since the value of the μk's depends on the pdf f, which depends on our parameters, computing the μk yields a term that includes those parameters, whereas computing the ^μk just yields numbers. Thus, if we have two parameters, then
μ1=^μ1μ2=^μ2
is actually a 2 by 2 system of equations, with the variables being our two parameters.
The method of moments is not super accurate, but has the advantage of being computationally cheap.
In the example from 6.3, μ1=E(f)=θ2 and ^μ1=1n∑ni=1xi. Thus, the method of moments estimator is just f2.
7. Hypothesis Testing and p-Values
Suppose we want to learn about some parameter θ. In this case, think of θ as measuring some effect we care about, say the difference in [frequency of trait X among people with trait Y] and [frequency of trait X among all people]. Let Θ be the parameter space of θ, and let Θ0⊊Θ be the subset in which "there is no effect". The claim that there is no effect, i.e., that θ∈Θ0 is called the null hypothesis, and the goal of a test is to refute the null hypothesis.
Let x1,...,xn∼X be our data. Let S be a function from the data to R. Formally, this is the same as an estimator, but here we call it the test statistic. S should be such that it yields a high number if there is an effect and a low number if there is not (or vice versa).
Once we evaluate S, we get some number x∈R that is somewhere on the less-effect-y ↔ more-effect-y axis. Let R⊊R be the subset of results that looks at least as effect-y as x. Here's a visualization:
We define the p-value of this test as the maximum chance that the result landed in R even though the null hypothesis was true, i.e.,
p-value:=supθ∈Θ0Pr(S∈R)
And that's it; that's a p-value. This is all well-defined since, once a θ is fixed, S is just a RV, and since R is a subset of the form {z∈R:z≥z0} the condition S∈R can be rewritten as S≥z0.
In the ideal case, a p-value of 0.002 means "under the assumption that the effect we are testing for doesn't exist (as defined by our choice of Θ0⊊Θ), there was at most a 0.2% chance for the data to show as much of an effect or more as we have observed". Not that this is not a probability statement about θ. (Frequentists don't want to make probability statements about θ since it's a constant.) The next chapter will work through an example of how to use p-values (which is also an example of where they 'fail', and it will illustrate why saying 'it's not a probability statement about θ' is more than a technicality).
In classical Hypothesis Testing, one defines the region R ahead of time, evaluates S on the data, and reports a binary 'null-hypothesis rejected' if S∈R and 'null-hypothesis not rejected' if S∉R. This is much worse since now the choice of R is arbitrary, and one can always find two points y,z on R such that z suggests a 0.00000000000000001% stronger effect than y, yet z∈R and y∉R. Compared to this, reporting a p-value is amazingly informative.
8. Failures
8.1. ... of p-values
Suppose we have a not-necessarily-fair coin with unknown probability p of getting heads. We want to refute the null hypothesis 'the coin is fair' defined by the one-point parameter subspace Θ0:={0.5}⊊[0,1].
Here are two possible experiments to test this hypothesis.
9.1.1. EXPERIMENT ONE
Experiment one is: "flip the coin twelve times". Our sample statistic is the RV X that counts the number of heads. Obviously, X∼Binom(12,p). We obtain the result X=3. The region R (of all outcomes that are at least as extreme as the observed one) is {0,1,2,3}. The p-value is thus
supθ∈Θ0Pr(Xθ∈R)=Pr(X0.5≤3)=3∑k=0(12k)0.512≈0.073
8.1.2. EXPERIMENT TWO
Experiment two is: "flip the coin until it lands on heads three times". Our sample statistic is the RV Y that counts the number of tosses that came up tails. We have Y∼NegativeBinomial(3,p). (The negative binomial distribution hasn't made it into the book or into my previous post; it's a generalization of the Geometric distribution that counts to r misses rather than to 1.) We obtain the result Y=9 (i.e., nine tails, three heads). The region R (of all outcomes that are at least as extreme as the observed one) is {9,10,11,12,13,14,15,...}. The p-value is thus
The first experiment yielded a p-value of 0.073, whereas the second one yielded a p-value of 0.0328. To see why this is an issue, consider a Bayesian update. Let h be an arbitrary distribution over Θ (now uppercase since we treat it as a RV). In experiment one, we have twelve samples of X with 3 heads and 9 tails, hence
h[X=x](θ)∝h(θ)n∏i=1f[Θ=θ](xi)=h(θ)θ3(1−θ)9
In experiment two, we have twelve samples of X with 3 heads and 9 tails, hence
h[X=x](θ)∝h(θ)n∏i=1f[Θ=θ](xi)=h(θ)θ3(1−θ)9
Thus, since both experiments have observed the same data, the posterior distribution is exactly the same in both cases (provided that both use the same prior.) Nonetheless, the experiments have yielded different p-values -- and what's worse, the expectedp-value is also different! Given a fixed θ, the second experiment systematically produces lower p-values, both in terms of their mean and in terms of their median.
Despite having heard lots of people rail against p-values in the past, I was pretty shocked by this result. I had previously thought that p-values are bad because they provide bad incentives and give misleading impressions. I didn't know that they have an in-principle arbitrary component.
I don't think I understand why p-values are lower in the second experiment, though. What properties of the distribution are responsible? If anyone knows, please explain.
8.2. ... of Bayesian inference
Bayesian Inference relies on Bayes theorem, which is a theorem. Can it really fail? We shall see. The book mentions two cases meant to showcase weaknesses of Bayesian inference.
The first one is very silly. It comes down to the following experiment where we have a googol biased coins: "randomize a number k between 1 and 10100, report the result of a flip by the k-th biased coin; repeat N times". Then, the book observes that, for any realistic N, using Bayesian inference on the 10100 unknown probabilities of the biased coins will fail since most of them are never observed. Of course, this has zero to do with Bayesian methods and everything with what we choose to update; if one estimates the total probability of getting heads, Bayesian inference does wonderfully.
The second one is much more interesting. Let g:R+→R+ be a continuous, integrable function such that ∫∞0g(x)dx=:c∈R. In general, g is not a pdf, but f:=1cg is.
Suppose we know g and want to estimate c. One way to obtain c is by computing the integral, but this may be difficult. Instead, suppose we are given points x1,...,xn that are independently sampled according to f.
Let h be any prior on c. We have
h[X=x](c)∝h(c)n∏i=1f[C=c](xi)=h(c)n∏i=11cg(xi)
This expression with the prior excluded is larger the smaller c is. Thus, this experiment will update h toward lower values regardless of the observed data points x1,...,xn.
... wait, what? What happened?
Here is what I think happened. The function g fully determines the value of c. Since we know g, one cannot model c as a random variable. In fact Pr(C=c) must be 1 since 1=1⟹C=c, hence Pr(C=c)≥Pr(1=1)=1. If anything C is a logically uncertain variable à la Logical Induction. Since probability theory cannot handle logical uncertainty, the method fails.
However, it is interesting that there is an estimator that does a fine job approximating f. This estimator is f(x1,...,xn):=1n∑ni=1g(xi)^f(xi), where the ^f(xi) are estimates of the probability distribution of the points based on the observed data (x1,...,xn). (The book doesn't explain how this is done.) Thus, the concept of 'estimator' appears to include methods that deal with logical uncertainty.
Alas, it appears that the second example highlights a legitimate failure of Bayesian inference.
9. Statistical Decision Theory
So far, we've evaluated predictors by computing their Mean Squared Error. One can also do this for Bayesian methods. In the discrete example (5.2.), the posterior distribution h[X=x] has mean kn+1 (no computations are needed to derive this result since the mean of a Beta distribution is known), so f:(x1,...,xn)→∑ni=1xin+1 is just another estimator that happens to have been derived with Bayesian methods.
However, recall that estimating the Mean Squared Error yields a term involving the true parameter θ. This can be considered a function of θ, and thus, one can imagine two estimators whose functions cross paths, such that both of them yield lower errors for some values of θ. In such a case, it's not immediately obvious which one is preferable. Statistical Decision Theory is about making this decision.
Let f be an estimator. As mentioned, its mean squared error is a function of θ. If we want to compare it to the mean squared error of other estimators, we need a one-point summary of this function. Here are two ways of doing this:
rMAX(f):=supθ∈ΘE((f−θ)2)
rBayes(f,h):=∫E((f−θ)2)h(θ)dθ
I.e., we can either optimize for the worst case and minimize the maximum error across all possible values of θ, or we can weigh the errors depending on which θ is more probable. Of course, this again requires a prior h on θ, hence why it's called the Bayes risk.
One can prove that:
If h is a proper prior, the estimator f that minimizes rBayes(f,h) is the function that maps data to the mean of the posterior distribution, i.e., the estimator f given by f(x,h):=∫θh[X=x](θ)dθ.
If h is a proper prior and one replaces the Mean Squared Error with the absolute loss, i.e., E(|f−θ|) instead of E((f−θ)2), then it's the median of the posterior distribution. (This was quite surprising to me. I always thought that, if one has values 1,2,3,4,90, then, surely, 20 (the mean) is a better summary of this data than 3 (the median). However, if one defines 'better' in terms of 'lower mean absolute distance to one of the elements', then 3 is better! This quickly becomes obvious if you think about it; once you're at the median element, any step to either side increases the distance to more than half of all points and decreases the distance to fewer than half, by the same amount.)
If h is a proper prior and one replaces the Mean Squared Error with the binary error (1 if θ is hit exactly and 0 otherwise), then it's the mode of the posterior distribution -- that is, provided the distribution of θ is discrete; if it's continuous, the loss is 1 with probability 1 regardless of the estimator.
Thus, Bayesian methods are optimal to minimize the Bayes risk.
Suppose a Bayes estimator f for prior h yields an error that does not depend on θ, i.e., an error function that is a constant. In this case, the one-point summary is just that constant, which means that rBayes(f,h)=rMAX(f). Thus, one can also view minimization of rMAX as a special case of applying Bayesian methods. In this case, h is called the least favorable prior.
10. Other Concepts
10.1. Score and Fisher Information
Consider a parametric inference problem. Let θ be the parameter and f the density function. This time, we consider f to be a function of two variables θ and x.
The score is defined as the derivative of f with respect to θ, i.e.,
s:=∂f∂θ
The score is again a function of two variables. However, if we fix a θ∈Θ, then the function sθ defined by sθ(x):=s(θ,x) is a RV. For any point x, sθ(x) measures how increasing θ affects the density that f assigns to x.
Suppose that E(sθ)>0. That would mean that increasing θ increases the overall density of f. However, this is impossible: fθ+ϵ is still a valid pdf, so the overall density cannot change. Thus, the expectation of the score is always zero. This can also be proven formally.
On the other hand, consider the variance of the score, V(sθ). We can imagine several cases:
If f does not depend on θ at all, then the score is zero everywhere. In this case, its variance is also zero.
If f can take drastically different forms depending on the value of θ, then the score should be high in some places and low in others. In this case, the variance is high.
Thus, the variance of the score is a measure for how much the parameter impacts the distribution. Equivalently (I believe, I don't know information theory), it's a measure of how much information [samples of the distribution] reveal about θ. Note that evaluating this variance for a specific θ yields a number; if θ is left as a parameter, it yields a term that includes θ. The variance of the score is also called the Fisher Information, and it's denoted as a function I (of θ). Wikipedia lists the fisher information for all important distributions; for example, if X∼Bernoulli(p), then the fisher information is 1p(1−p).
One can prove that
I(θ)=V(sθ)=E(−∂s∂θ(θ))=E(∂2f∂2θ(θ))
If f is the maximum likelihood estimator of θ, then the Fisher Information takes a role similar to the variance of f and one can use this to prove theorems about the limit distribution of f. Furthermore, in the context of Bayesian Inference, there is something called Jeffrey's Prior, which says that the prior on a parameter should be proportional to the root of its Fisher Information. This does not seem to make sense to me, but perhaps I don't understand it.
10.2. Bootstrapping
Which is a fairly complicated method for estimating the variance of statistics.
10.3. The Jacknife
Same as above.
10.4. Simulation
Which is a Monte Carlo method to estimate things that could in theory be computed but may be hard to compute, such as the marginal distribution of θ when the distribution of a vector (θ,θ2,...,θn) is known.
10.5. Multiparameter Models for Parametric Inference
In which (who'd have guessed) parametric inference with multiple parameters is studied.
10.6. Various tests
As in 'specific ways to do hypothesis testing'. There are also Bayesian methods here.
10.7. Admissibility
Which is a property of estimators. An estimator is inadmissible if there is another estimator that performs at least as well for all θ, and better for at least one θ. It's admissible if it's not inadmissible. Bayes' estimators with proper priors are always admissible.
10.7.1. Stein's Paradox
Suppose we sample x1∼N(θ1,1),...,xn∼N(θn,1) and want to estimate the θi. Stein's paradox says that the estimator f:(x1,...,xn)↦(x1,...,xn) is inadmissible if n>2 (and admissible for n=2 and n=1).
10.8. Plug-in Estimators
The empirical distribution function gives a general way to estimate the value of any statistical functional on an unknown distribution, provided we have access to sample points (x1,...,xn). This way is, of course, to simply compute the statistical functional of the empirical distribution function derived from (x1,...,xn).
In the case of the mean, this leads to the obvious estimate ^μ:=1n∑ni=1 that is used in the method of moments. For the variance, it leads to the estimate ^σ:=1n∑ni=1(xi−^μ)2. Interestingly, this estimate is biased. The unbiased estimate is called the sample variance and it's defined by S2:=1n−1∑ni=1(xi−^μ)2. The reason for this is that any term of the form (xi−^μ)2 is, in expectation, slightly lower than it should be because xi has contributed to the estimate ^μ, thus causing ^μ to be closer to xi in expectation than the true mean. For example, suppose we have just two data points x1 and x2. If they are both below the mean, the estimate ^μ will be too small, so the terms (x1−^μ)2 and (x2−^μ)2 will be too small as well. If they are both above the mean, the estimate ^μ will be too large, and the terms (x1−^μ)2 and (x2−^μ) will again be too small. Thus, regardless of the direction into which the noise skews the data points, the plug-in estimate for the variance will underestimate the true variance, making it biased. If one does the computation, it turns out that the effect of this bias is, on expectation, precisely the factor nn−1.
(This is the second of two posts on the textbook All of Statistics. Click here for post I.)
4. Fundamentals of Statistical Inference
Probability theory is about using distributions to derive information about outcomes. Given the assumption that X∼Poisson(3), we can compute probabilities of outcomes the form Pr(X=k). Statistics is about the opposite: using outcomes to derive information about distributions.
The book exclusively considers the setting where we observe independently sampled data points x1,...,xn∼X where the distribution of X is unknown. It is often convenient to talk about the RV that generated the k-th sample point, the logical notation for which is Xk. The book uses upper-case letters for both (e.g., 'we observe data X1,...,Xn"), not differentiating between the RV that generates the k-th point (i.e., Xk) and the k-th point itself (xk).
On the highest level, one can divide all inference problems into two categories: parametric inference and non-parametric inference. In parametric Inference, we assume that we know the family of distributions that X belongs to (Binomial, Poission, Normal, etc.). In this case, the problem reduces to inferring the parameters that characterize distributions in that family. For example, if we observe data on traffic accidents, we may assume that X∼Poisson(λ), and all that's left to estimate is λ. Conversely, in the context of non-parametric inference, one does not assume that X belongs to any one family and thus has to estimate the distribution directly.
5. Bayesian Parametric Inference
5.1. The concept
Let Θ be the parameter we wish to learn about. Since we don't know its value, we treat it as a RV that ranges over several possible values. Furthermore, let x=(x1,...,xn) be our observed data, and let X=(X1,...,Xn) be the RVs that generate this data.
Bayes' theorem is an equation relating PrB(A) to PrA(A). (With PY(X):=P(X|Y).) In our case, we are interested in terms of the form P[X=x](Θ=θ), and we know how to compute terms of the form P[Θ=θ](X=x) because once we fix a value of Θ, the distribution is known and we can compute the required probabilities. Thus, Bayes' theorem gives us a way to reduce a problem of statistical inference to a problem of probability theory.
Alas, we have
P[X=x](Θ=θ)=P[Θ=θ](X=x)P(Θ=θ)P(X=x)
Our prior on θ will virtually always be continuous. Let h be the pdf of this prior. The Xi may be continuous or discrete. Let f be their pdf or pmf. Our formula becomes
h[X=x](θ)=f[Θ=θ](x)h(θ)f(x)
Note that the denominator does not depend on θ: once we observe the data, how likely it was to appear has no impact on how we weigh different values of Θ against each other. Thus, we have
h[X=x](θ)∝h(θ)f[Θ=θ](x)
Suppose we begin by computing the right side. If f is a pmf, the result is an incorrectly scaled pdf (since we multiply a density with a product of masses, which is just a scalar). If f is a pdf, the result will be the product of a density with a product of other densities. If you think of a density as a mass multiplied by an infinitesimal, then it's a term with an infinitesimal to the power of n+1. However, one can ignore this fact entirely, compute the integral of the [non-density-function obtained by evaluating the right side above], then scale that non-density-function by the inverse of that integral, and the result will be the correct density function, h[X=x] (I really need to study nonstandard analysis at some point). Since computing f(x) is usually harder than computing this integral, this is the standard approach.
Using the fact that the xi are independently sampled, we can further rewrite the formula as
h[X=x](θ)∝h(θ)n∏i=1f[Θ=θ](xi)
The only open question is how to choose the prior distribution on θ. The philosophically correct thing to do is to use ones prior beliefs about θ, but this has the issue of making methods depend on the person carrying them out. There are various methods suggested in the literature that construct analytical priors, but there is no consensus as to which one is best.
5.2. A discrete Example
Let x1,...,xn∼X with X∼Bernoulli(Θ). Note that here, Θ is a probability, so any reasonable prior should put zero density to values outside of [0,1]. For now, assume our prior on Θ is uniform, i.e., h(θ)=1 for θ∈[0,1].
Define k:=∑ni=1xi. We have
h[X=x](θ)∝h(θ)n∏i=1f[Θ=p](xi)=θk(1−θ)n−k
Usuually, we would now have to compute the integral ∫10θk(1−θ)n−kdp. However, in this case, our posterior has a Beta distribution, h[X=x]∼Beta(k,n−k+1). This means we already know that the integral sums up to Γ(n+1)Γ(k)Γ(n−k+1), which means that our posterior pdf is
h[X=x](θ)=Γ(k)Γ(n−k+1)Γ(n+1)θk(1−θ)n−k
Here is a plot of h (black) and h[X=x] (red) for n=10 and k=7:
5.3. A continuous Example
Let X∼Uniform(0,θ), and let x1,x2,x3 be samples of X. Assume we have prior h(θ)=11+θ2 for θ∈R+. Let y:=max{x1,x2,x3} and suppose that y=1. Then,
h[X=x](θ)∝h(θ)n∏i=1f[Θ=θ](xi)=11+θ2n∏i=1f[Θ=θ](xi)={11+θ2θ−3;θ≥10;θ<1
Therefore,
∫∞−∞not-h[X=x](θ)dθ=∫∞1θ−31+θ2dθ=∫∞11θ3+θ5dθ=1−ln(2)2
which means we need to scale by 21−ln(2), and our posterior pdf is
h[X=x](θ)=2(1−ln(2))(θ3+θ5)
Here is a plot of h (black) and h[X=x] (red):
5.4. Applications
The book mentions three distinct use cases for statistical inference:
To obtain a point estimate, one may take the mean, median, or mode of the posterior distribution. To obtain a confidence interval, suppose we want certainty 90. Then, we need only find numbers a,c such that
∫cah[X=x](θ)dθ=0.9
If we want, we can always choose a,c such that our point estimate ^θ of Θ is exactly in the middle of the interval [a,c].
6. Frequentist Methods
6.1. The concept
The prevalent school of thought in statistics dictates that the unknown parameter θ is not a RV but a constant. Therefore, they reject Bayesian methods as they make probability statements about θ. Instead, Frequentist methods deal in terms of what are called estimators. An estimator is an arbitrary function from the data that is meant to estimate a certain quantity (such as a parameter of interest). Then, there are rigorous analytical methods to evaluate estimators.
There are a bunch of commonly used estimators that are applicable to many problems. However, for any specific problem, one may always construct a nonstandard estimator.
In this chapter, we use Θ to denote a space of possible values for the unknown parameter, and we denote the parameter itself as θ. (We don't need a symbol for the variable in the pdf of θ because it doesn't have a pdf.) Also, we can no longer condition on θ having a certain value, so we write fθ instead of f[Θ=θ], and think of f as denoting a family of probability densities with fθ being one of them, rather than as one density that we can update on the event Θ=θ.
6.2. Evaluations
Suppose we observe samples x1,...,xn from some distribution parametrized by θ. Let f be an estimator of θ. Since f is a function of the data and we can view each data point xk as generated by a random variable Xk, we can view f itself as a random variable that is some function of the Xk. For example, if we wish to estimate the mean of the distribution, then f=1n∑ni=1Xi is the obvious choice. This means we can compute
E((f−θ)2)
which is called the Mean Squared Error. Note that computing the mean Squared Error will yield a term involving θ, something like θn+n2.
If E(f)=θ, the Mean Squared Error equals V(f). In this case,
bias(f):=E(f)−θ
is 0, and we call the estimator unbiased.
On first glance, being unbiased sounds like a good thing. It took me a while to understand exactly why, form a Bayesian perspective, bias is desirable:
In other words, Bayes estimators are systematically biased toward the prior distribution, whereas Frequentist methods do not include a prior and hence can be unbiased. A recurring theme is that Frequentist methods tend to look like (or even be identical to) Bayesian methods with flat priors.
6.3. An Example
Consider again the case of a continuous RV X∼Uniform(0,θ) with θ unknown, where we have samples x1,...,xn of X. In this case, any function f:Rn→R+ that maps the data to an estimated value of θ is a valid estimator.
Here are two estimators that make intuitive sense:
In this case, the first estimator is biased (f1 is, in fact, guaranteed to be smaller than θ), whereas the second estimator is unbiased. Which estimator is better?
At first, this seems completely unclear. You could now do two things. One is to compute the Mean Squared Error (this can be done exactly). The other is to take a look at what the Bayesian estimator was doing. In particular, we can see that the posterior distribution is solely determined by the maximum of the xi. This suggests that f1 is better, which is indeed true. (It can be scaled by n+1n to make it unbiased, but even the biased version is superior if n is large.)
6.4. Commonly used Estimators
6.4.1. Maximum Likelihood Estimator
The Maximum Likelihood Estimator is the function outputting the value for θ such that the probability mass or density for the observed data is maximized, i.e.,
argmaxθ∈Θn∏i=1fθ(xi)
Thus, the Maximum Likelihood estimator outputs the mode of the posterior distribution obtained by doing a Bayesian Update with a flat prior. In the example from 6.3, the Maximum Likelihood estimator is just f1.
In the case where θ denotes a probability, we have Θ=[0,1] and a flat prior is possible. In the case where Θ=R+, a flat prior is mathematically impossible in standard analysis because ∫∞0c dθ does not converge. It also "implies" that Pr[θ<101000]≈0 since that interval is infinitely smaller than R+ and our prior is uniform.
6.4.2. Empirical Distribution Function
In non-parametric inference, the empirical distribution function is an estimate for the distribution that generated the data. It simply puts probability mass 1n on every data point xi. Let fEDF be this function. It's pretty easy to see that, for any probability mass function f, we have
n∏i=1f(xi)≤n∏i=1fEDF(xi)
Thus, the empirical distribution function is the maximum likelihood estimator applied to the probability distribution rather than to a parameter.
6.4.3. Method of Moments
The method of moments is a method for parametric inference that is not a special case of Bayesian Inference. Recall that a statistical functional provides a single number that encodes some property of a RV. Suppose that we can estimate this number for an observed data set. Then, we can estimate the value of a parameter θ in two steps:
If we have several parameters to estimate, just do the same with several statistical functionals. This leads to a system with k equations and k unknowns.
The statistical functionals we use are the moments, hence the name. Recall that the k-th moment is defined as
μk:=∫∞−∞xkf(x)dx
We estimate the k-th moment of a data set (x1,...,xn) in the obvious way:
^μk:=1nn∑i=1xk
Since the value of the μk's depends on the pdf f, which depends on our parameters, computing the μk yields a term that includes those parameters, whereas computing the ^μk just yields numbers. Thus, if we have two parameters, then
μ1=^μ1μ2=^μ2
is actually a 2 by 2 system of equations, with the variables being our two parameters.
The method of moments is not super accurate, but has the advantage of being computationally cheap.
In the example from 6.3, μ1=E(f)=θ2 and ^μ1=1n∑ni=1xi. Thus, the method of moments estimator is just f2.
7. Hypothesis Testing and p-Values
Suppose we want to learn about some parameter θ. In this case, think of θ as measuring some effect we care about, say the difference in [frequency of trait X among people with trait Y] and [frequency of trait X among all people]. Let Θ be the parameter space of θ, and let Θ0⊊Θ be the subset in which "there is no effect". The claim that there is no effect, i.e., that θ∈Θ0 is called the null hypothesis, and the goal of a test is to refute the null hypothesis.
Let x1,...,xn∼X be our data. Let S be a function from the data to R. Formally, this is the same as an estimator, but here we call it the test statistic. S should be such that it yields a high number if there is an effect and a low number if there is not (or vice versa).
Once we evaluate S, we get some number x∈R that is somewhere on the less-effect-y ↔ more-effect-y axis. Let R⊊R be the subset of results that looks at least as effect-y as x. Here's a visualization:
We define the p-value of this test as the maximum chance that the result landed in R even though the null hypothesis was true, i.e.,
p-value:=supθ∈Θ0Pr(S∈R)
And that's it; that's a p-value. This is all well-defined since, once a θ is fixed, S is just a RV, and since R is a subset of the form {z∈R:z≥z0} the condition S∈R can be rewritten as S≥z0.
In the ideal case, a p-value of 0.002 means "under the assumption that the effect we are testing for doesn't exist (as defined by our choice of Θ0⊊Θ), there was at most a 0.2% chance for the data to show as much of an effect or more as we have observed". Not that this is not a probability statement about θ. (Frequentists don't want to make probability statements about θ since it's a constant.) The next chapter will work through an example of how to use p-values (which is also an example of where they 'fail', and it will illustrate why saying 'it's not a probability statement about θ' is more than a technicality).
In classical Hypothesis Testing, one defines the region R ahead of time, evaluates S on the data, and reports a binary 'null-hypothesis rejected' if S∈R and 'null-hypothesis not rejected' if S∉R. This is much worse since now the choice of R is arbitrary, and one can always find two points y,z on R such that z suggests a 0.00000000000000001% stronger effect than y, yet z∈R and y∉R. Compared to this, reporting a p-value is amazingly informative.
8. Failures
8.1. ... of p-values
Suppose we have a not-necessarily-fair coin with unknown probability p of getting heads. We want to refute the null hypothesis 'the coin is fair' defined by the one-point parameter subspace Θ0:={0.5}⊊[0,1].
Here are two possible experiments to test this hypothesis.
9.1.1. EXPERIMENT ONE
Experiment one is: "flip the coin twelve times". Our sample statistic is the RV X that counts the number of heads. Obviously, X∼Binom(12,p). We obtain the result X=3. The region R (of all outcomes that are at least as extreme as the observed one) is {0,1,2,3}. The p-value is thus
supθ∈Θ0Pr(Xθ∈R)=Pr(X0.5≤3)=3∑k=0(12k)0.512≈0.073
8.1.2. EXPERIMENT TWO
Experiment two is: "flip the coin until it lands on heads three times". Our sample statistic is the RV Y that counts the number of tosses that came up tails. We have Y∼NegativeBinomial(3,p). (The negative binomial distribution hasn't made it into the book or into my previous post; it's a generalization of the Geometric distribution that counts to r misses rather than to 1.) We obtain the result Y=9 (i.e., nine tails, three heads). The region R (of all outcomes that are at least as extreme as the observed one) is {9,10,11,12,13,14,15,...}. The p-value is thus
supθ∈Θ0Pr(Yθ∈R)=Pr(Y0.5≥9)=∞∑k=9(k+3−1k)0.530.5k≈0.0328
8.1.3. THE FAILURE
The first experiment yielded a p-value of 0.073, whereas the second one yielded a p-value of 0.0328. To see why this is an issue, consider a Bayesian update. Let h be an arbitrary distribution over Θ (now uppercase since we treat it as a RV). In experiment one, we have twelve samples of X with 3 heads and 9 tails, hence
h[X=x](θ)∝h(θ)n∏i=1f[Θ=θ](xi)=h(θ)θ3(1−θ)9
In experiment two, we have twelve samples of X with 3 heads and 9 tails, hence
h[X=x](θ)∝h(θ)n∏i=1f[Θ=θ](xi)=h(θ)θ3(1−θ)9
Thus, since both experiments have observed the same data, the posterior distribution is exactly the same in both cases (provided that both use the same prior.) Nonetheless, the experiments have yielded different p-values -- and what's worse, the expected p-value is also different! Given a fixed θ, the second experiment systematically produces lower p-values, both in terms of their mean and in terms of their median.
Despite having heard lots of people rail against p-values in the past, I was pretty shocked by this result. I had previously thought that p-values are bad because they provide bad incentives and give misleading impressions. I didn't know that they have an in-principle arbitrary component.
I don't think I understand why p-values are lower in the second experiment, though. What properties of the distribution are responsible? If anyone knows, please explain.
8.2. ... of Bayesian inference
Bayesian Inference relies on Bayes theorem, which is a theorem. Can it really fail? We shall see. The book mentions two cases meant to showcase weaknesses of Bayesian inference.
The first one is very silly. It comes down to the following experiment where we have a googol biased coins: "randomize a number k between 1 and 10100, report the result of a flip by the k-th biased coin; repeat N times". Then, the book observes that, for any realistic N, using Bayesian inference on the 10100 unknown probabilities of the biased coins will fail since most of them are never observed. Of course, this has zero to do with Bayesian methods and everything with what we choose to update; if one estimates the total probability of getting heads, Bayesian inference does wonderfully.
The second one is much more interesting. Let g:R+→R+ be a continuous, integrable function such that ∫∞0g(x)dx=:c∈R. In general, g is not a pdf, but f:=1cg is.
Suppose we know g and want to estimate c. One way to obtain c is by computing the integral, but this may be difficult. Instead, suppose we are given points x1,...,xn that are independently sampled according to f.
Let h be any prior on c. We have
h[X=x](c)∝h(c)n∏i=1f[C=c](xi)=h(c)n∏i=11cg(xi)
This expression with the prior excluded is larger the smaller c is. Thus, this experiment will update h toward lower values regardless of the observed data points x1,...,xn.
... wait, what? What happened?
Here is what I think happened. The function g fully determines the value of c. Since we know g, one cannot model c as a random variable. In fact Pr(C=c) must be 1 since 1=1⟹C=c, hence Pr(C=c)≥Pr(1=1)=1. If anything C is a logically uncertain variable à la Logical Induction. Since probability theory cannot handle logical uncertainty, the method fails.
However, it is interesting that there is an estimator that does a fine job approximating f. This estimator is f(x1,...,xn):=1n∑ni=1g(xi)^f(xi), where the ^f(xi) are estimates of the probability distribution of the points based on the observed data (x1,...,xn). (The book doesn't explain how this is done.) Thus, the concept of 'estimator' appears to include methods that deal with logical uncertainty.
Alas, it appears that the second example highlights a legitimate failure of Bayesian inference.
9. Statistical Decision Theory
So far, we've evaluated predictors by computing their Mean Squared Error. One can also do this for Bayesian methods. In the discrete example (5.2.), the posterior distribution h[X=x] has mean kn+1 (no computations are needed to derive this result since the mean of a Beta distribution is known), so f:(x1,...,xn)→∑ni=1xin+1 is just another estimator that happens to have been derived with Bayesian methods.
However, recall that estimating the Mean Squared Error yields a term involving the true parameter θ. This can be considered a function of θ, and thus, one can imagine two estimators whose functions cross paths, such that both of them yield lower errors for some values of θ. In such a case, it's not immediately obvious which one is preferable. Statistical Decision Theory is about making this decision.
Let f be an estimator. As mentioned, its mean squared error is a function of θ. If we want to compare it to the mean squared error of other estimators, we need a one-point summary of this function. Here are two ways of doing this:
I.e., we can either optimize for the worst case and minimize the maximum error across all possible values of θ, or we can weigh the errors depending on which θ is more probable. Of course, this again requires a prior h on θ, hence why it's called the Bayes risk.
One can prove that:
Thus, Bayesian methods are optimal to minimize the Bayes risk.
Suppose a Bayes estimator f for prior h yields an error that does not depend on θ, i.e., an error function that is a constant. In this case, the one-point summary is just that constant, which means that rBayes(f,h)=rMAX(f). Thus, one can also view minimization of rMAX as a special case of applying Bayesian methods. In this case, h is called the least favorable prior.
10. Other Concepts
10.1. Score and Fisher Information
Consider a parametric inference problem. Let θ be the parameter and f the density function. This time, we consider f to be a function of two variables θ and x.
The score is defined as the derivative of f with respect to θ, i.e.,
s:=∂f∂θ
The score is again a function of two variables. However, if we fix a θ∈Θ, then the function sθ defined by sθ(x):=s(θ,x) is a RV. For any point x, sθ(x) measures how increasing θ affects the density that f assigns to x.
Suppose that E(sθ)>0. That would mean that increasing θ increases the overall density of f. However, this is impossible: fθ+ϵ is still a valid pdf, so the overall density cannot change. Thus, the expectation of the score is always zero. This can also be proven formally.
On the other hand, consider the variance of the score, V(sθ). We can imagine several cases:
Thus, the variance of the score is a measure for how much the parameter impacts the distribution. Equivalently (I believe, I don't know information theory), it's a measure of how much information [samples of the distribution] reveal about θ. Note that evaluating this variance for a specific θ yields a number; if θ is left as a parameter, it yields a term that includes θ. The variance of the score is also called the Fisher Information, and it's denoted as a function I (of θ). Wikipedia lists the fisher information for all important distributions; for example, if X∼Bernoulli(p), then the fisher information is 1p(1−p).
One can prove that
I(θ)=V(sθ)=E(−∂s∂θ(θ))=E(∂2f∂2θ(θ))
If f is the maximum likelihood estimator of θ, then the Fisher Information takes a role similar to the variance of f and one can use this to prove theorems about the limit distribution of f. Furthermore, in the context of Bayesian Inference, there is something called Jeffrey's Prior, which says that the prior on a parameter should be proportional to the root of its Fisher Information. This does not seem to make sense to me, but perhaps I don't understand it.
10.2. Bootstrapping
Which is a fairly complicated method for estimating the variance of statistics.
10.3. The Jacknife
Same as above.
10.4. Simulation
Which is a Monte Carlo method to estimate things that could in theory be computed but may be hard to compute, such as the marginal distribution of θ when the distribution of a vector (θ,θ2,...,θn) is known.
10.5. Multiparameter Models for Parametric Inference
In which (who'd have guessed) parametric inference with multiple parameters is studied.
10.6. Various tests
As in 'specific ways to do hypothesis testing'. There are also Bayesian methods here.
10.7. Admissibility
Which is a property of estimators. An estimator is inadmissible if there is another estimator that performs at least as well for all θ, and better for at least one θ. It's admissible if it's not inadmissible. Bayes' estimators with proper priors are always admissible.
10.7.1. Stein's Paradox
Suppose we sample x1∼N(θ1,1),...,xn∼N(θn,1) and want to estimate the θi. Stein's paradox says that the estimator f:(x1,...,xn)↦(x1,...,xn) is inadmissible if n>2 (and admissible for n=2 and n=1).
10.8. Plug-in Estimators
The empirical distribution function gives a general way to estimate the value of any statistical functional on an unknown distribution, provided we have access to sample points (x1,...,xn). This way is, of course, to simply compute the statistical functional of the empirical distribution function derived from (x1,...,xn).
In the case of the mean, this leads to the obvious estimate ^μ:=1n∑ni=1 that is used in the method of moments. For the variance, it leads to the estimate ^σ:=1n∑ni=1(xi−^μ)2. Interestingly, this estimate is biased. The unbiased estimate is called the sample variance and it's defined by S2:=1n−1∑ni=1(xi−^μ)2. The reason for this is that any term of the form (xi−^μ)2 is, in expectation, slightly lower than it should be because xi has contributed to the estimate ^μ, thus causing ^μ to be closer to xi in expectation than the true mean. For example, suppose we have just two data points x1 and x2. If they are both below the mean, the estimate ^μ will be too small, so the terms (x1−^μ)2 and (x2−^μ)2 will be too small as well. If they are both above the mean, the estimate ^μ will be too large, and the terms (x1−^μ)2 and (x2−^μ) will again be too small. Thus, regardless of the direction into which the noise skews the data points, the plug-in estimate for the variance will underestimate the true variance, making it biased. If one does the computation, it turns out that the effect of this bias is, on expectation, precisely the factor nn−1.