Sample means, how do they work?

Benquo

You know how people make public health decisions about food fortification, and medical decisions about taking supplements, based on things like the Recommended Daily Allowance? Well, there's an article in Nutrients titled A Statistical Error in the Estimation of the Recommended Dietary Allowance for Vitamin D. This paper says the following about the info used to establish the US recommended daily allowance for vitamin D:

The correct interpretation of the lower prediction limit is that 97.5% of study averages are predicted to have values exceeding this limit. This is essentially different from the IOM’s conclusion that 97.5% of individuals will have values exceeding the lower prediction limit.

The whole point of looking at averages is that individuals vary a lot due to a bunch of random stuff, but if you take an average of a lot of individuals, that cancels out most of the noise, so the average varies hardly at all. How much variation there is from individual to individual determines the population variance. How much variation you'd expect in your average due to statistical noise from sample to sample determines what we call the variation of the sample mean.

When you look at frequentist statistical confidence intervals, they are generally expressing how big the ordinary range of variation is for your average. For instance, 90% of the time, your average will not be farther off from the "true" average than it is from the boundaries of your confidence interval. This is relevant for answering questions like, "does this trend look a lot bigger than you'd expect from random chance?" The whole point of looking at large samples is that the errors have a chance to cancel out, leading to a very small random variation in the mean, relative to the variation in the population. This allows us to be confident that even fairly small differences in the mean are unlikely to be due to random noise.

The error here, was taking the statistical properties of the mean, and assuming that they applied to the population. In particular, the IOM looked at the dose-response curve for vitamin D, and came up with a distribution for the average response to vitamin D dosage. Based on their data, if you did another study like theirs on new data, it ought to predict that 600 IU of vitamin D is enough for the average person 97.5% of the time.

They concluded from this that 97.5% of people get enough vitamin D from 600 IU.

This is not an arcane detail. This is confusing the attributes of a population, with the attributes of an average. This is bad. This is real, real bad. In any sane world, this is mathematical statistics 101 stuff. I can imagine that someone who's heard about a margin of error a lot doesn't understand this stuff, but anyone who has to actually use the term should understand this.

Political polling is a simple example. Let's say that a poll shows 48% of Americans voting for the Republican and 52% for the Democrat, with a 5% margin of error. This means that 95% of polls like this one are expected to have an average within 5 percentage points of the true average. This does not mean that 95% of individual Americans have somewhere between a 43% and 53% chance of voting for the Republican. Most of them are almost definitively decided on one candidate, or the other. The average does not behave the same as the population. That's how fundamental this error is – it's like saying that all voters are undecided because the population is split.

Remember the famous joke about how the average family has two and a half kids? It's a joke because no one actually has two and a half kids. That's how fundamental this error is – it's like saying that there are people who have an extra half child hopping around. And this error caused actual harm:

The public health and clinical implications of the miscalculated RDA for vitamin D are serious. With the current recommendation of 600 IU, bone health objectives and disease and injury prevention targets will not be met. This became apparent in two studies conducted in Canada where, because of the Northern latitude, cutaneous vitamin D synthesis is limited and where diets contribute an estimated 232 IU of vitamin D per day. One study estimated that despite Vitamin D supplementation with 400 IU or more (including dietary intake that is a total intake of 632 IU or more) 10% of participants had values of less than 50 nmol/L. The second study reported serum 25(OH)D levels of less than 50 nmol/L for 15% of participants who reported supplementation with vitamin D. If the RDA had been adequate, these percentages should not have exceeded 2.5%. Herewith these studies show that the current public health target is not being met.

Actual people probably got hurt because of this. Some likely died.

This is also an example of scientific journals serving their intended purpose of pointing out errors, but it should never have gotten this far. This is a send a coal-burning engine under the control of a drunk engineer into the Taggart tunnel when the ventilation and signals are broken level of negligence. I think of the people using numbers as the reliable ones, but that's not actually enough – you have to think with them, you have to be trying to get the right answer, you have to understand what the numbers mean.

I can imagine making this mistake in school, when it's low stakes. I can imagine making this mistake on my blog. I can imagine making this mistake at work if I'm far behind on sleep and on a very tight deadline. But if I were setting public health policy? If I were setting the official RDA? I'd try to make sure I was right. And I'd ask the best quantitative thinkers I know to check my numbers.

The article was published in 2014, and as far as I can tell, as of the publication of this blog post, the RDA is unchanged.

(Cross-posted from my personal blog.)

Ouch. I've made this mistake myself - confusing the CI of the mean with the CI of the predictive interval. It's an easy mistake to make when you have very small data, because then both the predictive interval and the mean's confidence interval will be wide. Much harder to misinterpret when you have decent amounts of data because the predictive interval typically won't shrink much while the mean's CI will become so narrow you can't misinterpret it as being about the population.

Besides this mistake, the original FDA study seems iffy to me. Look at their graph -- how in the world do they draw 95% prediction limits in such a way that 19 out of their 32 sample averages (60%!) are outside of their 95% interval?

I've looked into the study, but it doesn't point to a specific location in IOM's document, which is huge. My tentative explanation is that those studies that fell out of the interval have a very small size, but I cannot say for certain.

This may seem pedantic, but given that this post is on the importance of precision:

"Some likely died."

Should be

"Likely, some died".

Also, I think you should more clearly distinguish between the two means, such as saying "sample average" rather than "your average". Or use x bar and mu.

The whole concept of confidence is rather problematic, because it's on the one hand one of the most common statistical measures presented to the public, but on the other hand it's one of the most difficult concepts to understand.

What makes the concept of CI so hard to explain is that pretty every time the public is presented with it, they are presented with one particular confidence interval, and then given the 95%, but the 95% is not a property of the particular confidence interval, it's a property of the process that generated it. The public understands "95% confidence interval" as being an interval that has a 95% chance of containing the true mean, but actually a 95% confidence interval is an interval generated by a process, where the process has a 95% chance of generating a confidence interval that contains the true mean.

the concept of CI so hard to explain is that pretty every time the public is presented with it

Not just the public, but scientists and medical professionals have trouble with it.

People tend to interpret frequentist statistics as if they were the Bayesian equivalents e.g. they interpret confidence intervals as Credible Intervals.

I don't get this (and I don't get Benquo's OP either. I don't really know any statistics. Only some basic probability theory.).

"the process has a 95% chance of generating a confidence interval that contains the true mean". I understand this to mean that if I run the process 100 times, 95 times the resulting CI contains the true mean. Therefore, if I look at random CI amongst those 100 there is a 95% chance that the CI contains the true mean.

[A]ctually a 95% confidence interval is an interval generated by a process, where the process has a 95% chance of generating a confidence interval that contains the true mean.

Is it incorrect for a Bayesian to gloss this as follow?

Given (only) that this CI was generated by process X with input 0.95, this CI has a 95% chance of containing the true mean.

I could imagine a frequentist being uncomfortable with talk of the "chance" that the true mean (a certain fixed number) is between two other fixed numbers. "The true mean either is or is not in the CI. There's no chance about it." But is there a deeper reason why a Bayesian would also object to that formulation?