Eliezer: glad you followed up on Robin's suggestion to do this. The examples in my earlier post on "Useful Bias", then involve resorting to systematic error to deal with problems arising out of processes with unsystematic errors. Similarly, the examples other commenters gave, such as alcoholics foreswearing all drinks or persons setting their watches five minutes early, would also be cases of using sytematic error to avoid problems caused by unsystematic errors.
I think this same basic formula is behind the argument for majoritarianism: the crowd's consensus (average) squared-error, plus the variance in the crowd, equals the expected squared-error for a random person in the crowd. Therefore the crowd consensus view has a lower expected squared-error than the average squared error for individuals in the crowd. Hence a random participant will do better to substitute the crowd consensus for his own estimate.
I'm reading a book, "The Difference", by Scott E. Page, which discusses how and when crowds do well, and he calls it the Diversity Prediction Theorem: Given a crowd of predictive models, Collective Error = Average Individual Error - Prediction Diversity.
I think this is called "inconsistency" rather than bias. Inconsistency means no matter how many data you have, and however you average them, the result still has directional error.
this is an easy mistake to make. You're thinking about one large sample, from which you derive one good estimate.
When the article says
Statistical bias is error you cannot correct by repeating the experiment many times and averaging together the results.
it means, getting many samples, and from each one of these samples, deriving an estimate from it in isolation.
You: 1 sample of n data points, n->infinity. 1 estimate.
Yudkowsky: n samples of d data points, n->infinity. n estimates.
Just to follow up on alex_zag_al's sibling comment, you can have consistent estimators which are biased for any finite sample size, but are aymptotically unbiased, i.e., the bias shrinks to zero as the sample size increases without bound.
(As alex_zag_al notes, EY's explanation of bias is correct. It means that in some situations "do an analysis on all the data" is not equivalent to "do the analysis on disjoint subjects of the data and average the results" -- the former may have a smaller bias than the latter.)
(Part one in a series on "statistical bias", "inductive bias", and "cognitive bias".)
"Bias" as used in the field of statistics refers to directional error in an estimator. Statistical bias is error you cannot correct by repeating the experiment many times and averaging together the results.
The famous bias-variance decomposition states that the expected squared error is equal to the squared directional error, or bias, plus the squared random error, or variance. The law of large numbers says that you can reduce variance, not bias, by repeating the experiment many times and averaging the results.
An experiment has some randomness in it, so if you repeat the experiment many times, you may get slightly different data each time; and if you run a statistical estimator over the data, you may get a slightly different estimate each time. In classical statistics, we regard the true value of the parameter as a constant, and the experimental estimate as a probabilistic variable. The bias is the systematic, or average, difference between these two values; the variance is the leftover probabilistic component.
Let's say you have a repeatable experiment intended to estimate, for example, the height of the Emperor of China. In fact, the Emperor's height is 200 cm. Suppose that every single American believes, without variation, that the Emperor's height is 180 cm. Then if you poll a random American and ask "How tall is the Emperor of China?", the answer is always "180 cm", the error is always -20 cm, and the squared error is always 400 (I shall omit the units on squared errors). But now suppose that Americans have normally distributed beliefs about the Emperor's height, with mean belief 180 cm, and standard deviation 10 cm. You conduct two independent repetitions of the poll, and one American says "190 cm", and the other says "170 cm", with errors respectively of -10 cm and -30 cm, and squared errors of 100 and 900. The average error is -20 cm, as before, but the average squared error is 100 + 900 / 2 = 500. So even though the average (directional) error didn't change as the result of adding noise to the experiments, the average squared error went up.
Although in one case the random perturbation of the answer happened to lead the American in the correct direction - the one who answered 190 cm, which is closer to the true value of 200 cm - the other American was led further away from the answer, replying 170 cm. Since these are equal deviations, the average answer did not change. But since the square increases faster than linear, the larger error corresponded to a still larger squared error, and the average squared error went up.
Furthermore, the new average squared error of 500 equals exactly the square of the directional error (-20 cm) plus the square of the random error (standard deviation of 10cm): 400 + 100 = 500.
In the long run, the above result is universal and exact: If the true value is constant X and the estimator is Y, then E[(X - Y)^2] = (X - E[Y])^2 + E[(E[Y] - Y)^2]. Expected squared error = squared expected bias + expected variance of estimator. This is the bias-variance decomposition.
If we averaged together the two Americans above, we would get an average estimate of 180 cm, with a squared error of 400, which is less than the average error of both experiments taken individually, but still erroneous.
If the true value is constant X and the estimator is Y, then by averaging many estimates together we converge toward the expected value of Y, E[Y], by the law of large numbers, and if we subtract this from X, we are left with a squared error of (X - E[Y])^2, which is the bias term of the bias-variance decomposition. If your estimator is all over the map and highly sensitive to noise in the experiment, then by repeating the experiment many times you can get the expected value of your estimator, and so you are left with only the systematic error of that estimator, and not the random noise in the estimator that varies from experiment to experiment. That's what the law of large numbers is good for.