I sometimes wonder just how useful probability and statistics are. There is the theoretical argument that Bayesian probability is the fundamental method of correct reasoning, and that logical reasoning is just the limit as p=0 or 1 (although that never seems to be applied at the meta-level: what is the probability that Bayes' Theorem is true?), but today I want to consider the practice.
Casinos, lotteries, and quantum mechanics: no problem. The information required for deterministic measurement is simply not available, by adversarial design in the first two cases, and by we know not what in the third. Insurance: by definition, this only works when it's impossible to predict the catastrophes insured against. No-one will offer insurance against a risk that will happen, and no-one will buy it for a risk that won't. Randomised controlled trials are the gold standard of medical testing; but over on OB Robin Hanson points out from time to time that the marginal dollar of medical spending has little effectiveness. And we don't actually know how a lot of treatments work. Quality control: test a random sample from your production run and judge the whole batch from the results. Fine -- it may be too expensive to test every widget, or impossible if the test is destructive. But wherever someone is doing statistical quality control of how accurately you're filling jam jars with the weight of jam it says on the label, someone else will be thinking about how to weigh every single one, and how to make the filling process more accurate. (And someone else will be trying to get the labelling regulations amended to let you sell the occasional 15-ounce pound of jam.)
But when you can make real measurements, that's the way to go. Here is a technical illustration.
Prof. Sagredo has assigned a problem to his two students Simplicio and Salviati: "X is difficult to measure accurately. Predict it in some other way."
Simplicio collects some experimental data consisting of a great many pairs (X,Y) and with high confidence finds a correlation of 0.6 between X and Y. So given the value y of Y, his best prediction for the value of X is 0.6y. [Edit: that formula is mistaken. The regression line for Y against X is Y = bcX/a, assuming the means have been normalised to zero, where a and b are the standard deviations of X and Y respectively. For the Y=X+D1 model below, bc/a is equal to 1.]
Salviati instead tries to measure X, and finds a variable Z which is experimentally found to have a good chance of lying close to X. Let us suppose that the standard deviation of Z-X is 10% that of X.
How do these two approaches compare?
A correlation of 0.6 is generally considered pretty high in psychology and social science, especially if it's established with p=0.001 to be above, say, 0.5. So Simplicio is quite pleased with himself.
A measurement whose range of error is 10% of the range of the thing measured is about as bad as it could be and still be called a measurement. (One might argue that any sort of entanglement whatever is a measurement, but one would be wrong.) It's a rubber tape measure. By that standard, Salviati is doing rather badly.
In effect, Simplicio is trying to predict someone's weight from their height, while Salviati is putting them on a (rather poor) weighing machine (and both, presumably, are putting their subjects on a very expensive and accurate weighing machine to obtain their true weights).
So we are comparing a good correlation with a bad measurement. How do they stack up? Let us suppose that the underlying reality is that Y = X + D1 and Z = X + D2, where X, D1, and D2 are normally distributed and uncorrelated (and causally unrelated, which is a stronger condition). I'm choosing the normal distribution because it's easy to calculate exact numbers, but I don't believe the conclusions would be substantially different for other distributions.
For convenience, assume the variables are normalised to all have mean zero, and let X, D1, and D2 have standard deviations 1, d1, and d2 respectively.
Z-X is D2, so d2 = 0.1. The correlation between Z and X is c(X,Z) = cov(X,Z)/(sd(X)sd(Z)) = 1/sqrt(1+d2 2) = 0.995.
The correlation between X and Y is c(X,Y) = 1/sqrt(1+d1 2) = 0.6, so d1 = 1.333.
We immediately see something suspicious here. Even a terrible measurement yields a sky-high correlation. Or put the other way round, if you're bothering to measure correlations, your data are rubbish. Even this "good" correlation gives a signal to noise ratio of less than 1. But let us proceed to calculate the mutual informations. How much do Y and Z tell you about X, separately or together?
For the bivariate normal distribution, the mutual information between variables A and B with correlation c is lg(I), where lg is the binary logarithm and I = sd(A)/sd(A|B). (The denominator here -- the standard deviation of A conditional on the value of B -- happens to be independent of the particular value of B for this distribution.) This works out to 1/sqrt(1-c2). So the mutual information is -lg(sqrt(1-c2)).
corr. | mut. inf. | |||
---|---|---|---|---|
Simplicio | 0.6 | 0.3219 | ||
Salviati | 0.995 | 3.3291 |
What can you do with one third of a bit? If Simplicio tries to predict just the sign of X from the sign of Y, he will be right only 70% of the time (i.e. cos-1(-c(X,Y))/π). Salviati will be right 96.8% of the time. Salviati's estimate will even be in the right decile 89% of the time, while on that task Simplicio can hardly do better than chance. So even a good correlation is useless as a measurement.
Simplicio and Salviati show their results to Prof. Sagredo. Simplicio can't figure out how Salviati did so much better without taking measurements on thousands of samples. Salviati seemed to just think about the problem and come up with a contraption out of nowhere that did the job, without doing a single statistical test. "But at least," says Simplicio, "you can't throw away my 0.3219, it all adds up!" Sagredo points out that it literally does not add up. The information gained about X from Y and Z together is not 0.3219+3.3291 = 3.6510 bits. The correct result is found from the standard deviation of X conditional on both Y and Z, which is sqrt(1/(1 + 1/d1 2 + 1/d2 2)). The information gained is then lg(sqrt(1 + 1/d1 2 + 1/d2 2)) = 0.5*lg(101.5625) = 3.3331. The extra information over knowing just Z is only 0.0040 = 1/250 of a bit, because nearly all of Simplicio's information is already included in Salviati's.
Sagredo tells Simplicio to go away and come up with some real data.
Good. The experiment is, however, very good evidence for the hypothesis that R.S. Marken is a crank, and explains the quote from his farewell speech that didn't make sense to me before:
The basic problem is that, generically, if your model uses more free parameters than data points, then it is mathematically trivial that you can get an exact fit to your data set, regardless of what the data are: thus you've provided exactly zero Bayesian evidence that your model fits this particular phenomenon.
(This is precisely the case in the paper you pointed me to. Marken asserts that his model successfully predicts the overall and relative error rates with high precision; but if these rates had been replaced with arbitrary numbers before being fed to him, he would have come up with different experimental values of the parameters, and claimed that his model exactly predicted the new error rates! This is known around here as an example of a fake explanation.)
The fact that Marken was repeatedly told this, interpreted it to mean that others were jealous of his precision, and continued to produce experimental "results" of the same sort along with bold claims of their predictive power, makes him a crank.
Anyhow...
The point I keep stressing is that, if cognitive-domain PCT is precise enough to do treatment with, then it can't be bereft of experimental consequences; and no matter how appealing certain aspects of it might be intuitively, a lack of experimental support after 35 years looks pretty damning. If every cognitive circuit is so complicated that you can't make an observable prediction (about an individual in varying circumstances, or different people in the same circumstances, etc) without assuming more parameters than data points... then PCT doesn't actually teach you anything about cognition, any more than the physicists who ascribed fire and respiration to phlogiston actually learned anything from their theory.
You've pointed me to one experiment, which turned out to be the work of a crank; I've accordingly lowered the probability that PCT is valid in the cognitive domain, not because the existence of a crank proves anything against their hypothesis, but because that was the most salient experimental result that you could point to!
I'm still quite able to revise my probability estimate upwards if presented with a legitimate experimental result, but at the moment PCT is down in the "don't waste your time and risk your rationality" bin of fringe theories.
I can be a pretty cranky fellow but I think there might be better evidence of that than the model fitting effort you refer to. The "experiment" that you find to be poor evidence for PCT comes from a paper published in the journal Ergonomics that describes a control theory model that can be used as a framework for understanding the causes of error in skilled ... (read more)