A discussion I had in the reddit comments on that Slate post made me invent this fake argument:
A: People who drink water inevitably end up dead. Therefore drinking water causes death.
B: No, that is correlation, not causation.
C: No, it is not correlation. To calculate correlation you divide the covariance of the two variables by the variance of each of the variables. In this case there is no variance in either variable, so you're dividing by zero, so correlation is not even defined.
I think it's an improvement to go from saying "there is obviously something wrong with A's argument" to actually being able to point out the divide-by-zero in the equation.
Only to score points at the expense of the audience's vocabulary would one say "there is no variance in either variable" as opposed to saying "there are no people who avoid drinking water, nor people who don't end up dead, to compare to".
Let's not encourage this.
Disagree. Our target audience - humans - rarely if ever thinks of 'correlation' in terms of its mathematical definition and I suspect would be put off by an attempt to do so.
This is entirely true - as a mere human, my interest plummeted at "covariance", and I'd still like to think I'm SOMEWHAT equipped to handle correlation/causation. Just not numerically. So, as a roughly average human, I say your suspicions are correct.
Pet peeve:
The saying should be: "statistical dependence does not imply causality." Correlation is a particular measure of a linear relationship. A lack of correlation can happily coexist with statistical dependence if variables are related in a complicated non-linear way. This "correlation" business needlessly emphasizes linear models (prevalent in Stat at the time of Pearson et al.) See also this: http://en.wikipedia.org/wiki/Correlation_and_dependence
Also, this is true: "lack of statistical dependence does not imply lack of causality" (due to effect cancellation).
I don't have much time to think on this right now, but perhaps an Anti-Godwin's law could be useful? Something along the lines of "just because your opponent made a simplistic analogy to Nazism, it does not follow that their overall argument is wrong".
Content-related suggestion. Comics are great tools for people too lazy/busy to read long articles, so here's XKCD's take
At the risk of being the village idiot today: how do people get this wrong? Point to an example or three?
I go to some lengths to avoid innumerate discussion online, but it still happens in real life with reasonable frequency. The flavours I seem to encounter most:
1) an all-purpose attempt to refute any statistical finding, even if said finding is not showing correlation, or proposing causation
2) dogged adherence to the belief that the direction of causal relationships are completely impossible to establish
3) the most perverse, that establishing an association between two variables is evidence against a causal relationship
Key to Memetic Value:
Make sure the landing page is simple, to the point, with no necessary scrolling to get the entire message in a matter of only a few moments, and without clutter. Perhaps include a simple, clear diagram -- but that's not necessary, as long as you have a simple, brief textual explanation that dominates the page. Include a small number of obvious links to other pages on your site for additional information if you want to go into greater detail. If you want to include links to off-site resources, they should probably be collected on a s...
Awesome idea.
As far as I understand it, if variables A and B are correlated, then we can be pretty damn sure that either:
(Am I right about this or is this an oversimplification?)
A good way to grab attention might be to deny a commonly believed fact in a way that promises intelligent elaboration. So the website could start with a huge 'Correlation does not imply causation' banner and then go like 'well, actually, it kind of does'. And then explain how going from not knowing anything a...
if variables A and B are correlated, then we can be pretty damn sure that either: a) A causes B b) B causes A c) there's a third variable affecting both A and B.
There is in fact a d) A and not-B both can cause some condition C that defines our sample.
Example: Sexy people are more likely to be hired as actors. Good actors are also more likely to be hired as actors. So if we look at "people who are actors," then we'll get people who are sexy but can't really act, people who are sexy and can act, and people who can act and aren't really sexy. If sexiness and acting ability are independent, these three groups will be about equally full.
Thus if we look at actors in general in our simple model, 2/3 of them will be sexy and 2/3 of them will be good actors. But of the ones who are sexy, only 1/2 will be good actors. So being sexy is correlated with being a bad actor! Not because sexiness rots your brain (a), or because acting well makes you ugly (b), and not because acting classes cause both good acting and ugliness, or diet pills cause both beauty and bad acting (c). Instead, it's just because how we picked actors made sexiness and acting ability "compete for the same niche."
Similar examples would be sports and academics in college, different sorts of skills in people promoted in the workplace, UI design versus functionality in popular programs, and so on and so on.
If you are familiar with d-separation (http://en.wikipedia.org/wiki/D-separation#d-separation), we have:
if A is dependent on B, and there's some unobserved C involved, then:
(1) A <- C -> B, or
(2) A -> C -> B, or
(3) A <- C <- B
(this is Reichenbach's common cause principle: http://plato.stanford.edu/entries/physics-Rpcc/)
or
(4) A -> C <- B
if C or its effect attains a particular (not necessarily recorded) value. Statisticians know this as Berkson's bias, which is a form of selection bias. In AI, this is known as "explaining away." Manfred's excellent example falls into category (4), with C observed to equal "hired as actor."
Beware: d-separation applies to causal graphical models, and Bayesian networks (which are statistical and not causal models). The meaning of arrows is different in these two kinds of models. This is actually a fairly subtle issue.
There's also e): A causes B within our sample, but A does not cause B generally, or in the sense that we care about.
For example, suppose a teacher gives out a gold star whenever a pupil does a good piece of work, and this causes the pupil to work harder. Suppose also that this effect is greatest on mediocre pupils and least on the best pupils - but the best pupils get most of the gold stars, naturally.
Now suppose an educational researcher observes the class, and notes the correlation between receiving a gold star, and increased effort. This is genuine causation. He then concludes that the teacher should give out more gold stars, regardless of whether the pupil does a good piece of work or not, and focus the stars on mediocre pupils. This change made, the gold stars no longer cause increased effort. The causation disappears! Changing the way the teacher hands out the gold stars changes the relationship between gold stars and effort. So although there was genuine causation in the original sample, there is no general causation, or causation in the sense we care about; we can't treat the gold stars as an exogenous variable.
See also the Lucas Critique.
lots of smart people argue about dumb things for irrational reasons on 4chan and reddit...
That might be a good place to dump links to succinct and engaging explanations.
(An idea I had while responding to this quotes thread)
"Correlation does not imply causation" is bandied around inexpertly and inappropriately all over the internet. Lots of us hate this.
But get this: the phrase, and the most obvious follow-up phrases like "what does imply causation?" are not high-competition search terms. Up until about an hour ago, the domain name correlationdoesnotimplycausation.com was not taken. I have just bought it.
There is a correlation-does-not-imply-causation shaped space on the internet, and it's ours for the taking. I would like to fill this space with a small collection of relevant educational resources explaining what is meant by the term, why it's important, why it's often used inappropriately, and the circumstances under which one may legitimately infer causation.
At the moment the Wikipedia page is trying to do this, but it's not really optimised for the task. It also doesn't carry the undercurrent of "no, seriously, lots of smart people get this wrong; let's make sure you're not one of them", and I think it should.
The purpose of this post is two-fold:
Firstly, it lets me say "hey dudes, I've just had this idea. Does anyone have any suggestions (pragmatic/technical, content-related, pointing out why it's a terrible idea, etc.), or alternatively, would anyone like to help?"
Secondly, it raises the question of what other corners of the internet are ripe for the planting of sanity waterline-raising resources. Are there any other similar concepts that people commonly get wrong, but don't have much of a guiding explanatory web presence to them? Could we put together a simple web platform for carrying out this task in lots of different places? The LW readership seems ideally placed to collectively do this sort of work.