In a recent comment, I suggested that correlations between seemingly unrelated periodic time series share a common cause: time. However, the math disagrees... and suggests a surprising alternative.

Imagine that we took measurements from a thermometer on my window and a ridiculously large tuning fork over several years. The first set of data is temperature T over time t, so it looks like a list of data points [(t0, T0), (t1, T1), ...]. The second set of data is mechanical strain e in the tuning fork over time, so it looks like a list of data points [(t0, e0), (t1, e1), ...]. We line up the temperature and strain data according to time, yielding [(T0, e0), (T1, e1), ...] and find a significant correlation between the two, since they happen to have similar periodicity.

Recalling Judea Pearl, we suggest that there is almost certainly some causal relationship between the temperature outside the window and the strain in the ridiculously large tuning fork. Common sense suggests that neither causes the other, so perhaps they have some common cause? The only other variable in the problem is time, so perhaps time is the common cause. This sort of makes sense, since changes in time intuitively seem to cause the changes in temperature and strain.

Let's check that intuition with some math. First, imagine that we ignore the time data. Now we just have a bunch of temperature data points [T0, T1, ...] and strain data points [e0, e1, ...]. In fact, in order to truly ignore time data, we cannot even order the points according to time! But that means that we no longer have any way to line up the points T0 with e0, T1 with e1, etc. Without any way to match up temperature points to corresponding strain points, the temperature and strain data are randomly ordered, and the correlation disappears!

We have just performed a d-separation. When time t was known (i.e., controlled for), the variables T and e were correlated. But when t was unknown, the variables were uncorrelated. Now, let's wave our hands a little and equate correlation with dependence. If time were a common cause of temperature and strain, then we should see that T and e are correlated without knowledge of time, but the correlation disappears when controlling for time. However, we see exactly the opposite structure: controlling for t induces the correlation. This pattern is called a "collider", and it implies that time is a common effect of temperature and strain. Rather than time causing the oscillations in our time series, the oscillations in our time series cause time.

Whoa. Now that the math has given us the answer, let's step back and try to make sense of it. Imagine that everything in the universe stopped moving for some time, and then went back to moving exactly as before. How could we measure how much time passed while the universe was stopped? We couldn't. For all practical purposes, if nothing changes, then time has stopped. Time, then, is an effect of motion, not vice versa. This is an old idea from philosophy/physics (I think I originally read it in one of Stephen Hawking's books). We've just rederived it.

But we may still wonder: what caused the correlation between temperature and strain? A common effect cannot cause a correlation, so where did it come from? The answer is that there was never any correlation between temperature and strain to begin with. Given just the temperature and strain data, with no information about time (e.g. no ordering or correspondence between points), there was no correlation. The correlation was induced by controlling for time. So the correlation is only logical; there is no physical cause relating the two, at least within our model.

New Comment
31 comments, sorted by Click to highlight new comments since:

RCCP (http://plato.stanford.edu/entries/physics-Rpcc/) isn't actually true (quantum mechanics violates it, for example).


Correlation is just a measure of a linear relationship. RCCP is (should be) stated in terms of dependence, not correlation.


If I think a correlation implies some sort of causal relationship somewhere, and I can bring things into a correlation by means of an affine transformation (as I can with any two lines), I have a problem. This is the problem: http://www.smbc-comics.com/?id=3129#comic


Outside view of insight : if there isn't a big inference gap, either someone has thought of it before, or it's stupid.

Actually, there is a theorem which states that two variables X and Y are independent if and only if f(X) and g(Y) are uncorrelated for any two functions f,g. That means that you cannot bring things into correlation by any independent transformations whatsoever exactly when the variables are independent. Specifically regarding affine transformations, you would have to add some multiple of X to Y in order to get a correlation, which is obviously silly.

Thanks for the link to RCCP, I hadn't seen the history of d-sep before. You should definitely check out the causality section of Highly Advanced Epistemology 101 for Beginners and possibly Judea Pearl's book Causality, which contain a more up-to-date discussion and address many of your concerns much better than I could.

I like your outside view of insight. That was part of the reason I pointed out that this is not a new insight, and has been found many times by other people before.

Edit after reading pragmatist's comment: Knowing your background, I'll add some technical meat.

First, regarding correlation versus dependence, consider any functions f(T) and g(e). The exact same argument made in the post still applies: without time, there is no ordering of the points, so we cannot establish any correlation. Since there is no correlation for any functions f,g the variables are independent. The argument could be made that the correlation is undefined rather than zero, but if we take a Bayesian approach then we should probably be summing over all permutations (since there is no reason to prefer any particular permutation). Intuitively, that seems like it ought to go to zero given enough data, but I'm not sure if it's identically zero for smaller data sets.

Regarding quantum, RCCP works under the MWI which everyone seems to love around here (the world-branch becomes a hidden common cause). But setting that argument aside, we can happily restrict ourselves to non-chaotic macroscopic situations.

You should definitely check out the causality section of Highly Advanced Epistemology 101 for Beginners and possibly Judea Pearl's book Causality, which contain a more up-to-date discussion and address many of your concerns much better than I could.

Just a heads up, in case Ilya considers it indelicate to mention this himself: He's an expert in this area, and definitely familiar with Judea Pearl's work (Pearl was his Ph.D. supervisor).

Really? I'm impressed. I guess he assumed that I was unfamiliar with it (not an unreasonable assumption).

[-][anonymous]90

Suppose instead of measuring temperature and strain at various times and lining them up later, you attach a thermostrainometer to the tuning fork and take a bunch of measurements at various points (a thermostrainometer, of course, is a device which measures temperature and strain and outputs a pair (T,e))

You've forgotten that time exists, so these measurements all get shoved in a box and shuffled, and they come out in a random order. But you notice a curious thing about these measurements - you can separate each one into two parts, and those parts seem to be correlated! How did that happen?

What the thermometer and tuning fork really have in common in this example is a person (or something) looking at a clock and then recording T_n and e_n. So the example already is a slightly more complex thermostrainometer. It's interesting how much we take accurate measurements of time for granted; in the good old days astronomers had to invent precise definitions and mechanisms for measuring time in order to correlate the motions of the heavenly bodies with pseudo-periodic observations.

We don't actually write down (t_0, e_0) and (t_2,T_2), we write down (clock-step_x, e_x), etc. Even if we're using an atomic clock we're really just counting the number of times a sine wave generator has cycled since we started it and not some nebulous substance called "time".

I was hoping someone would bring this up! This is why I was careful to specify that the temperature was taken outside my window, and the strain was measured in a tuning fork in some unspecified location. In that situation, time really is the only correspondence between the points.

But your example brings up a much more general (and much more interesting) problem of identifying points. I'll illustrate with another example. Suppose we measure a bunch of physiological variables in mice. We get a bunch of tuples mapping mice to the relevant variables, and we find lots of correlations. But then we lose our mouse id's! Suddenly we have no idea which mouse each measurement came from. As before, everything gets scrambled and correlations disappear. We conclude that the measurements cause the mouse, or more accurately, the measurements cause the id of the mouse.

In the mouse example, notice that giving the mice actual names or id numbers wasn't really necessary. We could just identify each mouse by its tuple of measurements. The identity of the mouse is mathematically just a mapping to match up the data points from different sensors.

Going back to your aptly-named thermostrainometer, we see a similar situation. Time is no longer the variable used to identify data points with each other. Instead, T and e points are associated through both space and time, and the whole mapping is conveniently handled inside the sensor itself and given to us in a convenient tuple structure. But the sensor itself still needs to associate the T and e values somehow, which is where space and time come in.

First, imagine that we ignore the time data. Now we just have a bunch of temperature data points [T0, T1, ...] and strain data points [e0, e1, ...]. In fact, in order to truly ignore time data, we cannot even order the points according to time! But that means that we no longer have any way to line up the points T0 with e0, T1 with e1, etc. Without any way to match up temperature points to corresponding strain points, the temperature and strain data are randomly ordered, and the correlation disappears!

That is not how d-separation works. If you control for time, the temperature and the strain on the tuning fork should be uncorrelated. We would expect the temperature to roughly follow a 24-hour cycle, plus some random noise. If the tuning fork has a 24-hour period, then we would expect the same thing to be true of the strain. But it would be very strange if the random noise in the temperature were correlated with the random noise in the mechanical strain (e.g. if, once we already knew it was 3 pm, reading the thermometer on the window tells you something about the strain on the tuning fork). That could plausibly happen if the material that the tuning fork is made of has different properties at different temperatures, changing the strain, but I think you meant to imply that neither one of these causes the other, so I'll ignore that for now.

In a Bayes net, not conditioning on a variable doesn't mean that you stop lining up the data into samples with a value for each variable, and declare that everything is uncorrelated with everything else; it just means that keep the data lined up in samples and look for correlations without paying attention to the variable you are not conditioning on. In this case, the temperature and mechanical strain should be very highly correlated if you do not condition on time, because time will still be there as a lurking variable.

Yup, you're right. That's the right way to handle it, and it yields time as the common cause of temperature and strain, as we'd expect.

Now that I'm knee-deep in it, I do think this crazy concept of separating sets of values of variables from the mappings between the sets has something to it. It isn't necessary for the example in the main post, but I think the example I gave with mice in one of the other comments still applies. The mapping between points is legitimately a variable unto itself, so it seems like it should be possible to handle it like other variables. It might even be useful to do so, since the mapping is nonparametric.

Anyway, thanks for giving a proper analysis of the problem.

[-][anonymous]10

I'm not sure what you see in it. For the mouse thing, it seems to suggest that the correlation between the variables causes the identity of mouse, which is, like the time thing, exactly the wrong way round. You say it "isn't necessary for the example in the main post" but it's more than unnecessary - it gives an answer that's completely backwards.

That's exactly why it's interesting.

There are undoubtedly multiple, specific, identifiable fallacies at work in this post. I hope someone makes the effort to identify them...

Name one.

Well, if he could he wouldn't be wishing for someone else (Ilya?) to do so.

You're quite right, and I should not have been so confrontational. That said, the criticism was not constructive.

I definitely appreciate any expert feedback, or even any concrete criticism.

The whole argument is like a daydream. Let's imagine that we have time series for two things that are correlated for no reason except that we assume this. Then let's throw away the time ordering information so we have two unordered sets of data. Then let's say that contrasting the uncorrelatability of the unordered sets with the original correlation of time series is a d-separation.

I hardly have even basic knowledge of these techniques of causal inference, but even I can see that you are doing crazy stuff. Think of causal inference analysis as a machine where either it produces an output, indicating a causal connection, or it does nothing, indicating no causal connection. The part where you throw away the time ordering is like smashing the machine; and then you treat the unresponsiveness of the resulting pile of parts, as if it were a null response from an intact machine.

There's also something dodgy going on, with your need to assume two time series that are correlated. If you start with two time series which, by hypothesis, are not correlated, then even this flawed argument isn't possible - you can't even get your alleged separation, because there's no correlation, either with or without time. Your formal demonstration that time is caused by motion, seems to require consideration of two time series that are correlated by coincidence, which would be a weird and stringent requirement for an argument purportedly demonstrating something about the nature of time in general.

The best diagnosis of the argument I can presently make is that it came about as follows: You were already sympathetic, or potentially sympathetic, to the idea that time is caused by motion. Then you were sort of musing about the formalism of causal analysis in a fashion increasingly detached from the usual context of its use, and eventually ran across a "does not compute" condition, but you interpreted this implosion of the formalism as a message from the formalism, and built it up into a formal demonstration of the metaphysical proposition that time is caused by motion.

If I take a step even further back, I can see this as another example of metaphysics returning in a disorderly way, through the gaps in a formalism which has replaced metaphysics with mathematics. In pre-scientific philosophy, people reasoned using natural language about time, space, causality, reality, truth, meaning, and so forth. In the 20th century, there was an attempt to reduce everything to measurement and computation. Reasoning was replaced with the symbol systems of formal logic, objective physical reality was replaced with observables in quantum mechanics, the study of mind was replaced with the study of behavior and of brains - there might be half a dozen core examples.

In statistics they abandoned causality for correlation. Pearl's mini-revolution was to reintroduce the concept of causality, but he only got to do it because he found a formal criterion for it. His theory has therefore strengthened, in a small way, the illusion of successful reduction - people can apply Pearl's procedure and perform causal analysis, without worrying about why anything causes anything else, or about what causality really is.

But people have a tendency to rediscover the issues and problems that the old informal philosophy tackled, and then they try to address them with the intellectual resources that their culture provides. Thus the wavefunction tools of Copenhagen positivism get turned into ontological realities by Everett, and the universe of mathematical concepts becomes the ultimate reality in Tegmark's neoplatonic theory... and odd manipulations of causal analysis formalism, become Wentworth's argument for a particular metaphysics of time.

I definitely don't want to say that every such reinvention of metaphysics from within a formal discourse, is mystical or pathological. The interaction between the modern formalisms and the old issues is a very complicated and diverse process. But in general, to me the process looks healthiest when the formalism is grounded in some old-fashioned informal intuitions - where people can explain the concepts of their formalized physics, logic, etc., in a way that grounds in very simple experiences, thoughts, and understandings. And the problem that modern thought has created, is that it denies the validity, possibility, or existence of many of these informal intuitions.

Modern people have these elaborate rule-based systems available to them, systems for representing or thinking about certain aspects of reality in a very sophisticated way, but they are cut off from the history of informal thought which motivated the formalisms. As a result, when they try to think about reality at a primordial level, they have to improvise as if there had never been such a thing as systematic metaphysical thought, but at the same time they have available to them, these modern intellectual power tools which bear in their design, traces of the abstract issues which motivated their construction. The result is a cargo cult of formalism in which the constructs of modern rigor are stacked up in imitation of philosophical reasoning.

I can talk in generalities like this for a long time, it seems. But I'm not yet at a stage where I can go into the details and say, your formalism assumes this, which is why you can't use it to do that. Which is why I hoped someone else would work out that part.

Actually, I wasn't thinking about metaphysics at all. I was trying to demonstrate rigorously that time is the common cause of the observed correlation (which AlexMennen did correctly in another comment). While trying to do this, I realized that even after removing the values of time from the samples, there was still information about time embedded in the ordering of points, so I was trying to not use that information. The rest just fell out of the analysis. I wasn't sympathetic to any particular metaphysics, and I wasn't thinking about making metaphysical statements at all. I was thinking about how to incorporate this bizarre case into algorithms for learning causal networks.

After thinking about it, I realized that time is a really bad example. We're not really interested in the numerical values of time, we're interested in the association of points in the same sample. In this case the association happens to contain time information, which is why it's so damn confusing. A clearer example would be a population of mice, where we sample each variable once from each mouse. Then the association is given by mapping each point to the mouse it came from. In that case, it's much more clear that the association contains information in itself separate from all the numerical values.

Reading this has left me confused, like I just read a textbook too quickly or something.

Am I having trouble understanding this because it's saying something complicated, or because it's saying something that doesn't quite make sense? It seems more like the former than the latter, but I can't be sure.

Assuming you're comfortable with d-separation and causality, I suspect the confusion is mostly my poor writing, since I did not spend any time editing this or adding diagrams (which it desperately needs). Anyway, if you draw out the causal diagram, we have the three named variables (t, T, and e). t is the only source of ordering and correspondence between T and e, from which everything else follows. It may take some thinking to believe that the information conveyed by t is exactly the ordering of the data. I'm still trying to come up with a less hand-wavy way to explain that part.

If you want to brush up on d-separation and causality, check out the causality section of Highly Advanced Epistemology 101 for Beginners .

Ah, yeah, it was mostly my not remembering the technical stuff...

In fact, in order to truly ignore time data, we cannot even order the points according to time! But that means that we no longer have any way to line up the points T0 with e0, T1 with e1, etc.

What? This makes no sense.

I guess you haven't seen this stated explicitly, but the framework of causal networks makes an iid assumption. The idea is that the causal network represents some process that occurs a lot, and we can watch it occur until we get a reasonably good understanding of the joint distribution of variables. Part of this is that it the same process occurring, so there is no time dependence built into the framework.

For some purposes, we can model time by simply including it as an observed variable, which you do in this post. However, the different measurements of each variable are associated because they come from the same sample of the (iid) causal process, whether or not we are conditioning on time. The way you are trying to condition on time isn't correct, and the correlation does exists in both cases. (Really, we care about dependence rather than correlation, but it doesn't make a difference here.)

I do think that this is a useful general direction of analysis. If the question is meaningful at all, then the answer is probably that given by Armok_GoB in the original thread, but it would be useful to clarify what exactly the question means. There is probably a lot of work to be done before we really understand such things, but I would advise you to better understand the ideas behind causal networks before trying to contribute.

Causal networks do not make an iid assumption. Consider one of the simplest examples, in which we examine experimental data. Some of the variables are chosen by the experimenter. They can be chosen any way the experimenter pleases, so long as they vary. The process is the same, but that does not imply iid observations. It just means that time dependence must enter through the variables. As you say, it is not built in to the framework.

The problem is to reduce the phrase "the different measurements of each variable are associated because they come from the same sample of the causal process." What is a sample? How do we know two numbers (or other strings) came from the same sample? Since the association contains information separate from the values themselves, how can we incorporate that information into the framework explicitly? How can we handle uncertainty in the association apart from uncertainty in the values of the variables?

Causal networks do not make an iid assumption.

Yeah, I guess that's way too strong; there are a lot of alternative assumptions also that justify using them.

What is a sample? How do we know two numbers (or other strings) came from the same sample?

I think we just have to assume this problem solved. Whenever we use causal networks in practice, we know what a sample is. You can try to weaken this and see if you still get anything useful, but this is very different then 'conditioning on time' as you present in the post.

Since the association contains information separate from the values themselves, how can we incorporate that information into the framework explicitly?

Bayes theorem? If we have a strong enough prior and enough information to reverse-engineer the association reasonably well, then we might be able to learn something. If you're running a clinical trial and you recorded which drugs were given out, but not to which patients, then you need other information, such as a prior about which side-effects they cause and measurements of side-effects that are associated with specific patients. Otherwise you just don't have the data necessary to construct the model.

Exactly! We want to incorporate the association information using Bayes theorem. If you have zero information about the mapping, then your knowledge is invariant under permutations of the data sets (e.g., swapping T0 with T1). That implies that your prior over the associations is uniform over the possible permutations (note that a permutation uniquely specifies an association and vice versa). So, when calculating the correlation, you have to average over all permutations, and the correlation turns out to be identically zero for all possible data. No association means no correlation.

So in the zero information case, we get this weird behavior that isn't what we expect. If the zero information case doesn't work, then we can't expect to get correct answers with only partial information about the associations. We can expect similar strangeness when trying to deal with partial information based on priors about side-effects caused by our hypothetical drug.

If we don't have enough information to construct the model, then our analysis should yield inconclusive results, not weird or backward results. So the problem is to figure out the right way to handle association information.

Yes, but this is a completely different matter than your original post. Obviously this is how we should handle this weird state of information that you're constructing, but it doesn't have the causal interpretation you give it. You are doing something, but it isn't causal analysis. Also, in the scenario you describe, you have the association information, so you should be using it.

From your recent comments, it sounds like you're trying to talk about approximating causal inference when you don't have completely reliable information about how the data points for individual variables are sorted into samples, which I guess could make an interesting problem, though this intention was not apparent in your original post. Obviously if two variables are correlated, the observed correlation will be stronger the better information you have about how the data points are sorted into samples, though this is not a d-separation and should not be treated as a d-separation. If you have perfect information about how the variables t and T line up, and perfect information about how the variables t and e line up, then pretending that you don't have any information about how the variables T and e line up doesn't seem like an operation that it makes any sense to do. If you really don't have any information about how T and e line up, then this will remain true no matter what you condition on. If you want to talk about what changes when you gain information about how T and e line up, that's an entirely different thing.

Yeah, when I wrote the post I hadn't thought through it enough to put it in clear terms. I hadn't even realized that there was an open question in here. Bouncing back and forth with commenters helped a lot. (Thankyou!)

At this point it's pretty clear that we can't treat the associations as variables in the same way that we usually treat variables in a causal net (as I did in the original post). As you say, we can't treat it as d-separation in the usual way. But we still need some way to integrate this information when trying to learn causal nets. So I guess the interesting question is how. I may write another discussion post with a better example and a clean formulation of the question.

May I suggest some standard books on learning causal structure from data? (Causation, Prediction and Search, for instance).

Structure learning is a huge area, lots of low (and not so low) hanging fruit has been picked.


The other thing to keep in mind about learning structure from data is that it (often) relies on faithfulness. That is, typically (A d-separated from B given C) in a graph implies (A independent of B given C) in the distribution Markov relative to the graph. But the converse is not necessarily true. If the converse is true, faithfulness holds. Lots of distributions out there are not faithful. That is, it may be by pure chance that the price of beans in China and the traffic patterns in LA are perfectly correlated. This does not allow us to conclude anything causally interesting.

Know any other good books on the subject? I've had trouble finding good books in the area. I'd especially appreciate something Bayesian. I've never even seen anyone do the math for Bayesian structure learning with multivariate normals.

Why do you care if the method is Bayesian or not?

Greg Cooper's paper is one classic reference on Bayesian methods:

http://www.inf.ufrgs.br/~alvares/CMP259DCBD/Bayes.pdf

Imagine that we took measurements from a thermometer on my window and a ridiculously large tuning fork over several years. The first set of data is temperature T over time t, so it looks like a list of data points [(t0, T0), (t1, T1), ...]. The second set of data is mechanical strain e in the tuning fork over time, so it looks like a list of data points [(t0, e0), (t1, e1), ...]. We line up the temperature and strain data according to time, yielding [(T0, e0), (T1, e1), ...] and find a significant correlation between the two, since they happen to have similar periodicity.

Note that unless their frequency is exactly the same, over a long enough time period (compared to the reciprocal of the difference between the frequencies) the correlation will be zero.

(EDIT: OTOH, any two things that vary monotonically with time will correlate. 1 2 3 (I think I've seen a few more.))

As for the rest of the post, see the Timeless Physics post by EY and the references therein.