Cyan comments on Why (and why not) Bayesian Updating? - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (26)
You're heading towards redefining correlation to mean causal connection.
When people actually do causal analysis (example, example, example) they perform specific calculations to detect various relationships among the variables. There are many different calculations they may do, which is the point of the first of those references, but we are not talking -- at least, I'm not -- about uncomputable Kolmogorov-based concepts (and even by that standard, the first data file, were it to use a QRNG, contains no mutual information). The moral that I was drawing was a practical one: certain very simple relationships between physical variables can generate time series containing no mutual information detectable by any of these methods. This suggests a substantial limitation of their practical applicability.
Specifics, please. Given the actual dynamical process generating those data (which is that B is the derivative of A, and A is a smoothly varying random variable), show me a mathematical definition of the mutual information between A and B, and a method of calculating it.
Mutual information is the difference of marginal and conditional entropy (eq 4 of this): I(X,Y) = H(X) - H(X|Y)
Suppose X is a deterministic function of Y (e.g., Y is a function sampled from a stochastic process and X is its derivative). Then P(X|Y) is a degenerate distribution and the conditional entropy H(X|Y) is 0. Hence Y is maximally informative about X.
I think the words Richard used in his question denoted the mutual information between the functions A and B, but I think he meant to ask about the mutual information between two time series datasets sampled from A and B over the same interval.
And my point was that this is an irrelevant comparison. When you look at the data sets, you want to know if they are mutually informative (if learning one can tell you about the other). A linear statistical correlation -- which Kennaway showed is absent -- is one way that the datasets can be mutually informative, but it is not the only way.
If you know the ordered, timewise development of each variable, you have extra information to use. If you discard this knowledge of the time ordering, and are left with just simultaneous pairs (pairs of the form [A(t0),B(t0)] ) then yes, as Kennaway points out, you're hosed. So?
One could ask both questions, but as Cyan points out, if you know the function A of this example exactly, then you also know B exactly. What do you know about B, though, when you know A only approximately, for example, by sampling a time series? As the sample time increases beyond the autocorrelation time of A then the amount of information you get about B converges to zero, in the sense that given all of both series up to A(t) and B(t-1), the distribution of B(t) is almost identical to its unconditional distribution.
I'm sure there is a general technical definition, BTW, even though I haven't seen it. This is not a rhetorical question.
My whole argument rests on a weaker reed than I first appreciated, because the definition of mutual information I linked is for univariate random variables. When I searched for a definition of mutual information for stochastic processes, all I could really find was various people writing that it was a generalization of mutual information for random variables in "the natural way". But the point you bring up is actually a step in the direction of a stronger argument, not a weaker one. Sampling the function to get a time series makes a vector-valued random variable out of a stochastic process, and numerical differentiation on that random vector is still deterministic. My argument then follows from the definition of multivariate mutual information.
This is not correct. Given the vector of all values of A sampled at intervals dt, the derivative of that vector -- that is, the time series for B -- is not determined by the vector itself, only by the complete trajectory of A. The longer dt is, the less the vector tells you about B.
True. I was also assuming that