Eliezer_Yudkowsky comments on Why (and why not) Bayesian Updating? - Less Wrong

17 Post author: Wei_Dai 16 November 2009 09:27PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (26)

You are viewing a single comment's thread.

Comment author: Eliezer_Yudkowsky 16 November 2009 09:43:59PM 16 points [-]

We could use a decision procedure where our beliefs about something can only get weaker, and never any stronger, no matter what evidence we see, and that would be equivalent. Since that seems to be computationally cheaper (by avoiding the division operation), why do our beliefs not actually work like that?

1) Because over a lifetime of this, we would rapidly fall off the precision that can be directly represented by analogue neurons. This takes floating-point math with large exponents.

2) Because it is convenient to cache some mental quantities around the standard probability - e.g., having emotional strength go as the absolute probability avoids the need to do lots of comparisons each time.

3) Because evolution isn't that clever, of course!

Comment author: Wei_Dai 17 November 2009 01:11:29AM *  12 points [-]

4) With Bayesian updating, you only have to update the beliefs that are correlated with the evidence you observe. Beliefs that are independent of the evidence can stay constant. Trying to "save" on the division means you have to update just about every belief upon every new observation, which ends up being much more costly.

Oh well, it was fun trying to imagine being a mind whose beliefs only get weaker with time. :)

Comment author: Tyrrell_McAllister 17 November 2009 01:50:33AM *  1 point [-]

With Bayesian updating, you only have to update the beliefs that are correlated with the evidence you observe.

What does "correlated" mean when talking about alternative kinds of updating?

ETA: And how would you know whether the belief and the evidence are correlated without performing the updating to check?

Comment author: Wei_Dai 17 November 2009 10:37:47AM 1 point [-]

A and B are correlated if P(A ∩ B) != P(A) * P(B).

The idea is that you'd represent the prior using a data structure which allows you to easily determine which beliefs are correlated with a given evidence. I'm not an expert here, but I think this is what Bayesian networks are all about.

Comment author: SilasBarta 17 November 2009 06:18:23PM 2 points [-]

A and B are correlated if P(A ∩ B) != P(A) * P(B).

Careful, though. That's the definition of when there's mutual information, and the term "correlated" can also be used to mean a "linear statistical correlation", which is not the same thing.

And before you roll your eyes, note that this entire LW article is based on equivocating between the two meanings! (See my comment. )

Comment author: RichardKennaway 19 November 2009 11:34:48PM *  0 points [-]

No. The data in the scatter-plot in that article contains no mutual information between the variables A and B, not merely zero product-moment correlation. I linked there to the data that are plotted; anyone is welcome to have a go at finding mutual information in them.

I challenge anyone to analyse these data and demonstrate substantial mutual information between A and B. If the data are insufficient for your favorite method of analysis, I can generate arbitrarily large quantities of it, and if I were using a quantum RNG instead of a PRNG, there would be absolutely no way to determine any connection between the two variables.

Despite that, there is one. It only shows up when the process from which these data are taken is sampled on a sufficiently short timescale, as in the other data file I linked to in that post.

Comment author: RobinZ 19 November 2009 11:46:38PM 1 point [-]

Correct me if I'm wrong, but would the actual measure of the connection between A and B be more accurately summarized as K(A + B) < K(A) + K(B), then?

Comment author: SilasBarta 20 November 2009 04:06:02PM *  0 points [-]

I believe that's an equivalent way to express "H(X) - H(X|Y) > 0" and "P(A ∩ B) != P(A) * P(B)". Or at least, any one of the three can be derived from any of the others.

Note that the Kullback-Leibler divergence (a generalization of entropy) between X and Y is equivalent to the number of extra bits required to code data sampled from X when your compression algorithm is optimized for Y, which shows how these all relate.

Comment author: SilasBarta 19 November 2009 11:40:44PM *  1 point [-]

If you separate out the variables into simultaneous pairs, then yes, you've destroyed the mutual information.

But if someone is allowed to look at the timewise development of each variable, they would see the mutual information, which necessarily results from one causing the other! If A causes B, then, by knowing A, you require less data to describe B (than if you did not know or could not reference A). That's the very definition of mutual information.

You can't just say that because the simultaneous pairs are uncorrelated, there is no mutual information between the variables. You showed as much when you later demonstrated that the simultaneous pairs between a function and its derivative are uncorrelated. But who denies that learning a function tells you something about its derivative? (which would mean there's mutual information between the two...)

Comment author: RichardKennaway 20 November 2009 12:02:38AM *  2 points [-]

You're heading towards redefining correlation to mean causal connection.

When people actually do causal analysis (example, example, example) they perform specific calculations to detect various relationships among the variables. There are many different calculations they may do, which is the point of the first of those references, but we are not talking -- at least, I'm not -- about uncomputable Kolmogorov-based concepts (and even by that standard, the first data file, were it to use a QRNG, contains no mutual information). The moral that I was drawing was a practical one: certain very simple relationships between physical variables can generate time series containing no mutual information detectable by any of these methods. This suggests a substantial limitation of their practical applicability.

But who denies that learning a function tells you something about its derivative? (which would mean there's mutual information between the two...)

Specifics, please. Given the actual dynamical process generating those data (which is that B is the derivative of A, and A is a smoothly varying random variable), show me a mathematical definition of the mutual information between A and B, and a method of calculating it.

Comment author: SilasBarta 20 November 2009 03:56:49PM *  1 point [-]

You're heading towards redefining correlation to mean causal connection.

Nope. I'm pointing out that "correlated" can mean "there exists a linear statistical correlation" or "there exists mutual information" -- but whichever you use, you need to be consistent. And at no point did I say it meant causal connection -- I just noted that that's one way mutual information can develop.

The moral that I was drawing was a practical one: certain very simple relationships between physical variables can generate time series containing no mutual information detectable by any of these methods. This suggests a substantial limitation of their practical applicability.

What you showed is that there is more than one way for two variables to be mutually informative, and if you limit yourself to a linear statistical regression on the simultaneous pairs, you might not find the mutual information. So what? If you know more than just the unordered simultaneous pairs, use that knowledge!

But who denies that learning a function tells you something about its derivative? (which would mean there's mutual information between the two...)

Specifics, please.

Sure. Let's use your point about derivatives. I tell you sin(x) = 4/5. Have I told you something about cos(x)? (And no it doen't matter that the cosine can have two values; you've still learned something.)

I tell you f(x) = sin(x) + cos(x). Have I told you something about f ' (x)?

Comment author: RichardKennaway 30 November 2009 01:20:16PM 0 points [-]

Sure. Let's use your point about derivatives. I tell you sin(x) = 4/5. Have I told you something about cos(x)?

Yes.

I tell you f(x) = sin(x) + cos(x). Have I told you something about f ' (x)?

Yes.

But in real experiments, you're not given the underlying function, only observations of some of its values.

So, I tell you a time series for an unknown function f.

What have I told you about f'? What further information would you need to make a numerical calculation of the amount of information you now have about f'?

In the data file I originally linked to, there is not merely no linear relationship, but virtually no relationship whatsoever, discoverable by any means whatever, between the two columns, which tabulate f and f' for a certain stochastic function f. Mutual information, even in Kolmogorov heaven, is not present.

Comment author: Cyan 20 November 2009 12:29:14AM 1 point [-]

Mutual information is the difference of marginal and conditional entropy (eq 4 of this): I(X,Y) = H(X) - H(X|Y)

Suppose X is a deterministic function of Y (e.g., Y is a function sampled from a stochastic process and X is its derivative). Then P(X|Y) is a degenerate distribution and the conditional entropy H(X|Y) is 0. Hence Y is maximally informative about X.

Comment author: Steve_Rayhawk 20 November 2009 02:20:57AM 1 point [-]

I think the words Richard used in his question denoted the mutual information between the functions A and B, but I think he meant to ask about the mutual information between two time series datasets sampled from A and B over the same interval.

Comment author: Cyan 20 November 2009 03:36:35AM 1 point [-]

If A causes B...

No need to bring up causality. It's enough that knowledge of A specifies B too.

Comment author: SilasBarta 20 November 2009 04:57:26PM 1 point [-]

Yes, that's correct. I only mentioned causality to make my comment relevant to the context Kennaway brought up.

Comment author: MichaelBishop 17 November 2009 07:41:00PM 0 points [-]

Considering all the different combinations of things you might condition on, the task does not sound trivial.