Previously: Book Club introductory post - First update and Chapter 1 summary

Discussion on chapter 1 has wound down, we move on to Chapter 2 (I have updated the previous post with a summary of chapter 1 with links to the discussion as appropriate). But first, a few announcements.

How to participate

This is both for people who have previously registered interest, as well as newcomers. This spreadsheet is our best attempt at coordinating 80+ Less Wrong readers interested in participating in "earnest study of the great literature in our area of interest".

If you are still participating, please let the group know - all you have to do is fill in the "Active (Chapter)" column. Write in an "X" if you are checked out, or the number of the chapter you are currently reading. This will let us measure attrition, as well as adapt the pace if necessary. If you would like to join, please add yourself to the spreadsheet. If you would like to participate in live chat about the material, please indicate your time zone and preferred meeting time. As always, your feedback on the process itself is more than welcome.

Refer to the previous post for more details on how to participate and meeting schedules.

Chapter 2: The Quantitative Rules

In this chapter Jaynes carefully introduces and justifies the elementary laws of plausibility, from which all later results are derived.

(Disclosure: I wasn't able to follow all the math in this chapter but I didn't let it deter me; the applications in later chapters are more accessible. We'll take things slow, and draw on such expertise as has been offered by more advanced members of the group. At worst this chapter can be enjoyed on a purely literary basis.)

Sections: The Product Rule - The Sum Rule. Exercises: 2.1 and 2.2

Chapter 2 works out the consequences of the qualitative desiderata introduced at the end of Chapter 1.

The first step is to consider the evaluation of the plausibility (AB|C), from the possibly relevant inputs: (B|C), (A|C), (A|BC) and (B|AC). Considerations of symmetry and the desideratum of consistency lead to a functional equation known as the "associativity equation": F(F(x,z),z)=F(x,F(y,z)), characterizing the the function F such that (AB|C)=F[(B|C),(A|BC)]. The derivation that follows requires some calculus, and shows by differentiating then integrating back the form of the product rule:

w(AB|C)=w(A|BC)w(B|C)=w(B|AC)w(A|C)

Having obtained this, the next step is to establish how (A|B) is related to (not-A|B). The functional equation in this case is

x*S(S(y)/x)=y*S(S(x)/y)

and the derivation, after some more calculus, leads to S(x)=(1-x^m)^(1/m). But the value of m is irrelevant, and so we end up with the two following rules:

p(AB|C)=p(A|BC)p(B|C)=p(B|AC)p(A|C)

p(not-A|B)+p(A|B)=1

The exercises provide a first opportunity to explore how these two rules yield a great many other ways of assessing probabilities of more complex propositions, for instance p(C|A+B), based on the elementary probabilities.

Sections: Qualitative Properties - Numerical Values - Notation and Finite Sets Policy - Comments. Exercises: 2.3

Jaynes next turns back to the relation between "plausible reasoning" and deductive logic, showing the latter as a limiting case of the former. The weaker syllogisms shown in Chapter 1 correspond to inequalities that can be derived from the product rule, and the direction of these inequalities start to point to likelihood ratios.

The product and sum rules allow us to consider the particular case when we have a finite set of mutually exclusive and exhaustive propositions, and background information which is symmetrical about each such proposition: it says the same about any one of them that it says about any other. Considering two such situations, where the propositions are the same but the labels we give them are different, Jaynes shows that, given our starting desiderata, we cannot do other than to assign the same probabilities to propositions which we are unable to distinguish otherwise than by their labels.

This is the principle of indifference; its significance is that even though what we have derived so far is an infinity of functions p(x) generated by the parameter m, the desiderata entirely "pin down" the numerical values in this particular situation.

So far in this chapter we had been using p(x) as a function relating the plausibilities of propositions, such that p(x) was an arbitrary monotonic function of the plausibility x. At this point Jaynes suggests that we "turn this around" and say that x is a function of p. These values of p, probabilities, become the primary mathematical objects, while the plausibilities "have faded entirely out of the picture. We will just have no further use for them".

The principle of indifference now allows us to start computing numerical values for "urn probabilities", which will be the main topic of the next chapter.

Exercise 2.3 is notable for providing a formal treatment of the conjunction fallacy.

Chapter 2 ends with a cautionary note on the topic of justifying results on infinite sets only based on a "well-behaved" process of passing to the limit of a series of finite cases. The Comments section addresses the "subjective" vs "objective" distinction.

New Comment
42 comments, sorted by Click to highlight new comments since:

I would like to share some interesting discussion on a hidden assumption used in Cox's Theorem (this is the result which states that what falls out of the desiderata is a probability measure).

First, some criticism of Cox's Theorem -- a paper by Joseph Y. Halpern published in the Journal of AI Research. Here he points out an assumption which is necessary to arrive at the associative functional equation:

F(x, F(y,z)) = F(F(x,y), z) for all x,y,z

This is (2.13) in PT:TLoS

Because this equation was derived by using the associativity of the conjunction operation A(BC) = (AB)C, there are restrictions on what values the plausibilities x, y, and z can take. If these restrictions were stringent enough that x,y and z could only take on finitely many values or if they were to miss an entire interval of values, then the proof would fall apart. There needs to be an additional assumption that the values they can take form a dense subset. Halpern argues that this assumption is unnatural and unreasonable since it disallows "notions of belief with only finitely many gradations." For example, many AI projects have only finitely many propositions that are considered.

K. S. Van Horn's article on Cox's Theorem addresses this criticism directly and powerfully starting on page 9. He argues that the theory that is being proposed should be universal and so having holes in the set of plausibilities should be unacceptable.

Anyhow, I found it interesting if only because it makes explicit a hidden assumption in the proof.

[-]cata40

After 2.9 in the text: "Furthermore, the function F(x, y) must be continuous; for otherwise an arbitrarily small increase in one of the plausibilities on the right-hand side of (2-1) could result in a large increase in AB|C."

Is there some particular reason that's an unacceptable outcome, or is it just generally undesirable?

(I suppose we might be in trouble later if it weren't necessarily continuous, since it wouldn't be necessarily differentiable (although he waves this off in a footnote with reference to some other papers and proofs) so this seems like an important statement.)

Jaynes mentions a "convenient" continuity assumption following 1.28, and uses it following 1.37. As you point out, the comments following 2.13 seem to indicate something of why Jaynes believes this assumption to be only convenient but not necessary. The comments of talyo just below suggest that Jaynes was wrong - we need something approximating continuity.

But continuity is not difficult to justify. We need only assume that we can flip a coin an arbitrarily large number of times and recall the results. Hmmm. Recall an arbitrarily large (unbounded) quantity of information? Maybe it is not so easy to justify...

During the proof of the product rule, Jayne's used the lemma without proof that, if G(x,y)G(y,z) is independent of y, then G can be written as rH(x)/H(y). This is easy to believe, but quite an important step, so it's a shame he skipped it.

Below is a proof of this lemma (credit goes to a friend; I found a similar but more cumbersome proof):

We know G(x,y)G(y,z)=f(x,z) for some function f. Setting z=1 gives G(x,y)=f(x,1)/G(y,1), while setting y=z=1 gives f(x,0)=G(x,1)G(1,1). Substituting the second into the first gives, G(x,y)=G(1,1)G(x,1)/G(y,1), which has the desired form.

I've re-written 2.6.1 'Subjective' vs. 'objective' in my own words. My words are very different from Jaynes' words which raises the question of whether I've got the substance of the Bayesian position correct.

2.6.1 is one of the least mathematical passages of chapter 2. Is it important? I've selected it for special attention because it seems to lead to quarrels and I believe that I have diagnosed the problem with online discussions when I note that every-one does a two-way split even though a three-way split is required.

[-]cata10

From your text: "If we think that probabilities are transcendal [sic] the principle of indifference offers us a free lunch. We get knowledge about the real world, merely by being ignorant of it. That is absurd."

Can you clarify what you mean by this?

I don't find your words very different from Jaynes' words. He makes it clear that his standard of "objectivity" is encapsulated in the consistency and completeness desiderata 3b and 3c, which ensure that two people reasoning independently from the same background information come to the same conclusions.

The difference is that you frame people's differences in background information as "different viewpoints" on the same situation, which I find a little confusing; logically, they're simply different situations that need to be reasoned out independently. There's no reason to be surprised when they yield different results.

Jaynes certainly agrees that probability is "situational" in your categorization. It can't be "individual", since that violates 3c consistency, unless you're considering people's mental states to be part of their background information (in which case it's just "situational.") "Transcendental" is just the special case of "situational" where everyone has perfect information.

There is a tradition of seeing probabilities as inherent physical properties of randomisation devices. That is what I was trying to get at with "transcendental". Probability goes beyond what we know we don't know to reach what is truly uncertain. In this tradition people say that when we toss a coin the probability of heads is 1/2 and this is a property of the coin.

I've sneaked a peek at section 10.3 How to cheat at coin and die tossing. I think that the Bayesian analysis is that when we toss a coin we exploit the fact that the coin is small compared to the fumble and tremble of human fingers which creates a situation of incomplete information.

What did I mean about the principle of indifference offering us a free lunch? Imagine that you are helping a friend move and you find an electronic dice at the back of a draw. "Oh, that old thing. There was a fault in the electronics. One number came up twice as often as the others." Your friend hasn't said which number; he might even have forgotten. According to the principle of indifference the probabilities are 1/6 for the first roll. (The probabilities for the second roll are different because the first roll hints weakly as to the number that comes up more often than the others). If we insist on interpreting probabilities as physical properties of randomisation devices, then the principle of indifference seems to be whipping out a soldering iron and mending the defective circuitry, at least for the first roll.

The difference is that you frame people's differences in background information as "different viewpoints" on the same situation, which I find a little confusing; logically, they're simply different situations that need to be reasoned out independently.

I think that there is a valid distinction which your observation ignores. When there is a car crash, the witnesses are in different places, one riding in a vehicle, one walking on the pavement. We want to check dates and times because we want to distinguish between "different viewpoints" on the same car crash and "different viewpoints" of two car crashes (perhaps at the same dangerous junction). If we have different viewpoints on the same situation there are consistancy constraints due to there being only a single underlying situation. These constraints are absent when they are simply different situations.

Going back to my piece, the humour of the playing cards rests on Superstish having a hunch, that the card is more likely to be black than red, about the deck that Prankster fixed. Some people say that Superstish is entitled to his hunch. I'm trying to say that he isn't and also to explain the Bayesian position, which is that he isn't entitled to his hunch. It is important to my text that there is a single situation. If as author I were to spread the action over two days, with Prankster fixing the deck on day two and Superstish insisting on his hunch on day one, very few people would feel that Superstish was entitled to his hunch. I would be attacking a straw man.

What do y'all think of what Jaynes says in the Comments about Gödel's Theorem and Venn Diagrams?

His remarks on GT I found thoroughly confusing. The section on Venn diagrams doesn't seem to be entirely about diagrams as a method of illustration, but it's hard to tell exactly what it's about.

I don't feel a desperate need to deeply understand those bits, but it might make the summary of the chapter more useful if it adressed those questions.

What do y'all think about Morendil's claim not to be a native English speaker? ;-)

(Because that might be misunderstood: I don't get how non-native speakers get accustomed to using so many idioms and colloquialisms.)

Roko, ciphergoth, Leonhart and Alexandros can now testify to my accent. (I've sometimes been told I sound like I'm Russian.)

Indeed we can :-)

Okay, but where do you come up with terms like "y'all" and "spoilsport"?

I read a lot.

I gave the Chapter to my friend who'd studied Gödel's Theorems for a term for his Masters, and he said the bit on Gödel was highly dubious, though he didn't go into detail. But they don't seem essential at all; I suspect Jaynes was engaging in a misconceived defence of Bayesianism's compatibility with Gödel's results against a misconceived attack.

Book Club Query

I haven't seen a lot of discussion on the last half of Chapter 2. I for one had some trouble with exercise 2.3 so I was expecting a little more.

Does this mean that people need more time? That it's time to move on to Chapter 3? That participation is eroding? Something else?

[-]cata00

I did read the rest of chapter 2. I solved the first part of 2.3 without difficulty (proving the inequalities) but I was surprised to work for half an hour on the second part without solving it; I intend to come back to it still with a clear head.

I also found the second part of exercise 2.3 surprisingly difficult, it took much longer than I would have expected (especially now that I've figured it out and see how simple the solution is).

Exercise 2.3. Limits on Probability Values. As soon as we have the numerical values a = P(A|C) and b = P(B|C), the product and sum rules place some limits on the possible numerical values for their conjunction and disjunction. Supposing that a≤b, show that the probability of the conjunction cannot exceed that of the least probable proposition: 0≤P(AB|C)≤a, and the probability of the disjunction cannot be less than that of the most probable proposition: b≤P (A + B|C)≤1. Then show that, if a + b > 1, there is a stronger inequality for the conjunction; and if a+b < 1 there is a stronger one for the disjunction.

I'll post my solutions for the second half (proving the stronger inequalities for the conjunction and disjunction).

1.) if a+b>1, we can prove a tighter lower bound on P(AB|C) as follows:

By the generalized sum rule (2-48), we have

P(AB|C)=a+b-P(A+B|C)

and since P(x)≤1 for any x, we have

a+b-1≤a+b-P(A+B|C)=P(AB|C)

and since 1<a+b, we know 0<a+b-1, so this is a tighter bound.

The intuitive explanation for this inequality is that if a+b>1, there must be some minimum amount of "overlap" of states which satisfy both A and B. This minimum is given by a-P(notB), because the largest possible proportion of states which can satisfy A without satisfying B is P(notB), so their difference is the minimum overlap.

2.) If a+b<1, we can prove a tighter upper bound on P(A+B|C) as follows:

This is basically the same as the previous one, I'll just one-line it:

P(A+B|C)=a+b-P(AB)≤a+b<1

(yes, I realize this post is more than a year old. Hopefully others working through the book later will find my response useful)

OK, in that case I'll wait a bit to give folks time to gnaw on this bone.

Can Equation 2.13 be viewed as a kind of communication?

AB = BA?

Communication? I think you mean commutation.

The two big algebraic properties are

associative: (x+y)+z = x+(y+z)

commutative: x+y = y+z

The prefix function notation used in Eq 2.13 uglifies the associative law so badly that ones eyes can barefly focus on it, but I still think Jaynes makes the right choice. Introducing a new infix operator here would feel like sleight of hand.

The associative law is usually seen as more basic than the commutative law. Matrix multiplication is associative but not commutative. Composition of maps is associative but not commutative. The theory of commutative groups is much simpler than the theory of groups in general; for example a subgroup of a commutative group is always normal.

Aha, that makes perfect sense, thank you :) I had a feeling there was something going on there to that effect, but I could not put my finger on it till your reply. thanks.

Book Club Update

Summary of the first part of Chapter 2 posted, and reading for the week (apologies for the late update). Questions for the remainder:

  • What role is "background information" starting to play in the exposition so far of probability theory?
  • At what point does Chapter 2 finally arrive at a formal definition of "probability"? What do you think of this passage from a purely literary standpoint?
  • What justifies the "finite sets policy"?

In some sense, the definition of probability is spread out among the whole chapter until the word is first used; as far as I can tell, the definition of probability he gives is "the unique real associated with a hypothesis where the product and negation rules can be applied without an additional transformation, and without breaking consistency with the desiderata."

I was actually thinking about how to give a concise and intuitive version of the Bayesian definition of probability earlier today, and find this very disconcerting. It would feel awfully wrong to echo everything bad I've heard about frequentism while relying on the measure theoretic definition of probability for intuition.

Anyone having trouble with exercises 2.1 or 2.2 and need a hint?

If not, we'll be moving on to the rest of Chapter 2 soon.

One question to start this off:

  • To derive the Product Rule we are invited to consider (AB|C) as a functional relation of other plausibilities; one candidate is ruled out with the aid of a scenario involving blue eyes and brown eyes. Can you think of similar examples ruling out other candidate relations?

(More discussion questions always welcome !)

I have lots of questions about the math itself. I'll limit myself to one to start with. Can someone with more math than I have confirm that we get (2-7 (*)) by application of the chain rule of calculus?

(*) ETA - that's 2.16 in the printed edition

[-][anonymous]10

Quick note: Jaynes cites Tribus (1969) as ruling out all but two possibilities. Actually, Tribus (1969) leaves ruling out most possibilities as an exercise for the reader.

I did not go through the 9 remaining cases, but I did think about one...

Suppose (AB|C) = F[(A|BC) , (B|AC)]. Compare A=B=C with (A = B) AND (C -> ~A).

Re 2-7: Yep, chain rule gets it done. By the way, took me a few minutes to realize that your citation "2-7" refers to a line in the pdf manuscript of the text. The numbering is different in the hardcopy version. In particular, it uses periods (e.g. equation 2.7) instead of dashes (e.g. equation 2-7), so as long as we're all consistent with that, I don't suppose there will be much confusion.

Could we standardize on using the whole-book-as-one-PDF version, at least for the purposes of referencing equations?

ETA: So far I've benefited from checking the relevant parts of Kevin Van Horn's unofficial errata pages before (and often while) reading a particular section.

OK, thanks.

I'm able to follow a fair bit of what's going on here; the hard portions for me are when Jaynes gets some result without saying which rule or operation justifies it - I suppose it's obvious to someone familiar with calculus, but when you lack these background assumptions it can be very hard to infer what rules are being used, so I can't even find out how I might plug the gaps in my knowledge. (Definitely "deadly unk-unk" territory for me.)

(Of course "follow" isn't the same thing at all as "would be able to get similar results on a different but related problem". I grok the notion of a functional equation, and I can verify intermediate steps using a symbolic math package, but Jaynes' overall strategy is obscure to me. Is this a common pattern, taking the derivative of a functional equation then integrating back?)

The next bit where I lose track is 2.22. What's going on here, is this a total derivative?

Yeah. A total derivative. The way I think about it is the dv thing there (jargon: a differential 1-form) eats a tangent vector in the y-z plane. It spits out the rate of change of the function in the direction of the vector (scaled appropriately with the magnitude of the vector). It does this by looking at the rate of change in the y-direction (the dy stuff) and in the z-direction (the dz stuff) and adding those together (since after taking derivatives, things get nice and linear).

I'm not too familiar with the functional equation business either. I'm currently trying to figure out what the heck is happening on the bottom half of page 32. Figuring out the top half took me a really long while (esp. 2.50).

I'm convinced that the inequality in eqn 2.52 shouldn't be there. In particular, when you stick in the solution S(x) = 1 - x, it's false. I can't figure out if anything below it depends on that because I don't understand much below it.

[-]Soki40

I could not figure out why alpha > 0 neither and it seems wrong to me too. But this does not look like a problem.

We know that J is an increasing function because of 2-49. So in 2-53, alpha and log(x/S(x)) must have the same sign, since the remaining of the right member tends toward 0 when q tends toward + infinity.

Then b is positive and I think it is all that matters.

However, if alpha = 0, b is not defined. But if alpha=0 then log(x/S(x))=0 as a consequence of 2-53, so x/S(x)=1. There is only one x that gives us this since S is strictly decreasing. And by continuity we can still get 2-56.

Lovely. Thanks.

I'm totally stuck on getting 2.50 from 2.48, would appreciate a hint.

K. S. Van Horn gives a few lines describing the derivation in his PT:TLoS errata. I don't understand why he does step 4 there -- it seems to me to be irrelevant. The two main facts which are needed are step 2-3 and step 5, the sum of a geometric series and the Taylor series expansion around y = S(x). Hopefully that is a good hint.

Nitpicking with his errata, 1/(1-z) = 1 + z + O(z^2) for all z is wrong since the interval of convergence for the RHS is (-1,1). This is not important to the problem since the z here will be z = exp(-q) which is less than 1 since q is positive.

[-]Soki30

It is not very important, but since you mentioned it :

The interval of convergence of the Taylor series of 1/(1-z) at z=0 is indeed (-1,1).

But "1/(1-z) = 1 + z + O(z^2) for all z" does not make sense to me.

1/(1-z) = 1 + z + O(z^2) means that there is an M such as |1/(1-z) - (1 + z)| is no greater that M*z^2 for every z close enough to 0. It is about the behavior of 1/(1-z) - (1 + z) when z tends toward 0, not when z belongs to (-1,1).

Is there anything more to getting 2.53 than just rearranging things around? I'm not sure I really understand where we get the left-hand side from.

Hopefully that is a good hint.

Indeed, thanks!

[-]Cyan10

Suppose (AB|C) = F[(A|BC) , (B|AC)]. Compare A=B=C with (A = B) AND (C -> ~A).

Not sure what you're getting at. To rule out (AB|C) = F[(A|BC) , (B|AC)], set A = B and let A's plausibility given C be arbitrary. Let T represent the (fixed) plausibility of a tautology. Then we have

(A|BC) = (B|AC) = T (because A = B)
(AB|C) = F(T, T) = constant

But (AB|C) is arbitrary by hypothesis, so (AB|C) = F[(A|BC) , (B|AC)] is not useful.

ETA: Credit where it's due: page 13, point 4 of Kevin S. Van Horne's guide to Cox's theorem (warning: pdf).

Yeah. My solution is basically the same as yours. Setting A=B=C makes F(T,T) = T. But setting A=B AND C -> ~A makes F(T,T) = F (warning: unfortunate notation collision here).

[-]Cyan20

Given C -> ~A, ({any proposition} | AC) is undefined. That's why I couldn't follow your argument all the way.

Ah OK. You're right. I guess I was taking the 'extension of logic' thing a little too far there. I had it in my head that ({any prop} | {any contradiction}) = T since contradictions imply anything. Thanks.

[-]Cyan00

That's legit so far as it goes -- it's just that every proposition is also false at the same time, since every proposition's negation is also true, and the whole enterprise goes to shit. There's no point in trying to extend logic to uncertain propositions when you can prove anything.