Comment author: Manfred 10 February 2014 11:40:59PM 0 points [-]

It seems to me you are neglecting the proposition "A->B"

Do you know what truth tables are? The statement "A->B" can be represented on a truth table. A and B can be possible. not-A and B can be possible. Not-A and not-B can be possible. But A and not-B is impossible.

A->B and the four statements about the truth table are interchangeable. Even though when I talk about the truth table, I never need to use the "->" symbol. They contain the same content because A->B says that A and not-B is impossible, and saying that A and not-B is impossible says that A->B. For example, "it raining but not being wet outside is impossible."

In the language of probability, saying that P(B|A)=1 means that A and not-B is impossible, while leaving the other possibilities able to vary freely. The product rule says P(A and not-B) = P(A) * P(not-B | A). What's P(not-B | A) if P(B | A)=1? It's zero, because it's the negation of our assumption.

Writing out things in classical logic doesn't just mean putting P() around the same symbols. It means making things behave the same way.

Comment author: Kurros 11 February 2014 12:15:53AM *  -1 points [-]

Ok sure, so you can go through my reasoning leaving out the implication symbol, but retaining the dependence on the proof "p", and it all works out the same. The point is only that the robot doesn't know that A->B, therefore it doesn't set P(B|A)=1 either.

You had "Suppose our robot knows that P(wet outside | raining) = 1. And it observes that it's raining, so P(rain)=1. But it's having trouble figuring out whether it's wet outside within its time limit, so it just gives up and says P(wet outside)=0.5. Has it violated the product rule? Yes. P(wet outside) >= P(wet outside and raining) = P(wet outside | rain) * P(rain) = 1."

But you say it is doing P(wet outside)=0.5 as an approximation. This isn't true though, because it knows that it is raining, so it is setting P(wet outside|rain) = 0.5, which was the crux of my calculation anyway. Therefore when it calculates P(wet outside and raining) = P(wet outside | rain) * P(rain) it gets the answer 0.5, not 1, so it is still being consistent.

Comment author: Manfred 10 February 2014 09:52:45PM *  1 point [-]

If this somehow violates Savage or Cox's theorems I'd like to know why

Well, Cox's theorem has as a requirement that when your axioms are completely certain, you assign probability 1 to all classical consequences of those axioms. Assigning probability 0.5 to any of those consequences thus violates Cox's theorem. But this is kind of unsatisfying, so: where do we violate the product rule?

Suppose our robot knows that P(wet outside | raining) = 1. And it observes that it's raining, so P(rain)=1. But it's having trouble figuring out whether it's wet outside within its time limit, so it just gives up and says P(wet outside)=0.5. Has it violated the product rule? Yes. P(wet outside) >= P(wet outside and raining) = P(wet outside | rain) * P(rain) = 1.

If we accept that the axioms have probability 1, we can deduce the consequences with certainty using the product rule. If at any point we stop deducing the consequences with certainty, this means we have stopped using the product rule.

Comment author: Kurros 10 February 2014 10:37:57PM *  -1 points [-]

Hmm this does not feel the same as what I am suggesting.

Let me map my scenario onto yours:

A = "raining"

B = "wet outside"

A->B = "It will be wet outside if it is raining"

The robot does not know P("wet outside" | "raining") = 1. It only knows P("wet outside" | "raining", "raining->wet outside") = 1. It observes that it is raining, so we'll condition everything on "raining", taking it as true.

We need some priors. Let P("wet outside") = 0.5. We also need a prior for "raining->wet outside", let that be 0.5 as well. From this it follows that

P("wet outside" | "raining") = P("wet outside" | "raining", "raining->wet outside") P("raining->wet outside"|"raining") + P("wet outside" | "raining", not "raining->wet outside") P(not "raining->wet outside"|"raining") = P("raining->wet outside"|"raining") = P("raining->wet outside") = 0.5

according to our priors [first and second equalities are the same as in my first post, third equality follow since whether or not it is "raining" is not relevant for figuring out if "raining->wet outside"].

So the product rule is not violated.

P("wet outside") >= P("wet outside" and "raining") = P("wet outside" | "raining") P("raining") = 0.5

Where the inequality is actually an equality because our prior was P("wet outside") = 0.5. Once the proof p that "raining->wet outside" is obtained, we can update this to

P("wet outside" | p) >= P("wet outside" and "raining" | p) = P("wet outside" | "raining", p) P("raining" | p) = 1

But there is still no product rule violation because

P("wet outside" | p) = P("wet outside" | "raining", p) P("raining" | p) + P("wet outside" | not "raining", p) P(not "raining" | p) = P("wet outside" | "raining", p) P("raining" | p) = 1.

In a nutshell: you need three pieces of information to apply this classical chain of reasoning; A, B, and A->B. All three of these propositions should have priors. Then everything seems fine to me. It seems to me you are neglecting the proposition "A->B", or rather assuming its truth value to be known, when we are explicitly saying that the robot does not know this.

edit: I just realised that I was lucky for my first inequality to work out; I assumed I was free to choose any prior for P("wet outside"), but it turns out I am not. My priors for "raining" and "raining->wet outside" determine the corresponding prior for "wet outside", in order to be compatible with the product rule. I just happened to choose the correct one by accident.

In response to Logic as Probability
Comment author: Kurros 09 February 2014 11:52:35PM *  0 points [-]

But it turns out that there is one true probability distribution over mathematical statements, given the axioms. The right distribution is obtained by straightforward application of the product rule - never mind that it takes 4^^^3 steps - and if you deviate from the right distribution that means you violate the product rule at some point.

This does not seem right to me. I feel like you are sneakily trying to condition all of the robots probabilities on mathematical proofs that it does not have a-priori. E.g. consider A, A->B, therefore B. To learn that P(A->B)=1, the robot has to do a big calculation to obtain the proof. After this, it can conclude that P(B|A,A->B)=1. But before it has the proof, it should still have some P(B|A)!=1.

Sure, it seems tempting to call the probabilities you would have after obtaining all the proofs of everything the "true" probabilties, but to me it doesn't actually seem different to the claim that "after I roll my dice an infinity of times, I will know the 'true' probability of rolling a 1". I should still have some beliefs about a one being rolled before I have observed vast numbers of rolls.

In other words I suggest that proof of mathematical relationships should be treated exactly the same as any other data/evidence.

edit: in fact surely one has to consider this so that the robot can incorporate the cost of computing the proof into its loss function, in order to decide if it should bother doing it or not. Knowing the answer for certain may still not be worth the time it takes (not to mention that even after computing the proof the robot may still not have total confidence in it; if it is a really long proof, the probability that cosmic rays have caused lots of bit-flips to mess up the logic may become significant. If the robot knows it cannot ever get the answer with sufficient confidence within the given time constraints, it must choose an action which accounts for this. And the logic it uses should be just the same as how it knows when to stop rolling die).

edit2: I realised I was a little sloppy above; let me make it clearer here:

The robot knows P(B|A,A->B)=1 apriori. But it does not know "A->B" is true apriori. It therefore calculates

P(B|A) = P(B|A,A->B) P(A->B|A) + P(B|A,not A->B) P(not A->B|A) = P(A->B|A)

After it obtains proof that "A->B", call this p, we have P(A->B|A,p) = 1, so

P(B|A,p) = P(B|A,A->B,p) P(A->B|A,p) + P(B|A,not A->B,p) P(not A->B|A,p)

collapses to

P(B|A,p) = P(B|A,A->B,p) = P(B|A,A->B) = 1

But I don't think it is reasonable to skip straight to this final statement, unless the cost of obtaining p is negligible.

edit3: If this somehow violates Savage or Cox's theorems I'd like to know why :).

Comment author: cousin_it 08 February 2014 11:50:23AM *  1 point [-]

Yeah, that sounds right. You could say that a "true" number is a model parameter that fits the observed data well.

Comment author: Kurros 09 February 2014 11:30:04PM 1 point [-]

Perhaps, though, you could argue it differently. I have been trying to understand so-called "operational" subjective statical methods recently (as advocated by Frank Lad and his friends), and he is insisting on only calling a thing a [meaningful, I guess] "quantity" when there is some well-defined operational procedure for measuring what it is. Where for him "measuring" does not rely on a model, he is refering to reading numbers off some device or other, I think. I don't quite understand him yet, since it seems to me that the numbers reported by devices all rely on some model or other to define them, but maybe one can argue their way out of this...

Comment author: Cyan 07 February 2014 01:50:00PM *  0 points [-]

I can pass along a recommendation I have received: Operational Subjective Statistical Methods by Frank Lad. I haven't read the book myself, so I can't actually vouch for it, but it was described to me as "excellent". I don't know if it is actively prediction-centered, but it should at least be compatible with that philosophy.

Comment author: Kurros 08 February 2014 12:06:35PM 1 point [-]

Thanks, this seems interesting. It is pretty radical; he is very insistent on the idea that for all 'quantities' about which we want to reason there must some operational procedure we can follow in order to find out what it is. I don't know what this means for the ontological status of physical principles, models, etc, but I can at least see the naive appeal... it makes it hard to understand why a model could ever have the power to predict new things we have never seen before though, like Higgs bosons...

Comment author: Kurros 08 February 2014 05:00:59AM 0 points [-]

An example of a "true number" is mass. We can measure the mass of a person or a car, and we use these values in engineering all the time. An example of a "fake number" is utility. I've never seen a concrete utility value used anywhere, though I always hear about nice mathematical laws that it must obey.

It is interesting that you choose mass as your prototypical "true" number. You say we can "measure" the mass of a person or car. This is true in the sense that we have a complex physical model of reality, and in one of the most superficial levels of this model (Newtonian mechanics) there exist some abstract numbers which characterise the motions of "objects" in response to "forces". So "measuring" mass seems to only mean that we collect some data, fit this Newtonian model to that data, and extract relatively precise values for this parameter we call "mass".

Most of your examples of "fake" numbers seem to be to be definable in exactly analogous terms. Your main gripe seems to be that different people try to use the same word to describe parameters in different models, or perhaps that there do not even exist mathematical models for some of them; do you agree? To use a fun phrase I saw recently, the problem is that we are wasting time with "linguistic warfare" when we should be busy building better models?

Comment author: VipulNaik 07 February 2014 12:02:26AM *  0 points [-]

I understand this, though I hadn't thought of it with such clear terminology. I think the point Jonah was making was that in many cases, people are talking about propensities/frequencies when they refer to probabilities. So it's not so much that Jonah or I are confusing epistemic probabilities with propensities/frequencies, it's that many people use the term "probability" to refer to the latter. With language used this way, the probability distribution for this model parameter can be called the "probability distribution of the probability estimate." If you reserve the term probability exclusive to epistemic probability (degree of belief) then this would constitute an abuse of language.

Comment author: Kurros 07 February 2014 02:02:07AM 0 points [-]

Sure, I don't want to suggest we only use the word 'probability' for epistemic probabilities (although the world might be a better place if we did...), only that if we use the word to mean different sorts of probabilities in the same sentence, or even whole body of text, without explicit clarification, then it is just asking for confusion.

Comment author: Cyan 06 February 2014 08:56:38PM *  2 points [-]

I'd guess that in Geisser-style predictive inference, the meaning or reality or what-have-you of G is to be found in the way it encodes the dependence (or maybe, compresses the description) of the joint multivariate predictive distribution. But like I say, that's not my school of thought -- I'm happy to admit the possibility of physical model parameters -- so I really am just guessing.

Comment author: Kurros 07 February 2014 01:00:16AM *  2 points [-]

Hmm, do you know of any good material to learn more about this? I am actually extremely sympathetic to any attempt to rid model parameters of physical meaning; I mean in an abstract sense I am happy to have degrees of belief about them, but in a prior-elucidation sense I find it extremely difficult to argue about what it is sensible to believe a-priori about parameters, particularly given parameterisation dependence problems.

I am a particle physicist, and a particular problem I have is that parameters in particle physics are not constant; they vary with renormalisation scale (roughly, energy of the scattering process), so that if I want to argue about what it is a-priori reasonable to believe about (say) the mass of the Higgs boson, it matters a very great deal what energy scale I choose to define my prior for the parameters at. If I choose (naively) a flat prior over low-energy values for the Higgs mass, it implies I believe some really special and weird things about the high-scale Higgs mass parameter values (they have to be fine-tuned to the bejesus); while if I believe something more "flat" about the high scale parameters, it in turn implies something extremely informative about the low-scale values, namely that the Higgs mass should be really heavy (in the Standard Model - this is essentially the Hierarchy problem, translated into Bayesian words).

Anyway, if I can more directly reason about the physically observable things and detach from the abstract parameters, it might help clarify how one should think about this mess...

Comment author: Cyan 05 February 2014 10:00:55AM *  3 points [-]

Yup, I'm referring to de Finetti's theorem. Thing is, de Finetti himself would have denied that there is such a thing as a parameter -- he was all about only assigning probabilities to observable, bet-on-able things. That's why he developed his representation theorem. From his perspective, p arises as a distinct mathematical entity merely as a result of the representation provided by exchangeability. The meaning of p is to be found in the predictive distribution; to describe p as a bias parameter is to reify a concept which has no place in de Finetti's Bayesian approach.

Now, I'm not a de-Finetti-style subjective Bayesian. For me, it's enough to note that the math is the same whether one conceives of p as stochastic model parameter or as the degree of plausibility of any single outcome. That's why I say it's not either/or.

Comment author: Kurros 06 February 2014 12:16:56AM 1 point [-]

Hmm, interesting. I will go and learn more deeply what de Finetti was getting at. It is a little confusing... in this simple case ok fine p can be defined in a straightforward way in terms of the predictive distribution, but in more complicated cases this quickly becomes extremely difficult or impossible. For one thing, a single model with a single set of parameters may describe outcomes of vastly different experiments. E.g. consider Newtonian gravity. Ok fine strictly the Newtonian gravity part of the model has to be coupled to various other models to describe specific details of the setup, but in all cases there is a parameter G for the universal gravitation constant. G impacts on the predictive distributions for all such experiments, so it is pretty hard to see how it could be defined in terms of them, at least in a concrete sense.

Comment author: Cyan 05 February 2014 03:52:39AM 1 point [-]

This is just a parameter of a stochastic model, not a degree of belief.

This is not exactly correct. It's true that in general there's a sharp distinction to be made between model parameters (which govern/summarize/encode properties of the entire stochastic process) and degrees of belief for various outcomes, but that distinction becomes very blurry in the current context.

What's going on here is that the probability distribution for the observable outcomes is infinitely exchangeable. Infinite exchangeability gives rise to a certain representation for the predictive distribution under which the prior expected limiting frequency is mathematically equal to the marginal prior probability for any single outcome. So under exchangeability, it's not an either/or -- it's a both/and.

Comment author: Kurros 05 February 2014 04:12:04AM 1 point [-]

Are you referring to De Finetti's theorem? I can't say I understand your point. Does it relate to the edit I made shortly before your post? i.e. Given a stochastic model with some parameters, you then have degrees of belief about certain outcomes, some of which may seem almost the same thing as the parameters themselves? I still maintain that the two are quite different: parameters characterise probability distributions, and just in certain cases happen to coincide with conditional degrees of belief. In this 'beliefs about beliefs' context, though, it is the parameters we have degrees of belief about, we do not have degrees of belief about the conditional degrees of belief to which said parameters may happen to coincide.

View more: Prev | Next