But it turns out that there is one true probability distribution over mathematical statements, given the axioms. The right distribution is obtained by straightforward application of the product rule - never mind that it takes 4^^^3 steps - and if you deviate from the right distribution that means you violate the product rule at some point.
This does not seem right to me. I feel like you are sneakily trying to condition all of the robots probabilities on mathematical proofs that it does not have a-priori. E.g. consider A, A->B, therefore B. To learn that P(A->B)=1, the robot has to do a big calculation to obtain the proof. After this, it can conclude that P(B|A,A->B)=1. But before it has the proof, it should still have some P(B|A)!=1.
Sure, it seems tempting to call the probabilities you would have after obtaining all the proofs of everything the "true" probabilties, but to me it doesn't actually seem different to the claim that "after I roll my dice an infinity of times, I will know the 'true' probability of rolling a 1". I should still have some beliefs about a one being rolled before I have observed vast numbers of rolls.
In other words I suggest that proof of mathematical relationships should be treated exactly the same as any other data/evidence.
edit: in fact surely one has to consider this so that the robot can incorporate the cost of computing the proof into its loss function, in order to decide if it should bother doing it or not. Knowing the answer for certain may still not be worth the time it takes (not to mention that even after computing the proof the robot may still not have total confidence in it; if it is a really long proof, the probability that cosmic rays have caused lots of bit-flips to mess up the logic may become significant. If the robot knows it cannot ever get the answer with sufficient confidence within the given time constraints, it must choose an action which accounts for this. And the logic it uses should be just the same as how it knows when to stop rolling die).
edit2: I realised I was a little sloppy above; let me make it clearer here:
The robot knows P(B|A,A->B)=1 apriori. But it does not know "A->B" is true apriori. It therefore calculates
P(B|A) = P(B|A,A->B) P(A->B|A) + P(B|A,not A->B) P(not A->B|A) = P(A->B|A)
After it obtains proof that "A->B", call this p, we have P(A->B|A,p) = 1, so
P(B|A,p) = P(B|A,A->B,p) P(A->B|A,p) + P(B|A,not A->B,p) P(not A->B|A,p)
collapses to
P(B|A,p) = P(B|A,A->B,p) = P(B|A,A->B) = 1
But I don't think it is reasonable to skip straight to this final statement, unless the cost of obtaining p is negligible.
edit3: If this somehow violates Savage or Cox's theorems I'd like to know why :).
If this somehow violates Savage or Cox's theorems I'd like to know why
Well, Cox's theorem has as a requirement that when your axioms are completely certain, you assign probability 1 to all classical consequences of those axioms. Assigning probability 0.5 to any of those consequences thus violates Cox's theorem. But this is kind of unsatisfying, so: where do we violate the product rule?
Suppose our robot knows that P(wet outside | raining) = 1. And it observes that it's raining, so P(rain)=1. But it's having trouble figuring out whether it's wet outside ...
Followup To: Putting in the Numbers
Before talking about logical uncertainty, our final topic is the relationship between probabilistic logic and classical logic. A robot running on probabilistic logic stores probabilities of events, e.g. that the grass is wet outside, P(wet), and then if they collect new evidence they update that probability to P(wet|evidence). Classical logic robots, on the other hand, deduce the truth of statements from axioms and observations. Maybe our robot starts out not being able to deduce whether the grass is wet, but then they observe that it is raining, and so they use an axiom about rain causing wetness to deduce that "the grass is wet" is true.
Classical logic relies on complete certainty in its axioms and observations, and makes completely certain deductions. This is unrealistic when applied to rain, but we're going to apply this to (first order, for starters) math later, which a better fit for classical logic.
The general pattern of the deduction "It's raining, and when it rains the grass is wet, therefore the grass is wet" was modus ponens: if 'U implies R' is true, and U is true, then R must be true. There is also modus tollens: if 'U implies R' is true, and R is false, then U has to be false too. Third, there is the law of non-contradiction: "It's simultaneously raining and not-raining outside" is always false.
We can imagine a robot that does classical logic as if it were writing in a notebook. Axioms are entered in the notebook at the start. Then our robot starts writing down statements that can be deduced by modus ponens or modus tollens. Eventually, the notebook is filled with statements deducible from the axioms. Modus tollens and modus ponens can be thought of as consistency conditions that apply to the contents of the notebook.
Doing math is one important application of our classical-logic robot. The robot can read from its notebook "If variable A is a number, A=A+0" and "SS0 is a number," and then write down "SS0=SS0+0."
Note that this requires the robot to interpret variable A differently than symbol SS0. This is one of many upgrades we can make to the basic robot so that it can interpret math more easily. We also want to program in special responses to symbols like 'and', so that if A and B are in the notebook our robot will write 'A and B', and if 'A and B' is in the notebook it will add in A and B. In this light, modus ponens is just the robot having a programmed response to the 'implies' symbol.
Certainty about our axioms is what lets us use classical logic, but you can represent complete certainty in probabilistic logic too, by the probabilities 1 and 0. These two methods of reasoning shouldn't contradict each other - if a classical logic robot can deduce that it's raining out, a probabilistic logic robot with the same information should assign P(rain)=1.
If it's raining out, then my grass is wet. In the language of probabilities, this is P(wet|rain)=1. If I look outside and see rain, P(rain)=1, and then the product rule says that P(wet and rain) = P(rain)·P(wet|rain), and that's equal to 1, so my grass must be wet too. Hey, that's modus ponens!
The rules of probability can also behave like modus tollens (if P(B)=0, and P(B|A)=1, P(A)=0) and the law of the excluded middle (P(A|not-A)=0). Thus, when we're completely certain, probabilistic logic and classical logic give the same answers.
There's a very short way to prove this, which is that one of Cox's desiderata for how probabilities must behave was "when you're completely certain, your plausibilities should satisfy the rules of classical logic."
In Foundations of Probability, I alluded to the idea that we should be able to apply probabilities to math. Dutch book arguments work because our robot must act as if it had probabilities in order to avoid losing money. Savage's theorem applies because the results of our robot's actions might depend on mathematical results. Cox's theorem applies because beliefs about math behave like other beliefs.
This is completely correct. Math follows the rules of probability, and thus can be described with probabilities, because classical logic is the same as probabilistic logic when you're certain.
We can even use this correspondence to figure out what numbers the probabilities take on:
1 for every statement that follows from the axioms, 0 for their negations.
This raises an issue: what about betting on the last digit of the 3^^^3'th prime? We dragged probability into this mess because it was supposed to help our robot stop trying to prove the answer and just bet as if P(last digit is 1)=1/4. But it turns out that there is one true probability distribution over mathematical statements, given the axioms. The right distribution is obtained by straightforward application of the product rule - never mind that it takes 4^^^3 steps - and if you deviate from the right distribution that means you violate the product rule at some point.
This is why logical uncertainty is different. Even though our robot doesn't have enough resources to find the right answer, using logical uncertainty violates Savage's theorem and Cox's theorem. If we want our robot to act as if it has some "logical probability," it's going to need a stranger sort of foundation.
Part of the sequence Logical Uncertainty
Previous Post: Putting in the Numbers
Next post: Approaching Logical Uncertainty