Approaching Logical Probability

Manfred

If we design a robot that acts as if it's uncertain about mathematical statements, that violates some desiderata for probability. But realistic robots cannot prove all theorems; they have to be uncertain about hard math problems.

In the name of practicality, we want a foundation for decision-making that captures what it means to make a good decision, even with limited resources. "Good" means that even though our real-world robot can't make decisions well enough to satisfy Savage's theorem, we want to approximate that ideal, not throw it out. Although I don't have the one best answer to give you, in this post we'll take some steps forward.

The objects we call probabilities are specified by desiderata that tell us how they behave. Any uncertainty about math problems violates those desiderata, but we still want to be able to assign logical probabilities that behave a lot like probabilities. The basic principles - not making up or ignoring information, not throwing away money or being inconsistent - should be deviated from as little as possible even when computing power is scarce. We want to develop a foundation for logical probabilities by starting from the rules governing ordinary probability, and then minimally restricting the application of those rules.

As we do this, it's important to keep track of what changes we make and why. Sometimes people just define logical probabilities, without worrying about desiderata. This is fine, when it works, and is often patchable if it doesn't have the right properties. But if you use it for something important and and get a surprise failure, it's really bad. My hope here is to construct logical probabilities that have the good properties, while keeping handwaving and mysterious assumptions to a minimum.

The perils of handwaving are more dire than they appear, and they are at their most dire in the hardest and most confusing reaches of physics. After better approaches fail, many theorists resort to just making up approximations and then trying to justify them. Doing this is known colloquially as a "1/ego expansion." Simply put, it doesn't work; there are too many vital little details. It's why even condensed matter theorists tell you not to trust condensed matter theorists about high temperature superconductivity.

We must abandon regular probabilities because our robot has limited time, but other parts of the decision-making process can also go over the time limit. If the robot's resources are limited, expected utility maximization breaks down at many points: there might be too many strategies to search through, too many outcomes to foresee, there might be probabilities that are too hard to find, and the utility of the outcomes might be too complicated.

The logical probabilities considered in this sequence will help approximate hard math problems, but they don't seem to help much when there are too many outcomes to consider, or if you want to make the best use of limited computational resources. They are only a part of the full solution.

Time for a desideratum: we want our robot to only assign a logical probability of 1 or 0 to a statement after it's actually checked the proof of that statement.

We can think of this as limiting what statements our robot is allowed to be certain about - only statements with short proofs can be found by our agent. However, this desideratum is not just about proof length, because a real robot won't check every checkable proof - it will spend time generating proofs, maybe trying to prove some specific statement, and will end up only checking some subset of short proofs.

Logical probabilities, unlike probabilities, are not determined just by the starting information. If our real robot only verifies some small collection of proofs, the robot's logical probabilities depend heavily upon what proof-steps were checked. One proof-step is just one truth-preserving step by our robot, like one application of modus ponens - it's a little proof one step long. The import is that they're the atomic unit of proofs, and once all the steps of a proof are checked, the proof is checked.

If we condition on which proof-steps get checked, does that determine the logical probabilities?

For any statement our robot is going to prove or disprove, we can use the checked proof steps to find whether it's logical probability 1 or 0. This gives the same answer as a real robot that checks steps according to some process and then returns 1 or 0 if it manages to prove or disprove the statement we give it. We just have to take the steps that the real robot ends up checking, and say that those are the proved steps for our abstract mathematical definition.

There's a problem, though. We haven't changed the old axioms, so they're still only satisfied if we get the right answer for everything. Meanwhile our new desideratum says we can't get the right answer for everything - we've made our axioms internally contradictory. In order to talk about the logical probabilities of unproven statements, we'll need to weaken the original axioms so that they no longer require certainty about everything. We'll explore ways to do this next time. Then we can assign numbers to statements in the usual way, by using our weakened axioms to find constraints, then maximizing entropy subject to those constraints.

Part of the sequence Logical Uncertainty

Previous Post: Logic as Probability

Next post: Solutions and Open Problems

For what it's worth, I now feel that representing any kind of uncertainty as probabilities might be a mistake. As I discussed here, in AMD-like situations indexical uncertainty is partly controlled by the agent, and to that extent it can't be represented as probabilities (though I know you disagree). But it's even easier to see that logical uncertainty is also partly controlled by the agent, because the agent's decision-making is controlling logical facts. (Self-fulfilling spurious proofs are related to that.) So it seems that any correct definition of probabilities must depend on what the agent is going to do with these probabilities. I guess we need some other abstraction instead of probabilities, but I have no clue what it might be.

I agree that if the outcome depends on our strategy or programming, then P(outcome | do(action)) has no proper place in our agent's decision-making. Savages theorem requires us to use probabilities for the things that determine the outcome; if our action does not determine the outcome, its probability isn't given by Savage's theorem.

I think you'd agree that there are more-abstract objects like P(outcome | strategy) that do obey Savage's theorem in these problems (I'd claim always, you'd maybe just say usually?). In the absent-minded driver problem, we can work the problem and get the right answer by using P(outcome | strategy). The strategy used determines the outcome, and so Savage's theorem works on it.

And I do think that simultaneously, we can use Cox's theorem to show that the absent-minded driver has some probability P(state | information). It's just not integrated with decision-making in the usual way - we want to obey Savage's theorem for that.

Also, Wei Dai made a comment similar to yours here.

Hmm. Following Sniffnoy's post, It seems to me that a randomized strategy determines a lottery over outcomes, even when the state of the world is fixed. So Savage's theorem will give you an arbitrary utility function over such lotteries, which cannot be easily converted to a utility function over outcomes... Have you worked through the application of Savage's theorem to things like P(outcome|strategy) in detail? I don't understand yet how it all works out.

a randomized strategy determines a lottery over outcomes, even when the state of the world is fixed

So, like, my options are to either to eat a cookie or not (utility 1 or 0). And if I want to randomize I can roll a die and only eat the cookie if I get an odd number. But then the expected utility of the strategy is pretty straightforward - 1/2 of a cookie.

Have you worked through the application of Savage's theorem to things like P(outcome|strategy) in detail? I don't understand yet how it all works out.

Not exhaustively, but the basic idea is that you're changing the objects in "event-space." If you want to look into it, I found this paper useful.

So, like, my options are to either to eat a cookie or not (utility 1 or 0)

If you already have probabilities and utilities, why do you need Savage's theorem? I thought it was used to prove that a reasonable decision-making agent must have utilities over outcomes in the first place. My point was that applying it to things like P(outcome|strategy) might lead to problems on that path.

Ah, I see - is the idea "if we haven't derived probabilities yet, how can we use probabilistic strategies?"

If we use some non-black-box random process, like rolling a die, then I think the problem resolves itself, since we don't have to use probabilities to specify a die, we can just have a symmetry in our information about the sides of the die, or some knowledge of past rolls, etc. Under this picture, the "mixed" in mixed strategy would be externalized to the random process, and it would have the same format as a pure strategy.

Hmm, no, I was trying to make a different point. Okay, let's back up a little. Can you spell out what you think are the assumptions and conclusions of Savage's theorem with your proposed changes? I have some vague idea of what you might say, and I suspect that the conclusions don't follow from the assumptions because the proof stops working, but by now we seem to misunderstand each other so much that I have to be sure.

I am proposing no changes. My claim is that even though we use english words like "event-space" or "actions" when describing Savage's theorem, the things that actually have the relevant properties in the AMD problem are the strategies.

Cribbing from the paper I linked, the key property of "actions" is that they are functions from the set of "states of the world" (also somewhat mutable) to the set of consequences (the things I have a utility function over). If the state is "I'm at the first intersection" and I take the action (no quotes, actual action) of "go straight," that does return a consequence.

How do you represent the strategy "always turn right" as a function from states to consequences? What does it return if the state is "I'm at the second intersection", which is impossible if the agent uses that strategy?

Well, if we're changing what objects are the "actions" in the proof, we're probably also changing which objects are the "states." You only need a strategy once, you don't need a new strategy for each intersection.

If we have a strategy like "go straight with probability p," a sufficient "state" is just the starting position and a description of the game.

Hmm, I'm not sure on what grounds we can actually rule out using the individual intersections as states, though, even though that leads to the wrong answer. Maybe they violate axiom 3, which requires the existence of "constant actions."

Sorry for deleting my comment. I'm still trying to figure out where this approach leads. So now you're saying that "I'm at the first intersection" isn't actually a "state" and shouldn't get a probability?

Right. To quote myself:

P(outcome | do(action)) has no proper place in our agent's decision-making. Savages theorem requires us to use probabilities for the things that determine the outcome; if our action does not determine the outcome, its probability isn't given by Savage's theorem.

And I do think that simultaneously, we can use Cox's theorem to show that the absent-minded driver has some probability P(state | information). It's just not integrated with decision-making in the usual way - we want to obey Savage's theorem for that.

So we'll have a probability due to Cox's theorem. But for decision-making, we won't ever actually need that probability, because it's not a probability of one of the objects Savage's theorem cares about.

Yes, the key property of actions is that they are functions from the set of states to the set of consequences. Strategies do not have that property, because they can be randomized. If you convert randomized strategies to deterministic ones by "externalizing" random processes into black boxes in the world, Savage's theorem will only give you some probability distribution over the black boxes, not necessarily the probability distribution that you intended. If you "hardcode" the probabilities of the black boxes into the inputs of Savage's theorem, you might as well hardcode other things like utilities, and I don't see the point.

Maybe I'm just thick, but I'm not at all convinced by your claim that probabilistic reasoning about potential mathematical theorems violates any desiderata.

I re-read the post you linked to in the first line, but am still not satisfied. Could you be a bit more specific? Which desideratum? And how violated?

Perhaps it will help you explain, if I describe how I see things.

Since mathematical symbols are nothing more than (for example) marks on a bit of paper, there is no sense in which strings of such symbols have any independent truth (beyond the fact that the ink really is arranged on the paper in that way). To talk about truth, we must refer to some physical mechanism that implements some analogy for the mathematical operations. Thus, a question about the plausibility of some putative theorem is really a question about the behaviour of some such mechanism. To do plausible inference under complete certainty (which, as you say must overlap with classical logic) is simply to do the calculus, having seen the output of the mechanism. To assign a probability having not seen the output of the mechanism seems to me to be just another bog-standard problem of inference under uncertainty about the state of some physical entity.

Have I missed an important point?

Cox's theorem literally has as a desideratum that the results should be identical to classical logic when you're completely certain about the axioms. This is what's violated.

I'll try illustrating this with an example.

Suppose we have a calculator that wants to add 298+587. If it can only take small steps, it just has to start with 298, and then do 298+1=299 (keeping track that this is step 1), 299+1=300 (step 2), 300+1=301 (that's 3), etc, until it reaches 884+1=885 (step 587), at which point the calculator put's "885" on the screen as its last step. And so our calculator, when you input 298+587, outputs 885.

In order to add two big numbers, our calculator only ever had to add 1 to things - the stored number and the step count - which is pretty cool.

The output follow inexorably follows from the rules of this calculator. When the first input was 298 and the step count is 587, the stored number is 885, and when the step count is equal to the second input, the calculator puts the stored number on the screen. We cannot get a different answer without using a calculator that follows different rules.

Now suppose we build a calculator that does the same thing but with different labels. We ask it for P(298+587=885 | standard axioms). So it start with something it knows - P(298=298)=1. Then it moves to P(298+1=299)=1. Then P(298+2=300)=1. Eventually it reaches P(298+587=885)=1, so it outputs "1." This is just a different label on our normal calculator. Every step it increments two numbers, one on the left and one on the fight, and then it stops when the incremented number on the left reaches 587. It's the same process.

This is the translation that obey's Cox's theorem. It's the same process as classical logic, because when your'e certain about axioms probabilistic logic and classical logic are equivalent.

Logical uncertainty is like saying that P(298+587=885 | standard axioms) = 0.5. This violates Cox's theorem because it doesn't agree with our "re-labeled calculator." But why is that bad / weird?

Well, suppose that even when we're logically uncertain, we can agree that P(298=298)=1, and that P(298+1=299)=1. So why can't we just do the calculator process, and keep adding 1 to both sides until we reach the desired answer? There must be some step where the calculator process breaks down, where our logically uncertain agent accepts that , e.g., P(298+100=398)=1, but thinks P(298+101=399)=0.5

In classical probability, this is not allowed. If you think that P(298+100=398)=1, and that adding 1 to both sides preserves probability (one of the axioms that defines addition), then it follows that P(298+101=391)=1. Anything else is... illogical.

Thanks for taking the time to elaborate.

I don't recall that desideratum in Jaynes' derivations. I think it is not needed. Why should it be needed? Certainty about axioms is a million miles from certainty about all their consequences, as seems to be the exact point of your series.

Help me out, what am I not understanding?

In Jaynes, this is sort of hidden in desideratum 2, "correspondence with common sense."

The key part is that if two statements are logically equivalent, Cox's theorem is required to assign them the same probability. Since the axioms of arithmetic and "the axioms of arithmetic, and also 298+587=885," are logically equivalent, they should be assigned the same probability.

I'm not sure how to help you well beyond that, my pedagogy is weak here.

I'm not sure that's what Jaynes meant by correspondence with common sense. To me, it's more reminiscent of his consistency requirements, but I don't think it is identical to any of them.

Certainly, it is desirable that logically equivalent statements receive the same probability assignment, but I'm not aware that the derivation of Cox's theorems collapses without this assumption.

Jaynes says, "the robot always represents equivalent states of knowledge by equivalent plausibility assignments." The problem, of course, is knowing that 2 statements are equivalent - if we don't know this, we should be allowed to make different probability assignments. Equivalence and known equivalence are, to me, not the same, and Jaynes' prescriptions seem to refer to the latter. I may know that x = 298 + 587, but not know that x = 885, so I would not be not violating probability theory if I adopted different degrees of belief for these statements.

Note that Jaynes used this consistency requirement to derive such principles as the Bernoulli urn rule, which is very much about symmetry of knowledge, and not about logical equivalence of states.

It's definitely not clear, I'll admit. And you're right, it is also a sort of consistency requirement.

Fortunately, I can direct you to section 5 of a more explicit derivation here.

Thanks, I'll take a look at the article.

If you don't mind, when you say "definitely not clear," do you mean that you are not certain about this point, or that you are confident, but it's complicated to explain?

i mean that I'm not very sure where that correspondence comes up in Jaynes, but Jaynes is being less explicit than other derivations, which I am more confident about.

Is there any more straightforward way to see the problem? I argued with you about this for a while and I think you convinced me, but it is still a little foggy. If there is a consistency problem, surely this means that we must be vulnerable to Dutch books doesn't it? I.e. they would not seem to be Dutch books to us, with our limited resources, but a superior intelligence would know that they were and would use them to con us out of utility. Do you know of some argument like this?

If there is a consistency problem, surely this means that we must be vulnerable to Dutch books doesn't it? I.e. they would not seem to be Dutch books to us, with our limited resources, but a superior intelligence would know that they were and would use them to con us out of utility.

Yes, this is right. Also, http://www.spaceandgames.com/?p=27 :)

If I know all the digits of pi and you think they're evenly distributed past a certain point, I can take your money.

In order to resist this, you need to have a hypothesis for "Manfred will pick the right number" - which, fortunately, is very doable, because the complexity of this hypothesis is only about the complexity of a program that computes the digits of pi.

But nonetheless, until you figure this out, that's the dutch book.

Lol that is a nice story in that link, but it isn't a Dutch book. The bet in it isn't set up to measure subjective probability either, so I don't really see what the lesson in it is for logical probability.

Say that instead of the digits of pi, we were betting on the contents of some boxes. For concreteness let there be three boxes, one of which contains a prize. Say also that you have looked inside the boxes and know exactly where the prize is. For me, I have some subjective probability P( X_i | I_mine ) that the prize is inside box i. For you, all your subjective probabilities are either zero or one, since you know perfectly well where the prize is. However, if my beliefs about where the prize is follow the probability calculus correctly, you still cannot Dutch book me, even though you know where the prize is and I don't.

So, how is the scenario about the digits of pi different to this? Do you have some example of an actual Dutch book that I would accept if I were to allow logical uncertainty?

edit:

Ok well I thought of what seems to be a typical Dutch book scenario, but it has made me yet more confused about what is special about the logical uncertainty case. So, let me present two scenarios, and I wonder if you can tell me what the difference is:

Consider two propositions, A and B. Let it be the case that A->B. However, say that we do not realise this, and say we assign the following probabilities to A and B:

P(A) = 0.5

P(B) = 0.5

P(B|A) = P(B)

P(A & B) = 0.25

indicating that we think A and B are independent. Based on these probabilities, we should accept the following arrangement of bets:

Sell bet for $0.50 that A is false, payoff $1 if correct

Sell bet for $0.25 that A & B are both true, payoff $1 if correct

The expected amount we must pay out is 0.5$1 + 0.25$1 = $0.75, which is how much we are selling the bets for, so everything seems fair to us.

Someone who understands that A->B will happily buy these bets from us, since they know that "not A" and "A & B" are actually equivalent to "not A" and "A", i.e. he knows P(not A) + P(A & B) = 1, so he wins $1 from us no matter what is the case, making a profit of $0.25. So that seems to show that we are being incoherent if we don't know that A->B.

But now consider the following scenario; instead of having the logical relation that A->B, say that our opponent just has some extra empirical information D that we do not, so that for him P(B|A,D) = 1. For him, then, he would still say that

so that we, who do not know D, could still be screwed by the same kind of trade as in the first example. But then, this is sort of obviously possible, since having more information than your opponent should give you a betting advantage. But both situations seem equivalently bad for us, so why are we being incoherent in the first example, but not in the second? Or am I still missing something?

a randomized strategy determines a lottery over outcomes, even when the state of the world is fixed

Have you worked through the application of Savage's theorem to things like P(outcome|strategy) in detail? I don't understand yet how it all works out.

Not exhaustively, but the basic idea is that you... (read more)

13

Approaching Logical Probability

13

13