If we want to reason with probability theory, we seem to be stuck if we want to reason about mathematics.

You can skip this pararaph and the next if you're familiar with the problem. But if you're not, here's an illustration. Suppose your friend has some pennies that she would like to arrange into a rectangle, which of course is impossible if the number of pennies is prime. Let's call the number of pennies N. Your friend would like to use probability theory to guess whether it's worth trying; if there's a 50% chance that Prime(N), she won't bother trying to make the rectangle. You might imagine that if she counts them and finds that there's an odd number, this is evidence of Prime(N); if she furthermore notices that the digits don't sum to a multiple of three, this is further evidence of Prime(N). In general, each test of compositeness that she knows should, if it fails, raise the probability of Prime(N).

 

But what happens instead is this. Suppose you both count them, and find that N=53. Being a LessWrong reader, you of course recognize from recently posted articles that N=53 implies Prime(N), though she does not. But this means that P(N=53) <= P(Prime(N)). If you're quite sure of N=53—that is, P(N=53) is near 1—then P(Prime(N)) is also near 1. There's no way for her to get a gradient of uncertainty from simple tests of compositeness. The probability is just some number near 1.

 

In general, conditional on the axioms, mathematical theorems have probability 1 if they're true, and 0 if they're false. Deriving these probabilities is exactly as difficult as deriving the theorems themselves.

 

A way of assigning actual probabilities to theorems occurred to me today. I usually see this problem discussed by folks that want to develop formal models of AI, and I don't know if this'll be helpful at all for that. But it's something I can imagine myself using when I want to reason about a mathematical conjecture in a basically reasonable way.

 

The basic idea is just, don't condition on the axioms. Condition on a few relevant things implied by the axioms. Then, maximize entropy subject to those constraints.

 

Like, if you're trying to figure out whether a number N is prime. Don't condition on any properties of numbers, any facts about addition or multiplication etc., just treat Prime(N) as a predicate. Then condition on a few facts about Prime(N). I'd start with the prime number theorem and a theorem about the speed of its convergence, which I imagine would lead to a prior distribution  P(Prime(N)|N=n, F1) = 1/log(n) (where F1 is Fact1, the fact that the prime number theorem is true). Now, let's say we want to be able to update on noticing a number is odd, so lets add Fact2, which is that all prime numbers are odd and half of all numbers are odd, which allows us to use Bayes' theorem:

 

bayesthm

 

which gives us the intuitive conclusion that observing that a number is odd doubles the probability that it's prime.

 

It's a weird thing to do, leaving out part of your knowledge. Here's how I think of it, though, so that it's less weird.

 

Suppose that the way you do statistics is by building Jaynesian robots. I'm referring here to an image used over and over again in Jaynes's book Probability Theory: The Logic of Science. The image is that we're building robots that assign plausibilities to statements, and we try to construct these robots in such a way that these plausibilities are useful to us. That is, we try to make robots whose conclusions we'll actually believe, because otherwise, why bother?

 

One of his "desiderata", his principles of construction, is that the robot gives equal plausibility assignments to logically equivalent statements, which gets us into all this trouble when we try to build a robot we can ask about theorems. But I'm keeping this desideratum; I'm building fully Jaynesian robots.

 

All I'm doing is hiding some of my knowledge from the robot. And this actually makes sense to me, because sometimes I'd rather have an ignorant Jaynesian robot than a fully informed one.

 

This is because speed is always important, and sometimes an ignorant Jaynesian robot is faster. Like, let's say my friend and I are building robots to compete in a Number Munchers-like game, where you're given lots of very large numbers and have to eat only the primes. And let's make the levels timed, too. If my friend builds a Jaynesian robot that knows the fundamental axioms of number theory, it's going to have to thoroughly test the primeness of each number before it eats it, and it's going to run out of time. But let's say I carefully program mine with enough facts to make it effective, but not so many facts that it's slow. I'll win.

 

This doesn't really apply to modeling AI, does it? I mean, it's a way to assign probabilities, but not the best one. Humans often do way better using, I don't know, analogy or whatever they do. So why would a self-modifying AI continue using this, after it finds a better way?

 

But it's very simple, and it does seem to work. Except for how I totally handwaved turning the prime number theorem into a prior. There could be all sorts of subtleties trying to get from "prime number theorem and some theorem on speed of convergence, and maximizing entropy" to "P(Prime(N)|N=n) = 1/log(n)". I at least know that it's not rigorously provable exactly as stated, since 1/log(n) is above 1 for small n.

 

And, I honestly don't know how to do any better than this, and still use probability theory. I have that Haim Gaifman paper, "Reasoning with Limited Resources and Assigning Probabilities to Arithmetical Statements", in which he describes an entirely different way of conceptualizing the problem, along with a much better probabilistic test of primeness. It's long though, and I haven't finished it yet.

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 9:28 PM

Looking with a naive view on math we see that initially all statments have not only a probability but a frequency. I just have to look at my children:

  • My 2.5 year old recently went from detecting single digits (mixing up 6 and 9 of course; who though about using rotation symetric digits#$!) to counting correctly (more or less) up to 14 but cannot apply this to determine cardinality larger than 3 or 4.

  • The 5 year old can count up to 100 (but often skips numbers) to determine cardinaility and can add up to 20 but not subtract that far. He will put forward random guesses when asked (by his brothers) about 10+10+10 (e.g. "43").

  • The 7.5 year old adds arbitrarily long numbers - but often gets the decimal places wrong (adding/missing 0s).

  • The 10 year old can extract and solve simple additive equations (a+b=20, a+4=b) without trying numbers (as his younger brother still would) but by using intuition ("a and b are about (a+b)/2 one 4/2 less and the other 4/2 more") but is far away from solving any equation system. He can find solutions only where he has an intuition.

Note that all these examples are not from lack of rigor or forgetting rules. And they are also not completely random (whatever that means). Each example names concepts that are in the process of being learned and where frequent 'errors' occur, not by noise but by insufficient detection and acquisition of the pattern. If I wanted I could assign probabilities to propositions like "N is the cardinality of set S" for each of the boys.

They show a gradual grasp of the concepts. I'd model probabilities from statements of logic or arithmetic by assigning non-zero probabilties to axioms (actually the axioms do not necessarily get the highest probabilities - at least not for my children). Rules of inference also get probabilities e.g. does none of my children undertand modus ponens. They might but it would initially have a low probability and you have to apply it very often to reach any sensible results. And probabilities are not simply propagated unlimited to other statements via the rules. On the other hand simple arithmetic surely has $P("a+b"="a"+"b")>0.99$ for $a, b < 100$ which also indicates that the probability depends on the size of the arguments (compare with the 53 is prime example).

The example of my 10 year boy is actually not far from Haim Gaifmans modelling: The axioms (arithmetics) are mastered (P(A)~1) and he can apply them but there are sentences which are too complex to handle such that only those that can be handled (by intermediate sentences captured by intuition) can be applied to solve problems reliably. The problem is that there is almost no grey. Either he will be able to solve a problem or he will not. He will most likely not guess on such tasks. If he reaches a wrong result that is not due to probabilistic failure to combine complex intermediate results nor by accumulated processing error (incresing rapidly by boredom) but by mis-judging a rule to apply (thus showing that he learning at the process level).
This leads me to think that Gaifmans modelling may be interesting from a theoretical point but not suitable to model human reasoning limits (not that that'd be his goal).

Hey, nice!

You might also like my post, here.

oh, excellent, I was hoping people would link me to other posts. I've searched LW on the topic before, but never seen this.

Yeah - there's a predecessor post by Benja Fallenstein that's good but relatively impenetrable, and a more recent post by I believe Abram Demski that comes at the problem from a different angle (A sampling distribution over algorithms).

EDIT: aha, found it

For those wanting to read the Haim Gaifman paper: It took me about one hour to read it.

Notable quotes:

Rationality, let us agree, implies that if I know that something is deductively implied by what I believe, then I should believe it, or revise my other beliefs.

Now the fact that mathematical activity is truth-revealing should not blind us to the a priori nature, or to the necessity, of mathematical truth; to its being constitutive of our conceptual organization, without which thought looses its coherence.

[The] deductive shortsightedness that makes [Mersennes] error possible isolates it to a large extent from the main body of other beliefs. The holding of it does not prevent one from developing other sound arithmetical beliefs about prime numbers. On the mathematical background of Mersenne’s time, believing that $2^67–1$ is prime was like having a false belief about some far away island. This is not due merely to the number’s size, but to the scarcity of known deductive chains between that belief and others. When the known territories of the deductive web expand, the contradiction between that belief and others emerges; at that stage revision takes place.

You can skip this pararaph and the next if you're familiar with the problem. But if you're not, here's an illustration. Suppose your friend has some pennies that she would like to arrange into a rectangle, which of course is impossible if the number of pennies is prime. Let's call the number of pennies N. Your friend would like to use probability theory to guess whether it's worth trying; if there's a 50% chance that Prime(N), she won't bother trying to make the rectangle. You might imagine that if she counts them and finds that there's an odd number, this is evidence of Prime(N); if she furthermore notices that the digits don't sum to a multiple of three, this is further evidence of Prime(N). In general, each test of compositeness that she knows should, if it fails, raise the probability of Prime(N).

But what happens instead is this. Suppose you both count them, and find that N=53. Being a LessWrong reader, you of course recognize from recently posted articles that N=53 implies Prime(N), though she does not. But this means that P(N=53) <= P(Prime(N)). If you're quite sure of N=53—that is, P(N=53) is near 1—then P(Prime(N)) is also near 1. There's no way for her to get a gradient of uncertainty from simple tests of compositeness. The probability is just some number near 1.

I don't understand why this is a problem. You and your friend have different states of knowledge, so you assign different probabilities.

the problem is that, in probability theory as usually formalized and discussed, we assign the same probabilities, though we shouldn't... do you see?

EDIT: and it's a problem because she can't calculate her probability without proving whether or not it's prime.

One of his "desiderata", his principles of construction, is that the robot gives equal plausibility assignments to logically equivalent statements

I don't see this desiderata. The consistency requirement is that if there are multiple ways of calculating something, then all of them yield the same result. A few minutes of thought didn't lead to any way of leveraging a non 1 or zero probability for Prime(53) into two different results.

If I try to do anything with P(Prime(53)|PA), I get stuff like P(PA|Prime(53)), and I don't have any idea how to interpret that. Since PA is a set of axioms, it doesn't really have a truth value that we can do probability with. Technically speaking, Prime(N) means that the PA axioms imply that 53 has two factors. Since the axioms are in the predicate, any mechanism that forces P(Prime(53)) to be one must do so for all priors.

One final thing: Isn't it wrong to assign a probability of zero to Prime(4), i.e. PA implies that 4 has two factors, since PA could be inconsistent and imply everything?

I now think you're right that logical uncertainty doesn't violate any of Jaynes's desiderata. Which means I should probably try to follow them more closely, if they don't create problems like I thought they would.

An Aspiring Rationalist's Ramble has a post asserting the same thing, that nothing in the desiderata implies logical omniscience.

What you're referring to are called 'features' in the machine intelligence community. A 'feature' is usually an easily-computable property of an input data point that provides some information as to what class the input data point corresponds to. Models are then built by specifying a probability distribution over features. An optimal feature is one that gives the maximal amount of information after all previous features have been taken into account. For instance, if a feature takes one bit to describe, it will be optimal if, considering all other features have been taken into account, it provides one bit of information (or, equivalently, partitions the probability distribution exactly in half). If a feature is not optimal, there are various measures for determining how optimal it is and if it should be used or not.

Ordinary reasoning in mathematics is just a special case of Bayesian reasoning, as has been pointed out numerous times in the sequences. There has been a lot of work on optimal feature selection and how to derive good features (for example, using the Bayesian Information Criterion or BIC). It might be useful to extend your idea to incorporate those developments.

It may be relevant to look at how mathematicians use heuristics to actually make conjectures that seem plausible. The heuristic for there being infinitely many Mersenne primes seems to be of a flavor very similar to what you are doing here.

Unsurprisingly, Terry Tao has an excellent post about this sort of thing.

It's interesting that he talks about "structure", and how the heuristics work best when there's the least structure. I guess he'd describe what I'm doing as including limited structure?

He's also brought attention to a problem with my proposal. I didn't think you could ever end up certain of a falsehood. But in a probabilistic model of numbers, he proved that there are infinitely many solutions to x^3 + y^3 = z^3. A robot with that probabilistic model would be certain that Fermat's Last Theorem is false.

Well, maybe it would. I wish I had the time today to learn to analyze this whole idea more precisely, instead of just conjecturing left and right!

So you mention that one advantage of this method is that it can reason about things faster - because it uses less computational power.

One of his "desiderata", his principles of construction, is that the robot gives equal plausibility assignments to logically equivalent statements, which gets us into all this trouble when we try to build a robot we can ask about theorems. But I'm keeping this desideratum; I'm building fully Jaynesian robots.

Unless I'm misunderstanding something, I don't think you'll be able to keep this with limited computational resources. All (ETA: syntactically) true statements will be assigned the same probability, which makes this kind of probability useless. And by hiding logical information, as you have, you are already breaking this.

I think a better way to think about this is as an update on new computations. (For simplicity, assume that N>2.) You start with the prior that P(Prime(N)|N=n) = 1/log(n). Then you compute (i.e. prove) from the definition of "prime" that all primes are odd, and express this as P(Odd(N)|Prime(P), N=n)=1. Similarly, compute that P(Odd(N)|N=n)=1/2. Then you can compute the Bayesian update. Any new logical information has to come from a computation of some kind.

So if you are given a large number, say 442537, you have a prior probability of it being prime at about 18%. You then compute that it is odd, and then update on this information, giving you a new probability of about 35%, etc...

Does this seem to clarify the concept?

Yeah, this makes sense.

Does this apply to Boolean logic to, in addition to deductions using the axioms of the system?

The archetype I've got in my head is Jaynes's derivation of P(A or B) = P(A) + P(B) - P(A and B).

The first step is that P(A or B) = P(~(~A and ~B)). So if we're updating on computations rather than facts, what we really have is only P(A or B|C) = P(A|C) + P(B|C) - P(A and B|C), where C is the computation that (A or B) is the same as ~(~A and ~B)).

Does that make sense, or is that a different kind of thing?

Unless I'm misunderstanding something, I don't think you'll be able to keep this with limited computational resources. All true statements will be assigned the same probability, which makes this kind of probability useless. And by hiding logical information, as you have, you are already breaking this.

No. It sounds like you imagine that the robot knows the axioms of number theory? It doesn't.

The idea is that you've got some system you're interested in, but the robot's knowledge underdetermines that system. From the info you've fed it, it can't prove all the true things about that system. So, there are things true about the system that it doesn't assign probability of 1 to. One way of thinking about it is that the robot doesn't know precisely what system you want it to think about. I mean, I've left out all facts about ordering for example, how's the robot to know we're even talking about numbers?

EDIT: however I'm realizing now that it still has to know boolean logic which means it assigns probabilities of 0 or 1 to the answers to 3-SAT problems, which are NP-complete. So, yeah, it's still got useless 0/1 probabilities that you can't calculate in reasonable time.

re: updating on computations, I have to give that more thought. It's easy to respond to clarify my thoughts, and I didn't want to keep you waiting unnecessarily, but it'll take me more time to adapt them.