A Limited But Better Than Nothing Way To Assign Probabilities to Statements of Logic, Arithmetic, etc.

alex_zag_al

If we want to reason with probability theory, we seem to be stuck if we want to reason about mathematics.

You can skip this pararaph and the next if you're familiar with the problem. But if you're not, here's an illustration. Suppose your friend has some pennies that she would like to arrange into a rectangle, which of course is impossible if the number of pennies is prime. Let's call the number of pennies N. Your friend would like to use probability theory to guess whether it's worth trying; if there's a 50% chance that Prime(N), she won't bother trying to make the rectangle. You might imagine that if she counts them and finds that there's an odd number, this is evidence of Prime(N); if she furthermore notices that the digits don't sum to a multiple of three, this is further evidence of Prime(N). In general, each test of compositeness that she knows should, if it fails, raise the probability of Prime(N).

But what happens instead is this. Suppose you both count them, and find that N=53. Being a LessWrong reader, you of course recognize from recently posted articles that N=53 implies Prime(N), though she does not. But this means that P(N=53) <= P(Prime(N)). If you're quite sure of N=53—that is, P(N=53) is near 1—then P(Prime(N)) is also near 1. There's no way for her to get a gradient of uncertainty from simple tests of compositeness. The probability is just some number near 1.

In general, conditional on the axioms, mathematical theorems have probability 1 if they're true, and 0 if they're false. Deriving these probabilities is exactly as difficult as deriving the theorems themselves.

A way of assigning actual probabilities to theorems occurred to me today. I usually see this problem discussed by folks that want to develop formal models of AI, and I don't know if this'll be helpful at all for that. But it's something I can imagine myself using when I want to reason about a mathematical conjecture in a basically reasonable way.

The basic idea is just, don't condition on the axioms. Condition on a few relevant things implied by the axioms. Then, maximize entropy subject to those constraints.

Like, if you're trying to figure out whether a number N is prime. Don't condition on any properties of numbers, any facts about addition or multiplication etc., just treat Prime(N) as a predicate. Then condition on a few facts about Prime(N). I'd start with the prime number theorem and a theorem about the speed of its convergence, which I imagine would lead to a prior distribution P(Prime(N)|N=n, F1) = 1/log(n) (where F1 is Fact1, the fact that the prime number theorem is true). Now, let's say we want to be able to update on noticing a number is odd, so lets add Fact2, which is that all prime numbers are odd and half of all numbers are odd, which allows us to use Bayes' theorem:

$bayesthm$

which gives us the intuitive conclusion that observing that a number is odd doubles the probability that it's prime.

It's a weird thing to do, leaving out part of your knowledge. Here's how I think of it, though, so that it's less weird.

Suppose that the way you do statistics is by building Jaynesian robots. I'm referring here to an image used over and over again in Jaynes's book Probability Theory: The Logic of Science. The image is that we're building robots that assign plausibilities to statements, and we try to construct these robots in such a way that these plausibilities are useful to us. That is, we try to make robots whose conclusions we'll actually believe, because otherwise, why bother?

One of his "desiderata", his principles of construction, is that the robot gives equal plausibility assignments to logically equivalent statements, which gets us into all this trouble when we try to build a robot we can ask about theorems. But I'm keeping this desideratum; I'm building fully Jaynesian robots.

All I'm doing is hiding some of my knowledge from the robot. And this actually makes sense to me, because sometimes I'd rather have an ignorant Jaynesian robot than a fully informed one.

This is because speed is always important, and sometimes an ignorant Jaynesian robot is faster. Like, let's say my friend and I are building robots to compete in a Number Munchers-like game, where you're given lots of very large numbers and have to eat only the primes. And let's make the levels timed, too. If my friend builds a Jaynesian robot that knows the fundamental axioms of number theory, it's going to have to thoroughly test the primeness of each number before it eats it, and it's going to run out of time. But let's say I carefully program mine with enough facts to make it effective, but not so many facts that it's slow. I'll win.

This doesn't really apply to modeling AI, does it? I mean, it's a way to assign probabilities, but not the best one. Humans often do way better using, I don't know, analogy or whatever they do. So why would a self-modifying AI continue using this, after it finds a better way?

But it's very simple, and it does seem to work. Except for how I totally handwaved turning the prime number theorem into a prior. There could be all sorts of subtleties trying to get from "prime number theorem and some theorem on speed of convergence, and maximizing entropy" to "P(Prime(N)|N=n) = 1/log(n)". I at least know that it's not rigorously provable exactly as stated, since 1/log(n) is above 1 for small n.

And, I honestly don't know how to do any better than this, and still use probability theory. I have that Haim Gaifman paper, "Reasoning with Limited Resources and Assigning Probabilities to Arithmetical Statements", in which he describes an entirely different way of conceptualizing the problem, along with a much better probabilistic test of primeness. It's long though, and I haven't finished it yet.

If we want to reason with probability theory, we seem to be stuck if we want to reason about mathematics.

The basic idea is just, don't condition on the axioms. Condition on a few relevant things implied by the axioms. Then, maximize entropy subject to those constraints.

$bayesthm$

which gives us the intuitive conclusion that observing that a number is odd doubles the probability that it's prime.

It's a weird thing to do, leaving out part of your knowledge. Here's how I think of it, though, so that it's less weird.

All I'm doing is hiding some of my knowledge from the robot. And this actually makes sense to me, because sometimes I'd rather have an ignorant Jaynesian robot than a fully informed one.

Looking with a naive view on math we see that initially all statments have not only a probability but a frequency. I just have to look at my children:

My 2.5 year old recently went from detecting single digits (mixing up 6 and 9 of course; who though about using rotation symetric digits#$!) to counting correctly (more or less) up to 14 but cannot apply this to determine cardinality larger than 3 or 4.
The 5 year old can count up to 100 (but often skips numbers) to determine cardinaility and can add up to 20 but not subtract that far. He will put forward random guesses when asked (by his brothers) about 10+10+10 (e.g. "43").
The 7.5 year old adds arbitrarily long numbers - but often gets the decimal places wrong (adding/missing 0s).
The 10 year old can extract and solve simple additive equations (a+b=20, a+4=b) without trying numbers (as his younger brother still would) but by using intuition ("a and b are about (a+b)/2 one 4/2 less and the other 4/2 more") but is far away from solving any equation system. He can find solutions only where he has an intuition.

Note that all these examples are not from lack of rigor or forgetting rules. And they are also not completely random (whatever that means). Each example names concepts that are in the process of being learned and where frequent 'errors' occur, not by noise but by insufficient detection and acquisition of the pattern. If I wanted I could assign probabilities to propositions like "N is the cardinality of set S" for each of the boys.

They show a gradual grasp of the concepts. I'd model probabilities from statements of logic or arithmetic by assigning non-zero probabilties to axioms (actually the axioms do not necessarily get the highest probabilities - at least not for my children). Rules of inference also get probabilities e.g. does none of my children undertand modus ponens. They might but it would initially have a low probability and you have to apply it very often to reach any sensible results. And probabilities are not simply propagated unlimited to other statements via the rules. On the other hand simple arithmetic surely has $P("a+b"="a"+"b")>0.99$ for $a, b < 100$ which also indicates that the probability depends on the size of the arguments (compare with the 53 is prime example).

The example of my 10 year boy is actually not far from Haim Gaifmans modelling: The axioms (arithmetics) are mastered (P(A)~1) and he can apply them but there are sentences which are too complex to handle such that only those that can be handled (by intermediate sentences captured by intuition) can be applied to solve problems reliably. The problem is that there is almost no grey. Either he will be able to solve a problem or he will not. He will most likely not guess on such tasks. If he reaches a wrong result that is not due to probabilistic failure to combine complex intermediate results nor by accumulated processing error (incresing rapidly by boredom) but by mis-judging a rule to apply (thus showing that he learning at the process level).
This leads me to think that Gaifmans modelling may be interesting from a theoretical point but not suitable to model human reasoning limits (not that that'd be his goal).