The dangers of zero and one

PhilGoetz

Eliezer wrote a post warning against unrealistically confident estimates, in which he argued that you can't be 99.99% sure that 53 is prime. Chris Hallquist replied with a post arguing that you can.

That particular case is tricky. There have been many independent calculations of the first hundred prime numbers. 53 is a small enough number that I think someone would notice if Wikipedia included it erroneously. But can you be 99.99% confident that 1159 is a prime? You found it in one particular source. Can you trust that source? It's large enough that no one would notice if it were wrong. You could try to verify it, but if I write a Perl or C++ program, I can't even be 99.9% sure that the compiler or interpreter will interpret it correctly, let alone that the program is correct.

Rather than argue over the number of nines to use for a specific case, I want to emphasize the the importance of not assigning things probability zero or one. Here's a real case where approximating 99.9999% confidence as 100% had disastrous consequences.

I developed a new gene-caller for JCVI. Genes are interpreted in units of 3 DNA nucleotides called codons. A bacterial gene starts with a start codon (usually ATG, TTG, or GTG) and ends at the first stop codon (usually TAG, TGA, or TAA). Most such sequences are not genes. A gene-caller is a computer program that takes a DNA sequence and guesses which of them are genes.

The first thing I tried was to create a second-order Markov model on codons, and train it on all of the large possible genes in the genome. (Long sequences without stop codons are unlikely to occur by chance and are probably genes.) That means that you set P = 1 and go down the sequence of each large possible gene, codon by codon, multiplying P by the probability of seeing each of the 64 possible codons in the third position given the codons in the first and second positions. Then I created a second Markov model from the entire genome. This took about one day to write, and plugging these two models into Bayes' law as shown below turned out to work better than all the other single-method gene-prediction algorithms developed over the past 30 years.

But what probability should you assign to a codon sequence that you've never seen? A bacterial genome might have 4 million base pairs, about half of which are in long possible genes and will be used for training. That means your training data for one genome has about 2 million codon triplets. Surprisingly, a little less than half of all possible codon triplets do not occur at all in that data (DNA sequences are not random). What probability do you assign to an event that occurs zero times out of 2 million?

This came up recently in an online argument. Another person said that, if the probability that X is true is below your detection threshold or your digits of accuracy, you should assign P(X) = 0, since any other number is just made up.

Well, I'd already empirically determined whether that was true for the gene caller. First, due to a coding error, I assigned such events P(X) = 1 / (64^3 * size of training set), which is too small by about 64^3. Next I tried P(X) = 0.5 / (size of training set), which is approximately correct. Finally I tried P(X) = 0. I tested the results on genomes where I had strong evidence for what where and were not genes.

How well do you think each P(X) worked?

The two non-zero probabilities gave nearly the same results, despite differing by 6 orders of magnitude. But using P(X) = 0 caused the gene-caller to miss hundreds of genes per genome, which is a disastrous result. Why?

Any particular codon triplet that was never found in the training set would have a prior of less than one in 4 million. But because a large number of triplets are in genes outside the training set, that meant some of those triplets (not most, but about a thousand of them) had true priors of being found somewhere in those genes of nearly one half. (You can work it out in more detail by assuming a Zipf law distribution of priors, but I won't get into that.)

So some of them did occur within genes in that genome, and each time one did, its assigned probability of zero annihilated all the hundreds of other pieces of evidence for the existence of that gene, making the gene impossible to detect.

You can think of this using logarithms. I computed

P(gene | sequence) = P(sequence | gene) * P(gene) / P(sequence)

where P(sequence) and P(sequence | gene) are computed using the two Markov models. Each of them is the product of a sequence of Markov probabilities. Ignoring P(gene), which is constant, we can compute

log(P(gene|sequence)) ~ log(P(sequence | gene)) - log(P(sequence)) =

sum (over all codon triplets in the sequence) [ log(P(codon3 | codon1, codon2, gene)) - log(P(codon3 | codon1, codon2)) ]

You can think of this as adding the bits of information it would take to specify that triplet outside of a gene, and subtracting the bits of information it would take to specify that information inside a gene, leaving bits of evidence that it is in a gene.

If we assign P(codon3 | codon1, codon2, gene) = 0, the number of bits of information it would take to specify "codon3 | codon1, codon2" inside a gene is -log(0) = infinity. Assign P(X) = 0 is claiming to have infinite bits of information that X is false.

Going back to the argument, the accuracy of the probabilities assigned by the Markov model are quite low, probably one to three digits of accuracy in most cases. Yet it was important to assign positive probabilities to events whose probabilities were at least seven orders of magnitude below that.

It didn't matter what probability I assigned to them! Given hundreds of other bits scores to add up, changing the number of bits taken away by one highly improbable event by 10 had little impact. It just matters not to make it zero.

Eliezer wrote a post warning against unrealistically confident estimates, in which he argued that you can't be 99.99% sure that 53 is prime. Chris Hallquist replied with a post arguing that you can.

How well do you think each P(X) worked?

You can think of this using logarithms. I computed

P(gene | sequence) = P(sequence | gene) * P(gene) / P(sequence)

log(P(gene|sequence)) ~ log(P(sequence | gene)) - log(P(sequence)) =

sum (over all codon triplets in the sequence) [ log(P(codon3 | codon1, codon2, gene)) - log(P(codon3 | codon1, codon2)) ]

A program this simple? Yes.

[EDITED to add: And I did say to test it on the primes up to 100.]

[EDITED again to add ...] Here, just for reference, is what I would write, in less than a minute, to do the job. It is only intended to work for integers >= 2, and need not be bearably efficient for integers that aren't very small.

def is_prime(n):
  for i in range(2,n):
    if n%i == 0: return False
  return True

Four short, simple lines corresponding quite exactly to the definition of primality.

So, what could be wrong in this program that could make it give a wrong answer for 1159? I can think of the following things. (1) I could have messed up the range of values of i to test against. (2) I could have got my variables muddled up, testing i%n instead of n%i or something of the kind. (3) I could have got my return conditions backwards, making this a not-prime test. (4) I could have messed up the control flow in some way that, e.g., does something crazy when we fall off the end of that loop. (5) I could have mixed up the % (modulus) operator with something else like division or exclusive-or. (6) I could have got myself confused about what the condition for primality actually is. (7) I could have done something else that I haven't been able to think of.

Well, most of those are very low-probability, especially given that I've listed them explicitly and checked the program. Further, it's easy to see that all of them would almost certainly fail on testing against the numbers from 2 to 100, and it seems likely that most errors in category 7 would do so too. (E.g., one can imagine messing up the test in such a way that numbers of the form prime^2 get mistakenly identified as prime, or so that it only catches primes of the form 4k+1, or something.)

And, lo and behold, I then did

print [p for p in range(2,100) if is_prime(p)]

and got

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]

which has 25 entries (which I know, or think I know, is the right answer), consists entirely of numbers I believe I know to be prime, and includes every number < 100 that I believe I know to be prime.

I am very comfortable indeed reckoning this as at least 1000:1 evidence that any specific positive integer on the order of 1000 is prime iff my is_prime function says it is and gives the same result on repeated runs.

Other failure modes I haven't listed above but want to mention that I have thought of: There could be integer overflow problems. (But not for numbers this small, and not in Python unless there's a weird Python bug I haven't heard of.) There could be ridiculous errors in Python itself that make it get simple arithmetic on small numbers wrong. (But I'm running Python 2.7.3, which was released some time ago and was for some time the most commonly used Python version; I would surely have heard if it had that sort of error; in any case that class of error is very, very rare.) My computer could be messed up by cosmic rays or something. (Which is why I'd run the program multiple times on 1159.)

Or, of course, my brain could be messed up in some way that specifically interferes with my understanding of prime numbers or of what it takes to check a program's correctness. The first is already dealt with in the discussion above. The second is one of the reasons why I would also check against someone else's list of smallish prime numbers and do a number-theoretic test by hand. (And also, I think, rather improbable.)

(7): indentation error. But I guess the interpreter will tell you i is used out of scope. That, or you would have gotten another catastrophic result on numbers below 10.

def is_prime(n):
for i in range(2,n):
    if n%i == 0: return False
return True

(Edit: okay, that was LessWrong screwing up leading spaces. We can cheat that with unbreakable spaces.)

0dankane12y

OK. I agree. Checking the output on numbers less than 100 vastly decreases the probability of error, especially after enumerating simple errors and realizing that none of them would return anything remotely correct on those numbers.

48

The dangers of zero and one

48

48

48

The dangers of zero and one

48

48