Nice example of how using a probability of exactly zero can screw you over. Two observations.
Could have done with a link to Eliezer's 0 and 1 are not probabilities from back in 2008.
I say you can get to 99.99% confidence that 1159 is prime (if it actually is; I haven't checked); probably 99.9999%. Suppose you (a) write a program to check all possible divisors, test that it gives the right answers for everything up to 100, and run it multiple times in case of cosmic rays; (b) look it up in a table of prime numbers; (c) apply, by hand, one of the fancy number-theoretical primality tests (most of these are probabilistic -- but again you can find statements of the form "If a number less than 10^12 passes these specific tests and isn't one of the following short list of exceptions, it is prime"). Then I reckon that apart from theories where what's wrong is your brain a,b,c are clearly extremely close to independent; (a) has well less than 0.001 chance of failure, (b) well less than 0.01, and (c) well less than 0.1; so the only hypothesis you need to consider that might take the probability above 10^-6 is that you're delusional in some way that specifically messes up your ability to tell whether 1159 is prime. Now (d) this is surely extraordinarily rare -- delusions in general aren't so rare, but this is something very specific and weird; and (e) if your mind is that badly screwed up then attempting to work with probabilities is probably kinda meaningless anyway. (Theorems like Cox's say that you should use probabilities as measures of confidence, and compute with them in the standard ways, if you are a rational agent. If you are known to be a catastrophically irrational agent, then what probabilities you assign is probably not the greatest of your worries.)
Can you really be sure that a program that you write has at least a 99.9% chance of being correct without performing moderately extensive testing? Personally, I'd probably put significantly more confidence in (b).
A program this simple? Yes.
[EDITED to add: And I did say to test it on the primes up to 100.]
[EDITED again to add ...] Here, just for reference, is what I would write, in less than a minute, to do the job. It is only intended to work for integers >= 2, and need not be bearably efficient for integers that aren't very small.
def is_prime(n):
for i in range(2,n):
if n%i == 0: return False
return True
Four short, simple lines corresponding quite exactly to the definition of primality.
So, what could be wrong in this program that could make it give a wrong answer for 1159? I can think of the following things. (1) I could have messed up the range of values of i to test against. (2) I could have got my variables muddled up, testing i%n instead of n%i or something of the kind. (3) I could have got my return conditions backwards, making this a not-prime test. (4) I could have messed up the control flow in some way that, e.g., does something crazy when we fall off the end of that loop. (5) I could have mixed up the % (modulus) operator with something else like division or exclusive-or. (6) I could have got myself confused about what the condition for primality actually is. (7) I could have done something else that I haven't been able to think of.
Well, most of those are very low-probability, especially given that I've listed them explicitly and checked the program. Further, it's easy to see that all of them would almost certainly fail on testing against the numbers from 2 to 100, and it seems likely that most errors in category 7 would do so too. (E.g., one can imagine messing up the test in such a way that numbers of the form prime^2 get mistakenly identified as prime, or so that it only catches primes of the form 4k+1, or something.)
And, lo and behold, I then did
print [p for p in range(2,100) if is_prime(p)]
and got
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
which has 25 entries (which I know, or think I know, is the right answer), consists entirely of numbers I believe I know to be prime, and includes every number < 100 that I believe I know to be prime.
I am very comfortable indeed reckoning this as at least 1000:1 evidence that any specific positive integer on the order of 1000 is prime iff my is_prime function says it is and gives the same result on repeated runs.
Other failure modes I haven't listed above but want to mention that I have thought of: There could be integer overflow problems. (But not for numbers this small, and not in Python unless there's a weird Python bug I haven't heard of.) There could be ridiculous errors in Python itself that make it get simple arithmetic on small numbers wrong. (But I'm running Python 2.7.3, which was released some time ago and was for some time the most commonly used Python version; I would surely have heard if it had that sort of error; in any case that class of error is very, very rare.) My computer could be messed up by cosmic rays or something. (Which is why I'd run the program multiple times on 1159.)
Or, of course, my brain could be messed up in some way that specifically interferes with my understanding of prime numbers or of what it takes to check a program's correctness. The first is already dealt with in the discussion above. The second is one of the reasons why I would also check against someone else's list of smallish prime numbers and do a number-theoretic test by hand. (And also, I think, rather improbable.)
(7): indentation error. But I guess the interpreter will tell you i
is used out of scope. That, or you would have gotten another catastrophic result on numbers below 10.
def is_prime(n):
for i in range(2,n):
if n%i == 0: return False
return True
(Edit: okay, that was LessWrong screwing up leading spaces. We can cheat that with unbreakable spaces.)
indentation error
Yes, good point. I'd generalize: "I could have committed some syntax error -- wrong indentation, wrong sort of brackets somewhere, etc. -- that just happens to leave a program with correct syntax but unintended semantics."
I guess the interpreter will tell you
i
is used out of scope.
Depends on exactly what error occurs. (With the formatting the way it actually looks here it would just be a syntax error.)
That, or you would have gotten another catastrophic result on numbers below 10.
This (aside from the simplicity of the code) is the real reason why I'm still happy with an error probability less than 0.001. It takes a really odd sort of error to produce correct results for the first 100 candidate primes and wrong results thereafter. I could write some programs with that sort of error, but they wouldn't look at all like naive primality checkers with simple typos.
OK. I agree. Checking the output on numbers less than 100 vastly decreases the probability of error, especially after enumerating simple errors and realizing that none of them would return anything remotely correct on those numbers.
Well, but you can (a) preform moderately extensive testing, and (b) do redundancy.
If you write 3 programs for verifying primeness (using different algorithms and possibly programming languages/approaches); and if all their results match, then you can assume a much higher confidence in correctness than for a single such program.
I can take a stab at it (in python):
> def isprime(n):
> > if n == 1 or n % 2 == 0: return False
> > i = 3
> > while i*i <= n:
> > > if n % i == 0: return False
> > > i += 2
> > return True
>
> print(all(map(isprime, [3,5,7,11,13,17,23,53])))
> print(isprime(1159))
(apparently LW's markdown implementation butchers whitespace in code)
when I run that it tells me that 1159 isn't prime. This programme is simple enough that I would after spending an hour or so, at my mental peak condition, proving it correct, trust it with very high stakes.
And... You'd lose the bet. That program does have a mistake in it. See if you can find it.
(Answer: Gjb vf cevzr, ohg lbhe shapgvba pnyyf vg pbzcbfvgr.)
Good catch, but I bet MagnetoHydroDynamics would not lose the bet "after spending an hour or so, at [their] mental peak condition, proving it correct".
True. It's not a perfect illustration. But it's a good, ironic example of how the point that MagnetoHydroDynamics was trying to make can have a pretty serious flaw.
Meh. It was a special case he already knew to call out for special treatment. Would have been caught VERY quickly in testing, just like the same sort of error occurring in the general case (hmm. the primes are 4, 6, 8, 9, 10, 12? nooo...)
99.99% confidence implies that he could write TEN THOUSAND programs of similar difficulty (with hours spent verifying each at peak mental condition) and make only ONE mistake.
As someone who's done quite a bit of programming and reading about programming, I'd be impressed if he made it passed a hundred without making two mistakes.
The industry average is 15-50 errors / 1000 lines of code. The amount of effort required to get below 0.1 errors/kLoC, like they do for the space shuttle, is very, very large. One person checking their own work for an hour doesn't stand a chance.
That's what I meant, though I can see that it's not clear.
In this case a mistake would be writing the code, iterating and testing until you were satisfied, pronouncing it done, and then afterwards catching the error.
Hmm. I can definitely buy that a program of more complexity than this - less easily checked - would have that accuracy rate.
But prime checking is super simple to write and super simple to check. The only way you'd get an error through the obvious testing scheme given is to skip the testing.
You're taking 100 bits of testing (which contain around 80 bits of information if not produced by means of the actual pattern) and treating them as around 13 bits of reliability.
My experience with coding is that stupid obvious mistakes are way more likely than 1/10000. You write something slightly wrong, keep reading it as if it were right, and that's that.
Determining if a number is prime is a bit of a nice case, I suppose, because it's so amenable to testing. The structure of the mistakes you make is unlikely to match the structure of primes, so you'll catch any mistakes more easily.
I'd still consider doing it 10000 times to be extremely difficult. Just adding 10000 six-digit numbers by hand, even with some cross-checking, is quite difficult.
Yes, stupid coding mistakes are more like 1 in 2 than 1 in 10^4; it is the testing that helps here.
Overall the space shuttle software is definitely more complicated than confirming a small prime, but on a line by line basis I don't know. The space shuttle is more of a special case example; not something you compare against.
I'd use the industry average rate until you showed me you could hit 0.1 errors/kloc for a few years. For example, Microsoft hits 10-20 defects/kloc and they put significant effort into testing.
The most likely reason that your bug rate on these programs would be anomalously low is the fact that they're small. Instead of writing one hundred thousand line program, you're writing 10000 ten line programs. The complexities won't compound as much.
The probability of mistake in a program is not the same as the probability of being wrong about 1159 being a composite. At high reliability level, you factor the number, then you multiply up the factors, and check that as well.
apart from theories where what's wrong is your brain
(just amused by the possibility)
Also, it is possible that Peano arithmetic isn't consistent; if so, either the very concept of 'primality' doesn't make any sense, or it can just mess up the primality tests which were used in creation of (b) and (c), and the connection between "1159 if prime" and "this program outputs True and halts" as well.
Of course, it screws up any application of Cox's theorem here, even worse than in delusion case.
But can you be 99.99% confident that 1159 is a prime?
This doesn't affect the thrust of the post but 1159 is not prime. Prime factors are 19 and 61.
I was about 80% sure that 1159 was not prime, based on reading that sentence. It took me <1 minute to confirm this. I can totally be more than 99.99% sure of the primality of any given four-digit number.
In fact, those odds suggest that I'd expect to make one mistake with probability >0.5 if I were to go through a list of all the numbers below 10,000 and classify them as prime or not prime. I think this is ridiculous. I'm quite willing to take a bet at 100 to 1 odds that I can produce an exhaustive list of all the prime numbers below 1,000,000 (which contains no composite numbers), if anyone's willing to stump up at least $10 for the other side of the bet.
You can only simply exponentiate the chance of success if it doesn't correlate over multiple repetitions. I would say that if the list of primes below 10^6 you were referencing has at least one error in the first 10^5, it would be more likely to be faulty later, and vice versa, which means that your gut estimates on the two scales might be noncontradictory.
Right. Say you write code to generate primes. If there's no bug, all your answers are correct. If there's a bug in your code, probably lots of your answers are incorrect.
Another person said that, if the probability that X is true is below your detection threshold or your digits of accuracy, you should assign P(X) = 0, since any other number is just made up.
This immediately struck me as so very wrong. The worse you can measure, the more events you feel justified assigning zero probability?
It's strange that such claims make it out into the wild. Maybe there was additional context that made it make sense for someone once upon a time, and they generalized to an ill fitting context.
My own response to this is "P(X) = 0 is made up, too. If I want to avoid using a made-up value, I should stop thinking about P(X) altogether. Alternatively, if I want to think about P(X), I should assign it a made-up value that works well. In many contexts epsilon works far better than zero."
I think Phil's experience suggests a reasonable way to attack these problems.
Do the analysis for a series of epsilons, starting with your measurement delta and working down, and see if it makes any difference to the results.
Also, in one of Jaynes' paper on the marginalization "paradox" he suggested that ignoring the existence of a variable gives you a different result than applying an ignorance prior (showing yet again that the term "ignorance prior" is a stupefying oxymoron).
Re: ignoring the existence of a variable... yes, absolutely. I meant in the more sweeping, less useful sense of "go work on a different problem altogether, or perhaps just have a beer and watch telly."
Re: seeing how results vary for different possible values of an unknown variable... yup, agreed.
This immediately struck me as so very wrong. The worse you can measure, the more events you feel justified assigning zero probability?
Assigning more events zero probability leaves you worse off compared to someone who makes accurate estimates, but it doesn't necessarily leave you worse off compared to someone else who measures as poorly as you and makes poorly estimated measurements..
It's a way of mitigating the damage by not being able to measure well. You're still worse off than a person who can measure well, you're just not as worse off.
No. If you'd only ever seen TAAGCC, period, you would NOT have any sort of license to completely rule out the possibility of anything else. Indeed, the probabilities should be nearly even with a little more weight given to that particular observation.
Applying the Sunrise formula seems appropriate here.
Another example where simple probability assignment fails is from "Reasoning with Limited Resources and Assigning Probabilities to Arithmetical Statements" by Haim Gaifman:( http://www.columbia.edu/~hg17/RLR.pdf )
A coin is tossed 50 times. Jack analyzes the sequence of 0’s (heads) and 1’s (tails) and concludes that the sequence is random, with nearly equal probabilities for each side. Then Jill informs him that the coin contains a tiny mechanism, controlled by a microchip, which shifts its gravity center so as to produce a pseudo random sequence–– one that passes the standard randomness tests but is in fact defined by a mathematical algorithm. [...] Jill gives Jack the algorithm and the outcomes of additional 50 tosses fully accord with the calculated predictions. Let h be the hypothesis that the outcome sequence coincides with the sequences produced by the algorithm. Let the evidence be e. Given e, it would be irrational of Jack not to accord h extremely high probability. Had P(h)–– Jack’s prior subjective probability for h––been non-zero, conditionalization on e could have pushed it up. But if P(h) = 0, then, P(h|e) = 0, for every e for which P(e) > 0. Now Jack’s prior probability accommodates the possibility of independent tosses with non-zero probabilities for ‘0’ and ‘1’. Therefore P( ) assigns each finite sequence a nonzero value, hence P(e) > 0. If P(h) = 0, then Jack cannot learn from e what he should, unless he changes probabilities in a way that violates the Bayesian prescription.
And it explicitly questions the modeling of 'unknown' possibilites as 0:
Jack has never considered the possibility of a coin that produces, deterministically, a mathematically defined sequence; hence, if the prior probability is to reflect anything like Jack’s psychological state, P(h) = 0. But could he, in principle, have made place for such possibilities in advance?
Actually "could he, in principle, have made place for such possibilities in advance?" is very, very excellent question.
We can allocate for such possibilities in advance. For example, we can use a simple statistical model for limitations of our own understanding of reality - I have a certain number of years experience in making judgements and assumptions about reality; I know that I don't consider all possible explanations, and I can estimate that in x% cases the 'true' explanation was one that I hadn't considered. So I can make a 'belief budget' for the 'other' category. For example, any question like 'will the coin fall heads or tails' has to include 'other' option. It may fall on it's side.
A great example is the quotation "One of these days in your travels, a guy is going to show you a brand-new deck of cards on which the seal is not yet broken. Then this guy is going to offer to bet you that he can make the jack of spades jump out of this brand-new deck of cards and squirt cider in your ear. But, son, do not accept this bet, because as sure as you stand there, you're going to wind up with an ear full of cider."
If you want to handle reality, you have to model the probability of 'jack of spades jumping out of this brand-new deck of cards and squirting cider in your ear' as non-zero. 10^-99 might be ok, but not 0.
I want to emphasize the importance of not assigning things probability zero or one.
Unless you are dealing with an officer of the law. Answers like, "Yes, I'm 99% certain X" do not go over well with them.
Also, I'm sure Godel would have enjoyed your post. Or at least 99% sure.
Regarding 53 being a prime, you can't just do some naive Bayesian mathematics on some real valued probability of 53 being a prime because that probability would be heavily correlated with uncertainty about every other number and every other mathematical operation, including conversion from decimal notation to other representation, internally done mathematics in visual character recognition, and so on. Such probability is entirely meaningless. One could speak of probability of (53 is a prime | mathematics is not completely wrong), that would be a number very close to 1 (due to repeat testing).
I know this is not your main topic, but are you familiar with Good-Turing estimation? It's a way of assigning non-arbitrary probability to unobserved events.
Interesting discussion but I suspect an important distinction may be required between logic and probability theory. Logic is a special case of probability theory where values are restricted to only 0 and 1, that is to 0% and 100% probability. Within logic you may arrive at certain conclusions but generally within probability theory conclusions are not certain but rather assigned a degree of plausibility.
If logic provides, in some contexts, a valid method of reasoning then conclusions arrived at will be either 0% or 100% true. Denying that 100% confidence is ever rational seems to be equivalent to denying that logic ever applies to anything.
It is certainly true that many phenomena are better described by probability than by logic but can we deny logic any validity. I understand mathematical proofs as being within the realm of logic where things may often be determined as being either true or false. For instance Euclid is credited with first proving that there is no largest prime. I believe most mathematicians accept this as a true statement and that most would agree that 53 is easily proven to be prime.
Denying that 100% confidence is ever rational seems to be equivalent to denying that logic ever applies to anything.
It's just saying that logic is a model that can't describe anything in the real world fully literally. That doesn't mean it's not useful. Abstracting away irrelevant details is bread and butter reductionism.
Yes I agree, there is only a rough isomorphism between the mathematics of binary logic and the real world; binary logic seems to describe a limit that reality approaches but never reaches.
We should consider that the mathematics of binary logic are the limiting case of probability theory; it is probability theory where the probabilities may only take the values of 0 or 1. Probability theory can do everything that logic can but it can also handle those real world cases where the probability of knowing something is something other than 0 or 1, as is the usual case with scientific knowledge.
When you prove something in mathematics, at very least you implicitly assume you have made no mistakes anywhere, are not hallucinating, etc. Your "real" subjective degree of belief in some mathematical proposition, on the other hand, must take all these things into account.
For practical purposes the probability of hallucinations etc. may be very small and so you can usually ignore them. But the OP is right to demonstrate that in some cases this is a bad approximation to make.
Deductive logic is just the special limiting case of probability theory where you allow yourself the luxury of an idealised box of thought isolated from "real world" small probabilities.
edit: Perhaps I could say it a different way. It may be reasonable for certain conditional probabilities to be zero or one, so long as they are conditioned on enough assumptions, e.g. P("51 is a prime" given "I did my math correctly, I am not hallucinating, the external world is real, etc...")=1 might be achievable. But if you try to remove the conditional on all that other stuff you cannot keep this certainty.
Can you give me an example of a proposition arrived at by what you're calling "logic" here that corresponds to an expected observation in which you have 0% or 100% confidence?
"Tautologies are tautological" is the statement for which I am most certain.
Certainly, if the fate of humanity was the consequence of a false positive and a slap on my wrist was the consequence of a false negative for an affirmative answer to "tautologies are tautological", I would give a negative answer...so by that litmus test, I don't really have 100% confidence. But I've still have a strong logical intuition that I aught to be that confident, and that my emotional underconfidence in the proposed scenario is a bit silly.
The problem is, the above hypothetical scenario involves weird and self contradictory stuff (an entity which knows the right answer, my 100% certainty that the entity knows the right answer and can interpret my answer, etc). I think it may be best to restrict probability estimates to empirical matters (I'm 99.99% certain that my calculator will output "2" in response to the input "1+1=") and keep it away from pure logic realms.
+1 for the thought experiment test. I haven't seen that one before; any source or is it something you came up with?
thanks! the specific thought experiment is something I came up with but the general notion of using hypothetical scenarios to gauge one's true certainty is something that I think many people have done at one time or another, though I can't think of a specific example.
One example in classical logic is the syllogism where if the premises are true then the conclusion is by necessity true:
Socrates is a man
All men are mortal
therefore it is true that Socrates is mortal
Another example is mathematical proofs. Here is the Wikipedia presentation of Euclids proof from 300 BC that there is an infinite number of prime numbers. Perhaps In your terms this proof provides 0% confidence that we will observe the largest prime number.
Take any finite list of prime numbers p1, p2, ..., pn. It will be shown that at least one additional prime number not in this list exists. Let P be the product of all the prime numbers in the list: P = p1p2...pn. Let q = P + 1. Then, q is either prime or not:
1) If q is prime then there is at least one more prime than is listed.
2) If q is not prime then some prime factor p divides q. If this factor p were on our list, then it would divide P (since P is the product of every number on the list); but as we know, p divides P + 1 = q. If p divides P and q then p would have to divide the difference of the two numbers, which is (P + 1) − P or just 1. But no prime number divides 1 so there would be a contradiction, and therefore p cannot be on the list. This means at least one more prime number exists beyond those in the list.
This proves that for every finite list of prime numbers, there is a prime number not on the list. Therefore there must be infinitely many prime numbers.
The "Socrates is mortal" one is a good example because nowadays its conclusion has a probability of less than one.
I would be interested if you would care to elaborate a little.Syllogisms have been a mainstay of philosophy for over two millennium and undoubtedly I have a lot to learn about them.
In my admittedly limited understanding of syllogisms the conclusion is true given the premises being true. Truth is more in the structure of the argument than in its conclusion. If Socrates is not mortal than either he is not a man or not all men are mortal.
Perhaps In your terms this proof provides 0% confidence that we will observe the largest prime number.
Sure, I'm willing to consider that a prediction about the numbers that correspond to observable phenomena.
And you're asserting that the chance that Euclid was wrong about the properties of the numbers we observe is not just vanishingly small, but in fact zero, such that no amount of observed evidence could possibly properly change our minds about it.
Yes?
I have some skepticism about absolute certainty. Logic deals in certainties but it seems unclear if it absolutely describes anything in the real world. I am not sure if observed evidence plays a role in logic. If all men are mortal and if Socrates is a man then Socrates is mortal appears to be true. If we were to observe Socrates being immortal the syllogism would still be true but one of the conditional premises that all men are mortal or that Socrates is a man would not be true.
In science at least where evidence plays a decisive role there is no certainty; scientific theories must be falsifiable, there is always some possibility that an experimental result will not agree with theory.
The examples I gave are true by virtue of logical relationships such as if all A are B and all B are C then all A are C. In this vein it might seem certain that if something is here it cannot be there, however this is not true for quantum systems; due to superposition a quantum entity can be said to be both here and there.
Another interesting approach to this problem was taken by David Deutsch. He considers that any mathematical proof is a form of calculation and all calculation is physical just as all information has a physical form. Thus mathematical proofs are no more certain than the physical laws invoked to calculate them. All mathematical proofs require our mathematical intuition, the intuition that one step of the proof follows logically from the other. Undoubtedly such intuition is the result of our long evolutionary history that has built knowledge of how the world works into our brains. Although these intuitions are formed from principles encoded in our genetics they are no more reliable than any other hypothesis supported by the data; they are not certain.
It may be interesting that although all measurable results in quantum theory are in the form of probabilities there is at least one instance where this theory predicts a certain result. If the same measurement is immediately made a second time on a quantum system the second result will be the same as the first with probability 1. In other words the state of the quantum system revealed by the first measurement is confirmed by the second measurement. It may seem odd that the theory predicts the result of the first measurement as a probability distribution of possible results but predicts only a single possible result for the second measurement.
Wojciech Zuruk considers this as a postulate of quantum theory (see his paper quantum Darwinism ). (sorry for the typo in the quote).
If we consider that information exchange took place between the quantum system and the measuring device in the first measurement then we might view the probability distribution implied by the wave function as having undergone a Bayesian update on the receipt of new information. We might understand that this new information moved the quantum model to predictive certainty regarding the result of the second measurement.
Of course this certainty is only certain within the terms of quantum theory which is itself falsifiable.
I fail to discern your point, here; sorry. Specifically, I don't see what makes this more interesting in context than my expectation, within the limits of precision and reliability of my measuring device, that if I (e.g.) measure the mass of a macroscopic object twice I'll get the same result.
Yes, good point. Classical physics, dealing with macroscopic objects, predicts definite (non-probabilistic) measurement outcomes for both the first and second measurements.
The point I was (poorly) aiming at is that while quantum theory is inherently probabilistic even it sometimes predicts specific results as certainties.
I guess the important point for me is that while theories may predict certainties they are always falsifiable; the theory itself may be wrong.
Ah, I see. Yes, exactly... the theory may be wrong, or we have made a mistake in applying it or interpreting it, etc.
Eliezer wrote a post warning against unrealistically confident estimates, in which he argued that you can't be 99.99% sure that 53 is prime. Chris Hallquist replied with a post arguing that you can.
That particular case is tricky. There have been many independent calculations of the first hundred prime numbers. 53 is a small enough number that I think someone would notice if Wikipedia included it erroneously. But can you be 99.99% confident that 1159 is a prime? You found it in one particular source. Can you trust that source? It's large enough that no one would notice if it were wrong. You could try to verify it, but if I write a Perl or C++ program, I can't even be 99.9% sure that the compiler or interpreter will interpret it correctly, let alone that the program is correct.
Rather than argue over the number of nines to use for a specific case, I want to emphasize the the importance of not assigning things probability zero or one. Here's a real case where approximating 99.9999% confidence as 100% had disastrous consequences.
I developed a new gene-caller for JCVI. Genes are interpreted in units of 3 DNA nucleotides called codons. A bacterial gene starts with a start codon (usually ATG, TTG, or GTG) and ends at the first stop codon (usually TAG, TGA, or TAA). Most such sequences are not genes. A gene-caller is a computer program that takes a DNA sequence and guesses which of them are genes.
The first thing I tried was to create a second-order Markov model on codons, and train it on all of the large possible genes in the genome. (Long sequences without stop codons are unlikely to occur by chance and are probably genes.) That means that you set P = 1 and go down the sequence of each large possible gene, codon by codon, multiplying P by the probability of seeing each of the 64 possible codons in the third position given the codons in the first and second positions. Then I created a second Markov model from the entire genome. This took about one day to write, and plugging these two models into Bayes' law as shown below turned out to work better than all the other single-method gene-prediction algorithms developed over the past 30 years.
But what probability should you assign to a codon sequence that you've never seen? A bacterial genome might have 4 million base pairs, about half of which are in long possible genes and will be used for training. That means your training data for one genome has about 2 million codon triplets. Surprisingly, a little less than half of all possible codon triplets do not occur at all in that data (DNA sequences are not random). What probability do you assign to an event that occurs zero times out of 2 million?
This came up recently in an online argument. Another person said that, if the probability that X is true is below your detection threshold or your digits of accuracy, you should assign P(X) = 0, since any other number is just made up.
Well, I'd already empirically determined whether that was true for the gene caller. First, due to a coding error, I assigned such events P(X) = 1 / (64^3 * size of training set), which is too small by about 64^3. Next I tried P(X) = 0.5 / (size of training set), which is approximately correct. Finally I tried P(X) = 0. I tested the results on genomes where I had strong evidence for what where and were not genes.
How well do you think each P(X) worked?
The two non-zero probabilities gave nearly the same results, despite differing by 6 orders of magnitude. But using P(X) = 0 caused the gene-caller to miss hundreds of genes per genome, which is a disastrous result. Why?
Any particular codon triplet that was never found in the training set would have a prior of less than one in 4 million. But because a large number of triplets are in genes outside the training set, that meant some of those triplets (not most, but about a thousand of them) had true priors of being found somewhere in those genes of nearly one half. (You can work it out in more detail by assuming a Zipf law distribution of priors, but I won't get into that.)
So some of them did occur within genes in that genome, and each time one did, its assigned probability of zero annihilated all the hundreds of other pieces of evidence for the existence of that gene, making the gene impossible to detect.
You can think of this using logarithms. I computed
where P(sequence) and P(sequence | gene) are computed using the two Markov models. Each of them is the product of a sequence of Markov probabilities. Ignoring P(gene), which is constant, we can compute
You can think of this as adding the bits of information it would take to specify that triplet outside of a gene, and subtracting the bits of information it would take to specify that information inside a gene, leaving bits of evidence that it is in a gene.
If we assign P(codon3 | codon1, codon2, gene) = 0, the number of bits of information it would take to specify "codon3 | codon1, codon2" inside a gene is -log(0) = infinity. Assign P(X) = 0 is claiming to have infinite bits of information that X is false.
Going back to the argument, the accuracy of the probabilities assigned by the Markov model are quite low, probably one to three digits of accuracy in most cases. Yet it was important to assign positive probabilities to events whose probabilities were at least seven orders of magnitude below that.
It didn't matter what probability I assigned to them! Given hundreds of other bits scores to add up, changing the number of bits taken away by one highly improbable event by 10 had little impact. It just matters not to make it zero.