Last Friday, I finally packaged up the quarterly release of JCVI's automatic prokaryote functional annotation pipeline and distributed it to the other 3 sequencing centers for the Human Microbiome Project.  It looks at genes found in newly-sequenced bacterial genomes and guesses what they do.  As often happens when I release a new version, shortly afterwards, I discovered a major bug that had been hiding in the code for years.

The program takes each new gene and runs BLAST against a database of known genes, and produces a list of identifiers of genes resembling the new genes.  It then takes these identifiers, and calls a program to look in a database for all of the synonyms for these identifiers used in different gene databases.  This lookup step takes 90% of the program's runtime.

I found that the database lookup usually failed, because most identifiers didn't match the regular expression used in the lookup program to retrieve identifiers.  Nobody had noticed this, because nobody had checked the database log files.  I fixed the program so that the database lookup would always work correctly, and re-ran the program.  It produced exactly the same output as before, but took five times as long to run.

So instead of going dancing as I'd planned, I spent Friday evening figuring out why this happened.  It turned out that the class of identifiers that failed to match the regular expression were a subset of a set of identifiers for which the lookup didn't have to be done, because some previously-cached data would give the same results.  Once I realized this, I was able to speed the program up more, by excluding more such identifiers, and avoiding the overhead of about a million subroutine calls that would eventually fail when the regular expression failed to match.  (Lest you think that the regular expression was intentionally written that way to produce that behavior, the regular expression was inside a library that was written by someone else.  Also, the previously-cached data would not have given the correct results prior to a change that I made a few months ago.)

A bug in a program is like a mutation.  Bugs in a computer program are almost always bad.  But this was a beneficial bug, which had no effect other than to make the program run much faster than it had been designed to.  I was delighted to see this proof of the central non-intuitive idea of evolution:  A random change can sometimes be beneficial, even to something as complex and brittle as a 10,000-line Perl program.

New Comment
10 comments, sorted by Click to highlight new comments since:

This is such an awesome programming story that I would like to see it expanded - more exact details of how the bug worked, with more specific code. In particular, I'd like to see this section expanded:

It turned out that the class of identifiers that failed to match the regular expression were a subset of a set of identifiers for which the lookup didn't have to be done, because some previously-cached data would give the same results.

If you really want the thrilling details...

We use a program called BLAST to compare every new gene to every known gene. To reduce the number of comparisons, we use a database called Panda that groups together known genes that produce the same protein sequence. A Panda header is a line containing an identifier and a string called an 'annotation' for each gene, separated by ^|^, like this:

SP|P49991|DNAA_MYCBO Chromosomal replication initiator protein dnaA taxon:1765 {Mycobacterium bovis;} (exp=0; wgp=-1; cg=-1; closed=-1; pub=0; rf_status= ;)^|^RF|NP_853671.1|31791178|NC_002945 chromosomal replication initiation protein taxon:233413 {Mycobacterium bovis AF2122/97;} (exp=1; wgp=1; cg=1; closed=1; pub=1; rf_status=provisional;)^|^GB|CAD92863.1|31616763|BX248334 CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA taxon:233413 {Mycobacterium bovis AF2122/97;} (exp=1; wgp=1; cg=1; closed=1; pub=1; rf_status=;)^|^OMNI|NTL01MB0001||31616763| CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA taxon:233413 {Mycobacterium bovis AF2122/97;} (exp=-1; wgp=-1; cg=-1; closed=-1; pub=-1; rf_status= ;)

The entries in a Panda header are sorted by the reliability of the database that each entry came from. SwissProt (SP) identifiers come first, because SwissProt is curated, meaning a human has looked at and approved every entry in the database. RefSeq (RF) and Genbank (GB) come near the end, because they typically have annotations that were assigned by some other computer program running BLAST and transferring it from a SwissProt match. The less-reliable computer-produced databases are, naturally, the largest; so most matches are to RF or GB identifiers.

So, we only BLAST each gene against the first gene listed in each line of the Panda database. We save the 250 best matches and their Panda headers in a file called the btab.

When trying to assign a function to the new gene, the program, among other things, picks the best matches that pass various tests, and looks to see if any of those genes have known functions. It does this by looking up their identifiers in various databases.

When I inherited the program, this is what it did: Take the identifier of the matched gene, load its Panda header from the btab and store it in a variable. Then look up the Panda header for that identifier using a slow program called yank-panda, and write that over the version it just copied from the btab. Take only the first entry in the returned header, and look up that identifier in the databases.

This was silly, both because we already had the Panda header, and because the first entry in the saved header was always the same as the identifier used to call yank_panda to find the header. It was a very slow way of using an identifier to look itself up. But I realized that many of the databases we checked for function assignments listed genes under the alternate identifiers. So I had the program check all of the genes in the returned Panda header; and this resulted in getting more annotations.

Then I modified the program to just read the copy of the Panda header from the btab; and not call yank-panda at all; making the program run much, much faster. But this caused the output to change so as to miss some annotations. I discovered that, in some cases, the copy of the Panda header in the btab was truncated; and it was difficult to tell whether a header was truncated or not, since the annotation has no standard format.

So I restored the call to yank-panda, and stored the result in the variable that previously had the (possibly-truncated) header copied from the btab, to make sure to get the full header.

The call to yank-panda failed silently for 3 types of identifiers: RefSeq, GenBank, and Omnium. That's because these identifiers have 4 parts instead of 3, like so:

RF|NP_853671.1|31791178|NC_002945

These identifiers fail to match this regular expression, which is deep inside a bioinformatics library that yank_panda calls that parses identifiers:

my $search_match = qr/^([A-Za-z]{2,})|([\w.]+)?|?([\w.]+)?$/;

However, because Panda headers are sorted by database, and RefSeq (RF) and GenBank (GB) identifiers always come near the end, it's never the case both that a GB or RF identifier is the first in the header, and that the header is long enough to be truncated. So in these cases the regular expression fails to match, yank-panda returns nothing, and the program fails to notice and goes ahead and uses the header from the btab, which is the same as would have been returned by yank-panda, because headers starting with RF or GB are never truncated.

So, you could argue that this beneficial bug was possible only because the code was poorly-written in the first place. It could also be the case that the program's original author knew that RF and GB identifiers would fail when calling yank-panda, and that's why he copied the original btab header. That would make this story not very interesting. I don't think he did that, because in his version of the program there was never any reason to call yank-panda; you used only the first entry in the header, which was always present in the btab whether the header was truncated or not. But it still could be that, before that, there was a time when there was a need to call yank-panda; and some programmer years cleverly realized that the databases ordered last in Panda were the same ones that had a non-standard identifier format, and uncleverly designed that into the code without writing any comments or allowing for the possibility that identifier formats or Panda formats or databases used would change.

Tangentially, an example of, roughly speaking, a naturally occurring computer virus. There is a feature/bug in windows that causes computers to rebroadcast ad hoc wireless networks that they have connected to.

It started out using auto-connect features and network names broadcast by default by wireless routers. But these were outcompeted by ones that require human intervention to connect, with the name "Free Public WiFi."

That is a brilliant story.

10,000-line Perl program.

Ouch.

It's nice to see some programming related content on LW. Thanks.

I shouldn't need to tell you that the analog of a locally-beneficial bug in human thinking was discussed here already :)

Cool story.

The site of ours, may be in context here.

http://critticall.com/

Huh. This sounds like a good time to update to the latest version. I haven't done that in... a couple of years... damn availability heuristic...

[-][anonymous]-10

test

[-][anonymous]00

test