Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
Last Friday, I finally packaged up the quarterly release of JCVI's automatic prokaryote functional annotation pipeline and distributed it to the other 3 sequencing centers for the Human Microbiome Project. It looks at genes found in newly-sequenced bacterial genomes and guesses what they do. As often happens when I release a new version, shortly afterwards, I discovered a major bug that had been hiding in the code for years.
The program takes each new gene and runs BLAST against a database of known genes, and produces a list of identifiers of genes resembling the new genes. It then takes these identifiers, and calls a program to look in a database for all of the synonyms for these identifiers used in different gene databases. This lookup step takes 90% of the program's runtime.
I found that the database lookup usually failed, because most identifiers didn't match the regular expression used in the lookup program to retrieve identifiers. Nobody had noticed this, because nobody had checked the database log files. I fixed the program so that the database lookup would always work correctly, and re-ran the program. It produced exactly the same output as before, but took five times as long to run.
So instead of going dancing as I'd planned, I spent Friday evening figuring out why this happened. It turned out that the class of identifiers that failed to match the regular expression were a subset of a set of identifiers for which the lookup didn't have to be done, because some previously-cached data would give the same results. Once I realized this, I was able to speed the program up more, by excluding more such identifiers, and avoiding the overhead of about a million subroutine calls that would eventually fail when the regular expression failed to match. (Lest you think that the regular expression was intentionally written that way to produce that behavior, the regular expression was inside a library that was written by someone else. Also, the previously-cached data would not have given the correct results prior to a change that I made a few months ago.)
A bug in a program is like a mutation. Bugs in a computer program are almost always bad. But this was a beneficial bug, which had no effect other than to make the program run much faster than it had been designed to. I was delighted to see this proof of the central non-intuitive idea of evolution: A random change can sometimes be beneficial, even to something as complex and brittle as a 10,000-line Perl program.