I have had the following situation happen several times during my research career: I write code to analyze data; there is some expectation about what the results will be; after running the program, the results are not what was expected; I go back and carefully check the code to make sure there are no errors; sometimes I find an error
No matter how careful you are when it comes to writing computer code, I think you are more likely to find a mistake if you think there is one. Unexpected results lead one to suspect a coding error more than expected results do.
In general, researchers usually do have general expectations about what they will find (e.g., the drug will not increase risk of the disease; the toxin will not decrease risk of cancer).
Consider the following graphic:
Here, the green region is consistent with what our expectations are. For example, if we expect a relative risk (RR) of about 1.5, we might not be too surprised if the estimated RR is between (e.g.) 0.9 and 2.0. Anything above 2.0 or below 0.9 might make us highly suspicious of an error -- that's the red region. Estimates in the red region are likely to trigger serious coding error investigation. Obviously, if there is no coding error then the paper will get submitted with the surprising results.
Error scenarios
Let's assume that there is a coding error that causes the estimated effect to differ from the true effect (assume sample size large enough to ignore sampling variability).
Consider the following scenario:
Type A. Here, the estimated value is biased, but it's within the expected range. In this scenario, error checking is probably more casual and less likely to be successful.
Next, consider this scenario:
Type B. In this case, the estimated value is in the red zone. This triggers aggressive error checking of the type that has a higher success rate.
Finally:
Type C. In this case it's the true value that differs from our expectations. However, the estimated value is about what we would expect. This triggers casual error checking of the less-likely-to-be-successful variety.
If this line of reasoning holds, we should expect journal articles to contain errors at a higher rate when the results are consistent with the authors' prior expectations. This could be viewed as a type of confirmation bias.
How common are programming errors in research?
There are many opportunities for hard-to-detect errors to occur. For large studies, there might be hundreds of lines of code related to database creation, data cleaning, etc., plus many more lines of code for data analysis. Studies also typically involve multiple programmers. I would not be surprised if at least 20% of published studies include results that were affected by at least one coding error. Many of these errors probably had a trivial effect, but I am sure others did not.
Bioinformatics is a neat example of how this bias can arise. Forgive me: I'm going to go into excessively nerdy detail about a specific example, because bioinformatics is cool.
Suppose that a biologist has amino acid sequences for 100 species' versions of the same protein. The species are not closely related, and the protein sequences for each species have a lot of variation between them. The biologist wants to find parts of the protein sequence that have remained similar for a long evolutionary time. The usual way to do this is to try to line up matching parts of the sequences, inserting gaps and accepting mismatches where necessary.
Aligning multiple sequences is a very hard problem, computationally, so we have to use approximate methods. The most common way is to break it down to a problem we can solve much more easily: aligning two sequences, then computing an average sequence for that clump which can be used to add another sequence to it. And another, and another. These algorithms compare all pairs of sequences to measure how similar they are, and then starts clumping together similar sequences in a tree that looks a lot like a diagram of evolutionary ancestry. At the end, your sequences should be aligned acceptably. You hope.
Of course, this assumes that some of the sequences are more closely related than others, and that you can form a nice tree shape. And it's very approximate, and there's lots of opportunity for error to creep in. So for some data, this works great, and for some data, it gives nonsense. Another method looks for small, very common subsequences, and iteratively refines the alignment based on these. Again, this works great for some data, and not so well for others. And then of course there are dozens of other methods, based on things like genetic algorithms, or simulated annealing, or hidden Markov models, and all of these have times when they work well and times when they don't.
So what does a biologist do? Try several methods, of course! The Right Way to do this is to run the same input through a bunch of algorithms, check them for agreement, see if several algorithms are giving the same results, and then apply any biological knowledge you may have to help decide what's working. The way that I'm sure some people end up using is that they reach for whatever multiple sequence alignment program they like most, run it, and trust its output if it's not too surprising. If it surprises them, they might try another algorithm. If that produces something more like what they were expecting to see, they may then stop, because they're busy, damn it, and they have work to do. Most biologists really don't want to be computer scientists, and it's tempting to treat the computer as a magic answer machine. (The obvious solution here is to rent some computing cloud time and run a bunch of different algorithms on your data every time you do multiple sequence alignment, and have your software automatically compare the results. More dakka helps.)
I don't know how common this sort of error is in practice, but there's certainly the potential for it.