Dmytry comments on Diseased disciplines: the strange case of the inverted chart - LessWrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (150)
This strikes me as particularly galling because I have in fact repeated this claim to someone new to the field. I think I prefaced it with "studies have conclusively shown...". Of course, it was unreasonable of me to think that what is being touted by so many as well-researched was not, in fact, so.
Mind, it seems to me that defects do follow both patterns: Introducing defects earlier and/or fixing them later should come at a higher dollar cost, that just makes sense. However, it could be the same type of "makes sense" that made Aristotle conclude that heavy objects fall faster than light objects - getting actual data would be much better than reasoning alone, especially is it would tell us just how much costlier, if at all, these differences are - it would be an actual precise tool rather than a crude (and uncertain) rule of thumb.
I do have one nagging worry about this example: These days a lot of projects collect a lot of metrics. It seems dubious to me that no one has tried to replicate these results.
Mostly the ones that are easy to collect: a classic case of "looking under the lamppost where there is light rather than where you actually lost your keys".
Now we're starting to think. Could we (I don't have a prefabricated answer to this one) think of a cheap and easy to run experiment that would help us see more clearly what's going on?
Here's an idea. There are a number of open-source software projects that exist. Many of these are in some sort of version control system which, generally, keeps a number of important records; any change made to the software will include a timestamp, a note by the programmer detailing what was the intention of the change, and a list of changes to the files that resulted from the change.
A simple experiment might then be to simply collate data from either one large project, or a number of smaller projects. The cost of fixing a bug can be estimated from the number of lines of code changed to fix the bug; the amount of time since the bug was introduced can be found by looking back through previous versions and comparing timestamps. A scatter plot of time vs. lines-of-code-changed can then be produced, and investigated for trends.
Of course, this would require a fair investment of time to do it properly.
And time is money, so that doesn't really fit the "cheap and easy" constraint I specified.
Hmmm. I'd parsed 'cheap and easy' as 'can be done by a university student, on a university student's budget, in furtherance of a degree' - which possibly undervalues time somewhat.
At the cost of some amount of accuracy, however, a less time-consuming method might be the following; to automate the query, under the assumption that the bug being repaired was introduced at the earliest time when one of the lines of code modified to fix the bug was last modified (that is, if three lines of code were changed to fix the bug, two of which had last been changed on 24 June and one of which had last been changed on 22 June, then the bug would be assumed to have been introduced on 22 June). Without human inspection of each result, some extra noise will be introduced into the final graph. (A human (or suitable AGI, if you have one available) inspection of a small subset of the results could give an estimate of the noise introduced)
By "cheap and easy" what I mean is "do the very hard work of reasoning out how the world would behave if the hypothesis were true, versus if it were false, and locate the smallest observation that discriminates between these two logically possible worlds".
That's hard and time-consuming work (therefore expensive), but the experiment itself is cheap and easy.
My intuition (and I could well be Wrong on this) tells me that experiments of the sort you are proposing are sort of the opposite: cheap in the front and expensive in the back. What I'm after is a mullet of an experiment, business in front and party in back.
An exploratory experiment might consist of taking notes the next time you yourself fix a bug, and note the answers to a bunch of hard questions: how did I measure the "cost" of this fix? How did I ascertain that this was in fact a "bug" (vs. some other kind of change)? How did I figure out when the bug was introduced? What else was going on at the same time that might make the measurements invalid?
Asking these questions, ISTM, is the real work of experimental design to be done here.
Well, for a recent bug; first, some background:
Then, to answer your questions in order:
Once the problem code was identified, the fix was done in a few minutes. Identifying the problem code took a little longer, as the problem was a rare and sporadic one - it happened first during a particularly irritating test case (and then, entirely by coincidence, a second time on a similar test case, which caused some searching in the wrong bit of code at first)
A numeric value, displayed to the user, was showing "NaN".
The bug was introduced by failing to consider a rare but theoretically possible test case at the time (near the beginning of a long project) that a certain utility function was produced. I could get a time estimate by checking version control to see when the function in question had been first written; but it was some time ago.
A more recent change made the bug slightly more likely to crop up (by increasing the potential for rounding errors). The bug may otherwise have gone unnoticed for some time.
Of course, that example may well be an outlier.
Hmmm. Thinking about this further, I can imagine whole rafts of changes to the specifications which can be made just before final testing at very little cost (e.g. "Can you swap the positions of those two buttons?") Depending on the software development methodology, I can even imagine pretty severe errors creeping into the code early on that are trivial to fix later, once properly identified.
The only circumstances I can think of that might change how long a bug takes to fix as a function of how long the development has run are:
Good stuff! One crucial nitpick:
That doesn't tell me why it's a bug. How is 'bug-ness' measured? What's the "objective" procedure to determine whether a change is a bug fix, vs something else (dev gold-plating, change request, optimization, etc)?
NaN is an error code. The display was supposed to show the answer to an arithmetical computation; NaN ("Not a Number") means that, at some point in the calculation, an invalid operation was performed (division by zero, arccos of a number greater than 1, or similar).
It is a bug because it does not answer the question that the arithmetical computation was supposed to solve. It merely indicates that, at some point in the code, the computer was told to perform an operation that does not have a defined answer.
That strikes me as a highly specific description of the "bug predicate" - I can see how it applies in this instance, but if you have 1000 bugs to classify, of which this is one, you'll have to write 999 more predicates at this level. It seems to me, too, that we've only moved the question one step back - to why you deem an operation or a displayed result "invalid". (The calculator applet on my computer lets me compute 1/0 giving back the result "ERROR", but since that's been the behavior over several OS versions, I suspect it's not considered a "bug".)
Is there a more abstract way of framing the predicate "this behavior is a bug"? (What is "bug" even a property of?)
By definition, no cheap experiment can give meaningful data about high-cost bugs.
That sounds intuitively appealing, but I'm not quite convinced that it actually follows.
You can try to find people who produce such an experiment as a side-effect, but in that case you don't get to specify parameters (that may lead to a failure to control some variable - or not).
Overall cost of experiment for all involved parties will be not too low, though (although marginal cost of the experiment relative to just doing business as usual can be reduced, probably).
A "high-cost bug" seems to imply tens of hours spent overall on fixing. Otherwise, it is not clear how to measure the cost - from my experience quite similar bugs can take from 5 minutes to a couple of hours to locate and fix without clear signs of either case. Exploration depends on your shape, after all. On the other hand, it should be a relatively small part of the entire project, otherwise it seems to be not a bug, but the entire project goal (this skews data about both locating the bug and cost of integrating the fix).
if 10-20 hours (how could you predict how high-cost will a bug be?) are a small part of a project, you are talking about at least hundreds of man-hours (it is not a good measure of project complexity, but it is an estimate of cost). Now, you need to repeat, you need to try alternative strategies to get more data on early detection and on late detection and so on.
It can be that you have access to some resource that you can spend on this (I dunno, a hundred students with a few hours per week for a year dedicated to some programming practice where you have a relative freedom?) but not on anything better; it may be that you can influence set of measurements of some real projects.. But the experiment will only be cheap by making someone else cover the main cost (probably, for a good unrelated reason).
Also notice that if you cannot influence how things are done, only how they are measured, you need to specify what is measured much better than the cited papers do. What is the moment of introduction of a bug? What is cost of fixing a bug? Note that fixing a high-cost bug may include doing some improvements that were put off before. This putting off could be a decision with a reason, or just irrational. It would be nice if someone proposed a methodology of measuring enough control variables in such a project - but not because it would let us run this experiment, but because it would be a very useful piece of research on software project costs in general.
A high-cost bug can also be one that reduces the benefit of having the program by a large amount.
For instance, suppose the "program" is a profitable web service that makes $200/hour of revenue when it is up, and costs $100/hour to operate (in hosting fees, ISP fees, sysadmin time, etc.), thus turning a tidy profit of $100/hour. When the service is down, it still costs $100/hour but makes no revenue.
Bug A is a crashing bug that causes data corruption that takes time to recover; it strikes once, and causes the service to be down for 24 hours, which time is spent fixing it. This has the revenue impact of $200 · 24 = $4800.
Bug B is a small algorithmic inefficiency; fixing it takes an eight-hour code audit, and causes the operational cost of the service to come down from $100/hour to $99/hour. This has the revenue impact of $1 · 24 · 365 = $8760/year.
Bug C is a user interface design flaw that makes the service unusable to the 5% of the population who are colorblind. It takes five minutes of CSS editing to fix. Colorblind people spend as much money as everyone else, if they can; so fixing it increases the service's revenue by 4.8% to $209.50/hour. This has the revenue impact of $9.50 · 24 · 365 = $83,220/year.
Which bug is the highest-cost? Seems clear to me.
The definition of cost you use (damage-if-unfixed-by-release) is distinct from all the previous definitions of cost (cost-to-fix-when-found). Neither is easy to measure. Actual cited articles discuss the latter definition.
I asked to include the original description of the values plotted in the article, but this it not there yet.
Of course, existence of the high-cost bug in your definition implies that the project is not just a cheap experiment.
Futhermore, following your example makes the claim the article contests as plausible story without facts behind it the matter of simple arithmetics (the longer the bug lives, the higher is time multiplier of its value). On the other hand, given that many bugs become irrelevant because of some upgrade/rewrite before they are found, it is even harder to estimate the number of bugs, let alone cost of each one. Also, how an inefficiency affects operating costs can be difficult enough to estimate that nobody knows whether it is better to fix a cost-increaser or add a new feature to increase revenue.
Is that a request addressed to me? :)
If so, all I can say is that what is being measured is very rarely operationalized in the cited articles: for instance, the Brady 1999 "paper" isn't really a paper in the usual sense, it's a PowerPoint, with absolutely no accompanying text. The Brady 1989 article I quote even states that these costs weren't accurately measured.
The older literature, such as Boehm's 1976 article "Software Engineering", does talk about cost to fix, not total cost of the consequences. He doesn't say what he means by "fixing". Other papers mention "development cost required to detect and resolve a software bug" or "cost of reworking errors in programs" - those point more strongly to excluding the economic consequences other than programmer labor.
Of course. My point is that you focused a bit too much on misciting instead of going for quick kill and saying that they measure something underspecified.
Also, if you think that their main transgression is citing things wrong, exact labels from the graphs you show seem to be a natural thing to include. I don't expect you to tell us what they measured - I expect you to quote them precisely on that.
The main issue is that people just aren't paying attention. My focus on citation stems from observing that a pair of parentheses, a name and a year seem to function, for a large number of people in my field, as a powerful narcotic suspending their critical reason.
If this is a tu quoque argument, it is spectacularly mis-aimed.
Or to put that another way, there can't be any low-hanging fruit, otherwise someone would have plucked it already.
A costly, but simple way would be to gather groups of SW engineers and have them work on projects where you intentionally introduce defects at various stages, and measure the costs of fixing them. To be statistically meaningful, this probably means thousands of engineer hours just to that effect.
A cheap (but not simple) way would be to go around as many companies as possible and hold the relevant measurements on actual products. This entails a lot of variables, however - engineer groups tend to work in many different ways. This might cause the data to be less than conclusive. In addition, the politics of working with existing companies may also tilt the results of such a research.
I can think of simple experiments that are not cheap; and of cheap experiments that are not simple. I'm having difficulty satisfying the conjunction and I suspect one doesn't exist that would give a meaningful answer for high-cost bugs.
(Minor edit: Added the missing "hours" word)
It's not that costly if you do with university students: Get two groups of 4 university students. One group is told "test early and often". One group is told "test after the code is integrated". For every bug they fix, measure the effort it is to fix it (by having them "sign a clock" for every task they do). Then, do analysis on when the bug was introduced (this seems easy post-fixing the bug, which is easy if they use something like Trac and SVN). All it takes is a month-long project that a group of 4 software engineering students can do. It seems like any university with a software engineering department can do it for the course-worth of one course. Seems to me it's under $50K to fund?
Yes, it would be nice to have such a study.
But it can't really be done the way you envision it. Variance in developer quality is high. Getting a meaningful result would require a lot more than 8 developers. And very few research groups can afford to run an experiment of that size -- particularly since the usual experience in science is that you have to try the study a few times before you have the procedure right.
That would be cheap and simple, but wouldn't give a meaningful answer for high-cost bugs, which don't manifest in such small projects. Furthermore, with only eight people total, individual ability differences would overwhelmingly dominate all the other factors.
We know that late detection is sometimes much more expensive, simply because depending on the domain, some bugs can do harm (letting bad data into the database, making your customers' credit card numbers accessible to the Russian Mafia, delivering a satellite to the bottom of the Atlantic instead of into orbit) much more expensive than the cost of fixing the code itself. So it's clear that on average, cost does increase with time of detection. But are those high-profile disasters part of a smooth graph, or is it a step function where the cost of fixing the code typically doesn't increase very much, but once bugs slip past final QA all the way into production, there is suddenly the opportunity for expensive harm to be done?
In my experience, the truth is closer to the latter than the former, so that instead of constantly pushing for everything to be done as early as possible, we would be better off focusing our efforts on e.g. better automatic verification to make sure potentially costly bugs are caught no later than final QA.
But obviously there is no easy way to measure this, particularly since the profile varies greatly across domains.
The real problem with these graphs is not that they were cited wrong. After all, it does look like both are taken from different data sets, however they were collected, and support the same conclusion.
The true problem is that it is hard to say what do they measure at all.
If this true problem didn't exist, and these graphs measured something that can be measured, I'd bet that these graphs not being refuted would actually mean that they are both showing true sign of correlation. The reason is quite simple: every possible metric gets collected for a stupid presentation from time to time. If the correlation was falsifiable and wrong, we would likely see falsifications on TheDailyWTF forum as an anecdots.
I don't understand why you think the graphs are not measuring a quantifiable metric, nor why it would not be falsifiable. Especially if the ratios are as dramatic as often depicted, I can think of a lot of things that would falsify it.
I also don't find it difficult to say what they measure: The cost of fixing a bug depending on which stage it was introduced in (one graph) or which stage it was fixed in (other graph). Both things seem pretty straightforward to me, even if "stages" of development can sometimes be a little fuzzy.
I agree with your point that falsifications should have been forthcoming by now, but then again, I don't know that anyone is actually collecting this sort of metrics - so anecdotal evidence might be all people have to go on, and we know how unreliable that is.
There are things that could falsify it dramatically, most probably. Apparently they are not true facts. I specifically said "falsifiable and wrong" - in the parts where this correlation is falsifiable, it is not wrong for majority of the projects.
About dramatic ratio: you cannot falsify a single data point. It simply happenned like this - or so the story goes. There are so many things that will be different in another experiment that can change (although not reverse) the ratio without disproving the general strong correlation...
Actually, we do not even know what are axis labels. I guess they are fungible enough.
Saying that cost of fixing is something straightforward seems to be too optimistic. Estimating true cost of the entire project is not always simple when you have more than one project at once and some people are involved with both. What do you call cost of fixing a bug?
Any metrics that contains "cost" in the name get requested by some manager from time to time somewhere in the world. How it is calculated is another question. Actually, this is the question that actually matters.