Please Note: This entry is an archived entry from my previous weblog. No new comments may be posted. Also, as of May 2005, I have left Microsoft. I am still happy to respond to any questions or comments about this article, but as I am no longer on the Visual Studio team, I may not be able to provide any helpful answers.
I got a lot of really positive feedback to my recent entry, "The anatomy of a bug." I'm glad some people found it interesting, so I'll try to write some more entries like that in the future.
(It surprised me how long it took to write that entry, so I don't think I'll be able to do it very often. On a similar note, any aspirations I may have had about becoming a book author have been effectively squashed, since I can hardly manage to cobble together 1000 words at a time, let along enough content for a book. Anyhow...)
I got some great responses from some fellow Big House Bloggers like Eric Lippert, who followed up my post with an excellent one of his own: "How many Microsoft employees does it take to change a lightbulb?" In my comments, Eric says:
Every time we fix a bug it takes up several person-days which could have been spent finding more important bugs, doing security reviews, designing the next version, implementing features, whatever. It is crucial to ensure that we stop bugs from getting written in the first place, and to brutally triage them once they're in.
Eric makes a great point: every bug is expensive, even if the fix seems simple and every new feature is expensive even if the implementation seems trivial.
In his own follow-up, Tom makes the following comment: "That said, for a company that makes so much money, the 'it's too hard, not enough time' argument is much, much weaker." The fact that Microsoft has many resources does not make software any easier to develop or bugs any simpler to fix. Money is not a silver bullet. As Eric mentions, each design change, new feature, or bug fix can require many person-weeks of effort. Microsoft does software development on a scale that is virtually unprecedented; this increase in scope and complexity means that bugs are not easier to fix -- they are phenomenally more expensive to fix. The people I work with are creating tools specifically to make programmers more productive, to make testers more effective, and to make software more reliable and secure. Still, nothing can completely remove the human element from this process: we have only so many warm bodies to do the work, and sometimes the fix is too hard, and there isn't enough time. This is also part of managing the chaos. (Again, "we" means "people I work with" in this context.)
Jim Edgar makes a comment about regression testing: "Commonly a large part of the application has to be retested to ensure the 'fix' has not broken anything else." This is absolutely true. We try to minimize the cost of doing regression testing during product development, but as we get closer to the expected ship date, the requirements (and the cost) of verifying that any change or bug fix does not cause regressions grows almost exponentially. This is even more true for shipping a service pack or other post-release patch; in those cases, the regression testing can comprise the overwhelming majority of testing resources.
Finally, one commentor (as "Billy Boy," which I assume is a pseudonym) asks a very good question:
While I agree with your overall points in this article, I have to question your proposed solution to the bug. I can understand the concern about adding this check to the main code path. I think this is a valid concern. But it begs the question of why add it as a pre-deployment check as opposed to a post failure check. Once the deployment fails you check if the keys were missing and then do whatever recover is possible. Personally I would have put up a message box telling the user that registry keys are misssing and to import the ProxyPorts.reg in the VS.Net directory. The users are assumed to be developers -- they should understand what this means and how to correct it. It is certainly much more preferable than getting a generic 'Deployment Failed' error.
These are all reasonable points. Why check pre-deployment rather than post-deployment? Part of the answer is largely a matter of preference, but I think that if you are going to fail you should fail as early as possible. Also, this bug was an example of a case where answering "Is this going to fail because the keys are missing" is a lot easier to answer than "Did that fail because the keys were missing". The errors that get reported in the event of a deployment failure are, unfortunately, not very specific, so it was deemed better to preemptively detect a problem than to diagnose it after a failure had occured. Also, a device deployment in VS is not an instantaneous operation; it can take many seconds. It's better to give feedback as soon as possible, and if we had fixed the registry problem after deployment instead of before, the user would have to re-deploy in order to get a successful deployment. If we were going to fix the root cause, we should probably fix it in such a way that we don't have to tell the user "You did something, there was a problem, we fixed it, but you have to do that same thing again now." That's not a very polished experience.
Why not just show a dialog telling the user to import "proxyports.reg"? I think it would have been worse to "almost" fix the bug than to make a readme note and not fix the bug at all. If we did all the work to detect the bug, display the dialog, and include the registry file, and do all the attendant testing, localization, and documentation work, why not go the other 10% and fix the bug the right way? Also, this "near-fix" has almost all of the same problems as the real fix: Localization issues, an often-hit code path, etc. Plus there's the same issue of running as a non-admin user and trying to import the reg file, which would fail with RegEdit's generic "could not import settings" dialog instead of something potentially more informative. Also, how does the user get to the reg file? Do we give them a button to open an Explorer window to the right directory? What if the system administrator has prevented access to "regedit.exe"? Sure, the customers who use VS are developers, and they're a smart group, but it's not very nice for software to say, "hey, there's a problem here, and now you have to go and fix it because we won't fix it for you." It's probably better to say, "here's a bug, we unfortunately didn't fix it, but here's a workaround." And with the "half-fix" there's a new issue: including "proxyports.reg" with VS requires making changes to the VS setup. This is no small issue; setup changes require a lot of testing and verification because changes to setup affect a lot more than just the feature in question. Setup changes can affect application compatibilty, side-by-side installation, servicing (i.e., service packs, patches, etc.), and a whole host of other scenarios.
In short (or maybe not, by this point), we considered many different kinds of fixes, and weighed the pros and cons of all of them. In the end, we decided not to fix the bug, for all these reasons and more. Shipping a bug is painful, but sometimes a fix that might seem "good enough" is not only not good enough, it might even be worse.
< Microsoft > Posted at November 2, 2003 06:20 PM
I am reminded of an annoying 'feature' on my microwave: When you use the timer, you can't turn it off by pressing 'cancel'. You can only turn it off by pressing the Timer On/Off button.
When you press cancel while the timer's going, it displays a dialog: "Press timer on/off". For goodness sakes, if you can *tell* me what to do to solve this problem, why not do it?
Makes me wonder if GE has similar issues with microwave development as we do with software :-)
Comment by: KC Lemson at November 7, 2003 12:25 PMI disagree completely. I had to search google for twenty minutes to find the proxyports.reg fix and I had to disregard all the "re-install Visual Studio" answers. ANY hint from VS would have been welcome. You are suggesting that it is better for software to fail without giving the user any clue as to what failed, unless it can provide a 100% remedy.
Comment by: Bill Guenthner at January 28, 2004 12:39 PM