Please Note: This entry is an archived entry from my previous weblog. No new comments may be posted. Also, as of May 2005, I have left Microsoft. I am still happy to respond to any questions or comments about this article, but as I am no longer on the Visual Studio team, I may not be able to provide any helpful answers.
When people find out I'm a software tester at Microsoft, they'll sometimes start a conversation with me that goes roughly as follows:
Them: "So Microsoft has all these testers, right? Then why does your software have bugs?
Me: "Well, writing and testing software is difficult. We work very hard to fix all the bugs, but sometimes we miss a few."
Them: (snarkily) "It sounds like you need to hire more testers, then; ha ha! I am so clever."
Me: (holding back my inner annoyance) "Yes... that's very... funny."
I never get the chance to explain some of the subtleties of the find-a-bug-and-fix-a-bug process, which is sometimes not as straightforward as most people assume it to be. I thought I'd take some time to write a few words about one specific experience of dealing with a bug. Maybe this will provide some context; maybe it will be interesting, maybe not.
There is nothing that would make me happier than to fix every bug that is found during testing before the product officially ships. Unfortunately, the realities of product development always dash my happy dreams and I wind up taking a lot of aspirin. I don't care what model of software development you use -- waterfall, spiral, eXtreme, or otherwise -- the process of shipping a piece of software boils down to the same thing: trying to control chaos.
(As an aside, I always picture "eXtreme" programming practitioners using phrases like "Dude, gnarly unit test!" but maybe that's just me. I'm not a big fan. But I digress...)
Software has bugs
Of course, Microsoft ships software with bugs. You might think that, as a software tester, I get very angry about this. You would be right. Mostly, I get angry about bugs that aren't discovered until after the product has been released, since this means that the test team missed something, and we tend to take things like that rather personally. However, there are a few cases where we (and by "we," I mean "me and some other people on my team") find a bug, and perhaps the developer has even produced a fix for the bug, and yet we ship the software without fixing the bug.
I can hear people screaming "You do WHAT?" at the top of their indignant lungs, so please -- calm down. In the interest of lesser opacity, I'll take one (of a very few) of these kinds of bugs as an example, and talk in excruciating detail what the bug is, and why we decided not to fix it before RTM (release to manufacturing).
The bug and the workaround
First, let's look at the Readme that is part of the "Windows CE Utilities for Visual Studio .NET 2003 Add-on Pack 1.1 (Updated in September 2003)", which is available at http://www.microsoft.com/downloads/details.aspx?familyid=7EC99CA6-2095-4086-B0CC-7C6C39B28762. The Readme includes this note:
I uninstalled ActiveSync and then reinstalled it. Now I can't deploy my application. How do I fix this?
Issue: After uninstalling ActiveSync version 3.5, 3.6, 3.7 or 3.71, deploying fails with the error message "There were deployment errors. Continue?" This typically happens when one version of ActiveSync is uninstalled and a newer version is installed.
Solution: Uninstalling these versions of ActiveSync removes some registry settings that are required for deployment. You must restore these registry settings to enable deployment.
To restore the registry settings
- Import ProxyPorts.reg. to the registry.
Note ProxyPorts.reg is located by default in Program Files\Microsoft Visual Studio .NET 2003\CompactFrameworkSDK\WinCE Utilities\WinCE Proxy Ports Reg.
- Remove the device from the cradle and replace it. If deployment still does not work, soft-reboot the device.
As you might guess, this basically describes a workaround for a bug in VS.NET 2003. What is the actual problem that the workaround solves?
The root cause
VS.NET 2003 depends on ActiveSync in order to debug and deploy device solutions to actual Windows CE devices (as opposed to debugging and deploying to the emulator, which does not require ActiveSync). This is all well and good, except that VS.NET 2003 and ActiveSync are two separate products. This means that they can be installed and uninstalled separately from each other, and these installations and uninstalls can occur in any order.
In order to make the deployment magic happen, VS.NET 2003 depends on having several keys added to the registry that allow the device connection manager to bootstrap the debugger or to get the compiled assemblies deployed to the device. The sticky part is that the keys that VS depends on actually get added to the ActiveSync section of the registry. In fact, if you look at the "proxyports.reg" file that is referenced by the Readme above, you'll see a bunch of keys that look like this:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows CE Services\ProxyPorts]
"DeviceMDM00"=dword:00000bb8
"DeviceMDM01"=dword:00000bb9
...
This normally isn't a problem, even if you install VS before you install ActiveSync. These keys will be added as part of the VS installation process even if ActiveSync isn't already installed (this is assuming you've chosen to include the "Smart Device Programmability" features of VS in the installation, of course). If you later choose to install ActiveSync, it won't clobber these keys, since they don't appear as part of a default ActiveSync installation.
A problem arises, however, if you uninstall ActiveSync after you've installed VS. During the uninstall process, ActiveSync will helpfully clean up after itself by deleting all of the registry keys it owns -- and this includes all the keys under "Windows CE Services". This will happen even if you're upgrading to a newer version of ActiveSync, since you have to uninstall the previous version first. Now, I'd like to point out that this isn't a bad behavior on the part of the ActiveSync uninstaller -- all well-behaved applications are expected to delete their registry keys when they are removed.
So anyhow, you decide to uninstall or upgrade your ActiveSync, and -- presto! -- device deployment in VS is now broken, because these registry keys have disappeared. Bummer.
How to fix the bug
This isn't a difficult problem to fix, in principle. Simply re-importing the right keys (using the aforementioned "proxyports.reg" file) will work. In the worst case, going to the "Add/Remove Programs" control panel and choosing to "Repair" your VS.NET 2003 installation will also work, although this option takes a while.
So, we knew that uninstalling ActiveSync would break device deployment. We also knew what the workarounds were. You might ask, "Why didn't someone make a fix for the bug?" Well, we did that too.
One of the developers implemented a fix for this bug that did the following on the first deployment using ActiveSync:
The last step was necessary because once the keys were re-added, the ActiveSync connection to the device needed to be reset in order for the changes to take effect. This sounds like a pretty simple fix, right? Let's take a closer look, because there were several complications that were factors in the decision not to take the fix.
Extra dev cost
First of all, it's worth noting that the registry keys in question live in the HKEY_LOCAL_MACHINE hive, which requires Administrator privileges to modify. Sure a lot of developers run as admin on their machine, but there are plenty of situations (computer labs, for instance) where running as an admin isn't practical. From a defense-in-depth perspective, it's also not wise to run as admin.
So in any case where the user wasn't an administrator, we couldn't silently repair the missing registry keys. What should happen then? Should we prompt the user to login as admin and fix the problem? Should we point them to a .Reg file and have them fix the problem themselves as admin? It wasn't clear that there was a good course of action to take in the non-admin case.
Extra test cost
This brings me to my second point -- testing the fix. The simplest fix we had (described above) required at least one extra dialog to display to the user (and perhaps two or more, depending on how we wanted to handle the non-admin case). For every piece of UI that gets displayed, there are one or (usually) more resource strings that contain the text to display.
Microsoft ships VS.NET in about a dozen different languages, so not only would all of these strings have to be properly validated for proper phrasing, but they would have to be translated into all the different languages, and then each language would have to be tested to make sure that the UI was displayed properly (i.e., none of the text was clipped, no Unicode encoding errors, no spelling mistakes, etc.) This is further complicated by the fact that more than one language of VS.NET can be installed on a single machine and that the UI language can be changed at runtime, which further increases the testing requirements of the bug fix. Throw in the fact that you can install one or more languages of VS.NET onto each of the 20 or so languages that Windows is available in, and you're looking at quite a lot of extra testing.
"Why not automate all of that testing?" I hear you ask. Well, we would have done that anyhow -- it's simply unreasonable to expect all of the additional testing be done manually. This means that our automated tests would need to be updated, and that all of these tests would have to be re-run to make sure that (a) the bug fix didn't cause any regressions in other product features, and (b) that none of the changes to the automation caused any false negative test results.
Then we would have had to add additional automated tests to specifically exercise the new dialogs associated with the registry fix, tests to ensure that the registry keys had indeed been re-imported correctly, and so on. Then we would have had to have all of the internationalization teams re-run all of these tests as well, to cover the localization cases I mentioned above.
Other costs
So there's a bit more dev work involved in the fix, and certainly a lot more test work involved to verify the fix than one might have imagined originally. The "simple" fix doesn't look quite so simple now.
There are other complications, like the fact that at least part of the code path modified for the sake of the bug fix would be exercised every single time a device project was deployed via F5 (with debugging) or via Ctrl-F5 (without debugging). This may not sound like a big deal, but as a tester, I tend to get nervous when the answer to "How often will the user exercise this code path" is "all the time." If nothing else, changing an often-used code path increases the risk of causing a regression. As much as I hate shipping software with a bug, I hate shipping software with a regression even more.
There was also a small, but not negligible performance impact on the first ActiveSync deployment after starting VS while the integrity of the registry keys was being checked. All of these things added up to the fact that this was not a trivial fix for the bug in question.
Other possible solutions
Could we have asked the ActiveSync team to modify the uninstall behavior of their app to leave the registry keys VS installs alone? Sure, but this doesn't do anything but push most of the dev and test costs from our team to theirs; we'd just be passing the buck. And, as I mentioned before, there's a pretty strong argument to be made that ActiveSync isn't broken at all, and their uninstaller is doing exactly what it should be.
Besides, if the ActiveSync team had decided to change their behavior instead (and, although I don't know for sure, I find it unlikely that they would want to do so), we would still be saddled with the extra test cost of verifying that they fix they implemented actually solved the problem on our end.
An additional complication is the fact that the devices supported in VS.NET 2003 include Pocket PC 2000 and 2002 devices, all of which used ActiveSync 3.5. If the ActiveSync team had changed their product, then we would essentially have made VS.NET 2003 dependant on a newer version of ActiveSync than what many, many customers already had.
Maybe this doesn't sound so bad, but consider the worst-case scenario: Someone goes to CompUSA, and buys a brand new iPaq and a copy of VS.NET 2003. They come home, get their iPaq drivers and ActiveSync installed and working. Then they install VS, and want to write some managed code for their device. Whoops -- we tell them they need to upgrade their version of ActiveSync to be compatible with VS. But now what happens? The user uninstalls ActiveSync 3.5 that came with the device, installs the new (hypothetical) ActiveSync, and -- that's right -- they've broken VS already. So now they need to re-install (or repair, which is essentially the same thing) their VS installation.
Now they've had to install both ActiveSync AND VS.NET 2003 -- twice. By trying to fix the problem we've actually forced a bunch of our customers into the absolute worst-case scenario, or at least a close approximation of it. Not a very good end-user experience, right?
Testing a fix anyhow
Even with all of the issues I've called out, we still went ahead and built a private version of VS that included a prototype fix for the original bug. At least two people on our test team spent portions of two days running various manual and automated tests against the private build to see if the fix was at least reasonably sound.
It was sound, but it didn't include any ability to deal with the non-admin scenario, and it didn't have a particularly robust way of verifying the integrity of the registry keys in question, so it was clear that there was still work to do to get a fix that was good enough to ship.
Of course, the quality of the private fix we tested wasn't a reflection of the abilities of the developer who wrote it; we wanted a quick-and-dirty fix built in a short amount of time, just so we could see how it altered the behavior of the product and to get a better feel for the additional test work involved.
Not enough time
All of this discussion centers around the fact that it would have taken a lot of time to develop, build, and test a proper fix for this particular issue. A fair question to ask at this point would be, "if you knew about the bug before shipping, why didn't you have enough time to fix it?"
Rather then dance around the issue by talking about product schedules and such, I'll just cut to the chase. The bottom line is that there were two major factors:
Once we understood the actual likelihood and impact to the user of encountering the bug, as well as the details of what we had to fix and how we had to test the fix, it was unfortunately too late to take the fix.
The cost to fix a bug is lower the earlier it is found. Of course, it's easiest to fix a bug before you write a single line of code -- this is why it's important to find as many bugs in the specifications as possible, as early as possible. The closer you get to shipping the product, the cost rises exponentially -- to say nothing of what the cost becomes to fix a bug after you've shipped.
The bug stops here
This isn't to say that we were entirely beholden to the "ship date." It's critical to understand that this bug had the following properties, in decreasing order of importance:
If this had been a serious security bug, we would have fixed it. If it had caused data loss, or of it had affected the majority of our users, we would have fixed it. I want to be clear on those points.
The decision not to take the fix was a difficult one, because we all want to ship a completely bug-free product. Given the cost to develop, implement, and test the fix, the time available, the severity of the bug, the worst-case scenario for not fixing it, and the relative priority of other issues at hand, we wound up making the painful decision to not fix this bug, and a handful of others. Given all of those factors, I think we made the right choice. In the meantime, I think we have a reasonable workaround.
Conclusions
Not every single bug is given this level of detailed examination, but the same basic process of investigation holds for basically every product bug that we deal with, from spec review to release and to servicing. For the overwhelming majority of them, we do feel the love and that's a good thing.
I've probably exhausted you with details at this point, but that's a good thing. I hope I've been able to convey the fact that we take our bugs seriously, and we don't postpone fixing them without a lot of investigation and consideration. It's all part of managing the chaos.
Thanks for this very well thought out look from the inside. It all makes a little more sense now.
Comment by: Brian Graf at October 25, 2003 07:15 AMGreat article, thanks a lot.
As a former tester for Windows XP I highly appreciate this openness and effort to explain things that Microsoft never wanted to explain to general public before.
Greetings Stefen
(Discovered your link via Soble, http://radio.weblogs.com/0001011/)
Comment by: Stefen Niemeyer at October 26, 2003 09:14 AMVery intresting read.
As a hobby beta tester for aps.
I understand beter now why somtimes the bugs i report arent fixed (minor ones afcorse).
Great article! It's good hearing about testing-type things from those who have more experience than I. Even if I was a pro, though, it's always great hearing about things like this, just to get another point of view.
Comment by: Stef at October 26, 2003 11:03 PMNice article. Glad someone has clear my path. I'm a Beta tester for quite some software companies, but only knows about the bug cycle today.
Comment by: Jabez Gan at October 27, 2003 12:48 AMGood to hear the details behind some mysteries in sw development. I wish articles like this were more widely circulated as it would make folks more likely to take certain problems in stride rather than to into high whine.
Comment by: Paul Cassel at October 27, 2003 06:12 AMThanks, I now undestand more about the process of a bug and the fix/not fix steps. :^) Working with the MS testing team, I see why some bugs or problems are not fixed until after the RTS date.
Comment by: John D. Levesque at October 27, 2003 07:39 AMNow I, for one...being a software tester, REALLY hope this helps all the whiners as to why we can't fix them all. He's right...we DO take it personal! It's why we beta test! For a bug to show up after the fact HURTS, but as this fine article showed, not everything is so "black & white".
Thanks for the explaination for the unitiated,
Paul
A great read. Nice to see another perspective on the bugging process.
Comment by: Adam Field at October 27, 2003 12:53 PMFabulous article. Saved me from having to write it myself! :-)
Every time we fix a bug it takes up several person-days which could have been spent finding more important bugs, doing security reviews, designing the next version, implementing features, whatever. It is crucial to ensure that we stop bugs from getting written in the first place, and to brutally triage them once they're in. And I'll tell you, since I started using C#, whole taxonomies of bugs have disappeared from my code...
Comment by: Eric Lippert at October 27, 2003 05:34 PMGreat article
Comment by: Andrew at October 28, 2003 05:22 AMThe kinds of processes and decisions you make in fixing errors is virtually identical to the ones we use. Yes there are know bugs that do not get fixed. I suspect, at minimum, your article will help inform people that bug fixes, especially late-date fixes are expensive. Commonly a large part of the application has to be retested to ensure the "fix" has not broken anything else. Good work!
Comment by: Jim Edgar at October 28, 2003 09:09 AMWhile I agree with your overall points in this article, I have to question your proposed solution to the bug. I can understand the concern about adding this check to the main code path. I think this is a valid concern. But it begs the question of why add it as a pre-deployment check as opposed to a post failure check. Once the deployment fails you check if the keys were missing and then do whatever recover is possible. Personally I would have put up a message box telling the user that registry keys are misssing and to import the ProxyPorts.reg in the VS.Net directory. The users are assumed to be developers -- they should understand what this means and how to correct it. It is certainly much more preferable than getting a generic 'Deployment Failed' error.
Comment by: Billy Boy at October 28, 2003 10:30 AMThe post on anatomy of a bug is a very good read for a recent college grad like me. I have a question to follow up this post. I have had a chance to speak to some other MS testers before and I asked them if they tested for security vulnerabilities in their groups and their answer was no. I am surprised to find that you mention security vulnerability as your first choice. They told me that security testing was done by separate division. My question is at what stage do you begin security testing and what general techniques do you take to test security?
Comment by: Mahi at November 5, 2003 09:59 AMThis is a great question, and I want to provide a more detailed answer in a separate entry, but I'll touch on the major points here. We start testing for security as early in the development process as possible. Even before coding has begun, we review feature and design specifications for security issues very early on. Every team in my divison (the Developer Division, responsible for the CLR, the .NET Frameworks, Visual Studio, etc.) has people who are responsible for the security planning and testing process. We do threat modeling for every feature, and we create security tests as part of our test planning and design for every feature. For developers, fixing security bugs is among their highest priorities. It is true that there are dedicated teams around the company who have specific expertise in security testing, and they support the security efforts of individual feature teams. But the responsibility for doing the security work rests squarely with those who design, plan, develop, and test the software.
Comment by: Joe Bork at November 5, 2003 10:40 AMHallo,
Thanks for this article. I always try in my software
development to give descriptive errors. "Deployment Failed" is really not a descriptive error message,
and this shows how M$ things about the users. They know the error but they won't tell you. How can I know that there is a possibility to fix my installation? How can I know that a .reg file exists?
I think there is a big design bug in this whole thing. Error Handling!
Tell the user what is wrong, stating errors. Do not
tell them "the deployment failed" It sounds to me like "Get lost ashhole!"
Just now I am getting "There were build errors" also no error is displayed in the "to do list", and the program runs smoothly on the device. Again. The guy who designed, programmed, tested this should
think about this! You state fixing bugs costs money and time. Now I am wasting time and money, because M$ did not spend it. Customer deserve nearly bug free code, my customers expect it from my work.
But I am not M$, I am forced to fix them.
The article was good insight into MS bug policy, but I also agree with Billy Boy, who doubts the proposed solution. I believe your bug-fixing would have been quite more effective, if you had asked developer to find more than one alternative to the solution. There are always more than 3 solutions to a problem:
prevent it from happening in the first place (put the keys under your own program's space in registry or some common area for example - nothing changes in the user experience, even though both programs have to be changed in several places, if written ineffectively, but tests should be fine, only one new test added to test the very bug, no language dependancy)
when the error occurs, make sure if that is caused by the known bug - fix it, let user do their part (your developer proposed it - not the best user experience)
let user dig into the big helpfile and find somewhere the answer (worst user experience of those)So I am still convinced, that MS should refine their debugging/error-fixing strategies. Of course I understand that you probably had more critical errors to fix, but even this kind of bug should eventually get fixed instead of forever giving the bad experience of importing registry entries again under certain conditions.
Chris: You are right, the "Deployment Failed" error message is not particularly helpful, and we're working to make the experience of deployment and debugging more reliable and informative in the next release. I agree that the solution is not very discoverable, but we're trying to publicize the availablity of a work around as much as possible, including links on MSDN and the knowledge base.
With regards to your build errors, I can understand that you are frustrated. If you would like to provide more details about the error you get and the project you're building, we'll be happy to take a look at it. Feel free to post a question in the public newsgroup (see: http://msdn.microsoft.com/newsgroups/loadframes.asp?newsgroup=microsoft.public.dotnet.framework.compactframework). Failing that, you can call PSS or you can email me (joebork@microsoft.com).
On a personal note, however, your use of the moniker "M$" does not help you make your point. If you wish to follow up on this issue with me or my coworkers, I would appreciate it if you did not use it.
Comment by: Joe Bork at November 11, 2003 08:50 AM"Programmer": I may not have made it as clear as I should have, but your proposal #1 was something we considered but discarded since it would have required changed to the ActiveSync code (an entirely separate product), and this would have created unacceptable version dependancies between ActiveSync and Visual Studio. As for #2, it was not just the developer who proposed the solution we wound up testing. That solution was a result of discussions between members of the developement, test, and program management teams, and we decided that it was the most appropriate compromise. In the end, we essentially settled with your #3, which requires the user to search in help, MSDN, or the knowledge base to find the workaround. As I have mentioned before, this is far from an ideal soltuion, and we want to make the experience much better in the future.
However, it would be improper and incorrect to draw any broad conclusions, like when you say you are convinced "that MS should refine their debugging/error-fixing strategies". My original entry dealt in detail with the process of fixing or not fixing one single bug in one piece of one product that Microsoft makes. I tried to be very explicit about this; indeed, I have not related all of the minutae about even this one issue that we considered. I am convinced that we have a responsible and effective method to manage the finding, evaluating, and fixing of bugs, and I believe that I am not alone among Microsoft employees who feels so.
I was afraid that people might misconstrue the article I wrote as an example of how "Microsoft ships buggy software on purpose," even though such a statement ignores the fact that shipping perfect software is impossible. Instead, I believe that that story I shared demonstrates how much effort and consideration at least my coworkers (and I believe Microsoft as a whole) puts into resolving bugs and other defects in our software.
Comment by: Joe Bork at November 11, 2003 09:04 AMi just got started as a software tester. this article really helped me know somethin abt what i will face in future. thanx.
Comment by: anjani at December 18, 2003 03:57 PMNiccceee pagee
Comment by: Acanty at February 20, 2004 04:45 AMVery interesting and useful article.
Comment by: mrc at February 28, 2004 01:10 PMVery interesting article. I would be one of "them" in the article asking the exact questions to the author. After reading this article I am more considerate to testers now. I really spent two days trying to fix the connection problem myself which was caused when I did a windows update, which is another issue for the tester's I guess.
This reminds me of other products' update procedure, like Symantec does. They have that liveupdate, which works fine. Microsoft itself uses windows update, but only for windows. For office you have a separate update, which is not quite clear for john doe user.
If a update (one which would warn the user about the error more precisely, or check for the absence of registry keys, whatever) would be published, it surely would need administrator rights.
This wouldn't reduce the effort for bug fixing, testing and translation, however would be way much better than 1) let the user search MSDN and several help files 2) keep joe user asking to the admin about this fix.
I'm speaking with no absolute knowledge of cause, I admit. But this kind of bug fixing is what I like in products like Symantec and some others.
Comment by: Alexandre Strube at April 28, 2004 05:33 AMI find the article well written and much needed.
I am a software developer, and I find that in the specific case of this article the problem goes back to the devs.
Why on earth are they writing in another application's registry keys???
Has no one thought of the simple solution to just stop writing keys to the ActiveSync portion of the registry and write them in the portion where they belong, namely that of the VS.NET 2003 mobile dev section?
I am really dumbfounded that Microsoft is actually shipping software that violates good practices like this? I am not sure where, but I would bet I have read somewhere that you shouldn't write in another app's registry section.
My two cents...
Comment by: Sven Aelterman at August 13, 2004 10:01 AMPost #41 hits the nail on the head. It seems to me that the design itself is kludgy, and workarounds consist of hacks to a kludgy design. It is far better to start with a clean design.
With software getting more and more complex, and with complex interactions between different systems and subsystems, the importance of a well thought out design cannot be underestimated. The first principle of good software design is KISS (keep it simple and stupid).
Comment by: Pankaj Gupta at November 1, 2004 11:28 AMGood read. Informative, thoughtful, detailed, and it went in-depth enough for the consequences to be understood.
Unfortunately, I can't echo the sentiments of others when they say this will be helpful for explanations. This sort of thing gets explained all the time (admittedly, not nearly so cogently), but it never stops the whiners. Sure, you'll get a few, but you won't get as many as you hope.
Still, I'm adding this to my bookmarks for future use as a reference...maybe it'll help me some in similar situations.
Comment by: Jeff Walden at November 23, 2004 11:52 PM