The Philosophy of Bug-fixing

I spend way to much time fixing bugs. I don’t say that because my code is excessively buggy (although that is also true), I say that because quite a few of the bugs I fix are probably not worth fixing. Not really. Let’s imagine…

Your customer has submitted, along with vague threats of ninja assassins being dispatched to your company headquarters, a bug report about a crash in your software. He cannot really supply any useful information on what led up to the crash but the error message tells you exactly where in the code the crash occurred. Firing up your favourite editor you take a look. The crash happens because an unallocated resource is freed.

A bug :

mystificate(wizard)
 ...

 fire(wizard)

The crash is easily prevented, just check if the resource is allocated before freeing it. The thing is, that resource should be allocated at this point in the code. You know that it usually is or you would have seen this crash before. The crash, while a bug in itself, is clearly also a symptom of another bug. What do you do and why?

A bug fix :

mystificate(wizard)
 ...

 if (hired(wizard))
    fire(wizard)

Any diligent programmer would at least spend a little time considering the implications of this discovery. If you can easily figure out how the resource came to be unallocated then perhaps you can fix that too. And you should. You may also realise that the unallocated resource is a fatal error which absolutely, positively, must be fixed or the plague breaks out. Put on your mining helmet and start digging. Perhaps your product is a pacemaker. Fix the bug.

Most of the time though, the situation is not as clear cut. You just don’t know what the implications are. The bug could be a corner case with no other consequences than the crash that you just fixed. Or it could be a symptom of some intricate failure mode that may cause other bad things to happen randomly. I would guess that most competent programmers would feel decidedly uncomfortable papering over the problem. Some may even find it hard to sleep at night, knowing that the bug is out there, plotting against you. They lie awake worrying about when and how the bug will crop up next. So you start to investigate.

Initial state.
Initial state.

You rule out the most likely causes. You try to repeat the problem. You try to make it happen more often. If you’re an embedded programmer you verify five times that RX and TX pins haven’t been swapped somewhere. You instrument the code. You meditate over reams of trace printouts. You start to suspect that this is a compiler bug, but that is almost never really the case. You read the code (desperation has taken hold). You poke around in a debugger. You remove more and more code, trying to reduce the problem. You start to become certain that this is a compiler bug (it never is). You dream about race conditions. You stop showering. You start to have conversations with an interrupt handler like it was a person. You dig into the assembly. You now know that it is a compiler bug (it isn’t). You mutter angrily to yourself on the subway. A colleague rewrites a completely unrelated part of the code and the symptom inexplicably goes away. Your cat shuns you. You start to question everything; how certain are you that a byte is 8 bits on this machine?

Intermediate stage.
Intermediate stage.

After aeons of hell a co-worker drops by. She looks over your shoulder and asks “isn’t that resource freed right there?” And sure enough — in the code that you’ve been staring at for weeks, a mere 4 lines above the bad free, the resource is freed the first time. To say that the error had been staring you in the face the whole time would be understating it. You kill yourself. The cycle starts again.

The final state.
The final state.

The terrible truth:

mystificate(wizard)
   fire(wizard)

   hire(izzard)
       if (flag)
           occupy(india)

   if (hired(wizard))
       fire(wizard)

Or maybe the bug turned out to be something really gnarly that could explain dozens of different bug reports.

Should you have fixed that bug? The thing about bugs is that they are unknown. Until you know what is going on you can’t tell wether the bug will cause catastrophic errors or is relatively harmless, right? The bug is what Donald Rumsfeld would call a “known unknown,” something that you know that you don’t know. Not tracking down the bug means living with uncertainty. Fixing bugs increases the quality of the software so fixing any bug (correctly) makes the software better. You can easily convince yourself that fixing the bug was the right thing to do, even if it turned out to be relatively harmless. But you’re probably wrong.

sheldon

The “known unknown” argument cuts both ways. Even if the bug turns out to be serious you didn’t know that beforehand. Putting in serious effort before you know what the payoff would be may not be the wisest way to allocate your time. And the question isn’t really “did the fix make the software better” but “was the time spent on the fix the best way to make the software better”.

Living with uncertainty can be difficult but lets face it, your code probably has hundreds or thousands of unknown bugs. Does it actually make a difference that you know that this one is out there?

Am I saying that it was wrong to start looking for the bug? Definitely not. If you suspect that something is amiss, you should absolutely strive to understand what is going on. The difficulty lies in knowing when to stop. There’s always a sense that you are this close to finding the problem. If you would just spend 10 more minutes you would figure it out, or stumble upon some clue that would really help. Now you’ve spent half a day on the bug, might as well spend the whole day since you’ve just managed to fit the problem inside your head. You’ve spent a day on it, might as well put in another hour just to see where this latest lead goes. You’ve spent three days and if you don’t fix it now you’ve just wasted three days. You’ve spent two weeks and dropping the whole thing without resolution at this point would cause irreparable psychological damage. The more you dig the deeper the hole becomes.

So what do you do to prevent this downward spiral? Talk to a co-worker. I know it might seem uncomfortable and unsanitary and possibly even unpossible, but it can be an invaluable strategy even if your colleagues are all muppets. Take ten minutes a day to talk about where you are and how it is going. Just describing the problem often makes you realise what the solution is. “So the flag is unset when frobnitz() is called and… and… never mind, I am an idiot” is a common conversation pattern (although “never mind, you are an idiot” is idiomatic in some places). Sometimes this even works with inanimate objects like managers. This is sometimes called “Rubber duck debugging“: carefully explain your problem to your rubber duck and the fix will become apparent. Don’t own a rubber duck? What is wrong with you??

Even if describing the problem doesn’t help, a sentient co-worker has a tendency to ask if you have checked 10 different things that are so obvious that how can you even ask, and do you think I’m completely incompetent and, by the way, no, I meant to get around to it. “Did I recompile the code after changing it? No, do you have to?” If your colleague is very polite you may have to insult him a bit before he asks the obvious questions. Obvious questions can be hard on the ego but very useful because the odds are in favour of the obvious things. For some reason the human mind seems to prefer the mystical and complex (“it’s a compiler bug”) to the simple and likely (“I forgot to recompile”).

The final reason it is good to talk to a co-worker is that you get some distance from the problem. Just discussing loosely where you are and what you know and how much time you’ve spent can really help when you start to suspect that you should strap your bug-fixing project to an ejection seat.

Conclusions:
  • Don’t get lost fixing a hard bug, keep an eye on the big picture. This is a lot easier to write than to do.
  • Learn to live with uncertainty. If this is impossible: embrace misery.
  • Co-workers can serve a useful purpose other than being on the opposite side of indentation wars.