I have something of a reputation for fixing obscure bugs. At some point in time you could probably call this a weird kind of hobby, the idea was that if you really had tried everything and you still didn’t find your problem, that I would take a wager for a dinner, either I find and fix the bug and you buy me dinner or I buy you dinner. It was a fun thing to do.
Now, of course, simply the perspective of a clueless newbie that takes a fresh look at a problem will often times flush out a problem with relatively little effort (those were the easy ones).
But every now and then you’d run in to something of a different order, a bug that occurs only once every week or so, but leaves a whole company dead in the water while a bunch of stuff gets restarted. Or a bug that would occur without any indication whatsoever other than say subtle data corruption, only realised long after the fact.
For those really tough cases I worked out a method, and the method is this:
ASSUME ABSOLUTELY NOTHING.
People would cringe when I would shut down the machine the bug was exhibiting on to begin with meticulously checking the hardware, from the power supply voltages to temperatures and whether the RAM was seated properly, running memtest for a day or so and making sure all the cabling was good.
This simple pre-check probably fixed another 30 to 50% of the ‘bugs’, the problem here was not that the software was bad, just that people tend to assume a software problem because they are so much more common than hardware problems.
Most of the time, this sort of starting from the ground and working your way up was met with headshaking by those involved, after all, they’d checked ‘everything’ already 10 times over.
After getting rid of this second class of problems you’re left with an even harder category, the ‘flakeys’, bugs that are not hardware related (you’ve hopefully ruled that out by now), that have no direct cause and that are difficult to reproduce in a given amount of time.
Man how I hate those… but, with my reputation solidly on the line it would be silly to give up at this point, after all, this is the first time that you’re really debugging an actual problem.
The next trick in the debuggers toolbox is divide and conquer. Try to partition the arena where the problem occurs in such a way that you can rule out half the code where the problem may occur so that after an occurrence the remaining problem is half the size.
Repeat this several times and you’ll be literally staring at the bug, which means you usually can add a couple of assertions (which are the C programmers way of checking if their assumptions still hold true at runtime) and on the next run you will find out what is the cause of the problem.
Unless, of course you don’t…
Now, most of the stuff we’re talking about here is production C code, and even though C is often referred to as ‘glorified assembler’ the distance between the two is still considerable.
Switching the view from C source code to assembly (that -S option on your compiler is good for something after all) you get to see the code the processor actually executes. That doesn’t help in terms of linecount, but it does help in terms of removing a veil that hides what is really going on.
Chances are that when you’ve crossed off all the possible culprits (un-initialised variables, double frees, pointer overwrites, array overwrites and so on, which will take care of a good portion of the remaining issues, fortunately compilers are much better nowadays at spotting these errors) that you’re still left with a problem.
That problem is simply that even though looking at the assembly code has given you greater insight in to what is going on under the hood does not give you the ability to backtrack through more than a very small fraction of a second worth of actual execution. Processors can produce traces longer than a human could analyse in a lifetime in a few seconds.
The next trick I would employ at this stage is to see if I could trigger a bug by varying system parameters. Overload the machine on purpose, see if that will increase the frequency of occurrence, that sort of thing. After all, if bugs get harder to fix as they occur less frequently increasing their occurrence will make them easier to fix.
If a bug is somehow timing related then manipulating the load will usually bring it about (or make it go away entirely, very frustrating). Other parameters that you can manipulate are the amount of available memory (rip out a few of those dimms, gasp!), have a process access the disk for instance by running a bunch of benchmarks along the program that you’re trying to debug and so on.
In the end, I think I may have bought one or two dinners over the course of many years, and all the rest of the bugs were squashed using these simple techniques. In spite of the name, I never used a debugger for any of this, mostly because when I grew up with computers we didn’t have them so I learned to do without. Maybe I would have found some of those problems above easier when using a debugger, but to date I haven’t run in to a problem so severe that I needed one. Yet ;)
The ‘Assume Absolutely Nothing’ rule is a good one I think, and even if times have changed considerably I think it still stands as strong as ever. If you are debugging and you do not check your assumptions then you are building your house on quicksand.
<!– 72 –>