Debugging stories

I just spent 1-2 hours debugging my sgeSetError() function because it segfaulted sometimes.

In the end, it was all because I used snprintf instead of vsnprintf. When I realised, I undid all the changes I had made in the last hour or so and, for about a minute, repeated "Seriously?!" to myself.

It took me over an hour, pretty much just to press 'v' twice (two calls to vsnprintf).

I hate debugging.
Last edited on
Once, 12 years ago, when I worked in OS development, I was assigned a bug reported by one of our telecom customers: once a few months, a thread on their main production server would stop making progress. It shows up as 'running' in ps, but isn't using any CPU time. The server was handling millions of interrupts a second, running 300+ threads doing many different things, and nobody could reproduce this in controlled conditions.

The next time it happened, I drove over (in the Silicon Valley, everyone was 10 minutes away), punched the OS into kernel debugger and saw that well.. it is "running" in the process table. But it's not even present in any of the scheduler's queues, so the scheduler never gives it time slices. Nothing at all suspicious about anything else. I went over every code path in the sheduler, added a bunch of trace code (it had to be fast, don't remember what I did exactly, I think I filled out a bunch of structs in kernel memory), patched their kernel and let them reboot.

Two car trips and customer-paid lunches later, I tracked it down: it was a three-thread race condition. While the user thread was awakening and when it was the only awake thread at its priority level, a kernel thread that would add it to the special scheduler data structure for one-per-priority threads, got preempted by another thread at a specific line of code (also demanding thread rescheduling), which was safe, but a third kernel thread preempted the second one, also to reschedule, at another very specific line of code as well, which was also safe on its own, but in combination, this gave me my first taste of the ABA problem.

I love debugging. It's a source of new and unexpected.
Once I spent 2 days debugging a single function. All I had to do was to change
int diff_new=abs(sum_a-i)-(sum_b+j); to int diff_new=abs(sum_a-i+j)-(sum_b-j+i);. I can't believe how stupid I was.

Now I'm trying to convert the same program to java, and again, I have no idea why it doesn't work. Arghhhh!
Just yesterday I spent three hours to find I needed to convert lua_tounsigned( m_L, 1 ) into lua_tounsigned( m_L, lua_gettop( m_L ) )
In my latest project, I would get an out-of-bounds error once every few runs (it's not a deterministic algorithm). I never put any validation code in lookup functions, because a) vectors come with asserts in debug mode, b) checks slow down the program significantly (and require me to remove them again afterwards), and c) the program should crash on bad input, because handling the error makes no sense anyway. Only downside is that I get a vague runtime error and I have to search the faulty line myself.

For the first time in many, many projects, I managed to somewhat organize my code in a readable fashion. I was quickly, and with minimal effort, able to pinpoint the phase where the problem was. I found the line, ran over the process in my head and shortly after understood when and why the problem would occur. So I fixed it, recompiled and ran it a few times. And then it crashed again.

I was convinced my fix should have worked. I repeated the search process and ultimately came to the same piece of code. I wrote down several scenarios on paper, but couldn't figure out why it would still crash. I ran the debugger, which was a pain in the ass, considering it happened once every few runs in one of the many, many iterations of that line. I couldn't add a manual check line to put a breakpoint on, because I had no idea why it was still crashing after the fix. I started throwing checks and "fixes" around in places I was convinced the problem couldn't be, out of pure desparation.

Ultimately, I assumed my fix didn't fix it at all, put a check on the case that would have broken the old version, and stumbled across an iteration where it would crash. Using pen and paper to follow the case, I quickly noticed that my fix didn't work. Seconds later, I realized the behaviour of the variables were not consistent to what they should do in the fixed version. Minutes later, I realized I had been fixing the algorithm in a backup version of the file, that was opened in the version I was running.

Exactly 4 seconds later, I fixed the line in the correct file and the program ran perfectly ever since.
Topic archived. No new replies allowed.