Nightmare Debug Problem

I’m having great difficulty tracking down an issue causing a crash on our overnight build and test suite. I’ve been trying to resolve for a couple of days now.

We have a nightly build system in place that pulls down all our (Borland BDS 2006) code (a C style SDK), builds it, ups the version and then checks the updated version files into the svn repo. A suite of about 800 tests are then run against that build.

The problem arose when a particular item of HW was added to the suite. Our suite now crashes nightly and hasn’t had a complete run for a while now. I’ve gone through the following to try and get to the bottom of this:

1) Set up on my local machine, attached the debugger and tried to catch any exceptions, but could not reproduce, even with the same test environment and HW. I then ran again on my local machine but without the debugger; in case the issue was timing related. Again it did not crash.
2) I set up remote debugging on the test machine to try and get hold of the stack or any exceptions on crash but when the crash occurred it brought the entire debugger down with it.
3) I then started going through all of the nightly build files using a “binary search” style method until I located the breaking build. I got down to one particular binary but on inspection of the svn log found that it corresponded to a massive merge from another branch. There was a list of merged revisions and range of revisions left on the log.
4) I then started a similar search on the second branch, between the lowest and the highest version numbers in the log and again found what I thought to be the break. However, on inspection I found that the code changes here didn’t seem to correspond with what had been merged forward.
5) Considering the above and the fact that there was an element of uncertainty with my search method (I was only using a small subset of the tests to determine build health, so I could never be completely sure that a build was healthy) I decided to change approach. I got a copy of Memproof in order to see if there were any allocation errors. But I could not get the thing to attach without crashing the entire suite.
6) Set up the debugger on the test machine and checked out the source. Ran a few times. The debugger eventually crashed (without a stack). Violation message in a Borland dll (from the debugger). Have now left the thing running overnight with the debugger again, in the vain hope that when I go in tomorrow it’ll be telling me exactly where the problem is.

…which it probably won’t. I’m getting pretty near my wits end. All I can think of now is to try and get hold of an advanced profiling tool or start adding debug output to the code. I’m feeling drained; does anyone have any advice and/or leads here before I go mad?
Last edited on
you can do it!
What was the new Hardware? Did it replace an older unit with a simular function? Is it a piece of hardware that would require a driver such as a video card or printer? Can you test the memory on the problematic system (MemTest86)?
It's our own HW. Our SW is designed to control our entire range of HW. This particular piece was usb.

We rotate the HW regularly and prior to this particular rotation we hadn't had a problem before.

I haven't used Memtest86 before. Will investigate. Main feature that it need for my purpose it to be able to get back to a code line in the source.
MemTest86 has nothing to do with the software you and\or your company wrote. It tests the physical hardware for errors. Your said the issue is not occuring on your machine so I immediatly thought hardware might be the problem.

A USB what? A USB printer? A USB Keyboard? Some things USB still require drivers. If any of your software tries to talk with a driver that has changed then you might see a failure.

Writing in some output code to your scripts may not be a bad idea for something this complex.
Thank you.

I used the exact same test suite HW on my local machine. But now that you mention it, I didn't check the different driver versions installed on both machines. That might be a lead, as we have different releases of our device drivers.

We don't use scripts for the testing, rather dlls (one per test), which call into the main SDK. Our SDK already has DEBUG output but I fear that this will interfere with timings, so I'm probably going to have to add my own, both to the SDK and the test code. A colleague suggested a try catch incase there are any allocation exceptions throw.
Also don't forget about physical hardware, equipment can run for 10 years without a problem then one day... There are in fact several things in your origional post that suggest this, including the fact that your remote debugger went down. This suggests to me that either the entire system rebooted or the target system lost it's network connection.

Your other options are either an off hours day time run (does your office observe good friday?) or an overnighter so that you are physically present and watching the system when it crashes. I'm a Sys Admin myself and I can tell you these are not uncommon in our field.

Can't be any of these two things. I ran the remote debugger several times but it crashed pretty consistently, at the same SDK function call. Also, I've ran several times during the day with a smaller subset of tests and am getting regular crashes.

I'm pretty sure that it's been a code checkin that caused the crash. I'll just have to keep pushing on. I'll post if I get a solution.
Topic archived. No new replies allowed.