long term bug advice

Pages: 12
Hey guys,
Im running into some issues with a project im working on. The project itself is very large so I cant really post code but here goes.
The program runs and works fine for a period of about 2 days, no exact time but some time after about 48 hours the program halts with the word "Aborted" on the command line. The program is just a simple bartop game and theres not too much going on. I used valgrind to trace down any memory leaks i missed but thats not the problem. So i guess im looking to see if anyone has any idea how to track this down, any debugging software they can recommend or techniques. Also Im using to external libraries, SDL and SDL_mixer. This is becomming a pain, two days is along time to wait to find a bug.

it's running on ubuntu 8.04
This type of crashing - where a program crashes consistently afeter a period of time - can be down to resource issues.

You maybe continually asking for some resource (memory/handles/etc..) and not releasing them - so eventually you runout of available resources.

This is subtlety different from 'illegal' memory access - which usually throws a SegFault error and crash. If no more resources are available, the program goes weird or simply just dies.

You can always post the code somewhere that we can download and check.
Why not get a coredump and load it into the debugger as a start?
it's not dumping the core, ill have to see why later.

The code is 15,000 lines so i don't think anyone wants to look at it.

Ill look into dumping the core hopefully that will give me a clue as to what is slowly eating up resources.
If this is on Linux, then it is quite likely the the user limit for core files is set to 0. Check with "ulimit -a"

ulimit -c unlimited
indeed it was, thanks for that. now two days and we'll see if a core dump helps :) As long as we're at it can anyone point me to a good reference for how to go through the core dump and see what went wrong?
Start with a simple backtrace to the find the function at fault.

"bt" if using gdb.

(You should be able to find gdb references online, if that's the debugger you are using)
so I finally got around to getting it to crash and I have the core file...
the output is not all that helpful, I do beleive the the problem now lies in an openGL call :)
here is the backtrace...any help from here is greatly appreciated.

1
2
3
4
5
6
7
8
9
10
Program terminated with signal 6, Aborted.
[New process 5984]
[New process 5985]
#0  0xb7f6d410 in ?? () from /lib/ld-linux.so.2
(gdb) bt
#0  0xb7f6d410 in ?? () from /lib/ld-linux.so.2
#1  0xb7b43085 in modfl () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7b44a01 in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0xb718ff48 in ?? () from /usr/lib/libGLcore.so.1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

The stack is corrupt, as the debugger says.

Are you allocating arrays on the stack, and perhaps walking beyond the array boundary?
"Aborted" usually indicates an assertion failure or a direct call to abort(). The corrupt stack is typically an indication that something has gone terribly awry in your program. There are many ways to corrupt the stack. The most common is a buffer overrun as jsmith suggests. Though the last corrupt stack I had to debug was due to a double-free because of a class that didn't properly implement a copy constructor. The one prior to that was due to a method signature mismatch between header and source for a C function; the function was being called with an incorrect number of arguments.

I'd start by compiling with -g -fmudflap (or -fmudflapth) and see where that gets you. That should detect most buffer overruns. If that doesn't help, try running under ElectricFence.

Because of the stack corruption, I would not put to much stock in the error being related to an OpenGL call.
Thanks, Ill check those and get back to you.
I am unable to use either mudflap or efence as they both stop right away with segfault errors, though the program runs. The documentation leads me to beleive the c++ support is not that good yet. Im gonna look over my code for leaks myself again but if anyone may have another idea im definitely open to it, or an idea of why these don't work.
I can personally vouch for efence working fine on linux w/gcc 3.3
And I can vouch for mudflap working with g++ 4.3. What OS (distro & version) and tool versions are you using?
I did get them to work, but they weren't as much help as I was hoping.
efence halts on the first error which happens to be in the initialization of SDL_Mixer and is deep inside and has nothing to do with my code. I assume its a livable error but it won't continue past this point. same with mudflap.
What is the error?

I don't think you should be getting any errors inside the SDL libraries unless you are misusing the API.
here is the output from running in gdb...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(gdb) run
Starting program: /home/marcus/Desktop/CvD/Game 
[Thread debugging using libthread_db enabled]

  Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.
Unable to initialize serial port!
[New Thread 0xb67706f0 (LWP 19086)]
[New Thread 0xafc7fb90 (LWP 19089)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xafc7fb90 (LWP 19089)]
0xb7dcb5d7 in memalign () from /usr/lib/libefence.so.0
(gdb) bt
#0  0xb7dcb5d7 in memalign () from /usr/lib/libefence.so.0
#1  0xb7dcb88b in malloc () from /usr/lib/libefence.so.0
#2  0xb3cdb8f8 in pa_xmalloc () from /usr/lib/libpulse.so.0
#3  0xb3d023c6 in ?? () from /usr/lib/libpulse.so.0
#4  0xb3cffdf4 in ?? () from /usr/lib/libpulse.so.0
#5  0xb3cfffb4 in ?? () from /usr/lib/libpulse.so.0
#6  0xb3d02ed3 in ?? () from /usr/lib/libpulse.so.0
#7  0xb78b24ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#8  0xb7c2649e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb) 


When i step through the program it errors when initializeing the sdl_mixer stuff, for reference i have included that snippet of code.

1
2
3
4
5
6
7
8
9
10
int audio_rate = 22050;
  Uint16 audio_format = AUDIO_S16SYS;
  int audio_channels = 2; 
  int audio_buffers = 4096;
 
  if(Mix_OpenAudio(audio_rate, audio_format, audio_channels, audio_buffers) != 0) 
  {
		fprintf(stderr, "Unable to initialize audio: %s\n", Mix_GetError());
		exit(1);
  }  


the error is within the MixOpenAudio function
Don't discount bugs in libpulse: https://bugzilla.redhat.com/buglist.cgi?component=pulseaudio&product=Fedora

This segfault indicates a real problem and should not be ignored. This may be corrupting the heap or stack in a way that doesn't immediately crash your program, but is responsible later on.

Does this crash for you?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include "SDL/SDL_mixer.h"
#include <iostream>

int main()
{
  int audio_rate = 22050;
  Uint16 audio_format = AUDIO_S16SYS;
  int audio_channels = 2; 
  int audio_buffers = 4096;
 
  if (Mix_OpenAudio(audio_rate, audio_format, audio_channels, audio_buffers) != 0) 
  {
    std::cerr << "Unable to initialize audio: " << Mix_GetError() << std::endl;
    return 1;
  }

  SDL_CloseAudio();
  return 0;
}


It works for me. If it fails for you, you need to upgrade pulseaudio. Otherwise it indicates a problem in your code before to this point. Though I would be surprised that neither efence nor mudflap flagged the point of the error.

indeed it did fail. I then ran it through gdb and stepped through the program and this is the result...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
(gdb) break 1
Breakpoint 1 at 0x80488e0: file testSDL.cpp, line 1.
(gdb) run
Starting program: /home/marcus/Desktop/test 
[Thread debugging using libthread_db enabled]
[New Thread 0xb79086d0 (LWP 9979)]
[Switching to Thread 0xb79086d0 (LWP 9979)]

Breakpoint 1, main () at testSDL.cpp:4
4	int main()
(gdb) next
main () at testSDL.cpp:6
6	  int audio_rate = 22050;
(gdb) 
7	  Uint16 audio_format = AUDIO_S16SYS;
(gdb) 
8	  int audio_channels = 2; 
(gdb) 
9	  int audio_buffers = 4096;
(gdb) 
11	  if (Mix_OpenAudio(audio_rate, audio_format, audio_channels, audio_buffers) != 0) 
(gdb) 

  Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.
[New Thread 0xb2c53b90 (LWP 9982)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb2c53b90 (LWP 9982)]
0xb7e77ee6 in free () from /usr/lib/libefence.so.0
(gdb) bt
#0  0xb7e77ee6 in free () from /usr/lib/libefence.so.0
#1  0xb6dbd685 in pa_xfree () from /usr/lib/libpulse.so.0
#2  0xb6de42b2 in ?? () from /usr/lib/libpulse.so.0
#3  0xb6de1e1d in ?? () from /usr/lib/libpulse.so.0
#4  0xb6de1fb4 in ?? () from /usr/lib/libpulse.so.0
#5  0xb6de4ed3 in ?? () from /usr/lib/libpulse.so.0
#6  0xb7a974ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#7  0xb7cd349e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb) 


ill take a look at the bug and see what version im using and if i can upgrade or possibly down grade...could you post the version you are using?
come to think of it...the sound does cut out eventually and before the program crashes so I would have to lean toward this being part of the problem.
I have version 9.14-0ubuntu20 of libpulse
Pages: 12