Valgrind: segmentation fault error

Pages: 12
I have a serial code that is working perfectly and printing out data as expecxted. I rewrote a parallel version using openMP of it while testing individual for loops in the main code, but when running the full code I get:
 
segmentation fault

So, I ran valgrind with this parallel code and I got the following error messages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
==2451== Thread 6:
==2451== Conditional jump or move depends on uninitialised value(s)
==2451==    at 0x4C5178C: sqrt (w_sqrt_compat.c:31)
==2451==    by 0x11233F: absComp(double (*) [2], double*) [clone ._omp_fn.0] (U.cpp:1247)
==2451==    by 0x4DA877D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2451==    by 0x4FEE608: start_thread (pthread_create.c:477)
==2451==    by 0x4F0F292: clone (clone.S:95)
==2451==  Uninitialised value was created by a heap allocation
==2451==    at 0x483E340: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2451==    by 0x10AB88: main (U.cpp:171)
==2451==
==2451== Thread 1:
==2451== Conditional jump or move depends on uninitialised value(s)
==2451==    at 0x4C5178C: sqrt (w_sqrt_compat.c:31)
==2451==    by 0x11233F: absComp(double (*) [2], double*) [clone ._omp_fn.0] (U.cpp:1247)
==2451==    by 0x4DA08E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2451==    by 0x10D478: absComp(double (*) [2], double*) (U.cpp:1244)
==2451==    by 0x10D5CE: max_absComp(double (*) [2]) (U.cpp:1289)
==2451==    by 0x10D780: potentialk(double (*) [2], double (*) [2], double (*) [2], double (*) [2], double (*) [2], double*, double*, double*, double, int) (U.cpp:1355)
==2451==    by 0x10B80B: main (U.cpp:455)
==2451==  Uninitialised value was created by a heap allocation
==2451==    at 0x483E340: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2451==    by 0x10AB88: main (U.cpp:171)


I have never ran valgrind with openmp before so I tried following with the lines but not really understanding. How can everything be initiallized correctly in the serial code but no the parallel code?

If you want to take a look at the full code:
https://www.zipshare.com/download/eyJhcmNoaXZlSWQiOiI3YThmOGZiZi03NjdmLTRmNTQtYWM0ZC1hMDk0OTMzZTU5MjciLCJlbWFpbCI6ImFzdHJvbHVqeUBnbWFpbC5jb20ifQ==
Please post the routine this alludes to, not require the forum user to download a large zip file and fathom it out.

At a rough guess it is associated with an attempt to
abs(some complex number)
in some procedure called absComp(), and you didn't initialise that complex number, either in that routine or further up the call stack.

It is pointing to stuff on line 1247 of U.cpp. Trace back the behaviour of all variables from that, all the way to main() if necessary. Be careful that all elements of dynamic arrays have been suitably initialised: "Uninitialised value was created by a heap allocation".
Last edited on
Can you upload the complete project so someone can run it?

...download a large zip file

52KB large???
I suspect, from context,
static const int nyk = ny/2 + 1;
should probably be
static const int nyk = ny/2;
but I'll leave it to @thmm to sort out.
For the record, I couldn't figure it out. phyk in main() would appear to have the right structure given how it's used in absComp(), and it's fully memset()'d with 42.
My suggestion would be to tell Valgrind to attach a debugger when the faulty jump happens so you can see what exactly is going on. Just a couple of stack traces is not nearly enough information. Also, if you do this, I'd also suggest turning off parallelism. It's going to make inspecting previous stack frames significantly easier.

On an unrelated note, I don't think you're using threads the most effective way, OP. You're giving them too small workloads and you're synchronizing them too often. I would not be surprised if for every additional thread you got no more than a 5-10% performance improvement.
phik is allocated in main() via
1
2
3
	fftw_complex *phik;
	phik = (fftw_complex*) fftw_malloc(nx*nyk* sizeof(fftw_complex)); 
	memset(phik, 42, nx*nyk* sizeof(fftw_complex));


It's then passed to potentialk() via pairs of doubles:
int potentialk(double invnk[][ncomp], double dndxk[][ncomp], double dndyk[][ncomp], double phik[][ncomp], double potSourcek[][ncomp], double kx[], double ky[], double ninvksqu[], double err_max, int max_iter)

Then it gets sent to max_absComp as pairs of doubles:
double max_absComp(double arr3D[][ncomp])

and thence to the routine absComp as pairs of doubles:
void absComp(double arr3D[][ncomp], double arrOut[])

where it appears to crash out.

I take it that fftw_complex does just correspond to a pair of doubles and that memset will correct initialise all real and imaginary parts as required?


Yeah, just turn off parallelisation (which is probably as simple as omitting the compiler/linker option -fopenmp) and see what happens. It's just possible that the fftw library has OpenMP calls in itself (which may or may not be buggy).
Last edited on
fftw_complex is apparently defined as typedef double fftw_complex[2];, so its sizeof is that of two doubles. The allocated buffer would thus have a raw size consistent with how arr3D is used. Additionally, declaring arr3D the way it is and passing phyk to it would have the right semantics. Personally I would have just declared arr3D as a pointer and done the pointer arithmetic myself, but at this point it's merely a matter of style.

What I'm now not sure if I checked is whether between where it's memset()'d and where it's read phyk is overwritten with values coming from uninitialized memory. E.g.
1
2
3
memset(buffer, 0, size);
int foo;
buffer[0] = foo;
I'm not sure if Valgrind is capable of detecting that condition, or if it would just flag the read from uninitialized memory.
Well, since the OP isn't going to post it, here's the offending function potentialk().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
int potentialk(double invnk[][ncomp], double dndxk[][ncomp], double dndyk[][ncomp], double phik[][ncomp], double potSourcek[][ncomp], double kx[], double ky[], double ninvksqu[], double err_max, int max_iter){
	
	//function outputs phik and iterations ---> it solves for the potential using the spectral method

	// invnk is F[1/n] where F is fourier 
	// potsourcek = F[s]
	// ninvksqu= 1/k^2

	 // Initialize variables used in the function
	fftw_complex *dphidxk;
	dphidxk = (fftw_complex*) fftw_malloc(nx*nyk* sizeof(fftw_complex)); 
	
	fftw_complex *dphidyk;
	dphidyk = (fftw_complex*) fftw_malloc(nx*nyk* sizeof(fftw_complex)); 
	
	fftw_complex *gradNgradPhi_x;
	gradNgradPhi_x = (fftw_complex*) fftw_malloc(nx*nyk* sizeof(fftw_complex)); 
	
	fftw_complex *gradNgradPhi_y;
	gradNgradPhi_y = (fftw_complex*) fftw_malloc(nx*nyk* sizeof(fftw_complex)); 
	
	fftw_complex *RHS;
	RHS = (fftw_complex*) fftw_malloc(nx*nyk* sizeof(fftw_complex)); 

	
	double phik_max, phik_max_old;
	double it_error;

	// Begin counter for the number of iterations it takes to calculate phi
	int count = 0;
	
	// Begin while loop
	do{			
		// Calculate phi derivatives
		derivk(phik, kx, dphidxk);
		derivk(phik, ky, dphidyk);		
		// Do convolutions for grad n dot grad phi term
		// gradNgradphi = [ikx phik] * [ikx nk] - [iky phik] * [iky nk], where * is the convolution
		convolve2D(dndxk, dphidxk, gradNgradPhi_x);
		convolve2D(dndyk, dphidyk, gradNgradPhi_y);
		
		// Subtract gradNgradphi from the source term. Calculate RHS of the equation:
		#pragma omp parallel for collapse(3)
		for (int i = 0; i < nx; i++){
			for (int j = 0; j < nyk; j++){
				for (int k = 0; k < ncomp; k++){
					RHS[j + nyk*i][k] = potSourcek[j + nyk*i][k] - gradNgradPhi_x[j + nyk*i][k] - gradNgradPhi_y[j + nyk*i][k];
				}
			}
		}
		
		// Convolve RHS with invnk
		convolve2D(RHS, invnk, RHS);			
		
		// Calculate maximum of absolute value of previous phi
		phik_max_old = max_absComp(phik);		
		
		// Multiply by ninvksqu to get the updated phi: this will output potential in Fourier space, phik
		Arr3DArr2DMult(RHS, ninvksqu, phik);

		// Calculate maximum of absolute value of updated phi(new phi)
		phik_max = max_absComp(phik); //by
		
		// Increase iteration count by 1
		count = count + 1;
		
		// Calculate error
		it_error = fabs((phik_max-phik_max_old)/phik_max);	//err_max is the error we want to converge to	
		
		// If error is too high and we haven't reached the max iterations yet, repeat iterations
	 }while( it_error > err_max && count <= max_iter); // and instead &&
	 
	
	fftw_free(dphidxk);
	fftw_free(dphidyk);
	fftw_free(gradNgradPhi_x);
	fftw_free(gradNgradPhi_y);
	fftw_free(RHS);
	 // Output the number of iterations taken
	 return count;
	 
}



Here is max_absComp():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
double max_absComp(double arr3D[][ncomp]){
	// Take the absolute value
	double *absArr;
	absArr = (double*) fftw_malloc(nx*nyk*sizeof(double));
	memset(absArr, 42, nx*nyk* sizeof(double)); //test here, if you pass 2D array to func decays to pointer and sizeof doesn't give size of array

	absComp(arr3D, absArr);
	
	// Calculate the max value
	double maxVal = max2Dk(absArr); //by
	fftw_free(absArr);
	
	return maxVal;
}


and here is absComp():
1
2
3
4
5
6
7
8
void absComp(double arr3D[][ncomp], double arrOut[]){
	#pragma omp parallel for collapse(2)
	for (int i = 0; i < nx ; i++){
		for (int j = 0 ; j < nyk; j++){
			arrOut[j + nyk*i] = sqrt(arr3D[j + nyk*i][0]*arr3D[j + nyk*i][0] + arr3D[j + nyk*i][1]*arr3D[j + nyk*i][1]);
		}
	}
}



He could just comment out the omp statement in absComp() and see what happens.

I think the OP needs to try a few things and report back. I don't have these fft libraries to run them.
Last edited on
Thanks everyone!

@thmm

Can you upload the complete project so someone can run it?


Well, I added a zipshare file in hopes it gives you access to the full code, but I don't know is there a better way to do this?


Yeah, just turn off parallelisation (which is probably as simple as omitting the compiler/linker option -fopenmp) and see what happens. It's just possible that the fftw library has OpenMP calls in itself (which may or may not be buggy).

Turning the parallelisation off runs the code, but wouldn't this just run the serial version instead? I attempted to run without -fopenmp but with the fftw threads like:

 
g++ U.cpp -lfftw3_threads -lfftw3 -lm -g -o ./Parallel/test.out

This ran the code however, my potentail solver iteration is almost stuck at a specific value meaning something is wrong with int potentialk

He could just comment out the omp statement in absComp() and see what happens.


I will try doing this and reporting back as soon as the code is done running and see if that would solve the problem. Thanks a lot.

@helios
My suggestion would be to tell Valgrind to attach a debugger when the faulty jump happens so you can see what exactly is going on. Just a couple of stack traces is not nearly enough information

These are the tags I am using with Valgrind:
 
g++ U.cpp -lfftw3_threads -lfftw3 -lm -g -fopenmp -o ./Parallel/test.out && valgrind --track-origins=yes ./Parallel/test.out



Also, I read someone having a similar issue and it was suggested the segmentation fault might have to do with multiple threads updating one value at the same time. Can this be the issue here? I just don't see how a serial code that runs fine with no segmentation fault error when parallelized creates this problem. The link is from here:
https://stackoverflow.com/questions/29247593/openmp-segfault

@helios


On an unrelated note, I don't think you're using threads the most effective way, OP. You're giving them too small workloads and you're synchronizing them too often. I would not be surprised if for every additional thread you got no more than a 5-10% performance improvement.


This is something I am trying so hard to work on, I noticed with turning the -fopenmp off the code takes 18 mins instead of the full serial code which takes 20 mins (granted potentialk functions is still acting funky and results are incorrect since nothing is being updated in the potential solver). I am wondering if this has anything to do with the FFTW built-in threading or is it actually the openMP in the code. I guess if that's possible how can I enhance the performance with my specific code? My sole purpose is to reduce my runtime significantly and therefore enhance my results resolution. What would you suggest? Thanks
JamieAl,
Why do you persist in saying the serial version is running fine when, by your own admission, "granted potentialk functions is still acting funky and results are incorrect since nothing is being updated in the potential solver". Fix the serial version first. What does the serial version give with gdb or valgrind? Any out-of-bounds error will not necessarily appear as a segfault: it depends what, if anything, is in the memory address you are trying to access.

One nice thing about OpenMP is that you can switch it off simply by omitting -fopenmp on the compile line. The compiler will then simply ignore your #pragma omp directives. You don't have to change any code.

There is simply no point in believing that making faulty code run faster will remove the faults. Also, FFTs are expensive operations: I doubt that changing your own code will make that much difference when the code spends 90% of its time in external library routines.
Last edited on
Okay, I found something. phi in main() is uninitialized before it's passed to r2cfft(). memsetting it gets rid of the first avalanche of errors. At a glance it would appear Pi has the exact same problem. This is all I have time for, for now.

This is some real shitty code, OP. Sorry for being so blunt.
1. Liberally sprinkling omp parallel for collapse(foo) won't make your code faster. The added complexity might even make it slower.
2. No attempt at object orientation. How many times do you need to do j + width*i before realizing you need a matrix class?
3. Meaningless variable names.
4. Repetition, repetition, repetition. The error I mentioned above could have been avoided if you'd just created a function that both allocated and initialized the memory.
5. Confusing function parameters where it's unclear which are inputs and which are outputs.
6. Functions with too many parameters. (Would be solved by encapsulating in classes.)
7. Meaningless comments that only serve to add visual noise.

Honestly, it's so bad I might do a full refactor later, if I can find the time.

I will try [commenting out the omp statement]
That's not necessary. It will have the same effect as not using -fopenmp

Also, I read someone having a similar issue and it was suggested the segmentation fault might have to do with multiple threads updating one value at the same time. Can this be the issue here?
I don't think so. Unlike that poster's code, your code doesn't contain race conditions. At least not in the parts I checked.

I just don't see how a serial code that runs fine with no segmentation fault error when parallelized creates this problem.
That's the fun part of undefined behavior.

EDIT:
This is something I am trying so hard to work on, I noticed with turning the -fopenmp off the code takes 18 mins instead of the full serial code which takes 20 mins (granted potentialk functions is still acting funky and results are incorrect since nothing is being updated in the potential solver). I am wondering if this has anything to do with the FFTW built-in threading or is it actually the openMP in the code. I guess if that's possible how can I enhance the performance with my specific code? My sole purpose is to reduce my runtime significantly and therefore enhance my results resolution. What would you suggest?
The immediate problem is how you structured the parallelization. OpenMP isn't magic. You can't expect efficient use of threads by just making every single for loop parallel and synchronizing after each one. Unfortunately designing performant parallel code is a subject complex enough for a book, not something I can usefully communicate in a forum post.
For now, let's just say that you first need to understand which data depends on the value of which data. Things that can be computed independently are good parallelization candidates.
In your case you're seeing poor parallel scaling performance because you're treating your CPU as if it was a GPU. GPUs are very good at handling massive parallelization (hundreds or thousands of threads) of very short and simple functions. E.g. computing each cell of a large matrix multiplication in parallel. CPUs are exactly the opposite; they're very good at handling low parallelization (on desktop computers usually 4-8 threads, rarely more than 32) of very complex functions. E.g. Computing two independent large matrix multiplications in parallel.
Last edited on
^^^
the first optimize threading you want to do is totally independent things, like reading a large file while setting up drawing for your GUI while checking online to see that the program is up to date or something like that where none of it is (yet) talking to the other parts. I am NOT a fan of code that splits off into threads then the threads sit idle waiting on mutex locks to clear instead of crunching. It frequently defeats the purpose if you don't juggle cats to get it just so, and then the hardware changes and how you split it is suboptimal again...

the newest cpus are running 10+ cores now. I forget the rule of thumb, but more than 3 threads per cpu is really asking a lot from the box for most programs.

linear algebra has been done to death. You can look up how to optimally split a big multiply up to perform well on various types of machines with a little legwork. How big are your matrices? 20 years ago, we did not try to split before 100x100 or so. Today that is probably at least 300x300 roughly? Are you working in big matrix space?
Last edited on
Found a race condition on the variable 'dummy' in iArr3DMult(). The variable is written and read in parallel without any synchronization. Declaring the variable inside the innermost loop fixes it.
It's so bad...

Found another race condition of the same type in calcCollFreqk().
@lastchance

JamieAl,
Why do you persist in saying the serial version is running fine when, by your own admission, "granted potentialk functions is still acting funky and results are incorrect since nothing is being updated in the potential solver".

Ok, so you're right because whatever I tried to run was an older version of my code where I forgot to properllay initialize phi . As @helios pointed out:

Okay, I found something. phi in main() is uninitialized before it's passed to r2cfft().


When fixed in the serial code Valgrind runs with no errors:
1
2
3
4
5
6
7
==2704== Memcheck, a memory error detector
==2704== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==2704== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==2704== Command: ./../../Datasim1/test.out
==2704==
Iteration = 0    t = 0.1000000000   phi_iter = 1
.... 


Jeez guys, thanks for pointing this disaster out!!

@helios
This is some real shitty code, OP. Sorry for being so blunt.

lol unfortunately I am aware. I find it hard to learn the correct methodology in approaching all of this while trying to do some research projects with deadlines. But, one of my goals is to focus on improving the code after achieving the main goals (getting correct results). I appreciate the help here, I have learned more from you guys than any class I have taken to be honest.

I added the following line in my initalizing for loop in the "parallel" code and now it runs with valrgind with no "obvious" errors:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for (int i = 0; i < nx; i++){
		for (int j = 0; j < ny; j++){
		  ne[j + ny*i] =  (a * tanh(b*(XX[j + ny*i] + c)) + d + a2 * tanh(b*(XX[j + ny*i] + c2)) + d2 + .02*cos(4*M_PI * YY[j + ny*i]/Ly) *( exp( - bg * pow(XX[j + ny*i] - xg,2)) + exp(-bg* pow( XX[j + ny*i] - Lx + xg,2)  ) ) )    * 1.E11; 	
	
			Ti[j + ny*i] = 1000.;
			Te[j + ny*i] = 1000.;
			phi[j + ny*i] = 1.; // THIS LINE added
			
			
			// Calculate pressures here
			Pi[j + ny*i] = ne[j + ny*i] * Ti[j + ny*i] * kb;
			Pe[j + ny*i] = ne[j + ny*i] * Te[j + ny*i] * kb;
			
		}
	}


Again, I appreciate all the great commentary and help. I will not
1. Liberally sprinkling omp parallel for collapse(foo) won't make your code faster.

As some for loops might not need parallelizing necessary. Also, depending on whether nested loops are dependent (if one iteration depends on the other) then openMP won't be very much helpful. Finally, race conditions as @helios pointed out are something I wasn't sure how to fix so now I will go back and change all of that.

@jonnin
How big are your matrices? 20 years ago, we did not try to split before 100x100 or so. Today that is probably at least 300x300 roughly? Are you working in big matrix space?


Currently, I am working on 256x256 for this specific code, ideally I would like to go higher.
I'll be pushing my changes here: https://github.com/Helios-vmg/refactor-example
main() is looking quite a bit cleaner already. Don't bother trying to compile it yet.

Out of curiosity, what is this supposed to be simulating? Some sort of subatomic interaction, it would seem, but what?

I find it hard to learn the correct methodology in approaching all of this while trying to do some research projects with deadlines.
That's understandable. The only real way to get good at designing code is to write lots of it over a long time. When you think enough times "what an idiot! What was I thinking when I did this?" you'll understand which things you should do and which you shouldn't. Obviously it's going to take you longer if coding is not your primary activity, but rather mainly a tool.
Last edited on
@helios

I'll be pushing my changes here: https://github.com/Helios-vmg/refactor-example
main() is looking quite a bit cleaner already. Don't bother trying to compile it yet.


Thanks will be checking as I fix some race condition errors as well!


Out of curiosity, what is this supposed to be simulating? Some sort of subatomic interaction, it would seem, but what?


This particular code is simulating a type of plasma instability called the gradient drift instability (GDI) that forms in the F region in the ionosphere. Which serves as a generation mechanism of turbulence in the ionosphere that negatively impact the propagation and accuracy of GPS signals.

"what an idiot! What was I thinking when I did this?" you'll understand which things you should do and which you shouldn't. Obviously it's going to take you longer if coding is not your primary activity, but rather mainly a tool.


That is very true, and that's why I intent on coding more and trying to perform more simulation so eventually I would feel a bit more comfortable with coding. Thanks again, these are all very helpful tips.
@JamieAl,
Why don't you simply use std::vector rather than all those dynamic arrays? Then you will never have to use a malloc or a free (which will reduce the line count in your code by 20% or so to start with, and the likelihood of bugs by a similar amount).

Your library routines may not directly use vectors, but they do use flat 1-d arrays, so if V is a vector then you can simply pass a reference to its data buffer: V.data(). It will still work.

BTW, there is no point in timing your code with the -g option set (which puts all the debug information in, and turns off optimisation). When you do come to timing, turn -O3 on.

Pages: 12