OpenMp: Parallel code takes longer to run than serial code?

I have a serial code with simple forward FFT function and other functions, in the main code a basic for loop for initialization purposes. This code is a much smaller version of the 2D original code that I am trying to parallelize.

I ran a simple test with no errors, but the answers I am getting are not really correct. I get the run time for the parallel code for number of threads (between 1 and 16) to be:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
num_threads = 1Compute Time: 779.000000 seconds
num_threads = 2Compute Time: 6931.000000 seconds
num_threads = 3Compute Time: 618.000000 seconds
num_threads = 4Compute Time: 580.000000 seconds
num_threads = 5Compute Time: 621.000000 seconds
num_threads = 6Compute Time: 922.000000 seconds
num_threads = 7Compute Time: 869.000000 seconds
num_threads = 8Compute Time: 678.000000 seconds
num_threads = 9Compute Time: 692.000000 seconds
num_threads = 10Compute Time: 668.000000 seconds
num_threads = 11Compute Time: 699.000000 seconds
num_threads = 12Compute Time: 524.000000 seconds
num_threads = 13Compute Time: 56581.000000 seconds
num_threads = 14Compute Time: 3057.000000 seconds
num_threads = 15Compute Time: 4134.000000 seconds
num_threads = 16Compute Time: 3287.000000 seconds


Where the serial code has the run time of:
 
Compute Time: 365.000000 seconds


My parallel code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
#define _USE_MATH_DEFINES // for C++
#define FFTW_ESTIMATE (1U << 6)
#include<math.h>
#include<stdio.h>
# include <cmath>
# include <cstdlib>
# include <iostream>
# include <iomanip>
# include <ctime>
# include <omp.h>
#include "fftw3.h" //instead of this include the file fftw_threads.h
#include <valarray> // interesting stuff
#include <iostream> //for cout..
#include <vector>
# include <omp.h>
#include <cstring>


static const int nx =  16; //256;
static const int ny = 16; // 256;
static const int nyk = ny/2 + 1;
static const int ncomp = 2;
using namespace std;

// g++ U.cpp -lfftw3_threads -lfftw3 -lm -g -fopenmp -o test.out 
// g++ U.cpp -lfftw3 -lm -g -fopenmp -o test.out 
int main();
void r2cfft(double rArr[], double cArr[][ncomp]);
void print2DPhysArray(double arr[]);
double makeSpatialMesh2D(double dx, double dy, double xarr[], double yarr[]);

int main (void){
    int fftw_threads_init(void); // should only be called once and (in your main()function), performs any one-time initialization required to use threads on your system. It returns zeros if succesful and a non-zero if not (error)
    // hyperbolic tan paramters 
    double L = 2*M_PI;	//useless 
	double Lx = 200000., Ly = 80000.; //different
	//double Lx = L, Ly = L;
	double dx = Lx / nx;
	double dy = Ly / ny;
	//I.Cs:  Parameters for initializing hyperbolic tangent IC
	double xg = Lx * (19./24); //xg center of hyperbolic tangent func
	double m = 0.5;
	double lg = 12000.; //lg is width of hyperbolic tangent func
	double o = 1.;
	double a = (o-m)/2.; //difference from background?
	double c = -xg;
	double d = (o + m)/2.; //size of the cloud ??
	double a2 = -1*a;
	double c2 = - Lx - c;
	double d2 = d - m;
	
	double b = abs(((((tanh(log(m/o)/4)) * a) + d) * pow(cosh(log(m/o)/4), 2))/(a * lg));
	double bg = 2./lg/lg;
    double start_time; // time before parallel region
    double run_time; //time 

    //test
    double *ne;
    ne = (double*) fftw_malloc(nx*ny*sizeof(double));

    fftw_complex *nek;
    nek = (fftw_complex*) fftw_malloc(nx*nyk*sizeof(fftw_complex));
    memset(nek, 42, nx*nyk* sizeof(fftw_complex)); 

    double *XX;
	XX = (double*) fftw_malloc(nx*ny*sizeof(double));
	//print2DPhysArray(XX);

	double *YY;
	YY = (double*) fftw_malloc(nx*ny*sizeof(double));

    double kmax = makeSpatialMesh2D(dx, dy, XX, YY); // returns XX, YY, ..


    // start a threading loop
    for (int nThreads= 1; nThreads<=16; nThreads++){
        omp_set_num_threads(nThreads); // asks openmp to create a team of threads
        clock_t start_time = clock(); // or use: start_time= omp_get_wtime();
#pragma omp parallel // start parallel region or fork threads
{
#pragma omp single // cause one thread to print number of threads used by the program
        printf("num_threads = %d", omp_get_num_threads());
//#pragma omp for reduction(+ : sum) // this is to split up loop iterations among the team threads. It include 2 clauses:1) creates a private variable and 2) cause threads to compute their sums locally and then combine their local sums to a single gloabl value
#pragma omp for // more natural option to parallel execution of for loops: no need for new loop bounds         
        for (int i = 0; i < nx; i++){
            for (int j = 0; j < ny; j++){
                ne[j + ny*i] =  (a * tanh(b*(XX[j + ny*i] + c)) + d + a2 * tanh(b*(XX[j + ny*i] + c2)) + d2 + .02*cos(4*M_PI * YY[j + ny*i]/Ly) *( exp( - bg * pow(XX[j + ny*i] - xg,2)) + exp(-bg* pow( XX[j + ny*i] - Lx + xg,2)  ) ) )    * 1.E11; 	
                //print2DPhysArray(ne);
            }
                
        }

} // convert to Fourier space
        r2cfft(ne, nek);
        clock_t run_time = clock() - start_time; // end time
        printf("Compute Time: %f seconds\n",(double) run_time); // %f is for double try 1f

    // print ne and nek for test
    //print2DPhysArray(ne);
}
    fftw_free(ne);
    fftw_free(nek);


   return 0;


}


void r2cfft(double rArr[], double cArr[][ncomp]){
    
    fftw_plan r2c; 

    // FFTW Thread parameters
    int nbthreads=2;
    fftw_init_threads();
    fftw_plan_with_nthreads(nbthreads);
    //fftwf_init_threads();
    //fftwf_plan_with_nthreads(nbthreads);
	
   // void fftw_plan_with_nthreads(int nbthreads); // before calling any plan OR: fftw_plan_with_nthreads(omp_get_max_threads())
    #pragma omp critical (FFTW)

    r2c = fftw_plan_dft_r2c_2d(nx, ny, &rArr[0], &cArr[0], FFTW_ESTIMATE);
    fftw_execute(r2c);

   
    fftw_destroy_plan(r2c);
    fftw_cleanup();
    // Thread clean
    fftw_cleanup_threads();
    //fftwf_cleanup_threads();
 // you should create/destroy plans only from a single thread, but can safely execute multiple plans in parallel.
}	
 
void print2DPhysArray(double arr[]) { 
    #pragma omp parallel

    #pragma omp for 
	for (int i = 0; i < nx; i++) {
		for (int j = 0 ; j < ny ; j++) {
			printf("%+4.2le  ",arr[j + ny*i]);

		}
		printf("\n");		
	}
}

double makeSpatialMesh2D(double dx, double dy, double xarr[], double yarr[]){
	// Make x coordinates. We want them to change as we change the zeroth index and be constant as we change the first index.
	#pragma omp parallel

    #pragma omp for
	for(int i = 0; i< nx; i++){
		for(int j = 0; j< ny; j++){
			xarr[j + ny*i] = i*dx;
			yarr[j + ny*i] = j * dy;			
		}		
	}

		if (dx < dy){
			return 1/(2*dx);
		}else {
			return 1/(2*dy);
		}
		

}

Last edited on
Don’t put “omp single” in the middle of the timed, parallel loop. Any output should go outside the parallel region.

BTW, you are allowed to use
#pragma omp parallel for
rather than separate statements.

Some time it would be worth asking your operating system how many threads it would be prepared to give you. If you try to use more than that it will just timeshare them.
Last edited on
So, I think one of the issues was in the FFTW function, I pass a int nbthreads=2; and threading parameters which is unnecessary and with simply writing it as:

1
2
3
4
5
6
7
8
9
10
11
12
void r2cfft(double rArr[], double cArr[][ncomp]){
    
    fftw_plan r2c; 
    #pragma omp critical (FFTW)
        r2c = fftw_plan_dft_r2c_2d(nx, ny, &rArr[0], &cArr[0], FFTW_ESTIMATE);
        
        fftw_execute(r2c);

    #pragma omp critical (FFTW)
        fftw_destroy_plan(r2c);
   
     fftw_cleanup();


Don’t put “omp single” in the middle of the timed, parallel loop. Any output should go outside the parallel region.

I see that makes sense, I guess I was trying to figure out "which number of threads would returns the shorter runtime"? I am still figuring things out not quiet sure how to decide on that.

#pragma parallel for

Thanks, I wasn't sure if I can.

Some time it would be worth asking your operating system how many threads it would be prepared to give you

Which function can I use for this?

Also, just a basic question. Is it normal that every time I run my code get a different "runtime" for each thread?

JamieAI wrote:
Which function can I use for this?

To be honest, you can probably just run
1
2
3
4
5
6
7
8
#include <iostream>
#include "omp.h"

int main()
{
    #pragma omp parallel
    std::cout << "Hello, world.\n";
}

and count how many times it writes "Hello, world.". Unless your operating system has an explicit limiter on the number of threads it can spawn then it is typically number-of-processors times hyperthreading factor (typically, 2).

Alternatively, call omp_get_max_threads(), but make sure that you do it from a serial region, not from inside a parallel one.



JamieAI wrote:
Is it normal that every time I run my code get a different "runtime" for each thread?

Yes.
Your computer is probably doing lots of other things in the background and your threads will, very democratically, have to share time with those. Also, the discrete clock "ticks" may be quite coarse, so unless the number of them is large the answer may be fairly inaccurate. Do several runs and take an average,



There are overheads for spawning and then joining lots of threads, especially if you have to wait for one tardy member to finish. Your speed is unlikely to be linear in number of threads. I wouldn't usually set it to the maximum number available.

Last edited on
Yes.
Your computer is probably doing lots of other things in the background and your threads will, very democratically, have to share time with those


you can override this by setting your priory very high. (I assume your threading tool has a way to do that?). this is 'rude' and can even cause instability (nevermind angry co-workers if the machine is shared) if you run too long without some relief to the OS, but depending on what you are doing and how fast you need it and such, you can use this. Always leave 1 thread free for the OS if you decide to do this.
Thanks everyone! I did notice another thing that seems problematic and that is my output is almost random. For example going from number of threads 1 to 2 give a larger runtime then it's back to smaller runtime from 2 to 3, which indicates terrible scalability and maybe other things. Am I understanding this correctly? How can I go about this?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
num_threads = 1Compute Time: 779.000000 seconds
num_threads = 2Compute Time: 6931.000000 seconds
num_threads = 3Compute Time: 618.000000 seconds
num_threads = 4Compute Time: 580.000000 seconds
num_threads = 5Compute Time: 621.000000 seconds
num_threads = 6Compute Time: 922.000000 seconds
num_threads = 7Compute Time: 869.000000 seconds
num_threads = 8Compute Time: 678.000000 seconds
num_threads = 9Compute Time: 692.000000 seconds
num_threads = 10Compute Time: 668.000000 seconds
num_threads = 11Compute Time: 699.000000 seconds
num_threads = 12Compute Time: 524.000000 seconds
num_threads = 13Compute Time: 56581.000000 seconds
num_threads = 14Compute Time: 3057.000000 seconds
num_threads = 15Compute Time: 4134.000000 seconds
num_threads = 16Compute Time: 3287.000000 seconds
Last edited on
how many times did you get the same result? I ask as those are long run times (> 10 min) and the temptation to run it once is pretty high. If you get the same result multiple times (try 5 runs at least) then you need to figure out why. When 2 threads are 10x longer than 1 thread, what that tells me is that something is locked up in a mutex and your computer is spending all its time waiting, as a starting what to look for (could be something else, but something somewhere is doing a lot of nothing).

a smaller problem is in order if possible. 10 seconds 1 thread is plenty to see what is going on.
Last edited on
The result is always changing to be honest but always random. So, if I run the code with one thread for several times I get the results:
 
Compute Time: 5908.000000 seconds


 
Compute Time: 1113.000000 seconds


 
Compute Time: 1127.000000 seconds


 
Compute Time: 1119.000000 seconds


Which is again very random! and still takes SO long for a simple for loop that the serial code takes usually 300+ seconds

I have changed the code to something like this, and quietly not sure what's going on:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
	 int nThreads = 1;
        omp_set_num_threads(nThreads); 
        clock_t start_time = clock();
#pragma omp parallel for  
        for (int i = 0; i < nx; i++){
            for (int j = 0; j < ny; j++){
                ne[j + ny*i] =  (a * tanh(b*(XX[j + ny*i] + c)) + d + a2 * tanh(b*(XX[j + ny*i] + c2)) + d2 + .02*cos(4*M_PI * YY[j + ny*i]/Ly) *( exp( - bg * pow(XX[j + ny*i] - xg,2)) + exp(-bg* pow( XX[j + ny*i] - Lx + xg,2)  ) ) )    * 1.E11; 	
                Ti[j + ny*i] = 1000.;
			    Te[j + ny*i] = 1000.;

                Pi[j + ny*i] = ne[j + ny*i] * Ti[j + ny*i] * kb;
			    Pe[j + ny*i] = ne[j + ny*i] * Te[j + ny*i] * kb;
            }
                
        }


        r2cfft(ne, nek);
        r2cfft(Ti, Tik);
	r2cfft(Te, Tek);

        clock_t run_time = clock() - start_time; 
        printf("Compute Time: %f seconds\n", (double) run_time); 

nah that 1 thread run isnt random. its likely it cached something after the first run making subsequent runs faster, though. The 3 after are close enough to each other.

that said:
the first run tells you how long it may take if it needs to load the data from ram to cache or whatever it needs to do.
the 3 runs after that tell you how much work it needs to really do, removing the page fault problem and getting the data into an organized format.
and those 2 things tell you that perhaps the data could possibly be organized better in memory so you access it with less page faults (?). Its worth a look.

if you are worried about time, you still need to profile the single threaded loops and figure out where it is slow.
tanh(b*(XX[j + ny*i] + c))
can you say tmp = tanh(b*(XX[j + ny*i] + c))

and later use tmp twice instead of computing the expression twice? Is your compiler smart enough that it does not matter (some can condense repeated expressions, some can't and it also depends on expression complexity whether it can do it).
pow(anything,2) is notably slower than anything*anything.
you need to take a pass over the math part to tweak it out first.
same thing how many j+ ny*i can a temp value save you?
chrono is a better timer than clock, but since you are in the seconds realm, it does not matter.

what values actually CHANGE in there? that is, are you filling something with the same computations over and over that you can pull up to the outer for loop? I am losing track of it trying to unravel it in my head, but it looks like a bunch of stuff does not change...
I looked for a while and I see nothing in the NE[] = expression that is different from one inner loop to the next...

you know that TI and TE are 1000, so don't look that up again in the bottom part expression sub 1000 constant back in.

make it screaming fast for 1 thread. Then split it up after that with confidence that it will go faster still.
Last edited on
Topic archived. No new replies allowed.