Optimization using OpenMP: Parallel code is slow

I have a made a post in the past about optimization and parallelization where I got lots of help and feedback. I am attempting to use openMP with FFTW library and just test a single function. My current results show that my parallel code either return the same or slightly slower speed than the serial code which is obviously wrong. I am expecting I guess at least twice as fast? depending on the threads. Also, I know FFTW has its own multi-threading features, but wasn't sure if it would interfere with other parts of the code using openMP?

Example code:
Main.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static const int nx = 128;
static const int ny = 128; 
static const int nz = 128;

Eigen::initParallel();
Eigen::setNbThreads(5); 

Eigen::Tensor<double, 3> Re(nx,ny,nz); 
for(int i = 0; i< nx; i++){
		for(int j = 0; j< ny; j++){
			for(int k = 0; k< nz; k++){ 
                Re(k,i,j) = //Initialized w/ some function
            }
      }
}
Eigen::Tensor<std::complex<double>, 3> Im(nx,ny,nz); 

double start_time, run_time;
start_time = omp_get_wtime();
r2cfft3d(Re, Im); //function take real input and take forward fft to output complex result
run_time = omp_get_wtime() - start_time;
cout << "Time in s: " <<  run_time << "s\n";


My function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

void r2cfft3d(Eigen::Tensor<double, 3>& rArr, Eigen::Tensor<std::complex<double>, 3>& cArr){	

	#pragma omp parallel 
	{
	fftw_complex *input_array;
    fftw_complex *output_array;

	input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
	output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));

	#pragma omp parallel for
	for (int i = 0; i < nx; ++i) {

        for (int j = 0; j < ny; ++j) {

            for (int k = 0; k < nz; ++k) {
                {
                    input_array[k + nz * (j + ny * i)][REAL] = rArr(k,i,j);
                    input_array[k + nz * (j + ny * i)][IMAG] = 0;
                }
            }
        }
    }

	#pragma omp critical (make_plan)
	{
		fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
		fftw_execute(forward);
    	fftw_destroy_plan(forward);
    	fftw_cleanup();
	}

	#pragma omp for nowait
     for(int i=0; i < nx; ++i){

        for(int j=0; j < ny; ++j) {

            for(int k=0; k < nz; ++k) {
                cArr(k,i,j).real(output_array[k + nz * (j + ny * i)][REAL]);
                cArr(k,i,j).imag(output_array[k + nz * (j + ny * i)][IMAG]);
                
            }

        }
    }
	
	fftw_free(input_array);
	fftw_free(output_array);

	}


}


I am using omp_get_wtime() to get the parallel time and clock_t to get the serial time for some reason this is the only way to get accurate comparison. For now the serial code returns:
 
Time in s: 0.044047s

and the parallel code returns:
 
Time in s: 0.0490732s


Someone suggesting pre-allocate my buffers input_array/output_arrayand pass them to the function, but this gives me very little improvement in my time and also, sort of looks messy. Thanks.
heh... have you ever printed that array index?
k + nz * (j + ny * i) I mean?

consider..
int dx{};

...blah blah for loops
input_array[++dx][REAL] = rArr(k,i,j);

there are ~2 million loops as written.

also, any way to just initialize to zeros (bytewise or value wise) so you don't need to assign 0 to all the complex parts...?

since its just going sequentially anyway, you may be able to convert it to a single for loop and just provide the i,j,k values to rArr with less nested loop stuff. The compilers are smarter but I don't think it can see that its a sequential indexing problem nor eliminate the loop-jumps, at least not all of them.

finally, this may need a different MP approach. If I were threading it by hand, I would probably do something like 250K per thread, sequentially.. that is thread out [0] to [250k] and then 250k+1 to 500k in thread2 ... etc. Maybe its doing that, I am not experienced with open MP enough to say how it is splitting it up.
Last edited on
You can make a real-to-complex plan which should be a little faster than the complex-to-complex plan (half the size). Just call fftw_plan_dft_r2c_3d instead, leaving out the FFTW_FORWARD argument because it's implied. Although that only fills half the array I think.

It may be possible to get a speed up by operating directly on the tensor object's storage, accessible via member function data;
Beware problems involving storage order mismatches. As we've discovered before FFTW works with basically any reasonable memory layouts as long as you're careful to specify it properly.

There is a member function called Eigen::Tensor::swap_layout that could help, although the documentation says "only the default column major layout is currently fully supported, and it is therefore not recommended to attempt to use the row major layout at the moment":
https://eigen.tuxfamily.org/dox/unsupported/eigen_tensors.html

It will be beneficial to re-use FFTW plans. If you re-use them you could reasonably ask FFTW to find a particularly fast decomposition with FFTW_MEASURE.

I know basically nothing about OpenMP but read this page carefully:
https://www.fftw.org/fftw3_doc/Usage-of-Multi_002dthreaded-FFTW.html
It briefly mentions OpenMP compatibility.

The following program is definitely wrong because I didn't address storage order nor even read enough of Eigen's documentation to make sure it's reasonable. But as a smoke test, on my computer, this program takes about 3ms per FFT.
average Time in s: 0.00291012s
This is just a decent computer from late 2019. The CPU is a AMD Ryzen 9 3950X, with 16 cores.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
static const int nx = 128;
static const int ny = 128;
static const int nz = 128;

#include "D:\jamieal\20230227\eigen\unsupported\Eigen\CXX11\Tensor"
#include "omp.h"
#include "fftw3.h"

#include <iostream>

int main()
{
  Eigen::Tensor<double, 3> Re(nx, ny, nz);
  Re.setRandom();

  Eigen::Tensor<std::complex<double>, 3> Im(nx, ny, nz);
  
  fftw_init_threads();
  fftw_plan_with_nthreads(omp_get_max_threads());
  fftw_plan plan = fftw_plan_dft_r2c_3d(nx, ny, nz, Re.data(), reinterpret_cast<fftw_complex*>(Im.data()), FFTW_MEASURE);
  
  Re.setRandom();

  double start_time = omp_get_wtime();
  for (int i = 0; i < 1000; ++i)
  {
    fftw_execute(plan);
  }
  double run_time = omp_get_wtime() - start_time;
  std::cout << "average Time in s: " << run_time / 1000 << "s\n";
}


You could also investigate hardware-accelerated implementations e.g.
https://docs.nvidia.com/cuda/cufft/
Last edited on
Thanks everyone! Sorry, got multiple jobs on top of finishing a grad program and I am struggling to be engaged here on a timely manner. I will respond to everyone, but I noticed something in @mbozzi 's post.

@mbozzi

double run_time = omp_get_wtime() - start_time;
std::cout << "average Time in s: " << run_time / 1000 << "s\n";

why are you dividing by 1000? according to the documentation omp_get_wtime returns elapsed wall clock time in seconds not ms. I think this might be true for clock(); only?
Just checking here if I am mistaken or comparing times incorrectly.

https://www.openmp.org/spec-html/5.0/openmpsu160.html
The division is because the code runs 1000 iterations, not to convert milliseconds->seconds.
@mbozzi
I see, I read somewhere that:
The clock() function returns CPU time, not wall time. Instead, use gettimeofday().


Is this true? Now, I am starting to question if the way I am comparing time b/w serial vs parallel code is the problem?
Just call fftw_plan_dft_r2c_3d instead, leaving out the FFTW_FORWARD argument because it's implied. Although that only fills half the array I think.


I think I might do that, testing fftw_plan_dft_r2c_3d against
fftw_plan_dft_3d which is what I have shows significant difference in performance. I notice you were able to pass Eigen matrix to fftw_plan_dft_r2c_3d directly, which is something I was not able to do so easily with fftw_plan_dft_3d. Hence, all the for loops copying data between Eigen and c++ array and wasting so much time as @jonnin pointed out. But, I think I have an idea of what to test and I will return with my results/updates. Thanks.
clock is generally low resolution, usually in ms, and yes, its the process time, not the wall clock time. That may be good if your process is being kicked around.

chrono tools are pretty good for timing.
timing stuff can be tricky. The first time you run a program (or part of a large program), it may take 5 to 10 times longer than the second time. Because, the OS/hardware/etc cached things! A busy computer, your process can be interrupted... it took 10 seconds, 9.999999 of them were spent sleeping while the operating system and background garbage called microsoft to see if your copy of office was still valid today. What do you want to know if you threaded the work 20 ways, the total time, or the wall clock time? Are you measuring cpu load/time/activity or mad human at the keyboard delay? I don't need the answers, but all that stuff and more can get in the way.

I notice you were able to pass Eigen matrix to fftw_plan_dft_r2c_3d directly, which is something I was not able to do so easily with fftw_plan_dft_3d.

I was too short on time to check that FFTW correctly understood the layout of the tensor's storage. So double check that code closely, because you might have to use one of the "advanced" interfaces.

Also I think you should measure wall clock time because that's what you're trying to optimize.
Okay, I guess I found few issues with the above example. First thing, apparently, I can not just use reinterpret_cast<fftw_complex*>(Im.data.data()); since it just sets a pointer and I would need to copy the data using memcpy as the following memcpy(input_array, rArr.data(), nx*ny*nz * sizeof(fftw_complex));

So I implemented the following test:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void r2cfft3d(Eigen::Tensor<double, 3>& rArr, Eigen::Tensor<std::complex<double>, 3>& cArr){	
	   
		fftw_complex *input_array;
		input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
		memcpy(input_array, rArr.data(), nx*ny*nz * sizeof(fftw_complex));

		fftw_complex *output_array;
		output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
		
		fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
		
		fftw_execute(forward);
    	fftw_destroy_plan(forward);
		fftw_cleanup();

		for(int i=0; i < nx; ++i){
			for(int j=0; j < ny; ++j) {
				for(int k=0; k < nz; ++k) {
					cArr(i,j,k).real(output_array[i + nz * (j + ny * k)][REAL]);
					cArr(i,j,k).imag(output_array[i + nz * (j + ny * k)][IMAG]);
                
            	}

        	}
    	}
		
		fftw_free(input_array);
		fftw_free(output_array);
	}

The only problem with the above code is that it returns a tensor with columns/rows off from the correct output (what I expect).

For example, this returns the following:
1
2
3
4
     1e+11        1e+11        1e+11        1e+11        1e+11        1e+11        1e+11        1e+11 -9.53674e-07 -9.53674e-07 -9.53674e-07  2.86102e-06            0            0            0            0
       1e+11        1e+11        1e+11        1e+11        1e+11        5e+11        1e+11        1e+11 -4.76837e-06 -4.76837e-06 -4.76837e-06 -1.62125e-05 -1.52588e-05 -1.52588e-05 -1.52588e-05 -1.52588e-05
       1e+11        1e+11        1e+11        1e+11        1e+11        1e+11        1e+11        1e+11 -9.53674e-07 -9.53674e-07 -9.53674e-07  2.86102e-06            0            0            0            0
       1e+11        1e+11        1e+11        1e+11        1e+11        1e+11        1e+11        1e+11 -8.58307e-06 -8.58307e-06 -8.58307e-06 -4.76837e-06 -7.62939e-06 -7.62939e-06 -7.62939e-06 -7.62939e-06

While the correct answer is
1
2
3
4
5
1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11
1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11
1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 5e+11 1e+11 1e+11 1e+11 1e+11 1e+11
1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11 1e+11

It seems like some columns are being filled with zeros for some reason and the rows and columns are off as well (i.e. location of 5e11)

I guess it's worth noting I am taking the forward FFT and inverse FFT to compre the output with what I originally used to initialize my input (thought this would be easier test to check)

I am trying to use the following (TensorMap) to map my c++ array into an Eigen tensor instead of using a for loop:
https://eigen.tuxfamily.org/dox/unsupported/eigen_tensors.html#title3
Last edited on
Well, I was able to fix this by using memcpy() and got rid of all the for loops but now the serial time is the same as the old code, however, I will test parallelizing the new code since it's more streamlined and hopeflly it works:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
fftw_complex *input_array;
input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
memcpy(input_array, rArr.data(), nx*ny*nz * sizeof(fftw_complex));

fftw_complex *output_array;
output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
		
		
fftw_execute(forward);
fftw_destroy_plan(forward);
fftw_cleanup();
	
memcpy(cArr.data(),output_array, nx*ny*nz * sizeof(fftw_complex));
		
		
fftw_free(input_array);
fftw_free(output_array);
My most recent attempt using openMP only:

Now, for the serial code both of the attempts return same/similar serial time: ``Serial Time in SEC: 0.039189s``
However, by adding openmp parallelization attempt the time again is SLOWED down: ``Parallel Time in SEC: 0.0934717s``
Example code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#pragma omp parallel
	{	

		fftw_complex *input_array;
		input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
		memcpy(input_array, rArr.data(), nx*ny*nz * sizeof(fftw_complex));

		fftw_complex *output_array;
		output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
		#pragma omp single
		{
			fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
			fftw_execute(forward);
    		fftw_destroy_plan(forward);
			fftw_cleanup();
		}
		
		memcpy(cArr.data(),output_array, nx*ny*nz * sizeof(fftw_complex));
		
		
		
		fftw_free(input_array);
		fftw_free(output_array);
	}
}

Also, I tried using fftw_plan_dft_r2c_3d above instead for some reason kept returning a much slower serial time: Serial Time in SEC: 0.485722s don't know what that was about!
But if I run with openMP and FFTW following the instructions:
https://www.fftw.org/fftw3_doc/Usage-of-Multi_002dthreaded-FFTW.html
with the flags:
-lfftw3 -lfftw3_threads -fopenmp

Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
fftw_complex *input_array;
input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
	
		
memcpy(input_array, rArr.data(), nx*ny*nz * sizeof(fftw_complex));

fftw_complex *output_array;
output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));

fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());

fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);


fftw_execute(forward);
fftw_destroy_plan(forward);
fftw_cleanup();

		
memcpy(cArr.data(),output_array, nx*ny*nz * sizeof(fftw_complex));
		
		
		
fftw_free(input_array);
fftw_free(output_array);

This returns: Parallel Time in SEC: 0.0281432s which makes me sort of questioning the results. It feels like this is way too easy and straightforward? How can I get a twice as speed up just like that :/
Topic archived. No new replies allowed.