How to implement FFTW efficiently in C/C++?

From the FFTW documentation:
1
2
3
You are probably recreating the plan before every transform, rather than creating it once and reusing it for all transforms of the same size. FFTW is designed to be used in the following way:
1. First, you create a plan. This will take several seconds.
2. Then, you reuse the plan many times to perform FFTs. These are fast.

FFTW documentation:
https://fftw.org/faq/section3.html#:~:text=If%20you%20need%20to%20transform%20another%20array%20of,really%20use%20the%20plan_manyroutines%20in%20FFTW%27s%20%22advanced%22%20interface.

What I have been doing is call a function every time I compute an FFT where inside the function I create my FFTW plan and then execute it so for example my function looks like:
1
2
3
4
5
6
7
8
9
10
11
12
void r2cfft(double rArr[], double cArr[][ncomp]){
fftw_plan r2c; 

// Define the FFT plan
r2c = fftw_plan_dft_r2c_2d(nx, ny, &rArr[0], &cArr[0], FFTW_ESTIMATE);

fftw_execute(r2c);
fftw_destroy_plan(r2c);
fftw_cleanup();

	
}

Where this function in the main code is called hundreds of times. Does this mean I am " recreating the plan before every transform"!
If so, is there a way for me to create an FFTW plan once and then executing it as many times as needed while keeping a similar structure of my function? Thanks!
Indeed, you want to create the plan once and then execute it as often as needed! Re-create and destroy the plan every time is bad.

I think you'd have to do something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#define DIM_0 128
#define DIM_1 128

// Keep plan and associate buffers in static global variables, so that it can be re-used
static fftw_plan g_plan = NULL;
static double *g_buffer_in = NULL;
static fftw_complex *g_buffer_out = NULL;
static int g_initialized = 0;

// Note: Size of 'in' is DIM_0 x DIM_1, whereas size of 'out' is DIM_0 x (DIM_1/2)+1 
// https://www.fftw.org/fftw3_doc/Multi_002dDimensional-DFTs-of-Real-Data.html
void r2cfft(double *const in, fftw_complex *const out)
{
    if (!g_initialized)
    {
        g_buffer_in = fftw_alloc_real(DIM_0 * DIM_1);
        g_buffer_out = fftw_alloc_complex(DIM_0 * ((DIM_1 / 2) + 1));
        if (!(g_buffer_in && g_buffer_out))
        {
            abort(); /*probably out of memory*/
        }

        g_plan = fftw_plan_dft_r2c_2d(DIM_0, DIM_1, g_buff_in, g_buff_out, FFTW_PATIENT);
        if (!g_plan)
        {
            abort(); /*failed to create the plan*/
        }

        g_initialized = 1;
    }

    memcpy(g_buffer_in, in, sizeof(double) * DIM_0 * DIM_1);
    fftw_execute(g_plan);
    memcpy(out, g_buff_out, sizeof(fftw_complex) * DIM_0 * ((DIM_1 / 2) + 1));
}

// Call this once you are done, in order to avoid memory leaks!
void r2cfft_cleanup(void)
{
    if (g_initialized)
    {
        fftw_destroy_plan(g_plan);
        fftw_free(g_buffer_in);
        fftw_free(g_buffer_out);
        g_plan = NULL;
        g_buffer_in = NULL;
        g_buffer_out = NULL;
        g_initialized = 0;
    }
}

This is the "pure C" version. In C++ you probably want to wrap this in a class, instead of using global variables to keep the state 😏

Also, be ware that this function is not thread-safe at all! Use, e.g., a pthread_mutex, to make it safe for multi-threading.
Last edited on
Also, be ware that this function is not thread-safe at all! Use, e.g., a pthread_mutex, to make it safe for multi-threading.
Or prepare a separate plan for each thread to avoid serializing the work.
...or like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// structure to keep the context (use a separate one per thread)
typedef struct
{
    size_t dim_0, dim_1;
    fftw_plan plan;
    double *buffer_in;
    fftw_complex *buffer_out;
}
r2cfft_t;

// call once to initialize
int r2cfft_initialize(r2cfft_t *const ctx, const size_t dim_0, const size_t dim_1)
{
    memset(ctx, 0, sizeof(r2cfft_t));
    if (!((ctx->dim_0 = dim_0) && (ctx->dim_1 = dim_1)))
    {
        return 0;
    }

    ctx->buffer_in = fftw_alloc_real(dim_0 * dim_1);
    if (!ctx->buffer_in)
    {
        return 0;
    }

    ctx->buffer_out = fftw_alloc_complex(dim_0 * ((dim_1 / 2) + 1));
    if (!ctx->buffer_out)
    {
        fftw_free(g_buffer_in);
        return 0;
    }

    ctx->plan = fftw_plan_dft_r2c_2d(dim_0, dim_1, ctx->buff_in, ctx->buff_out, FFTW_PATIENT);
    if (!ctx->plan)
    {
        fftw_free(ctx->buffer_in);
        fftw_free(ctx->buffer_out);
        return 0;
    }

    return 1;
}

// call many times to execute
void r2cfft_execute(r2cfft_t *const ctx, double *const in, fftw_complex *const out)
{
    memcpy(ctx->buffer_in, in, sizeof(double) * ctx->dim_0 * ctx->dim_1);
    fftw_execute(ctx->plan);
    memcpy(out, ctx->buff_out, sizeof(fftw_complex) * ctx->dim_0 * ((ctx->dim_1 / 2) + 1));
}

// call once to clean up
void r2cfft_cleanup(r2cfft_t *const ctx)
{
    fftw_destroy_plan(ctx->_plan);
    fftw_free(ctx->buffer_in);
    fftw_free(ctx->buffer_out);
    memset(ctx, 0, sizeof(r2cfft_t));
}
Last edited on
There's a C++ solution to a similar problem I supplied in an old thread of yours (last year?).
The relevant part is below.

It's based on the assumption that you will be using FFTW3 to execute individual plans in parallel. Therefore the plans are stored in a global cache, one plan for each array size. Further, this code was designed as a drop-in replacement for your existing program, which re-allocated all of its arrays constantly. It takes special care to use the same plan with different arrays, even if both buffers have the same size.

Ideally, you would never allocate memory inside of your program's main simulation loop. All the arrays and plans should be allocated ahead of time - at startup - and be kept around and reused for the entire simulation.

The best approach seems to be one that adopts the core idea of @kigar's suggestion - keep the buffer and the associated plans in a bundle, and keep them around forever.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#include <map>
#include <tuple>

using namespace std;

using u32 = uint32_t;
using key_type = tuple<u32, u32, u32, u32>;
map<key_type, fftw_plan> r2c_fftw_plan_cache;
map<key_type, fftw_plan> c2r_fftw_plan_cache;

// Unconditionally create a new plan for use in r2cfft1d.
//
// This function is expensive because it allows FFTW to run its benchmark code. 
// The caller is intended to cache its result so that it can be re-used.
// 
// WARNING(mbozzi): The returned plan must not be passed to fftw_execute.
// Use the new-array execute function fftw_execute_dft_r2c instead.
[[nodiscard]] static fftw_plan make_r2c_plan(double* in, fftw_complex* out, u32 ri, u32 ci, u32 ro, u32 co)
{
	// Copy input and output buffers because FFTW's measurement process would clobber both.
	// 
	// It's probably best to give FFTW real data to work with when measuring
	// If we don't bother, parameters in, out become unnecesssary
	int const i_buffer_size = ri * ci;
	int const o_buffer_size = ro * co;
	auto* i = fftw_alloc_real(i_buffer_size);    if (!i) std::abort();
	auto* o = fftw_alloc_complex(o_buffer_size); if (!o) std::abort();

	static_assert(std::is_trivially_copyable<fftw_complex>::value,
		"can memcpy fftw_complex"); // (but not std::copy it)
	std::memcpy(i, in, i_buffer_size * sizeof * i);
	std::memcpy(o, out, o_buffer_size * sizeof * o);

	int const ci_ = ci; // FFTW wants an int* not a u32*.

	// https://www.fftw.org/fftw3_doc/Advanced-Complex-DFTs.html
	// https://www.fftw.org/fftw3_doc/Advanced-Real_002ddata-DFTs.html
	fftw_plan const the_plan = fftw_plan_many_dft_r2c(
		1,         // compute rank-one transforms.
		&ci_,      // the rank-one transforms are this long in their first (only) dimension
		ri,        // do one transform per row
		i,         // the input data is here
		nullptr,   // the input data is not a row-major subarray of a larger array
		ri,        // within this input array the next element is offset ri elements from prior
		1,         // address of next input array is offset 1 element from start of prior
		o,         // the output data is here
		nullptr,   // the output data is not a row-major subarray of a larger array
		ro,        // within this output array the next element is offset ro elements from prior
		1,         // address of next output array is offset 1 element from start of prior
		FFTW_MEASURE
	);

	fftw_free(i);
	fftw_free(o);

	return the_plan;
}

// See make_r2c_plan.
//  
// WARNING(mbozzi): The returned plan must not be passed to fftw_execute.  
// Use the new-array execute function fftw_execute_dft_c2r instead.
[[nodiscard]] static fftw_plan make_c2r_plan(fftw_complex* in, double* out, u32 ri, u32 ci, u32 ro, u32 co)
{
	int const i_buffer_size = ri * ci;
	int const o_buffer_size = ro * co;

	auto* i = fftw_alloc_complex(i_buffer_size); if (!i) std::abort();
	auto* o = fftw_alloc_real(o_buffer_size);    if (!o) std::abort();

	std::memcpy(i, in, i_buffer_size * sizeof * i);
	std::memcpy(o, out, o_buffer_size * sizeof * o);

	int const co_ = co;

	fftw_plan const the_plan = fftw_plan_many_dft_c2r(
		1, &co_, ri, i, nullptr, ri, 1, o, nullptr, ro, 1, FFTW_MEASURE);

	fftw_free(i);
	fftw_free(o);

	return the_plan;
}

[[nodiscard]] static constexpr key_type make_key(u32 ri, u32 ci, u32 ro, u32 co)
{
	return make_tuple(ri, ci, ro, co);
}

// Get an existing plan from r2c_fftw_plan_cache or make one with make_r2c_plan
// if it does not exist.
[[nodiscard]] fftw_plan get_r2c_plan(double* i, fftw_complex* o, u32 ri, u32 ci, u32 ro, u32 co)
{
	auto const key = make_key(ri, ci, ro, co);
	auto const iter = r2c_fftw_plan_cache.find(key);

	if (iter == r2c_fftw_plan_cache.end())
		r2c_fftw_plan_cache[key] = make_r2c_plan(i, o, ri, ci, ro, co);

	return r2c_fftw_plan_cache[key];
}

[[nodiscard]] fftw_plan get_c2r_plan(fftw_complex* i, double* o, u32 ri, u32 ci, u32 ro, u32 co)
{
	auto const key = make_key(ri, ci, ro, co);
	auto const iter = c2r_fftw_plan_cache.find(key);

	if (iter == c2r_fftw_plan_cache.end())
		c2r_fftw_plan_cache[key] = make_c2r_plan(i, o, ri, ci, ro, co);

	return c2r_fftw_plan_cache[key];
}
Last edited on
Thanks everyone! You've provided A LOT of great examples and honestly sometimes it's hard to choose. I think my main issue is that I am mixing C and C++ in my code and ending up in this situation.
For now, I got a quick example working from what @kigar64551 suggested at the first comment! It looks like that I will be able to get an example working from the second comment of @kigar64551 with structures as well. Again, thanks @mbozzi for literally the millionth time. You've provided a lot of insight and help and my apologies for being a slow learner! I think I will have to get back to the example you just provided as I am struggling to follow with just from the first glance.

I do have a lot of questions regarding parallelization and thread safety and I just need to reread some of your comments and get my thoughts organized.

For now this is the test code I have working based on what @kigar64551 provided:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
static const int nx = 128;
static const int ny = 128;
static const int ncomp = 2;
static const int nyk = ny/2 + 1;

static fftw_plan g_plan = NULL;
static double *g_buffer_in = NULL;
static fftw_complex *g_buffer_out = NULL;
static int g_initialized = 0;
//for c2r
static fftw_plan g_plan1 = NULL;
static fftw_complex *g_buffer_in1 = NULL;
static double *g_buffer_out1 = NULL;
static int g_initialized1 = 0;

void r2cfftNew(double rArr[], double cArr[][ncomp]); 
void r2cfft_cleanup(void);
void c2rfftNew(double cArr[][ncomp],double rArr[]);
void c2rfft_cleanup(void);
int main(){	

double *Function;
Function= (double*) fftw_malloc(nx*ny*sizeof(double)); 
	for (int i = 0; i < nx; i++){
		for (int j = 0; j < ny; j++){
		  Function[j + ny*i]  = //some initialization; 
			
		}
	}

fftw_complex *Functionk;
Functionk= (fftw_complex*) fftw_malloc(nx*nyk*sizeof(fftw_complex)); 
memset(Functionk, 42, nx*nyk* sizeof(fftw_complex));

r2cfftNew(Function,Functionk);

double *Out;
Out = (double*) fftw_malloc(nx*ny*sizeof(double)); 
memset(Out, 42, nx*ny* sizeof(double));

c2rfftNew(Functionk,Out); //Out = Function
r2cfft_cleanup();
c2rfft_cleanup();


//fftw_free stuff
}
	void r2cfftNew(double rArr[], double cArr[][ncomp]) 
	{
    	if (!g_initialized)
    	{
        	g_buffer_in = fftw_alloc_real(nx * ny); 
        	g_buffer_out = fftw_alloc_complex(nx * nyk);
        	if (!(g_buffer_in && g_buffer_out))
        	{
            	abort(); /*probably out of memory*/
        	}

        	g_plan = fftw_plan_dft_r2c_2d(nx, ny, g_buffer_in, g_buffer_out, FFTW_PATIENT);
        	if (!g_plan)
        	{
            	abort(); /*failed to create the plan*/
        	}

        	g_initialized = 1;
    	}

    	memcpy(g_buffer_in, rArr, sizeof(double) * nx * ny);
    	fftw_execute(g_plan);
    	memcpy(cArr, g_buffer_out, sizeof(fftw_complex) * nx * nyk);
	}
	// Call this once you are done, in order to avoid memory leaks!
	void r2cfft_cleanup(void)
	{
    	if (g_initialized)
    	{
        	fftw_destroy_plan(g_plan);
        	fftw_free(g_buffer_in);
        	fftw_free(g_buffer_out);
        	g_plan = NULL;
        	g_buffer_in = NULL;
        	g_buffer_out = NULL;
        	g_initialized = 0;
    }
	}
	void c2rfftNew(double cArr[][ncomp],double rArr[]) 
	{
    	if (!g_initialized1)
    	{
        	g_buffer_in1 = fftw_alloc_complex(nx * nyk);
        	g_buffer_out1 = fftw_alloc_real(nx * ny); 
        	if (!(g_buffer_in1 && g_buffer_out1))
        	{
            	abort(); /*probably out of memory*/
        	}

        	g_plan1 = fftw_plan_dft_c2r_2d(nx, ny, g_buffer_in1,g_buffer_out1, FFTW_PATIENT);
        	if (!g_plan1)
        	{
            	abort(); /*failed to create the plan*/
        	}

        	g_initialized1 = 1;
    	}

		memcpy( g_buffer_in1,cArr, sizeof(fftw_complex) * nx * nyk);
    	        fftw_execute(g_plan1);
		memcpy(rArr, g_buffer_out1, sizeof(double) * nx * ny);
		//(rArr, 1.0 / (nx*ny), rArr); //renormalize 
    	
	}
	void c2rfft_cleanup(void)
	{
    	if (g_initialized1)
    	{
        	fftw_destroy_plan(g_plan1);
        	fftw_free(g_buffer_in1);
        	fftw_free(g_buffer_out1);
        	g_plan1 = NULL;
        	g_buffer_in1 = NULL;
        	g_buffer_out1 = NULL;
        	g_initialized1 = 0;
    }
	}

Last edited on
I do have a lot of questions regarding parallelization and thread safety and I just need to reread some of your comments and get my thoughts organized.

As far as FFTW is concerned, only fftw_execute() is thread-safe. All other functions should only be called from one thread at a time! So it is probably best to create all plans in one thread. Either that, or the creation of the plans needs to be serialized. Anyways, once a plan was created, it is perfectly safe to use the same plan by multiple threads at the same time. But, if you want to do this, then you have to use the "new array" functions, so that each thread can use its own separate input/output buffers, even when they all share the same plan:
https://www.fftw.org/fftw3_doc/New_002darray-Execute-Functions.html


Another more general problem (not specific to FFTW) is that the following "initialize on first call" pattern is not thread safe:
1
2
3
4
5
6
7
8
9
10
11
12
static int g_initialized = 0;

void some_function(...)
{
    if (!g_initialized)
    {
        // do one-time initialization here!
        g_initialized = 1;
    }

    // do the actual computation here!
}

This can be solved, in pure C, by giving each thread its own separate "context" (i.e. set of state variables), which will be initialized or cleaned-up separately by each thread; you probably want to wrap the "context" in a struct. In C++, each thread can have its own separate instance (object) of the class that wraps the state. But, even if you give each thread its own separate "context", you still have to synchronize calls to fftw_plan*() functions and friends, e.g. by using a pthread_mutex_t or std::mutex, for the reasons pointed out before...

C++ example:
1
2
3
4
5
6
7
8
9
10
11
class MyFftwClass {
public:
    MyFftwClass(void);
    ~MyFftwClass(void);
    void execute(double *const in, fftw_complex *const out);

private:
    static fftw_plan s_plan; // <-- shared by all instances
    double *m_buffer_in;
    fftw_complex *m_buffer_out;
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
std::mutex g_my_fftw_mutex; // <-- must lock before using any fftw_*() function, except for fftw_execute*()

MyFftwClass::MyFftwClass(void)
{
    // serialize the initialization!
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);

    // create plan *once* for all instances/threads
    if (!s_plan)
    {
        s_plan = fftw_plan_xyz(...);
        if (!s_plan)
        {
            throw new std::runtime_error("Failed to create plan!");
        }
    }

    // allocate separate buffers for each instance
    m_buffer_in = fftw_alloc_real(...);
    m_buffer_out = fftw_alloc_complex(...);
    if (!(m_buffer_in && m_buffer_out))
    {
        throw new std::runtime_error("Failed to allocate memory!");
    }
}

MyFftwClass::~MyFftwClass(void)
{
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);
    fftw_free(m_buffer_in);
    fftw_free(m_buffer_out);
}

void MyFftwClass::execute(double *const in, fftw_complex *const out)
{
    // No serialization is needed here
    memcpy(m_buffer_in, in, ...);
    fftw_execute_dft_r2c(s_plan, m_buffer_in, m_buffer_out);
    memcpy(out, m_buff_out, ...);
}

fftw_plan MyFftwClass::s_plan = NULL;

(Be aware that each thread is supposed to use its own separate instance/object of the MyFftwClass class)

(I think that the memcpy() is still needed, as fftw_execute*() needs to use buffers that have been allocated with fftw_alloc*() functions, but the arguments/pointers passed into MyFftwClass::execute() can not, in general, be assumed to have been allocated like that)
Last edited on
I think that the memcpy() is still needed, as fftw_execute*() needs to use buffers that have been allocated with fftw_alloc*() functions


I was under this impression too, but it turns out to be unnecessary (though it should be preferred)
https://www.fftw.org/fftw3_doc/SIMD-alignment-and-fftw_005fmalloc.html

Well, while you are not "required" to use fftw_alloc*(), it may result in SIMD not being used:

...we recommend allocating your transform data with fftw_malloc and de-allocating it with fftw_free. These have exactly the same interface and behavior as malloc/free, except that for a SIMD FFTW they ensure that the returned pointer has the necessary alignment.

You are not required to use fftw_malloc. You can allocate your data in any way that you like, from malloc to new (in C++) to a fixed-size array declaration. If the array happens not to be properly aligned, FFTW will not use the SIMD extensions.

(not being able to use SIMD, e.g. SSE or AVX, probably means a major slow-down ☚ī¸)
Last edited on
Thanks all! I was able to reproduce all the examples you provided, which is extremely helpful! Now, I have this example where I try to parallelize and achieve thread-safety. Does this look correct to you?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
static const int nx = 128;
static const int ny = 128;
static const int ncomp = 2;
static const int nyk = ny/2 + 1;

class MyFftwClass {
public:
    MyFftwClass(void);
    ~MyFftwClass(void);
    void execute(double rArr[], double cArr[][ncomp]);

private:
    static fftw_plan s_plan; // <-- shared by all instances
    double *m_buffer_in;
    fftw_complex *m_buffer_out;
};


class MyFftwClass1 {
public:
    MyFftwClass1(void);
    ~MyFftwClass1(void);
    void execute1(double cArr[][ncomp],double rArr[]);

private:
    static fftw_plan s_plan1; // <-- shared by all instances
	fftw_complex *m_buffer_in1;
    double *m_buffer_out1;
};

int main(){	
	

double *Function;
Function= (double*) fftw_malloc(nx*ny*sizeof(double)); 
	for (int i = 0; i < nx; i++){
		for (int j = 0; j < ny; j++){
		  Function[j + ny*i]  = //some initialization; 
			
		}
	}

fftw_complex *Functionk;
Functionk= (fftw_complex*) fftw_malloc(nx*nyk*sizeof(fftw_complex)); 
memset(Functionk, 42, nx*nyk* sizeof(fftw_complex));

MyFftwClass r2c1; //declare r2c1 of type MyFftwClass
r2c1.execute(Function,Functionk);

double *Out;
Out = (double*) fftw_malloc(nx*ny*sizeof(double)); 
memset(Out, 42, nx*ny* sizeof(double));

MyFftwClass1 c2r1; //declare r2c1 of type MyFftwClass
c2r1.execute1(Functionk,Out);
//fftw_free stuff
}
std::mutex g_my_fftw_mutex; // <-- must lock before using any fftw_*() function, except for fftw_execute*()
MyFftwClass1::MyFftwClass1(void)
{
    // serialize the initialization!
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);

	// allocate separate buffers for each instance
    m_buffer_in1 = fftw_alloc_complex(nx * nyk);
    m_buffer_out1 = fftw_alloc_real(nx * ny); 
    if (!(m_buffer_in1 && m_buffer_out1))
    {
        throw new std::runtime_error("Failed to allocate memory!");
    }

    // create plan *once* for all instances/threads
    if (!s_plan1)
    {
        s_plan1 = fftw_plan_dft_c2r_2d(nx, ny, m_buffer_in1, m_buffer_out1, FFTW_PATIENT);
        if (!s_plan1)
        {
            throw new std::runtime_error("Failed to create plan!");
        }
    }

}
MyFftwClass1::~MyFftwClass1(void)
{
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);
	fftw_destroy_plan(s_plan1);
    fftw_free(m_buffer_in1);
    fftw_free(m_buffer_out1);
}

void MyFftwClass1::execute1(double cArr[][ncomp],double rArr[]) //void MyFftwClass::execute(double *const in, fftw_complex *const out)
{
    // No serialization is needed here
	memcpy(m_buffer_in1,cArr, sizeof(fftw_complex) * nx*(nyk));
    fftw_execute_dft_c2r(s_plan1, m_buffer_in1, m_buffer_out1); //instead of fftw_excute(plan)
	memcpy(rArr, m_buffer_out1,sizeof(double) * nx*ny);
	//(rArr, 1.0 / (nx*ny), rArr); //renormalize 
}

fftw_plan MyFftwClass1::s_plan1 = NULL;

std::mutex g_my_fftw_mutex; // <-- must lock before using any fftw_*() function, except for fftw_execute*()
MyFftwClass::MyFftwClass(void)
//int r2cfft_initialize(r2cfft_t *const ctx, const std::size_t nx, const std::size_t ny)
{
    // serialize the initialization!
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);

	//allocating buffers first before plan creating gets rid off seg fault error

	// allocate separate buffers for each instance
    m_buffer_in = fftw_alloc_real(nx * ny); 
    m_buffer_out = fftw_alloc_complex(nx * nyk);
    if (!(m_buffer_in && m_buffer_out))
    {
        throw new std::runtime_error("Failed to allocate memory!");
    }

    // create plan *once* for all instances/threads
   if (!s_plan)
    {
        s_plan = fftw_plan_dft_r2c_2d(nx, ny, m_buffer_in, m_buffer_out, FFTW_PATIENT);
        if (!s_plan)
        {
            throw new std::runtime_error("Failed to create plan!");
        }
    }

    
}

MyFftwClass::~MyFftwClass(void)
{
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);
    fftw_free(m_buffer_in);
    fftw_free(m_buffer_out);
    fftw_destroy_plan(s_plan);
}

void MyFftwClass::execute(double rArr[], double cArr[][ncomp]) //void MyFftwClass::execute(double *const in, fftw_complex *const out)
{
    // No serialization is needed here
    memcpy(m_buffer_in, rArr,  sizeof(double) * nx*ny);
    fftw_execute_dft_r2c(s_plan, m_buffer_in, m_buffer_out); //instead of fftw_excute(plan)
    memcpy(cArr, m_buffer_out, sizeof(fftw_complex) * nx*(nyk));
}

fftw_plan MyFftwClass::s_plan = NULL;


I am not very familiar with classes since I have been mixing C++/C so this was a bit confusing to me, however, this seems to work. I mean the ffts are returning the expected values. Now, I wanna try to parallelize but first I wanna make sure the above code follows the logic everyone provided.
Again, thanks
Cheers!

Would you give an broad overview of the design of your program?

I think you're working on a long-running, fixed-size simulation, where each step depends on the results of the previous step. This is just my impression from a while ago, so I could be wrong.

The long-running part means that the amount of CPU time spent starting-up should be small relative to the CPU time spent executing the entire program.

The fixed-size part suggests that it's possible to:
- Do all the memory allocation at startup, and
- Do all the FFTW planning at startup, and
- Start all your worker threads at startup.

Am I close at all? Does that seem to be possible?
Last edited on
Would you give an broad overview of the design of your program?


Well, I will try my best to give a description of the ''design'' since I am unfortunately by no means a coder/programmer.

Basically the code solves for two PDEs using two numerical methods: 1. Spectral methods (iterative solver) and 2. Rung-Kutta 4 (time stepping method since there's time dependence in my PDE i.e. d/dt). Therefore, the idea is my main function does a bunch of calculations (some mathematical terms and so on) then the most important part of code is the iterative solver and the time stepper. So for every single time step, my iterative solver will go however many iterations till it converges. Now, since spectral methods here rely heavily on FFT I am basically having to take FFT inside every single function that will be passed into the iterative solver as well as within the time stepper for loop and that's why I am focusing on parallelizing my FFTs. I can provide a "simple" script to give a better idea since you'd have a different perspective to the ''design''.

Main code{
//some calculations and initializations
//print iterations in command line
for{
//update time and iterations
//begin time step loop here
for (){
//FFTs and more calculations

//iterative solver function here

}
}
}


Do all the memory allocation at startup, and

This is something I have been struggling with. I am not sure how possibly I could allocate memory at startup, especially with all the variables being initialized to be passed to functions and so on.

Do all the FFTW planning at startup, and

I think I was able to achieve this with the code example provided above.

Start all your worker threads at startup.

This sounds like a good idea too and will try doing that.
I am trying to follow with FFTW parallel (multi-threading) documentation here:
https://fftw.org/fftw2_doc/fftw_4.html

clearly something is wrong cause I am getting no speed up at all!

s of the number of threads. What exactly are you measuring in your performance tests? Is it the time spent calling fftw_execute() or in the entire script/part of it? As others have mentioned, planning the FFT takes a long time, which is why you only do it once and then call the FFT many times after that. So what you want to compare is how long fftw_execute() takes, e.g. to be called 1000 times with different N_Threads. Also you should use FFTW_MEASURE as a planner flag for best performance. My compiler flags are -lfftw3_omp -lfftw3 -Ofast

So I do the FFT plan once and then lock it to make sure it's thread-safe. Then, excute FFTs many times (e.g. 1000 times) as possible with different N_threads. With the above code the main function can look like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
MyFftwClass r2c1; //declare r2c1 of type MyFftwClass

const int N = 1000;
for (int nThreads =1; nThreads <= 16; nThreads++){
        double start_time, run_time;
        start_time = omp_get_wtime();
       
        omp_set_num_threads(nThreads); // openMP
        #pragma omp parallel for 
        for ( int i = 0; i< N; i++){
            r2c1.execute(Function,Functionk);
        }
        //clock_t end = clock();
        run_time = omp_get_wtime() - start_time;
        //cout << "Threads: " << nThreads <<  "   Sum: " << sum << " Time:  " << (double)(end-start) / CLOCKS_PER_SEC << "s\n";
        cout << "Threads: " << nThreads <<  "Parallel Time in s: " <<  run_time << "s\n";
        
    }


From documentation void fftw_plan_with_nthreads(int nthreads); should be called before creating a plan that you want to parallelize, but I wasn't sure if it's okay to call it in the main code before the declaration of MyFftwClass r2c1; or inside the function definition.

The above code doesn't seem to really work since it returns:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Threads: 1Parallel Time in s: 0.0321368s
Threads: 2Parallel Time in s: 0.0331744s
Threads: 3Parallel Time in s: 0.0238543s
Threads: 4Parallel Time in s: 0.0248886s
Threads: 5Parallel Time in s: 0.0213162s
Threads: 6Parallel Time in s: 0.0202103s
Threads: 7Parallel Time in s: 0.0206512s
Threads: 8Parallel Time in s: 0.0182187s
Threads: 9Parallel Time in s: 0.0163617s
Threads: 10Parallel Time in s: 0.0170322s
Threads: 11Parallel Time in s: 0.0165179s
Threads: 12Parallel Time in s: 0.0164231s
Threads: 13Parallel Time in s: 0.0182663s
Threads: 14Parallel Time in s: 0.0181124s
Threads: 15Parallel Time in s: 0.017698s
Threads: 16Parallel Time in s: 0.0179136s


I have 6 cores so I sort of expected 6 times speedup.

Also, the flags I used:
 
-lfftw3_omp -lfftw3 -Ofast -fopenmp
I'm surprised there's even a correlation between thread count and elapsed time. Unless I'm missing something, you're not splitting the work among the threads, you're making all the threads do the same amount of work as if the thread was not part of a parallel workload.

 
f();
Time taken: 1

1
2
for (int i = 0; i < n; i++)
    f();
Time taken: n

1
2
3
4
5
6
7
8
9
std::vector<std::thread> threads;
for (int j = 0; j < n; j++){
    threads.emplace_back([](){
        for (int i = 0; i < n; i++)
            f();
    });
}
for (auto &t : threads)
    t.join();
Time taken: ~1
Registered users can post here. Sign in or register to post.