Dear experts
I think this post seems ambiguous and case-by-case question. But, if you have know-how, I would appreciate it if you could help me.
I think probably many people have experience to allocate memory data in advance before its use in loop (e.g. std::vector<Obj> dat(buffer_size);). To avoid repeated allocation of such a memory region is basic approach to speed up programs.
I tried this for parallel usage. From the conclusion, in my case, I succeed in speed up in serial run, but fail in speed up in parallel run (compared with the case that memory region is repeatedly allocated in each loop). It seems that the higher thread number becomes, the slower program runs.
(To avoid data race in each thread access to dat, dat is prepared for each thread (I tried 1 to 32 threads))
I suppose if this is due to Non-Uniform Memory Access (as known NUMA) because the pre-allocated memory regions are done in a single thread without consideration of memory locality.
If you know how to make each thread to access a pre-allocated memory region
near each cpu without data race,
I hope you give me your advice.
psuedo code is the below (I use TBB)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
// thread number is set
std::size_t max_thread_num = 16;
// Pre-allocated memory regions per each thread
std::vector<std::vector<Obj> > dats(max_thread_num);
for(auto& d : dats)
d = std::vector<Obj>(buffer_size);
tbb::parallel_for(tbb::blocked_range<std::size_t>(0, LOOP_NUM), [&](const tbb::blocked_range<std::size_t> &r) {
int thread_id = tbb::task_arena::current_thread_index();
auto& dat = dats[thread_id];
// std::vector<Obj> dat(buffer_size); //(in repeated allocation version)
for (size_t i = r.begin(); i != r.end(); ++i){
// process something by using dat
}
});
|