I have started working on multithreading and I am getting a little bit confused. Assuming that I have a huge array of double number, a million of small numbers for example, and I want to find the sum of every single element. Without using multithreading I will have to add to a variable every array[i] number. However multithreading could speed up the process I can divide the array to 4 parts for instance, create 4 threads calculate the sum of each part return it to the main function and the sum them, how could I make such a program? An example with a code could be really usefully to help me understand what exactly is going on, thanks in advance!
> However multithreading could speed up the process
> I can divide the array to 4 parts for instance, create 4 threads calculate the sum of each part
Yes. But there will not be a four fold increase in speed: summing up numbers is a fast operation and doing things asynchronously introduces its own overhead.
> how could I make such a program?
> An example with a code could be really usefully to help me understand what exactly is going on
When you (and implementations) switch to C++17, you would be able to use std::reduce
with an appropriate execution policy, say std::execution::parallel_unsequenced_policy https://en.cppreference.com/w/cpp/algorithm/reduce
I tried the OpenMP library since it looked like a fast and a simple way to implement it on my code however the results before using the library were faster than the ones after including it
1 2 3 4 5 6 7
Simple for:
Sum: 2.39484e-05
Time taken: 1.422164
For running in two threads:
Sum: 2.39484e-05
Time taken: 3.720002
@KarampistisDimitris,
I'm surprised. If I comment out the OpenMP lines in my code then it takes the same time as a single thread.
What compiler options are you using? Are you sure that you haven't left debug options on? Could you show your code?
OpenMP is so easy to use compared with the C++ standard library thread routines and the fact that it is available out of the box for the Gnu compilers (and a few others: https://www.openmp.org/resources/openmp-compilers-tools/ ) is a big plus.
g++ (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I have the debug options on since I am still developing the program, the options where on, on both simple for and with the OpenMP library (should I turn them off?).
g++ -Wall -g -fopenmp -std=c++11 simple.cpp -o out
I did not use the exact same code I made some research online and I found that I can parallel a for using
1 2
#pragma omp parallel for num_threads(2)
instead of
1 2
omp_set_num_threads( nThreads );
#pragma omp parallel for reduction( + : sum )
I just found the first way easier to implement (It may actually be a wrong decision, correct me if I am wrong)
I should turn the debugging options (particularly -g) off if you are interested in timing the threads.
Obviously there are several different ways of writing the parallel for ... but how are you doing the reduction to get the sum? What happens if you try to get 4 threads rather than 2?
Your code actually is faster than a simple for, I just tried it on my system
This is the code I am using to multithread the sum
1 2 3 4 5 6 7 8 9 10 11
#pragma omp parallel num_threads(2)
{
// This code will be executed by two threads.
// Chunks of this loop will be divided amongst
// the (two) threads of the current team.
#pragma omp for
for(int n=0; n<100000000; ++n) sum += A[n];
}
//Source: https://bisqwit.iki.fi/story/howto/openmp/
//Slightly modified
1 2 3 4 5 6 7 8 9
Your code:
Threads: 2 Sum: 6.28319 Time: 0.633998 s
Single thread
Threads: 1 Sum: 6.28319 Time: 0.668649 s
Source code:
Threads: 2 Sum: 6.28319 Time: 0.708978 s
*EDIT* You are actually right the source code is not always giving the right sum sometimes one thread is lost. Reduction is needed
some questions...
are you running this thing in debug mode??
I ran the dumb for loop and its 0.35 in debug mode for me (and I have a meh ish work laptop.. its not powerful at all). Its too fast to register on a crude clock() timer in a release build.
I put in a high res chrono timer and get
0.00000030 sec
in release mode.
and that makes *sense*. your cpu is 3.0 neighborhood gigahertz. Thats 3 billion operations per second. You are only adding 100M values, and addition isnt very complicated.
and I get that its an example, .. what do you want to do for real? For this example, if you really wanted to DO this, you would sum it in the loop that populates the array, not 2 loops (one to make and one to sum). As a counter example to a way to actually shave some time off the task. It might be cheaper on total time taken program wide if you multi threaded the array population loop, which actually does a little work!
At this point I am trying to learn some things about multithread, so that I can multithread this code
1 2 3 4 5
for(int i = 0; i < N; i++){
for(int j = 0; j < M; j++){
c[i][j] = b[i][j] + a[i][j];
}
}
However before going instantly on the apove code before nested for loops and 2d array I want to make sure that I have understand the basics. I am currently experimenting on simpler examples. After that I will try other mathematical expressions inside the for. When you are saying >On debug mode do you mean through gdb or any other similar program? If so I am not running it through a debugging program just executing it using ./out
I mean when you compiled it. Did you do g++ -O3 or similar? The default option is debug enabled code which is MUCH slower. If you take that out, you can't use the debugger anymore, but the code is faster.
multi-threading may actually slow you down on release code builds in a lot of cases if you don't have a very large amount of data or a complicated problem. Its kind of tricky to get a feel for it.
you can write that for loop as a single loop if a,b,c are arrays or single memory blocks. Learning multi threading is important, and its a strange way to think. You have to think about it on the front and back ends.. First you have to have an algorithm for the task that can make good use of the MP and then often the last thing you do after tuning the workhorse loop/function/etc is actually break it up and thread it. Tuning the loop in the middle is a whole lot of learning as well, but its equally important.
I literally just noticed it I was using the default option -O0 changing to -O3 is actually increasing the speed
Moreover if I get it right generally speaking you are not 100% sure whether multithreading will make your code faster since some CPUs may be busy so the on thread will have to wait for the others to end, does not it?
constint N = 100000000;
int sign;
double sum = 0;
staticdouble A[N];
for ( int i = 0, sign = 1; i < N; i++, sign = -sign )
{
A[i] = sign * 4.0 / ( 1.0 + 2.0 * i );
}
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for(int n=0; n<100000000; ++n) sum += A[n];
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
I get .35 for -O0, .034 for -O1, and e-7 for O2 and O3
Its not just busy CPU. Asking for a thread takes time in both the OS and the hardware, time that very fast things (like this loop taking e-7 sec) can actually finish the task faster than the overhead of splitting it up into threads takes. And its not hard and fast. If you looped 100 times, you would see it.. 100M maybe the threads beat it... somewhere there is a break even point. CPUs and OS do NOT wait on threads to end. They context switch (also very expensive) and give a little time to each thread on most desktop type OS. There are some systems or controls to modify that behavior, but that is the typical PC way. But more than 1 thread per CPU is counterproductive on most hardware. Leaving 1 cpu for the OS and burning the rest with your code is a solid approach for typical needs.
I put it on my hot-box @ home. .24 / 026 / 3e-7 … interestingly no faster on the most optimal version, indicating it is maybe bound by a factor of clock speed.