Multi-threading performance influence in effect

Pages: 12
I've just currently been to multi-threading and kept myself revolving around it to learn and get through that. A couple questions about the code below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>
#include <thread>
#include <vector>
#include <numeric>

void AccumulateRange(uint64_t& sum, uint64_t start, uint64_t end) {
	sum = 0;
	for (uint64_t i = start; i < end; ++i)
		sum += i;
}

void printVector(const std::vector<uint64_t>&  vec) {
	for (auto v : vec)
		std::cout << v << ' ';
	std::cout << '\n';
}


int main() {

	const uint64_t number_of_elements{ 1000000000 }; // one billion elements
	const uint16_t number_of_threads{ 10 };
	const int step{ number_of_elements / number_of_threads };

	std::vector<std::thread> threads;
	std::vector<uint64_t> partial_sums(number_of_threads);

	for (int i = 0; i < number_of_threads; ++i)
		threads.push_back(std::thread(AccumulateRange, std::ref(partial_sums[i]),
			i * step, (i + 1) * step));

	for (auto& t : threads)
		if (t.joinable())
			t.join();

//	for (int i = 0; i < 10; ++i)
	//	AccumulateRange(partial_sums[i], i * step, (i + 1) * step);

	uint64_t total = std::accumulate(partial_sums.begin(), partial_sums.end(), uint64_t(0));
	printVector(partial_sums);

	std::cout << "Total = " << total << '\n';

	system("pause");
	return 0;
}


1) Is it a good simple example of multi-threading in C++, please?

2) Here 10 other threads, than the main() thread, have been used to enhance performance. But the time needed to perform the task with or without those ten helper threads is the same! In either case, my PC takes about 6 or 7 seconds to carry out the task! Why!? I thought the program would be much faster with using those helper threads!

Last edited on
> const uint16_t number_of_threads{ 10 };
How many actual cores do you have?

> printVector(partial_sums);
You need to get the times without any I/O.
I/O is expensive, and I/O to your console screen is even more expensive than just writing to a file.

If you want to measure the effect of threads...
https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now
Put your timing code around just the work.
>How many actual cores do you have?
(corei-3 4160): 2 cores, 4 threads

> You need to get the times without any I/O. If you want to measure the effect of threads...
https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now
I did these. The output:
With helper threads: 3.9 seconds
Without them: 4.15 seconds.
Still no big difference!
Last edited on
> no big difference!

I get a fairly significant difference till the number of threads reaches hardware_concurrency.
(Note: quite similar, but not identical code).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>
#include <cstdint>
#include <future>
#include <chrono>
#include <vector>
#include <cassert>
#include <thread>

std::uintmax_t accumulate_range( std::uint64_t start, std::uint64_t end ) {

	volatile std::uintmax_t sum = 0; // volatile: inhibit a clever optimiser from optimising the loop away
	for( ; start < end ; ++start ) sum = sum + start ;
	return sum ;
}

void test( unsigned int nthreads ) {

    static const unsigned long long N = 2*3*5*7*2'000'000LL ;

    const unsigned long long N_PER_THREAD = N / nthreads ;

    std::vector< std::future<std::uintmax_t> > futures ;

    const auto start = std::chrono::steady_clock::now() ;
    unsigned int first_val = 0 ;
    for( unsigned int i = 0 ; i < nthreads ; ++i ) {

        first_val += N_PER_THREAD ;
        futures.push_back( std::async( std::launch::async, accumulate_range, first_val, first_val+N_PER_THREAD) ) ;
    }

    std::uintmax_t result = 0 ;
    for( auto& fut : futures ) result += fut.get() ;

    const auto end = std::chrono::steady_clock::now() ;

    assert( result = N * (N+1) / 2 ) ; // sanity check
    using namespace std::chrono ;
    std::cout << "#threads: " << nthreads << " elapsed: " << duration_cast<milliseconds>(end-start).count() << " millisecs\n" ;
}

int main() {

    std::cout << "hardware_concurrency: " << std::thread::hardware_concurrency() << '\n' ;
    for( unsigned int nthreads = 1 ; nthreads < 10 ; ++nthreads ) test(nthreads) ;
}


On coliru:
hardware_concurrency: 4
#threads: 1 elapsed: 1141 millisecs
#threads: 2 elapsed: 602 millisecs
#threads: 3 elapsed: 483 millisecs
#threads: 4 elapsed: 372 millisecs
#threads: 5 elapsed: 402 millisecs
#threads: 6 elapsed: 310 millisecs
#threads: 7 elapsed: 312 millisecs
#threads: 8 elapsed: 318 millisecs
#threads: 9 elapsed: 305 millisecs

https://coliru.stacked-crooked.com/a/8133fe731c85e99b

From the reported values, it would appear that the implementation uses a thread pool.
Last edited on
> With helper threads: 3.9 seconds
> Without them: 4.15 seconds.
> Still no big difference!
TBH, without showing your latest code, it's not that informative.

With cpu-bound thread(s), increasing the number of threads beyond the physical number of cores will not further increase performance and could degrade performance! Note that there is a performance hit in actually creating and setting up a thread. That's one reason why moving from single to multi-thread doesn't always give the expected performance improvement - and in some cases can make things worse! Also why using a multi-thread execution-policy of aware functions doesn't necessarily give improved performance.
ooo, fun..

hardware_concurrency: 20
#threads: 1 elapsed: 484 millisecs
#threads: 2 elapsed: 242 millisecs
#threads: 3 elapsed: 162 millisecs
#threads: 4 elapsed: 121 millisecs
#threads: 5 elapsed: 97 millisecs
#threads: 6 elapsed: 86 millisecs
#threads: 7 elapsed: 73 millisecs
#threads: 8 elapsed: 64 millisecs
#threads: 9 elapsed: 59 millisecs
#threads: 10 elapsed: 52 millisecs
#threads: 11 elapsed: 53 millisecs
#threads: 12 elapsed: 49 millisecs
#threads: 13 elapsed: 46 millisecs
#threads: 14 elapsed: 43 millisecs
#threads: 15 elapsed: 40 millisecs
#threads: 16 elapsed: 37 millisecs
#threads: 17 elapsed: 35 millisecs
#threads: 18 elapsed: 33 millisecs
#threads: 19 elapsed: 32 millisecs
#threads: 20 elapsed: 48 millisecs
#threads: 21 elapsed: 50 millisecs
#threads: 22 elapsed: 48 millisecs
#threads: 23 elapsed: 50 millisecs
#threads: 24 elapsed: 48 millisecs

your needs and whatnot vary, but after an analysis like the above, I would cut the threads where the linear progression stops -- so when 484/#threads isnt close to the time -- around 7 threads, and no more than 10, is where I would stop this one. You also should always leave 1 thread on the computer free unless its a special purpose and the only thing you care about is THIS program. Note that 12 threads is about 10x faster, not too far off linear.
Last edited on
@Salem C
> TBH, without showing your latest code, it's not that informative.
Here you are. You can comment out the threads part (line 28 to 35) and uncomment the for-loop on line 37 to see the difference in their performance times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <iostream>
#include <thread>
#include <vector>
#include <numeric>

void AccumulateRange(uint64_t& sum, uint64_t start, uint64_t end) {
	sum = 0;
	for (uint64_t i = start; i < end; ++i)
		sum += i;
}

void printVector(const std::vector<uint64_t>&  vec) {
	for (auto v : vec)
		std::cout << v << ' ';
	std::cout << '\n';
}

int main() {

	const uint64_t number_of_elements{ 1000000000 }; // one billion elements
	const uint16_t number_of_threads{ 10 };
	const int step{ number_of_elements / number_of_threads };
	std::vector<uint64_t> partial_sums(number_of_threads);

	// record start time
	auto start = std::chrono::high_resolution_clock::now();

	std::vector<std::thread> threads;
	for (int i = 0; i < number_of_threads; ++i)
		threads.push_back(std::thread(AccumulateRange, std::ref(partial_sums[i]),
			i * step, (i + 1) * step));

	for (auto& t : threads)
		if (t.joinable())
			t.join();    

//	for (int i = 0; i < 10; ++i)
	//	AccumulateRange(partial_sums[i], i * step, (i + 1) * step);   

	uint64_t total = std::accumulate(partial_sums.begin(), partial_sums.end(), uint64_t(0));

//  printVector(partial_sums);
	std::cout << "Total = " << total << '\n';

	// record end time
	auto end = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> diff = end - start;

	std::cout << diff.count() << " s\n";
	system("pause");
	return 0;
}


@Seeplus:
1) I completely agree with you and what you mentioned is exactly what I was thinking of when experiencing the execution of the program with or without those helper threads!
So in what kinds of programs do you step forward to using multi-threading, please?
2) And do you intend to use the number of threads, when using multi-threading, based on the number of actual cores? That is, since my system now uses two actual cores, so it's better to add only one extra thread, apart from main(), for execution performance when needed. Right?


@jonnin
I'm not sure I got what you meant completely, but will be thankful if you use my code to prove performance enhancement using multi threads, if you think the other way around my thinking above, . Unfortunately I can't understand JLBorges's code; it's too complex for me.
Last edited on
Unfortunately I can't understand JLBorges's code; it's too complex for me.
- I consider JL to be one of our best coders. It is very worth your time to go through it slowly and try to understand it, look up what you don't know, and ask if you still don't get it. You very likely WILL learn something from his posts.

-All you needed to get from my ramble was that you need to find the point of diminishing returns. Ignoring thread overhead, a typical routine runs twice as fast one 2 processors as 1 when split evenly. It runs 3 times faster on 3 processors than 1 when split evenly. This is what I mean by linear returns from threading it. After doing a study like that one that runs the code over different numbers of threads, find the LEAST number of threads you need to get a reasonable speedup. As you can see, on my 20 core processor, the differences between 10 threads and 24 is pretty much nothing gained for throwing more threads on the pile. The difference between 7 and 10 threads is tangible but small. So, just common sense: pick a value between 7 and 10 that you feel will do your code 'well enough' and run that many, and no more. Also, if the cpu does not support that many .. say its an old 4 core machine, then you run 3 and leave one free. Always leave a core free for the OS to do its background garbage... there is always something eating a CPU in modern OS unless you are on an embedded computer with a minimal OS or a R&D box.
so you need to know how many your computer supports, which the test code detects, and run one less than that up N where N was found by the testing results, in our case N is 7 to 10. This isnt perfect, as hardware varies, but its about as good as I know how to do it without running a test program when you install to find the best values for that specific machine (and even then, that varies when the machine has more or less programs running on it).

also note that this is for O(n) problems. this matters ;) a NlgN sort split actually does fewer operations split than on one cpu ... sorting 1 million items and reassembling them on 4 cpus, for example, turns about 20 million operations (n=1M, lg(1m)~= 20) into about 19M (250k * lg(250k)(18)) + N to reassemble the 4 parts back into 1) ... its pretty much the same concepts as above but the actual gains from threading vary depending on the algorithm and complexity of reassembly if any and so on. It doesn't matter too much since you do the same kind of analysis to choose the # of threads to split into... choose where the diminishing returns of more threads isnt worthwhile.
Last edited on
@frek

Just be careful with high_resolution_clock, notice how JLBorges had steady_clock ? steady_clock is the best for timing things.

https://en.cppreference.com/w/cpp/chrono/steady_clock

It is quite common for people to wrongly think high resolution clock is better.
This is an option:

1
2
3
4
5
6
7
8
#include <chrono>
#include <type_traits>

// the highest resolution clock that is suitable for measuring time intervals
//     the high resolution clock if it is monotonic, otherwise the steady clock
using clock_type = std::conditional< std::chrono::high_resolution_clock::is_steady,
                                     std::chrono::high_resolution_clock,
                                     std::chrono::steady_clock >::type ;


g++ -fopenmp test.cpp
or
cl /openmp /EHsc test.cpp

https://coliru.stacked-crooked.com/a/6e425ebe3a042414

(TBH, it makes it miles faster if you simply turn the -O3 optimisation on, though.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <iostream>
#include <chrono>
#include <cassert>
#include <thread>
#include "omp.h"                                           // OpenMP
using namespace std;
using namespace std::chrono;


void test( unsigned int nthreads )
{
   const unsigned long long N = 2*3*5*7*2000000ull;

   auto start = steady_clock::now();

   unsigned long long result = 0 ;
   omp_set_num_threads( nthreads );                        // OpenMP
   #pragma omp parallel for reduction( + : result )        // OpenMP
   for ( long long i = 1; i <= N; i++ ) result += i;

   auto end = steady_clock::now() ;

   assert( result == N * ( N + 1 ) / 2 );                  // <*** Corrected == for = *** (original computed result is WRONG)
   cout << "#threads: " << nthreads << " elapsed: " << duration_cast<milliseconds>( end - start ).count() << " millisecs\n" ;
}

int main()
{
   cout << "hardware_concurrency: " << std::thread::hardware_concurrency() << '\n' ;
   for ( unsigned int nthreads = 1; nthreads < 10; nthreads++ ) test( nthreads );
}



On my (work) PC - not Coliru:
hardware_concurrency: 8
#threads: 1 elapsed: 1093 millisecs
#threads: 2 elapsed: 580 millisecs
#threads: 3 elapsed: 404 millisecs
#threads: 4 elapsed: 309 millisecs
#threads: 5 elapsed: 229 millisecs
#threads: 6 elapsed: 201 millisecs
#threads: 7 elapsed: 165 millisecs
#threads: 8 elapsed: 144 millisecs
#threads: 9 elapsed: 155 millisecs



On Coliru:
hardware_concurrency: 4
#threads: 1 elapsed: 304 millisecs
#threads: 2 elapsed: 153 millisecs
#threads: 3 elapsed: 108 millisecs
#threads: 4 elapsed: 83 millisecs
#threads: 5 elapsed: 84 millisecs
#threads: 6 elapsed: 83 millisecs
#threads: 7 elapsed: 81 millisecs
#threads: 8 elapsed: 75 millisecs
#threads: 9 elapsed: 74 millisecs

Last edited on
When your threads all use the same areas of memory, that memory becomes the bottleneck.

Running your latest code on my system:
$ ./foo
Total = 499999999500000000
2.10132 s


Changing AccumulateRange so each loop iteration of each thread doesn't write to adjacent items in the partial_sums vector:
1
2
3
4
5
6
7
void AccumulateRange(uint64_t& sum, uint64_t start, uint64_t end) {
        uint64_t localSum = 0;
        localSum = 0;
        for (uint64_t i = start; i < end; ++i)
                localSum += i;
        sum = localSum;
}

$ ./foo
Total = 499999999500000000
1.09507 s

> When your threads all use the same areas of memory, that memory becomes the bottleneck.

Shared data (within the same cache line) that is only read (but not updated) by multiple threads does not cause false sharing.
@JLBorges
Please let's have this post and the answers you will be writing as replies in a simple language.

volatile std::uintmax_t sum = 0; // volatile: inhibit a clever optimiser from optimising the loop away
1) volatile means, hey compiler, try not to consider me as const and step forward to optimize the code, right?

2) Why did you think a good compiler may optimize the for loop code and avoid going through it?

3) Why didn't you use sum += start; instead of sum = sum + start;, please? (Any reason!?)

4) Why did you use static storage area here, please? static const unsigned long long N = 2*3*5*7*2'000'000LL;. You wanted an object which is initialized once and doesn't change throughout the program's life time. So why not only const? Why not constexpr or even static constexpr?

5) In first_val += N_PER_THREAD;, why did you think unsigned int can embrace the data inside unsigned long long, please? If they both can hold the same size of data why did you define N as type of unsigned long long?

6) Why did you declare async this way: ... std::async( std::launch::async, .... I mean why using std::launch::async?

These are my questions until line 30 of your code. When the above questions are sorted out properly so that I can understand them well, I will go further for the next questions until the end of your code.
Thanks for your time.
Last edited on
> volatile means, hey compiler, try not to consider me as const and step forward to optimize the code, right?

Reads and writes to a volatile object are deemed to be part of the 'observable behaviour' of the program.
So, sum + sum i; would involve updating (reading and writing) the value of sum in each iteration of the loop.


> Why did you think a good compiler may optimize the for loop code and avoid going through it?

I've observed similar optimisations; optimisers are capable of doing quite clever things.

1
2
3
4
5
6
7
8
9
10
int foo()
{
    int sum = 0  ;
    for( int i = 0 ; i < 1000 ; ++i ) sum += i ;
    return sum ;

    // the optimiser knows that the sum of the series 0+1+2+..+999 is 499500    
    // optimise the loop (and sum) away; apply the as-if rule and rewrite he function as    
    // return 499500 ;  
}

https://gcc.godbolt.org/z/GT3r6aPsz


> Why didn't you use sum += start; instead of sum = sum + start;

sum is a volatile object; sum += start; (compound assignment of a volatile object) is deprecated.
http://coliru.stacked-crooked.com/a/7750d87ba4d268d4


> So why not only const? Why not constexpr or even static constexpr?

It could have been any of those; it wouldn't have made a tangible difference.


> In first val += N_PER_THREAD;, why did you think unsigned int can embrace the data inside unsigned long long

'Usual arithmetic conversions': https://en.cppreference.com/w/cpp/language/operator_arithmetic#Conversions


> why using std::launch::async?

We want the task to be executed on a different thread (different from the calling thread).
See: https://en.cppreference.com/w/cpp/thread/launch
Last edited on
frek wrote:
1) volatile means, hey compiler, try not to consider me as const and step forward to optimize the code, right?


No, the opposite of that, here is the reference, emphasis by me:

https://en.cppreference.com/w/cpp/language/cv

cppreference wrote:
Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access.


Not sure what you mean by "not consider me as const" , const is const, unless: one casts it away with a const cast, which is strongly discouraged ; or the variable is mutable. But neither of those affect volatile. const volatile is non mutable.

volatile is often used because we want to see the visible side effect.
Last edited on
> I've observed similar optimisations; optimisers are capable of doing quite clever things.
OK, here in the example you wrote, albeit the compiler was clever and optimised the for loop returning the value 499500, it's not bad, since sum is a local variable, whether its value hasn't change and is still 0 or has change, right? I guess it'd be undesirable in situations where we also need the variable's value apart from the return value, like:
1
2
3
4
5
6
7
8
9
10
11
12
int foo(int& sum)
{
    int sum = 0  ;
    for( int i = 0 ; i < 1000 ; ++i ) sum += i ;
    return sum ;
}

int main() {
int sum {0};
std::cout<< foo(sum) << ' ';
// do something with sum
}

Agree?

>'Usual arithmetic conversions'.
So in this case, you meant: no need to declare first_val as unsigned long long. If the value it gets is small enough (say, when nthreads == 10), well, it can place it in itself, otherwise if the value is big (say, when nthreads == 9), the usual conversion is applied and the type of first_val will become unsigned long long. Right?

>We want the task to be executed on a different thread (different from the calling thread)..
You mean we're doing wrong in the example below, and are not calling the task to be executed on a different thread?
https://coliru.stacked-crooked.com/a/590c559f0a221b12

@TheIdeasMan:
I mean that volatile is not const, it may constantly change and we want the compiler to know this.
Last edited on
frek wrote:
I mean that volatile is not const, it may constantly change and we want the compiler to know this.


But the point is: not to optimise the code, which is opposite to your original statement.

Edit:

Variables may be const volatile , this is called cv qualified. So volatile does not mean non const .

Perhaps it is an interesting choice of word for a keyword, given it's English meanings seem to be opposite to it's meaning in C++.
https://www.dictionary.com/browse/volatile

@JLBorges

noted your statement about strength reduction versus optimised away completely - cheers :+)
Last edited on
> I guess it'd be undesirable in situations where we also need the variable's value apart from the return value, like:

The loop can still be optimised away.
1
2
3
4
5
6
7
8
9
10
int& foo( int& sum )
{
    sum = 0  ;
    for( int i = 0 ; i < 1000 ; ++i ) sum += i ;
    return sum ;

    // the optimiser knows that the sum of the series 0+1+2+..+999 is 499500    
    // optimise the loop (and sum) away; apply the as-if rule and rewrite he function as    
    // return sum = 499500 ; 
}

https://gcc.godbolt.org/z/arWso3jjq


> and are not calling the task to be executed on a different thread?

It is left to the implementation.
The policy defaults to std::launch::async | std::launch::deferred (ie. both flags are set).
If more than one flag is set, it is implementation-defined which policy is selected. For the default (both the std::launch::async and std::launch::deferred flags are set in policy), standard recommends (but doesn't require) utilizing available concurrency, and deferring any additional tasks.
https://en.cppreference.com/w/cpp/thread/async



> ... If the value it gets is small enough ...

Usual arithmetic conversions are always applied;
these conversions are based on the types of the operands, not their values.


> But the point is: not to optimise the code

Reads from and writes to volatile objects can't be optimised away.
Other optimisations may be applied to operations on volatile objects. For example, strength reduction:
1
2
3
4
5
6
void foo(volatile unsigned  int& sum )
{
    sum = sum / 16  ; 
    //optimise: strength reduction; replace division by a bitwise shift
    // sum = sum >> 4U ;
}

https://gcc.godbolt.org/z/e7e7x4K8c
Pages: 12