I've been doing some multi-threading in my program. This program essentially iterates over a dataset and for each line computes some stuff.
My multi-threading function basically gives for each thread a different offset for the starting index of the dataset iteration, and then increment this index by the number of threads. For example, let's say I have 4 cores, each core will treat the following lines of the dataset:
th1: 0 4 8
th2: 1 5 9
th3: 2 6 10
th4: 3 7 11
This threading works, as when changing from 1 to 2 thread, I almost divide by 2 my program runtime. However, when setting 3, 4, or more cores, the perfs are still the same than with 2 cores.
My machine has 4 cores, 8 logical cores, according to task manager.
I'm using std::thread:hardware_concurrency() to get the number of core, and it returns 8 as the value for my program.
I checked how many lines are computed by each thread, and the amount is correct (correctly splitted).
I am using std::async for my threading, and the .get() is not in the loop where I instanciate thos asyncs.
Multi-threading is tricky! Just using more threads doesn't necessarily speed things up 😏
For example, often the threads need to synchronize, e.g. to access a "shared" data structure, which can only be accessed by one thread at a time. The more threads, the more synchronization overhead! Consequently, at some point, the threads will spend more time for synchronization (e.g. waiting for acquiring the lock) than doing their actual work. So more threads isn't always better. In some cases this can be worked around by using "lock free" algorithms or "atomic" operations instead of locks.
Do you use any kind of "shared" data structure in your code? For example, where do the threads get the "lines" from?
It is also very possible that you have reached the point where the throughput (overall runtime) isn't limited by computations, but by memory bandwidth! Once memory bandwidth becomes the limiting factor, adding more threads won't help or can even be detrimental. Sometimes a better memory access pattern can improve the performance (by making batter use of the CPU caches), but I think your strategy of making each thread process an "interleaved" subset of the data, rather than subdividing the data into n contiguous blocks, makes sense.
(Also you must be very careful on how you "measure" performance in order to not fool yourself with incorrect/misleading measurements)
How long does each thread run for? If it's very short it might not be worth using extra threads because it takes time creating them.
kigar64551 wrote:
Sometimes a better memory access pattern can improve the performance (by making batter use of the CPU caches), but I think your strategy of making each thread process an "interleaved" subset of the data, rather than subdividing the data into n contiguous blocks, makes sense.
I don't know how big each element in the dataset is but if they're small, and they're being modified by the threads, then you might suffer from "false sharing".
This may be orthogonal to what you are learning, but since C++17 the for_each algorithm has the ability to automatically multi-thread, or since C++20 vectorise operations.
IMO, this is much easier. I read somewhere that the first step in multi threading is to use any facilities available in the STL first before going on to use other more trickier methods.
I also have to ask: How many data points do you have? Hopefully it's a decent number like 1'000'000 . Also how complex is the calculation? I would imagine that one would want some data and operations that will take a reasonable amount of time to complete, say 1 or 2 seconds execution time without the multi threading.
Further to kigar64551 suggestion about measurements, what are they exactly, and how did you achieve them?
Finally, if you really want to ramp things up and have a Nvidia GPU, then investigate CUDA programming. This will really blow things out of the water, because the GPU will have thousands of cores compared to the handful in the CPU.
Thank you all for your interesting answers. I gonna reply to each in the same order.
@kigar64551 You raise many point that, if I don't think they happen, are worth checking. On my side, I do not use any lock, however I'm getting those data from a vector. Maybe the vector has a lock on its getters [] and .get().
To "measure" I just have a timer starting at the beginning of my program and finishing at the end, then displaying the final real time. I could indeed try to add one into the threads to get the CPU time.
@Peter87 threads are handling the longest (timewise) function of my program. I'm pretty sure of them being at the right place. Also, having 3 threads instead of 2 should not like add that much computation that it loses all the time earned by using an extra thread.
Elements in the data structure are quite small (bunch of short strings, maximum of ~150-200 char in total). I do not modify them in the thread.
@TheIdeasMan I think using a for_each in my code will create an excessive amount of threads (at least if I run it on the data). Also, I have a quite time consuming function that I call inside my thread (but before entering the loop, so I do it only once) which cannot be ran outside of the threads. Using the for_each would most likely force me to call this function n times which really hurt perfs.
My data size can vary from 10 to 8 000. For the small dataset I do not care about threading as it is fast enough, but for the 8000 lines this is way longer, as it contains both more lines and more attributes (which impact my computation speed because of the algo I use).
I think going for a GPU, at least here, is overkill. Also, I'd like to be able to run my 8 cores correctly before running a thousand.
Anyway thank you all. I'll take a look at the execution time of each thread, and double check vector library.
So after some extra measurements, it seems that it is now working by its own (nothing unusual).
From 1 to 8 core used, the time correctly decrease, but not from 8 to 10 (normal since I do have only 8 cores).
I think the mistake was the following:
For the tests I've just done, I've set the seed of the shuffle the same, and limited the number of steps to 5. Then I was taking the last time stamp as the measurement. I think that before, when I was eyeballing the time, I was in fact letting my program run more iterations (for the same real time) when using more cores. My algo complexity increase over the iterations, so of course, after more iterations, the time for each thread was longer.
For measurement, make sure that you use the "high precision" timer (e.g. QueryPerformanceCounter() on Win32) and that the overall runtime isn't too short. Otherwise, the random fluctuations that are unavoidable may screw up the results...
I think using a for_each in my code will create an excessive amount of threads (at least if I run it on the data).
So why do you think this? And why do you think it's a problem?
Zaap wrote:
Also, I have a quite time consuming function that I call inside my thread (but before entering the loop, so I do it only once) which cannot be ran outside of the threads. Using the for_each would most likely force me to call this function n times which really hurt perfs.
The for_eachis the loop which has the threads inside it, so that doesn't make sense to me. In other words: it depends where one puts the for_each . Also you used thread singular ("my thread"), and thread plural ("outside of the threads"). That is confusing to figure out what you mean.