Multithreading and shared memory

Forum

Forum
General C++ Programming
Multithreading and shared memory

Multithreading and shared memory

Hi,

This isn't really a C++ specific question, more low-level programming in general.

In class we were tasked with applying a filter to an image in parallel using threads (C99 using pthread.h).
I would have just loaded the image once in the main thread, allocated as one chunk, and passed pointers to the threads. But going by the lectures, I'm led to believe it's better to do separate allocations for each subsection of the image and pass those instead, then merge when the threads are done. This is supposed to be faster.

My reasoning would be that hardware can't allow multiple CPU's to have the same section of RAM cached or the caches could go out of sync if any writes are performed. Addresses returned by malloc could be separated by more than the cache size, so this may not be an issue with separate allocations. If the image is one contiguous chunk, it is guaranteed to be one.

Does this make sense? Would you perform those optimizations in a real-world scenario?

jonnin (11497)

almost all performance questions come back to 'it depends'.
making a thread is a little slow. breaking the image into different memory locations and putting it back together later is a little slow. Is the filter so brutally intense that it dwarfs all these other costs, and is the image large enough to justify all this stuff? There are probably other questions to be asked, but that is an obvious example without seeing any code. I guess what I am saying is that splitting stuff up is the way to go when it is justified -- all computers now have many CPUs and you should code to that. Just remember that a single thread can rip through a LOT of computations/data very quickly, so you need a modest sized problem before the overhead is worth it. The other obvious question is ... how long does it really take and how fast does the user need it. If its an image editor, looking at one picture and click a filter, a 1/2 a second delay is nothing and fancy code isnt probably required. If its a video game or a cruncher that is processing 2 million xray images from a database, 1/2 a second is suddenly a lot less appealing (thats 1/2 a million seconds... ewww!). And, if its something like the xray example, you still come back to .. is doing 10 at a time on the CPUS each doing one image the way to do it, or is each cpu doing 1/10 of an image the way to do it? All this junk should pass through your head as to what you are trying to accomplish when you decide what to optimize and how much and which way... etc.
the bottom line though is yes, if you need the speed and the data is suitable to the approach, use it, milk the machine for all its worth and knock it out fast.

mbozzi (3945)

You just described a problem known as false sharing, but overestimated its impact.

My reasoning would be that hardware can't allow multiple CPU's to have the same section of RAM cached or the caches could go out of sync if any writes are performed.

Hardware designers have accounted for this in a way that doesn't involve preventing data from being shared between multiple processors. Instead, cache memory is divided into "blocks" or "lines" (synonymous) that are generally small (64 bytes is typical). The hardware keeps cache blocks in sync (or "coherent") by means of a cache coherence protocol:
https://en.wikipedia.org/wiki/Cache_coherence
Essentially, if one core modifies a block of memory that is in the cache of other cores, the non-local copies are invalidated. This invalidation works on individual blocks, rather than the entire cache or swathes of it at once.

It is still possible to encounter problems due to the behavior you describe: contrast unavoidable "true sharing" vs the fundamentally unnecessary "false sharing":
https://en.wikipedia.org/wiki/False_sharing

There's way too much background to cover in a forum post, but if you find this interesting, you could read Ulrich Drepper's somewhat dated paper What Every Programmer Should Know About Memory
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
Or pick up a copy of Patterson and Hennessy's Computer Architecture, a Quantitative Approach
https://www.amazon.com/Computer-Architecture-Quantitative-Approach-Kaufmann/dp/0128119055
And read Chapter 2.

Does this make sense? Would you perform those optimizations in a real-world scenario?

It seems unlikely that the changes you propose would make a significant impact, but it depends on the scenario. The goal is to deliver the highest-quality product possible within the budget and specification. Just evaluate every technical decision against this bottom line.

Last edited on

DrZoidberg (143)

Thank you very much. This was a far better answer than my professor would have given.

Some of the details in that pdf went over my head, but I am filing it away for future reference.

The filter was a pretty cheap operation and applied to a single bitmap only, so it ended up being overkill, as you suspected. Just memcpy and the thread overhead probably cost more performance than I ever would have saved by avoiding having to reload a couple blocks.

Topic archived. No new replies allowed.