shared memory

I have been working with MPI for a while and I've got some questions regarding what is really happening behind the scene. I know in shared memory architecture a segment of memory will be accessible by more than one processor but how much is that segment? where and when it will be specified during programming. Imagine using shared memory in a system of 8 cores and 16 GB RAM. does it mean when using MPI and running with 8 processors each processor has 1/8 of all memory? and each processor just can share it's own 1/8? I don't really get a clear picture of is each processor memory. I would be really grateful if someone can enlighten me a bit.

Last edited on

lastchance (6980)

Each processor separately gets the amount of memory it requires. There is no reason for these shares to be equal in size if you are using allocated memory with different allocations on each processor. Only if the memory requirements of each processor happen to be the same will you match sizes.

The whole point about MPI is that it is distributed memory, not shared. It might not even be on the same machine!

Last edited on

Cplusc (457)

The whole point about MPI is that it is distributed memory, not shared. It might not even be on the same machine.

so if there is not enough memory for one processor on the current machine, this core will have its required memory from other machine's memory which has been parallelized with?

lastchance (6980)

if there is not enough memory for one processor on the current machine, this core will have its required memory from other machine's memory

No, the process gets its memory from the same machine that it is running on.

If there isn't enough memory then you are stuffed (or, at least, the paging to hard disk becomes frightening).

Last edited on

Cplusc (457)

I know my question might seem silly a bit. but when we run a code in serial version do we actually use all cores or just one core? and using distributed memory how decrease the workload on cpu?

lastchance (6980)

If you run your code in serial you just use one core.

MPI may share out the work amongst what would otherwise be idle processors ("many hands make light work"), so making the wall-clock time shorter, but it certainly doesn't reduce the total workload or memory summed over all processors.

Cplusc (457)

Actually I got this issue in my code and I was expecting to reduce the workload and runtime by using MPI but it is not the case. It seems to me that there should be some criteria regarding when using MPI can be helpful, otherwise using MPI maybe pointless. Which I don't have any idea on what those criteria may be.
Sorry if my question is going beyond the scope. Is that a good Idea to use CUAD with MPI as a combination? I have worked with CUDA for a short time and It's been a while that I am working with MPI and now I'm thinking about using both MPI and CUDA. I searched to find out which parallelization scheme is better but there was no consensus about that. is that any limitation or shortcoming for any of them that is not the case for the other one?

lastchance (6980)

You can reduce the wall-clock time by using MPI, because you have different processors sharing the work.

The best jobs for MPI are those which require very little communication between processors (because that is inherently slow). The analogy that I like to give is that dividing up the work can certainly reduce the time taken ... unless you have a lot of meetings, which (personal opinion) are inherently time-wasting! You also need to balance the load as much as possible (because your overall runtime is governed by that of the slowest processor).

I use MPI a lot (for CFD, with domain decomposition to split up the work). I have never used CUDA, so couldn't comment.

You can combine MPI (distributed memory) with OpenMP (threads on a single processor), but the effectiveness of that is very architecture-dependent.

Last edited on

Cplusc (457)

Thanks for your good answers as always.

keskiverto (10425)

MPI starts multiple processes. Each process has their own cores and memory just like you can run browser and compiler -- two different processes -- simultaneously.

When MPI was written a typical machine had exactly one core, one CPU, and each process of "MPI application" did run on different machine. The MPI -- Message Passing Interface -- is a library that allows those processes to communicate.

Lets say that your MPI program is started with four processes. Each have their own memory. Each can have an array. Then, they can all run an MPI-function that copies data from array of first process to arrays of other processes. That function might be faster if both processes are in same machine than on different machines.

There is auto-vectorization. Totally unrelated to MPI. Compiler uses the MMX/SSE/AVX instructions, rather than singular versions. The process still uses only one core, but can do some operations somewhat in parallel -- as much as the streaming instruction sets allow.

There is threading. A process creates separate threads. They all see the memory of the process. They can be executed in different cores, so in parallel.

You can write an application that does use auto-vectorization, threading, and MPI. One process per machine uses many core, each core as efficiently as possible, and workload is divided to multiple machines that communicate with MPI.

Whether you use threading and/or MPI, there is always some overhead from communication/synchronization of the workers. Therefore, parallel application does always do more work / consume more resources than sequential application.

(You can add GPGPU to your app for extra credit.)

Cplusc (457)

@keskiverto thanks for your explanation.
I remember long time ago when doing my first CFD project (solving Navier Stokes eqs) on a machine (intel core i3 and 2GB RAM) I got "out of Memory" error. Now I understand that one of the reason of using MPI is to avoid this kind of problems by using the memory of different machines.

Last edited on

Cplusc (457)

@keskiverto In case of GPGPU, The device's memory of different machines will be accessible throughout all machines which are working in parallel? For example core A in machine A needs the information computed on the GPU of machine B. How we would handle that? should we copy the information from GPU unit of machine B to CPU of machine B and then communicate with machine A?

Last edited on

Topic archived. No new replies allowed.