How much RAM does my code need?

Good evening!

I want to run an mpi C++ code on a cluster node with N cores. Each mpi process needs 1 core. The node's memory will be evenly distributed over the used cores.
Each mpi process does equivalent work, i.e. needs the same amount of time and memory.

In order to know how many processes I can launch on the node without running out of memory, I need a reliable gauge on the memory a single process is going to use. I have my local Linux laptop to test. I tried using "top", but I am a bit insecure about my conclusions. Which number is the one that matters, VIRT or RES?

Any other recommendations on how to proceed?

Best,
PiF
Last edited on
I think this is not so easy to answer. Each process has its own separate"virtual" address space, which contains all the memory that is allocated to the process. This includes statically allocated memory (e.g. global variables), the stack for each of the process' threads as well as memory allocated dynamically via malloc() and co.

But: Certainly not all of a process' "virtual" memory needs to actually be in the "physical" RAM at the same time, or all the time! Memory pages that are not currently needed can can be swapped out. They'll be swapped in again, if they need to be accessed at a later time. So, it's very possible that the sum of the "virtual" memory of all of your running processes is way larger that the "physical" RAM; it is not necessarily/usually a problem!

So you should be more concerned about the "working set" of a process. That's the part of the process' (virtual) memory that the process is using actively, in a given time interval. If the "working sets" of all of your running processes add up to more than the available "physical" RAM (minus a certain fraction of the RAM reserved for the operating system itself), then you'll end up with what is called "memory thrashing". It means that the same memory pages will constantly have to be swapped in and out, resulting in a massive slow down...

You can use a tool like htop to monitor the global (physical) memory usage. As long as that remains somewhat below 100%, you are probably fine. But be aware that the operating system uses a certain fraction of the memory as cache. The memory that is currently used as cache can be freed up at any time, by dropping the cached data, if the system is running out of "free" memory. So, memory that is used as cache effectively can be thought of as "free" memory (at least potentially) too. In htop the different "types" of memory usages are indicated by colors: The green bars indicate "used" memory, whereas yellow bars indicate "cache".
Last edited on
Kigar64551 is absolutely right that it is all too convoluted to give you a simple answer.

Create some expectations. Record how long it takes for a single process to run. If they all truly are that similar then if any process takes more than twice that long to produce a result, or does so more than N times, then tell it to close down cleanly and log an error. Once that system is in place, then stress test the thing.

I think the better way to go about it is to ask; What kind of data are you trying to crunch? Why not figure out how to break the data down into manageable bites rather than overloading ram to begin with?
Last edited on
Hi guys,

thanks for your elaborate responses! I did hope that it was a bit more straight forward, but life's hard I guess.

I am a PhD student (applied maths) and I run numerical schemes that generate samples from probability distributions. Each iteration generates a sample. With a serial code, I would need to let the scheme run for hours to days in order to collect enough samples. I can, however, parallelize it by starting N processes to collect the samples, each one using a different random seed. The biggest memory load would then be the vector of samples being collected. I can easily gauge this (eg. 200 million double precision floats = 1.6 GB of data), but I am unsure about all the other memory requirements (other variables, functions, libraries) and so on.

I guess I will go with trial and error for a while...
Aside from your large sample vector/array, the rest of the variables are going to be fairly insignificant.

Also, it doesn't matter how many threads you have, there is only one copy of the code (functions, libraries etc).

Plus there should only be one copy of say your 100M samples, if each sample[i] is independent from any other sample[j]. You would then set up your N threads to work on a separate 1/Nth block of the overall array.
The memory requirement for the program code and libraries is probably in the range of hundreds of kilobytes, or maybe a few megabytes - so quite negligible compared to 1.6 GiB of sample size.

So I think the biggest memory hog would be the 1.6 GiB of sample data. But do you really need to keep all of the 1.6 GiB in memory at the same time? Instead of generating all the 1.6 GiB of sample data at once and then process it as a whole, would it be possible to generate the sample data "on the fly" while it is being processed? If so, you wouldn't have to keep all the 1.6 GiB of sample data in memory at the same time, but only the next "chunk" to be processed. This could greatly reduce the memory demand of your application!

Another important question: Does the memory requirement for the actual computations - variables, buffers for intermediate results, and so on - depend on the size of the input (sample), or is it fixed? If the memory requirement does depend on the input size, does it grow proportional, quadratic, exponential, etc. pp. ???
Last edited on
In MPI each processor has its own copy of the code, so the program code part will be linearly proportional to the number of processors used. However, as @kigar64551 points out, this is of the order of a few MB or less. It's a hell of a lot less than Google Chrome consumes!

The memory for the data part depends on how much of the total data each processor "sees". If you simply split the array store amongst processors then it will consume about the same total whether you have 1 processor or 16. If, however, you have a "master-slave" relationship (not allowed to use that term at work any more!) then the root processor may have a copy of all the other processors' data, so ending with a total requirement for multiple processes of about twice what you would have for a serial computation.

1.6 GB sounds a lot for what you have described your application to be.

The following MPI code (pi by Monte-Carlo methods) has almost no data store. On my PC the memory usage is about 1.2MB * (number of processors).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <iostream>
#include <random>
#include <ctime>
#include "mpi.h"
using namespace std;

int main( int argc, char* argv[] )
{
   const int N = 1 << 30;
   int root = 0;

   time_t t1 = clock();

   int rank, size;
   MPI_Init( &argc, &argv );
   MPI_Comm_rank( MPI_COMM_WORLD, &rank );
   MPI_Comm_size( MPI_COMM_WORLD, &size );

   mt19937 gen( time(0) + rank );
   uniform_real_distribution<double> dist( 0.0, 1.0 );
   int localN = N / size;

   int sum = 0;
   for ( int i = 0; i < localN; i++ )
   {
      double x = dist( gen ), y = dist( gen );
      sum += ( x * x + y * y ) < 1.0;
   }

// cout << "Processor " << rank << " with " << localN << " trials: " << 4.0 * sum / localN << '\n';

   int globalSum;
   MPI_Reduce( &sum, &globalSum, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD );
   time_t t2 = clock();
   if ( rank == root ) cout << "Overall with " << N << " trials: " << 4.0 * globalSum / N
                            << "      Time = " << ( t2 - t1 + 0.0 ) / CLOCKS_PER_SEC << " s\n";

   MPI_Finalize();
}


On 4 processors:
mpiexec -n 4 pi 
Overall with 1073741824 trials: 3.14163      Time = 4.636 ms
Last edited on
Hi guys,

I am still interested in learning more about the memory consumption gauging.

The situation is still the same: I have N independent processes, each of which collects n samples. The vector of samples is by far the most memory consuming object.

@lastchance

1.6 GB sounds a lot for what you have described your application to be.

The following MPI code (pi by Monte-Carlo methods) has almost no data store. On my PC the memory usage is about 1.2MB * (number of processors).

This toy example is not an accurate representation of what I am doing as it is not storing the samples but discarding them. What I have to do, to stay in the example of your sum calculation, I need to store the intermediate sums (to examine its distribution). In high dimensional settings (eg. molecular dynamics) , the memory consumption can easily cross hundreds of GBs.


As recommended in one of the earlier posts, I started playing around with htop to view memory.
Here is a test code to see what htop shows for arrays of various sizes.
1
2
3
4
5
6
7
8
9
int main(int argc, char *argv[]){

double arr[1];  // vary size to see what htop shows

while(true){      // to keep the process shown in htop
}

return 0;
}


For 1 array element, htop shows this:
VIRT RES SHR
4380 756 692

For 100,000 array elements, I have
5040 804 740

I have no clue where these numbers come from. I expected RES to show an 800kB difference, as the array requires 100k double for 8 byte each.
Instead, the largest difference is in VIRT (still less than 800). I am not solid enough on the different memory types to get what's going on.

Also, in the case of one element, why is the memory consumption still so large? Is this the C++ standard library that is being loaded?


If all you are doing is calculating sums over data ... then you don't need to store the entire vector of data. For each data point you do whatever analysis you want, add it to the sum and then forget it. (Which, I think, was the gist of kigar64551's posts.)

Without seeing your actual code it's difficult to say further. However, my comment still stands: in MPI every processor has its own copy of the code, plus whatever it needs to store its own data. On a common node it may come from a single large pool; however, each processor's data is not directly available to the others - it's not shared data.

As for size: well, vectors in c++ have some room for expansion, so tying up some memory that isn't currently being used. It is possible to make them more slimline if you want with the shrink_to_fit() member function.
Last edited on
Not only the sum, but also mean and variance can be computed with an "online" (one pass) algorithm, so that each input value only needs to be visited once. Therefore you do not need to keep all input values in memory at the same! Instead you can process your input values in a "one by one" fashion and simply discard the values that have already been processed. Google for "Welford's online algorithm" if you need details...
Last edited on
Guys, I never said I was computing a mere sum... I need the samples stored (i.e. no on-the-fly processing here), and I thus need to know the best way to gauge memory requirements of my code.

It was recommended (by you Kigar) in earlier answers that I could use "htop" to monitor the memory requirement of a single process. While this sounds promising, I am unclear about its behavior, see my last post. Can you make sense of the numbers in my little array example?
Well, keep in mind that the (virtual) memory space of your process contains both, program code and data.

Furthermore, even a "minimal" C program consisting only of a main() function that prints "Hello world!" and nothing else, will have to load - at runtime - the C runtime library and maybe further fundamental "system" libraries into its memory space. Consequently, the memory occupied by your double arr[1] array, i.e. a single double value (8 bytes), is quite negligible, compared to all the loaded program code (libraries).

Of course, this will change, if you make the array a whole lot bigger. If you only make the array large enough, then the amount of memory occupied by program code (libraries) will quickly become negligible.

Another thing to keep in mind is that memory is usually allocated in chunks referred to as "pages". Typically, the page size is at least 4 KB. Hence, if the "net" size of the double array (in bytes) is not an exact multiple of the page size, then its size needs to be rounded up to the next integer multiple of the page size!

Last but not least, just because some memory pages are allocated in the process' virtual memory space, they do not have to actually be present in the physical RAM - and certainly not all the time. For example, if you allocate a very large array, but you never actually access all the elements of the array, then that array will occupy a whole lot of virtual addresses, yes, but it will never actually be loaded into the physical RAM...

(very roughly, "RES" is the fraction of "VIRT" that is, at this very moment, held in the physical RAM; and "SHR" is the fraction of "VIRT" that is shareable with other processes, e.g. shared library code)

_________________

Edit:

More details about what exactly makes up the virtual memory space of a process is available via:
/proc/<PROCESS_ID>/maps

See also:
https://stackoverflow.com/a/1401595
Last edited on
Thanks Kigar,

this helps a lot!

This explains why there is significant memory consumption even for "empty" programs.
It also explains why, when increasing the array size to 100k elements, only the virtual memory consumption increased since I did not access the array elements nor initialized it. When I initialize the array in a for-loop, the memory consumption is also translated in RES..

With respect to SHR: I understand it is part of VIRT and that it is shared with other processes.
Is the following correct?
Suppose that process A and B need access to the same library. I assume the library is then stored in SHR of both processes. When A starts to "actively use" the library, it is loaded in A's RES. When B then actively starts using the library, is it loaded in B's RES? Or can B use the RES of A?

Just trying to figure out in what way the value of VIRT and SHR are interesting to me when running N independent mpi jobs on a node...

edit:
I found this article to be useful: https://web.archive.org/web/20120520221529/http://emilics.com/blog/article/mconsumption.html

It states to RES:
"RSS (Resident Set Size), therefore, is an indicator that will show the memory consumption when the process is running by it self without sharing anything with other processes. For practical situations where libraries are being shared, RSS (Resident Set Size) will over estimate the amount of memory being consumed by the process. Using to measure memory consumption of a process is not wrong but you may want to keep in mind of this behaviour. "

As long as the overestimation of used RAM is not too large, using RES to gauge the total memory used on the compute node by N processes should be fine. It's better to slightly overestimate the RAM needed than to run out of memory.


Furthermore, it introduces PSS (Proportional Set Size):
"PSS (Proportional Set Size) is a relatively new indicator that can be used to measure memory consumption of a single process. It is not be available on all Linux systems yet but if it is available, it may come in handy. The concept is to split the memory amount of shared pages evenly among the processes that are using them.
This is how PSS (Proportional Set Size) calculates memory consumption: If there are N processes that are using a shared library, each process is consuming one N-th of the shared libraries pages."


If we are talking about N mpi processes that follow the same algorithm, use the same libraries and so on, I would assume that this metric would be the best for my aim, no?
Last edited on
It also explains why, when increasing the array size to 100k elements, only the virtual memory consumption increased since I did not access the array elements nor initialized it. When I initialize the array in a for-loop, the memory consumption is also translated in RES.

Yeah, "virtual" memory usually is only actually allocated in the "physical" RAM when it is accessed/initialized for the first time. But even if you access/initialize the whole array in a long loop, then it is possible that "RES" will not grow linearly, because the parts of the array that you accessed least recently may get swapped out from "physical" RAM; they'd be swapped in again, if you access them again at a later time.

Suppose that process A and B need access to the same library. I assume the library is then stored in SHR of both processes. When A starts to "actively use" the library, it is loaded in A's RES. When B then actively starts using the library, is it loaded in B's RES? Or can B use the RES of A?

Shared library code only needs to be held in the "physical" RAM once, even if used by multiple processes. The exactly same "physical" memory pages will simply be mapped into the "virtual" address space of any process that needs them. This does not violate the process isolation, because those "shared" pages will be either "read only" or at least "copy on write" (COW). Also, the "shared" pages do not have to be loaded into the "physical" RAM separately for each process; if one process already caused them to be loaded into the "physical" RAM, then they'll be readily available there for all other processes too! However, I'm not exactly sure whether those "shared" pages would be counted separately in the "RES" shown for each individual process, but I think that in the global "physical" memory usage they will be counted only once for the whole system.

Maybe we can say: The "RES" shown for a specific process is the fraction of that process' "VIRT" which - at this very moment - is held in the physical RAM; but it is possible for the "VIRT" of multiple processes to partly overlap (because of shared memory pages), and therefore their "RES" may in fact partly overlap too.

If we are talking about N mpi processes that follow the same algorithm, use the same libraries and so on, I would assume that this metric would be the best for my aim, no?

I still think that, for a program that is not just a "minimal" demo program, but that actually processes large amounts of data, the memory occupied by the program code (and required libraries) quickly becomes negligible, compared to the memory occupied by the data. Even more so when you start N instances of the same program, because for the reasons discussed before, the program code and the shared libraries need to be held in "physical" RAM only once, regardless of how many instances of the program you start. But for the data, each instance of the program will obviously allocate its own separate memory buffers (those are independent for each process!), which means that the "physical" memory occupied by data will increase proportional with the number of instances – up to the point where it is no longer possible to hold all data of all processes in the "physical" RAM at the same time and therefore some of the data needs to be swapped out to the disk...
Last edited on
Thank you Kigar,

I think I have a better idea now how it all works.
Thanks for sharing such a helpful instruction, really appreciate for your article.


https://www.indigocard.review/
Topic archived. No new replies allowed.