I have parallelized my code using MPI with the Master-Slave scheme. The code is too big to be written here.
I have a problem when I run it on Linux while it works just fine on Mac.
When I run the code on Linux for short like hundred steps it works fine and generates all the output files while for thousands steps it seems going to a kind of coma. When I do squeue, it is not in the queue anymore, usually meaning that the job is done, but none of the outputs are generated. When, I check the slurm output it seems it is still running.
I don't have this problem when I run it on my Mac. I compile it with the latest gcc and g++ compilers.
I am confused! does anyone have any clue what might be going wrong with code?
Thank you for your reply. I worked on it a little bit more. I think it gets out of memory.
I am broadcasting a lot of data to all the slave nodes per iteration of a large loop.
It seems that it does not clean the buffer. I tried to send the finalize signal to the slaves that the transferring data is over, but it only broadcast only for the first iteration and then it remains waiting for ever.