So in the institute of my university, we have a "cluster", i.e. thousands of CPU's that we can use to run our simulations with different input parameters simultaneously.
In my case, I use different random seeds for a random number generator.
However, it seems like on some of the cluster nodes my code does not complete. There is no error message or anything (well, a cluster error message that the task unexpectedly stopped). From the output that is written to a file, I can see that the code randomly stops at a point in the program where nothing should go wrong. It does not happen with all tasks, so on most cluster nodes the program finishes just fine.
When I try to reproduce the error on my own computer, using the same input parameters (including randomseed) of one of the flawed cluster runs, the code completes just fine.
Now, my randomseed generates random numbers in a number interval given by two numbers that are obtained via floating point arithmetics (see my other thread), so even if I use the same randomseed on my own computer, the execution might not be completely equivalent.
Still, it makes no sense. What does it mean when a program simply stops without an error message and without finishing properly?
I am compiling the code on my own computer and only executing it on the cluster. Maybe I should try to compile it on the cluster nodes themselves...
Just wanted to ask whether people here are familiar with such behavior.
Still, it makes no sense. What does it mean when a program simply stops without an error message and without finishing properly?
Most often this is due to some 'undefined behavior' issue - code that doesn't work occasionally, but does most often, must be doing something unpredictable.
Without code, however, I can't see what that might be.
could also be an infinite loop. Are the programs sharing anything, could it be a deadlock? Those 3 things (undefined, loop, or deadlock) are all possible here.
The best way for us to help is for you to post your code. It's nearly impossible otherwise.
In your other thread, you mention that the random number generator takes a and b which are the result of some floating point calculations. Could you be dividing by zero or otherwise creating invalid values?
Does the cluster give you a core file for the stopped programs? You could examine that with a debugger to find the problem.
Programs exit with a return code (the value returned from main(), or a larger value indicating the reason that the program was terminated by the operating system). Does the cluster tell you the return code?
If you can't get a core file then you may need to add some debugging output to help see what the program is doing.
But if you post your code, someone here might be able to locate the problem.
I misread something, its not infinite loop if it crashed.
If it crashed, don't rule out disk full or out of memory or other local problems that have nothing to do with your code (directly).
The core output file for the defect runs simply states that the task died...
There is another outputfile that simply prints what otherwise would be given by "cout<<...". From this file, I can see that it happens at completely random points in the code (and not, say, always during the same procedure).
Wouldn't numerical errors like dividing by zero give a runtime error or sth. like that?
I can't post the code. On one hand, it is too large. On the other hand, I am not sure whether I am allowed to publish it here^^"
If you can't directly use a debugger, then all I can suggest is to add more print statements, and since it sounds like you redirecting to output files, make sure you are flushing the stream after each print statement, since you are writing to a file and you're not sure when it exactly crashes.
Edit:
Possibly more important: I seen no information given about what your environment is, other than you said it's single-threaded. What is the OS? How much memory do you have to work with? Is this on a virtual machine?
Edit 2:
You also should conduct some sanity tests. Does a simpler program that just runs in a busy loop of some sort, doing basic computation, also crash when run in this environment?
thank you again! Unfortunately, I cannot answer most of your questions. But I will keep testing my stuff on the cluster and talk to the admin later, also with respect to your suggestions!
I believe whether it crashes on a divide by zero is compiler flag. It may or may not. divide by zero sets a floating point nan value that usually propagates. Eventually everything you print will have 'nan' in it. You can check your compiler flags, see if you can find the control for this if you think it happened.
I suppose I'm being picky, but the result of divide by zero is actually +/- infinity unless it's 0.0 / 0.0, which is "not-a-number" (nan). I agree that the result tends to be contagious. Almost any expression containing inf or nan will result in inf or (more likely) nan.
The problem seemed to stem from the cluster hardware. It was really hot in my part of the world at that time, and the current theory is that the hardware went too hot.
The error suddenly stopped appearing.