Code simply stops randomly?

Forum

Forum
Beginners
Code simply stops randomly?

Code simply stops randomly?

Jun 30, 2019 at 9:52pm

Hello guys,

I am experiencing strange code behavior.

So in the institute of my university, we have a "cluster", i.e. thousands of CPU's that we can use to run our simulations with different input parameters simultaneously.

In my case, I use different random seeds for a random number generator.

However, it seems like on some of the cluster nodes my code does not complete. There is no error message or anything (well, a cluster error message that the task unexpectedly stopped). From the output that is written to a file, I can see that the code randomly stops at a point in the program where nothing should go wrong. It does not happen with all tasks, so on most cluster nodes the program finishes just fine.

When I try to reproduce the error on my own computer, using the same input parameters (including randomseed) of one of the flawed cluster runs, the code completes just fine.

Now, my randomseed generates random numbers in a number interval given by two numbers that are obtained via floating point arithmetics (see my other thread), so even if I use the same randomseed on my own computer, the execution might not be completely equivalent.

Still, it makes no sense. What does it mean when a program simply stops without an error message and without finishing properly?
I am compiling the code on my own computer and only executing it on the cluster. Maybe I should try to compile it on the cluster nodes themselves...

Just wanted to ask whether people here are familiar with such behavior.

Best,
PhysicsIsFun.

Last edited on Jun 30, 2019 at 9:55pm

Jun 30, 2019 at 10:06pm

Niccolo (720)

Still, it makes no sense. What does it mean when a program simply stops without an error message and without finishing properly?

Most often this is due to some 'undefined behavior' issue - code that doesn't work occasionally, but does most often, must be doing something unpredictable.

Without code, however, I can't see what that might be.

Jun 30, 2019 at 10:09pm

dutch (2548)

That's the signature of undefined behavior. Most likely there is an error in your code that only shows up in specific circumstances.

Jul 1, 2019 at 1:07am

jonnin (11491)

could also be an infinite loop. Are the programs sharing anything, could it be a deadlock? Those 3 things (undefined, loop, or deadlock) are all possible here.

Jul 1, 2019 at 10:32am

PhysicsIsFun (297)

Thanks for your input.

Would an infinite loop not keep running... like... infinitely?^^

edit: Deadlock seems to be a concept from multithreading, right? I am not doing that, each simulation is only executed by a single CPU.

Last edited on Jul 1, 2019 at 11:11am

Jul 1, 2019 at 11:50am

dhayden (5799)

The best way for us to help is for you to post your code. It's nearly impossible otherwise.

In your other thread, you mention that the random number generator takes a and b which are the result of some floating point calculations. Could you be dividing by zero or otherwise creating invalid values?

Does the cluster give you a core file for the stopped programs? You could examine that with a debugger to find the problem.

Programs exit with a return code (the value returned from main(), or a larger value indicating the reason that the program was terminated by the operating system). Does the cluster tell you the return code?

If you can't get a core file then you may need to add some debugging output to help see what the program is doing.

But if you post your code, someone here might be able to locate the problem.

Jul 1, 2019 at 1:02pm

jonnin (11491)

I misread something, its not infinite loop if it crashed.
If it crashed, don't rule out disk full or out of memory or other local problems that have nothing to do with your code (directly).

Jul 1, 2019 at 3:47pm

PhysicsIsFun (297)

@dhayden

The core output file for the defect runs simply states that the task died...
There is another outputfile that simply prints what otherwise would be given by "cout<<...". From this file, I can see that it happens at completely random points in the code (and not, say, always during the same procedure).

Wouldn't numerical errors like dividing by zero give a runtime error or sth. like that?

I can't post the code. On one hand, it is too large. On the other hand, I am not sure whether I am allowed to publish it here^^"

Jul 1, 2019 at 4:32pm

Ganado (6832)

If you can't directly use a debugger, then all I can suggest is to add more print statements, and since it sounds like you redirecting to output files, make sure you are flushing the stream after each print statement, since you are writing to a file and you're not sure when it exactly crashes.

Edit:
Possibly more important: I seen no information given about what your environment is, other than you said it's single-threaded. What is the OS? How much memory do you have to work with? Is this on a virtual machine?

If this is Windows, dump files can be generated when an application crashes.
https://docs.microsoft.com/en-us/windows/desktop/wer/collecting-user-mode-dumps
You can then examine the crash dump file to figure out which part of the stack you were on, and other states.

I don't know what capabilities other OSes have.

Edit 2:
You also should conduct some sanity tests. Does a simpler program that just runs in a busy loop of some sort, doing basic computation, also crash when run in this environment?

Last edited on Jul 1, 2019 at 4:44pm

Jul 1, 2019 at 4:45pm

dhayden (5799)

The core output file for the defect runs simply states that the task died...

Can you get a stack trace from the core file? That alone may point to the problem. Better yet would be to examine the core file with a debugger.

Jul 2, 2019 at 9:43am

PhysicsIsFun (297)

Hi guys,

thank you again! Unfortunately, I cannot answer most of your questions. But I will keep testing my stuff on the cluster and talk to the admin later, also with respect to your suggestions!

Thanks!

Jul 2, 2019 at 1:19pm

jonnin (11491)

I believe whether it crashes on a divide by zero is compiler flag. It may or may not. divide by zero sets a floating point nan value that usually propagates. Eventually everything you print will have 'nan' in it. You can check your compiler flags, see if you can find the control for this if you think it happened.

Jul 2, 2019 at 8:27pm

dhayden (5799)

divide by zero sets a floating point nan value

I suppose I'm being picky, but the result of divide by zero is actually +/- infinity unless it's 0.0 / 0.0, which is "not-a-number" (nan). I agree that the result tends to be contagious. Almost any expression containing inf or nan will result in inf or (more likely) nan.

Jul 16, 2019 at 4:13pm

PhysicsIsFun (297)

Hi again,

just to let you know:

The problem seemed to stem from the cluster hardware. It was really hot in my part of the world at that time, and the current theory is that the hardware went too hot.
The error suddenly stopped appearing.

Jul 16, 2019 at 8:16pm

Niccolo (720)

@PhysicsIsFun,

The error suddenly stopped appearing.

ooooooooo I hate it when that happens!

"Undefined behavior from the HARDWARE!"

...was wondering what happened to you...glad it makes sense.

Check the Freon, clean the coils :)

Jul 17, 2019 at 10:30am

MikeyBoy (5631)

Heh... ascribing hard-to-diagnose problems to "thermal issues" became a bit of a running joke at a previous job :)

Last edited on Jul 17, 2019 at 10:30am

Jul 17, 2019 at 3:15pm

PhysicsIsFun (297)

Well, as long as the error does not reappear, I take it as a convenient explanation and move on with my life :D

Jul 17, 2019 at 3:20pm

MikeyBoy (5631)

And keeping a close eye on the weather forecast :P

Topic archived. No new replies allowed.

C++

Forum

Code simply stops randomly?