Hello,
I am trying to find a reason why MPI/Parallel code hangs. The code works well when I do
mpirun -n 1
and
mpirun -n 2
seems to run without issues either. But when I run with 4 or more workers (up to 24) it hangs. I added a output statements at the beginning and end of eahc function that also prints MPI worker number and it seems that when number of workers is 4 or more, than about processes do work well and other half hangs somewhere. How can I troubleshoot such issue?
Inside run.cpp, function findRecirculationPointMain is called, which calls findRecirculationPoint. I run code on linux cluster.
Utilities::MPI::max is from 3rd party library described here
http://dealii.org/8.3.0/doxygen/deal.II/namespaceUtilities_1_1MPI.html
Here is the code of findRecirculationPointMain (I believe some processes hangs inside it ):
1 2 3 4 5 6 7 8 9 10 11 12
|
inline double findRecirculationPointMain(DoFHandler<2>& dof_handler, LA::MPI::BlockVector& velocity, double leftPoint, double rightPoint, const MPI_Comm & mpi_communicator){ //searches for root of Velocity1
cout << "process#" << Utilities::MPI::this_mpi_process(mpi_communicator) << "Entering findRecirculationPointMain" << std::endl;
double xCoordTemp = 10^6;
try{
xCoordTemp = findRecirculationPoint(dof_handler, velocity, leftPoint, rightPoint, mpi_communicator);
} catch (ExceptionBase exc) { //not found
xCoordTemp = 10^6;
}
double xCoord = -1. * Utilities::MPI::max ( (-xCoordTemp), mpi_communicator );
cout << "process#" << Utilities::MPI::this_mpi_process(mpi_communicator) << "Leaving findRecirculationPointMain: xCoord=" << xCoord << std::endl;
return xCoord;
}
|
Here is the output generated by my code for 4 workers:
process#1run.cpp: Before calling findRecirculationPoint()
process#1Entering findRecirculationPointMain
process#1 Entering findRecirculationPoint(DoFHandler<2>& dof_handler, LA::MPI::BlockVector& velocity, double leftPoint, double rightPoint, const MPI_Comm & mpi_communicator)
process#2run.cpp: Before calling findRecirculationPoint()
process#2Entering findRecirculationPointMain
process#2 Entering findRecirculationPoint(DoFHandler<2>& dof_handler, LA::MPI::BlockVector& velocity, double leftPoint, double rightPoint, const MPI_Comm & mpi_communicator)
process#3run.cpp: Before calling findRecirculationPoint()
process#3Entering findRecirculationPointMain
process#3 Entering findRecirculationPoint(DoFHandler<2>& dof_handler, LA::MPI::BlockVector& velocity, double leftPoint, double rightPoint, const MPI_Comm & mpi_communicator)
curDrag-2.4596
Current point is outside averaging interval, but previous point is inside it. Done with averaging! Exiting doTimeStepping
Current iteration=489 timestep=0.32
process#0run.cpp: Before calling findRecirculationPoint()
process#0Entering findRecirculationPointMain
process#0 Entering findRecirculationPoint(DoFHandler<2>& dof_handler, LA::MPI::BlockVector& velocity, double leftPoint, double rightPoint, const MPI_Comm & mpi_communicator)
process#2In findRecirculationPoint: Root found near 1.544
process#0 In findRecirculationPoint: Root NOT found!
process#0 Leaving findRecirculationPoint(DoFHandler<2>& dof_handler, LA::MPI::BlockVector& velocity, double leftPoint, double rightPoint, const MPI_Comm & mpi_communicator)
process#1 In findRecirculationPoint: Root NOT found!
process#1 Leaving findRecirculationPoint(DoFHandler<2>& dof_handler, LA::MPI::BlockVector& velocity, double leftPoint, double rightPoint, const MPI_Comm & mpi_communicator)
mpirun: killing job...
(yeah I have to kill the job since nothing happens for many hours)
Thank you.