consistent crash of MPICH during transient simulation

298 views
Skip to first unread message

Evan Cummings

unread,
Jul 16, 2018, 10:02:14 AM7/16/18
to fenics-support
Hi there,

I've been experiencing a consistent crash of a transient simulation at iteration 1014.  Even changing parameters and such, the simulation always crashes at iteration 1014 with the message :

Enter code here...Process 0: Solving nonlinear variational problem.
  Process 0: Newton iteration 0: r (abs) = 3.572e+14 (tol = 1.000e-10) r (rel) = 1.000e+00 (tol = 1.000e-09)
Traceback (most recent call last):
  File "EISMINT_II_A.py", line 125, in <module>
Traceback (most recent call last):
  File "EISMINT_II_A.py", line 125, in <module>
    callback   = cb_ftn)
    callback   = cb_ftn)
  File "/home/pf4d/software/cslvr/cslvr/model.py", line 3164, in transient_solve
  File "/home/pf4d/software/cslvr/cslvr/model.py", line 3164, in transient_solve
    self.transient_iteration(momentum, mass, dt, adaptive, annotate)
  File "/home/pf4d/software/cslvr/cslvr/d3model.py", line 1054, in transient_iteration
    self.transient_iteration(momentum, mass, dt, adaptive, annotate)
  File "/home/pf4d/software/cslvr/cslvr/d3model.py", line 1054, in transient_iteration
    momentum.solve()
  File "/home/pf4d/software/cslvr/cslvr/momentumbp.py", line 383, in solve
    momentum.solve()
  File "/home/pf4d/software/cslvr/cslvr/momentumbp.py", line 383, in solve
    annotate = annotate, solver_parameters = params['solver'])
  File "/home/pf4d/local/python-2.7.15/lib/python2.7/site-packages/dolfin_adjoint/solving.py", line 356, in solve
    annotate = annotate, solver_parameters = params['solver'])
  File "/home/pf4d/local/python-2.7.15/lib/python2.7/site-packages/dolfin_adjoint/solving.py", line 356, in solve
    ret = backend.solve(*args, **kwargs)
  File "/home/pf4d/local/dolfin-2017.2.0.post0/lib/python2.7/site-packages/dolfin/fem/solving.py", line 300, in solve
    ret = backend.solve(*args, **kwargs)
  File "/home/pf4d/local/dolfin-2017.2.0.post0/lib/python2.7/site-packages/dolfin/fem/solving.py", line 300, in solve
    _solve_varproblem(*args, **kwargs)
  File "/home/pf4d/local/dolfin-2017.2.0.post0/lib/python2.7/site-packages/dolfin/fem/solving.py", line 349, in _solve_varproblem
    _solve_varproblem(*args, **kwargs)
  File "/home/pf4d/local/dolfin-2017.2.0.post0/lib/python2.7/site-packages/dolfin/fem/solving.py", line 349, in _solve_varproblem
    solver.solve()
RuntimeError: *** Error: Duplication of MPI communicator failed (MPI_Comm_dup
    solver.solve()
RuntimeError: *** Error: Duplication of MPI communicator failed (MPI_Comm_dup

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 21714 RUNNING AT torch
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions


There doesn't seem to be any particular reason why this iteration should crash, I do the same operations as the other 1013...

The documentation for ``EXIT CODE:  134`` doesn't help much:


Anyone have any ideas?

Thanks,
Evan

Evan Cummings

unread,
Jul 16, 2018, 10:06:20 AM7/16/18
to fenics-support
Also, I should mention that I am running dolfin version ``2017.2.0`` and

$ mpiexec --version
HYDRA build details:
    Version:                                 3.2
    Release Date:                            Wed Nov 11 22:06:48 CST 2015
    CC:                              gcc     -O3 -DNDEBUG -fPIC    -O3 -DNDEBUG     
    CXX:                             g++     -O3 -DNDEBUG -fPIC    -O3 -DNDEBUG     
    F77:                             /opt/rh/devtoolset-4/root/usr/bin/gfortran      
    F90:                             /opt/rh/devtoolset-4/root/usr/bin/gfortran      
    Configure options:                       '--disable-option-checking' '--prefix=/home/buildslave/dashboards/buildbot/paraview-pvbinsdash-linux-shared-release_superbuild/build/install' '--with-device=ch3:sock' '--enable-shared' '--disable-static' '--disable-mpe' 'CFLAGS=-fPIC -O3 -DNDEBUG -O2' 'LDFLAGS= ' 'CPPFLAGS= -O3 -DNDEBUG -I/home/buildslave/dashboards/buildbot/paraview-pvbinsdash-linux-shared-release_superbuild/build/superbuild/mpi/src/src/mpl/include -I/home/buildslave/dashboards/buildbot/paraview-pvbinsdash-linux-shared-release_superbuild/build/superbuild/mpi/src/src/mpl/include -I/home/buildslave/dashboards/buildbot/paraview-pvbinsdash-linux-shared-release_superbuild/build/superbuild/mpi/src/src/openpa/src -I/home/buildslave/dashboards/buildbot/paraview-pvbinsdash-linux-shared-release_superbuild/build/superbuild/mpi/src/src/openpa/src -D_REENTRANT -I/home/buildslave/dashboards/buildbot/paraview-pvbinsdash-linux-shared-release_superbuild/build/superbuild/mpi/src/src/mpi/romio/include' 'CXXFLAGS=-fPIC -O3 -DNDEBUG -O2' 'FC=/opt/rh/devtoolset-4/root/usr/bin/gfortran' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'LIBS=-lpthread '
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:       
    Demux engines available:                 poll select

Adrian Jackson

unread,
Jul 16, 2018, 10:09:24 AM7/16/18
to Evan Cummings, fenics-support
Hi,

I wonder if MPI is running out of communicators. I've seen MPICH limits
of 2048 communicators per process on systems in the past. If the
communicators aren't being free'd during the run this could be issue.

cheers

adrianj

On 16/07/2018 15:02, Evan Cummings wrote:
> Hi there,
>
> I've been experiencing a consistent crash of a transient simulation at
> iteration 1014.  Even changing parameters and such, the simulation
> always crashes at iteration 1014 with the message :
>
> |
> Entercode here...Process0:Solvingnonlinear variational problem.
> --
> You received this message because you are subscribed to the Google
> Groups "fenics-support" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fenics-suppor...@googlegroups.com
> <mailto:fenics-suppor...@googlegroups.com>.
> To post to this group, send email to fenics-...@googlegroups.com
> <mailto:fenics-...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/fenics-support/9121b3e3-5b7a-4dc7-9784-a54dafdb458e%40googlegroups.com
> <https://groups.google.com/d/msgid/fenics-support/9121b3e3-5b7a-4dc7-9784-a54dafdb458e%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Tel: +44 131 6506470 skype: remoteadrianj

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Evan Cummings

unread,
Jul 18, 2018, 3:34:15 AM7/18/18
to fenics-support
Hi Adrian,

Thanks for the input.

After changing the variational problem significantly, I now receive an error at iteration 1012:

 Internal error 1 in DMUMPS_LOAD_RECV_MSGS       32764
 Internal error 1 in DMUMPS_LOAD_RECV_MSGS       22057
application called MPI_Abort(MPI_COMM_WORLD, -99) - process 1
application called MPI_Abort(MPI_COMM_WORLD, -99) - process 0

I cannot find any information online about ``DMUMPS_LOAD_RECV_MSGS`` ...  ideas?

-Evan

Adrian Jackson

unread,
Jul 18, 2018, 3:42:19 AM7/18/18
to Evan Cummings, fenics-support
Hi Evan,

It's not one I've seen before, but it would suggest an error in MUMPS:

http://mumps.enseeiht.fr/index.php?page=dwnld

Do you have MUMPS installed? What version is installed?

cheers

adrianj

--

Evan Cummings

unread,
Jul 18, 2018, 4:10:55 AM7/18/18
to fenics-support
Yes, I am using MUMPS, and I installed it with PETSc and linked it to DOLFIN.  The exact method I used to install is:

function install_petsc()
{
  # install petsc :
  cd $SFT_DIR;
  #git clone -b maint https://bitbucket.org/petsc/petsc petsc;
  #cd $SFT_DIR/petsc;
       -O ${SFT_DIR}/petsc-${PETSC_VERSION}.tar.gz;
  mkdir -p  ${SFT_DIR}/petsc-${PETSC_VERSION};
  tar -xzf ${SFT_DIR}/petsc-${PETSC_VERSION}.tar.gz \
             -C ${SFT_DIR}/petsc-${PETSC_VERSION} \
             --strip-components 1;
  rm ${SFT_DIR}/petsc-${PETSC_VERSION}.tar.gz;
  cd $SFT_DIR/petsc-${PETSC_VERSION};
  export PETSC_DIR=$(pwd);
  ./configure --with-cc=mpicc \
              --with-cxx=mpicxx \
              --with-fc=mpifort \
              --COPTFLAGS="-O2" \
              --CXXOPTFLAGS="-O2" \
              --FOPTFLAGS="-O2" \
              --with-c-support=1 \
              --with-cxx-dialect=C++11 \
              --with-debugging=0 \
              --with-shared-libraries=1 \
              --with-boost-dir=${BOOST_DIR} \
              --with-hdf5-dir=${HDF5_DIR} \
              --with-blas-lib=${OPENBLAS_DIR}/lib/libopenblas.a \
              --with-lapack-lib=${OPENBLAS_DIR}/lib/libopenblas.a \
              --download-scalapack=1 \
              --download-blacs=1 \
              --download-hypre=1 \
              --download-metis=1 \
              --download-mumps=1 \
              --download-parmetis=1 \
              --download-ptscotch=1 \
              --download-spai=1 \
              --download-elemental=1 \
              --download-ml=1 \
              --download-suitesparse=1 \
              --download-superlu=1 \
              --download-superlu_dist=1 \
              --prefix=${PREFIX}/petsc-${PETSC_VERSION};
  make PETSC_DIR=${PETSC_DIR} PETSC_ARCH=arch-linux2-c-opt all;
  make PETSC_DIR=${PETSC_DIR} PETSC_ARCH=arch-linux2-c-opt install;
  export PETSC_DIR=${PREFIX}/petsc-${PETSC_VERSION};
  make PETSC_DIR=${PETSC_DIR} PETSC_ARCH="" test;

  # install petsc4py 
  #pip install --no-cache-dir https://bitbucket.org/petsc/petsc4py/downloads/petsc4py-${PETSC4PY_VERSION}.tar.gz --prefix=${PYTHON_DIR};
  #git clone -b maint https://bitbucket.org/petsc/petsc4py petsc4py;
  #cd petsc4py;
  cd $SFT_DIR;
       petsc4py-${PETSC4PY_VERSION}.tar.gz";
  url=$(tr -d ' ' <<< "$url");   # remove spaces to fit the url within 80 chr
  wget $url;
  tar -xzvf petsc4py-${PETSC4PY_VERSION}.tar.gz;
  rm petsc4py-${PETSC4PY_VERSION}.tar.gz;
  cd $SFT_DIR/petsc4py-${PETSC4PY_VERSION};
  pip install . -v;
}

Evan Cummings

unread,
Jul 18, 2018, 4:23:07 AM7/18/18
to fenics-support
I suppose I should clarify, MUMPS is statically compiled, and is version 5.1.1 that PETSc version 3.8.3 downloads from their mirror....

Evan Cummings

unread,
Jul 20, 2018, 3:23:32 AM7/20/18
to fenics-support
Update:

After changing the mesh of the problem and increasing the number of MPI processes from 2 to 4, I get an error after 1012 identical iterations:

PBLAS ERROR 'Illegal IY = -202796224, IY must be at least 1'
from {0,1}, pnum=1, Contxt=0, in routine 'PDSWAP'.

PBLAS ERROR 'Parameter number 8 had an illegal value'
from {0,1}, pnum=1, Contxt=0, in routine 'PDSWAP'.

{0,1}, pnum=1, Contxt=0, killed other procs, exiting with error #-8.

application called MPI_Abort(MPI_COMM_WORLD, -8) - process 1
PBLAS ERROR 'Illegal IY = -1383410880, IY must be at least 1'
from {0,2}, pnum=2, Contxt=0, in routine 'PDSWAP'.

PBLAS ERROR 'Parameter number 8 had an illegal value'
from {0,2}, pnum=2, Contxt=0, in routine 'PDSWAP'.

{0,2}, pnum=2, Contxt=0, killed other procs, exiting with error #-8.

application called MPI_Abort(MPI_COMM_WORLD, -8) - process 2
PBLAS ERROR 'Array subscript out of bounds: IY = 1108869952, DESCY[M_] = 2034'
from {0,3}, pnum=3, Contxt=0, in routine 'PDSWAP'.

PBLAS ERROR 'Parameter number 8 had an illegal value'
from {0,3}, pnum=3, Contxt=0, in routine 'PDSWAP'.

{0,3}, pnum=3, Contxt=0, killed other procs, exiting with error #-8.

application called MPI_Abort(MPI_COMM_WORLD, -8) - process 3


It looks like there is a problem with PBLAS; I have linked a self-compiled OpenBLAS version to all the DOLFIN dependencies...  

On Monday, July 16, 2018 at 4:02:14 PM UTC+2, Evan Cummings wrote:
Reply all
Reply to author
Forward
0 new messages