"Fatal error in PMPI_Bcast" in the Dirac response_nmr_spin-spin test

175 views
Skip to first unread message

JR

unread,
Mar 14, 2013, 4:01:12 AM3/14/13
to dirac...@googlegroups.com
Hi,

We are testing Dirac 12.3 on a new Finnish supercomputer. Running the response_nmr_spin-spin calculation from Dirac's test directory fails with a curious error in the 2nd iteration of the HF calculation:

########## START ITERATION NO.   2 ##########   Wed Feb 20 14:57:27 2013

* GETGAB: label "GABAO1XX" not found; calling GABGEN.
SCR        scr.thr.    Step1    Step2  Coulomb  Exchange   WALL-time
  QM-QM nuclear repulsion energy :      9.055003146800
Application 873572 exit codes: 134
Application 873572 resources: utime ~2s, stime ~0s
 ====  below this line is the stderr stream  ====
Rank 1 [Wed Feb 20 14:57:27 2013] [c3-0c2s12n2] Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1682)......: MPI_Bcast(buf=0x2aaab2596968, count=1, dtype=0x4c000829, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1488).: 
MPIR_Bcast_intra(1247): 
MPIR_SMP_Bcast(1156)..: Failure during collective
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source             
libc.so.6          00002AAAB0562B55  Unknown               Unknown  Unknown
libc.so.6          00002AAAB0564131  Unknown               Unknown  Unknown
libmpich_intel.so  00002AAAAEC75F36  Unknown               Unknown  Unknown
libmpich_intel.so  00002AAAAEC4B2ED  Unknown               Unknown  Unknown
libmpich_intel.so  00002AAAAEC4B456  Unknown               Unknown  Unknown
libmpich_intel.so  00002AAAAECDCD90  Unknown               Unknown  Unknown
libmpich_intel.so  00002AAAAEC55F9A  Unknown               Unknown  Unknown
dirac.x            00000000006540A6  interface_to_mpi_         444  interface_to_mpi.F90
dirac.x            0000000000A2B188  hernod_                  1391  herpar.F
dirac.x            00000000004AAB67  dirnod_                  1141  dirac.F
dirac.x            000000000047F5BC  MAIN__                    157  main.F90
dirac.x            000000000047EEEC  Unknown               Unknown  Unknown
libc.so.6          00002AAAB054EC36  Unknown               Unknown  Unknown
dirac.x            000000000047EDE9  Unknown               Unknown  Unknown
_pmiu_daemon(SIGCHLD): [NID 00754] [c3-0c2s12n2] [Wed Feb 20 14:57:27 2013] PE RANK 1 exit signal Aborted
[NID 00754] 2013-02-20 14:57:27 Apid 873572: initiated application termination

The machine is a Massively Parallel Processor supercomputer (Cray XC30 family, Intel Xeon Sandy Bridge processors), running an x86_64 Linux operating system (CNL). 32-bit version of Dirac 12.3 was built using Intel compilers (through Cray's wrappers), CMake 2.8.10.2 and Cray's MPI (mpich2). Intel MKL (ComposerXE 2013.1.117) was also used; Cray's tools tried to mess the linking up at first (a bug), but the tech support adviced to give the link line explicitly:

./setup --fc=ftn --cc=cc --cxx=CC --mpi=on --explicit-libs="-L$MATH_ROOT/lib/intel64 -Wl,-rpath=$MATH_ROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -
lmkl_core -mkl=sequential" ${BUILDDIR}

after which setup+make went through without problems. I tried to find out a solution to the MPI error with the tech support staff of this computer, but we're still stuck. I'm wondering if this looks familiar to anyone, or if someone could say whether this is a bug/problem in the MPI installation or something in Dirac which just happens to explode with this particular MPI implementation. Any ideas?


Regards,

Juho

Stefan Knecht

unread,
Mar 14, 2013, 4:12:51 AM3/14/13
to dirac...@googlegroups.com
dear Juho,

indeed a curious crash. what was the command line to run the test? how many processes did you ask for?

the first SCF iteration, when starting from bare nucleus, does not run in parallel,
so if it's a parallel issue it's "normal" to crash in iteration 2.

with best regards,

stefan
--
You received this message because you are subscribed to the Google Groups "dirac-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


-- 

Stefan Knecht
Laboratory of Physical Chemistry - The Reiher Group -
ETH Zuerich
HCIG 230
Wolfgang-Pauli-Str. 10
CH-8093 Zuerich
Schweiz

phone: +41 44 633 22 19   fax: +41 44 633 15 94
email: stefan...@phys.chem.ethz.ch
web: http://www.theochem.uni-duesseldorf.de/users/stefan/index.htm
     http://www.reiher.ethz.ch/people/knechste

JR

unread,
Mar 14, 2013, 4:38:34 AM3/14/13
to dirac...@googlegroups.com
Dear Stefan,

I have tried with varying number of processes, in one node and over multiple nodes, with the same effect. The calculation, from which that output was, used just 2 processes, i.e., a minimal parallel test, both in the same node. The pam command was

pam --mb=500 --scratch=${SCRATCH} --basis=${BASISDIR} --inp=bss_rkb+mfsso.spsp.inp --mol=H2O.mol --mpi=2

where the scratch variable points to a shared directory visible to all nodes, and basisdir variable to the Dirac basis directories. The pam script was modified to use 'aprun' instead of 'mpirun', but otherwise it's stock stuff.


Regards,

Juho

Stefan Knecht

unread,
Mar 14, 2013, 5:21:21 AM3/14/13
to dirac...@googlegroups.com
dear Juho,

dear Juho,


On 14/03/13 09.38, JR wrote:
Dear Stefan,

I have tried with varying number of processes, in one node and over multiple nodes, with the same effect. The calculation, from which that output was, used just 2 processes, i.e., a minimal parallel test, both in the same node. The pam command was

pam --mb=500 --scratch=${SCRATCH} --basis=${BASISDIR} --inp=bss_rkb+mfsso.spsp.inp --mol=H2O.mol --mpi=2

where the scratch variable points to a shared directory visible to all nodes, and basisdir variable to the Dirac basis directories. The pam script was modified to use 'aprun' instead of 'mpirun', but otherwise it's stock stuff.
looks all fine to me. can you send me the output of the crashing test and (if possible) the output of your setup line, maybe this gives me a hint:
 
$  ./setup --fc=ftn --cc=cc --cxx=CC --mpi=on --explicit-libs="-L$MATH_ROOT/lib/intel64 -Wl,-rpath=$MATH_ROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -mkl=sequential" ${BUILDDIR}

i guess you made sure that no other MPI library could be loaded accidentally instead of the one you are expecting? these things can happen...
i don't have much experience with Cray machines (no access) meaning that i for sure haven't tested Cray's MPICH2 but others on the users-list might have.

@all: Anyone running Dirac on Cray?


However, IntelMPI is working fine with Dirac and this is also a derivative of MPICH2.

with best regards,

stefan

JR

unread,
Mar 14, 2013, 6:16:37 AM3/14/13
to dirac...@googlegroups.com
Dear Stefan,

I will attach the test calculation output here, and paste the (much shorter) setup output below:

-- User set math libraries: -L/opt/intel/composer_xe_2013.1.117/mkl//lib/intel64 -Wl,-rpath=/opt/intel/composer_xe_2013.1.117/mkl//lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -mkl=sequential
-- mpi.mod matches current compiler, setting -DUSE_MPI_MOD_F90
-- MPI-2 support found
CMake Warning:
  Manually-specified variables were not used by the project:
    ENABLE_BLAS
    ENABLE_LAPACK

 FC=ftn CC=cc CXX=CC cmake -DENABLE_MPI=ON -DENABLE_SGI_MPT=OFF -DENABLE_OPENMP=OFF -DENABLE_BLAS=ON -DENABLE_LAPACK=ON -DENABLE_TESTS=OFF -DENABLE_64BIT_INTEGERS=OFF -DEXPLICIT_LIBS="-L/opt/intel/composer_xe_2013.1.117/mkl//lib/intel64 -Wl,-rpath=/opt/intel/composer_xe_2013.1.117/mkl//lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -mkl=sequential" -DCMAKE_BUILD_TYPE=Release /homeappl/home/juhorouk/appl_sisu/DIRAC-12.3
-- The Fortran compiler identification is Intel
-- The C compiler identification is Intel 13.0.0.20121010
-- The CXX compiler identification is Intel 13.0.0.20121010
-- Check for working Fortran compiler: /opt/cray/craype/1.00/bin/ftn
-- Check for working Fortran compiler: /opt/cray/craype/1.00/bin/ftn  -- works
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Checking whether /opt/cray/craype/1.00/bin/ftn supports Fortran 90
-- Checking whether /opt/cray/craype/1.00/bin/ftn supports Fortran 90 -- yes
-- Check for working C compiler: /opt/cray/craype/1.00/bin/cc
-- Check for working C compiler: /opt/cray/craype/1.00/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /opt/cray/craype/1.00/bin/CC
-- Check for working CXX compiler: /opt/cray/craype/1.00/bin/CC -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found Git: /usr/bin/git  
-- Found MPI_C: /opt/cray/craype/1.00/bin/cc  
-- Found MPI_CXX: /opt/cray/craype/1.00/bin/CC  
-- Found MPI_Fortran: /opt/cray/craype/1.00/bin/ftn  
-- Performing Test MPI_COMPATIBLE
-- Performing Test MPI_COMPATIBLE - Success
-- Performing Test MPI_COMPILER_MATCHES
-- Performing Test MPI_COMPILER_MATCHES - Success
-- Performing Test MPI_ITYPE_MATCHES
-- Performing Test MPI_ITYPE_MATCHES - Success
-- Performing Test MPI_2_COMPATIBLE
-- Performing Test MPI_2_COMPATIBLE - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /homeappl/home/juhorouk/appl_sisu/DIRAC-12.3/sisu-intel-mklseq
   configure step is done
   now you need to compile the sources

Loading the MPI is done, like everything else, via a module system which should make sure that no conflicting software can be loaded (e.g., only one MPI module can be loaded at a time). In addition I think the Cray MPI is the only choice, so there should be no danger of having interfering MPI libraries around, as is the case on many other systems. 


Regards,

Juho
bss_rkb+mfsso.spsp_H2O.out.txt

Stefan Knecht

unread,
Mar 14, 2013, 8:08:59 AM3/14/13
to dirac...@googlegroups.com
dear Juho,

thanks for the extra information. actually, i can't see any obvious problem.
could you try the following:
- modify line 39 of cmake/parallel-environment/ConfigParallelEnvironment.cmake:
old: add_definitions(-DUSE_MPI_MOD_F90)
new: # add_definitions(-DUSE_MPI_MOD_F90)
- rerun $./setup and compilation


this will turn off the mpi-fortran90 interface. maybe something goes wrong there. just a guess.

with best regards,

stefan

JR

unread,
Mar 18, 2013, 8:53:56 AM3/18/13
to dirac...@googlegroups.com
Dear Stefan,

I just tried to comment out the line you mentioned, re-ran setup (creating a new build directory) and compiled the program. I still get the same error with the test calculation; should I expect to see something different, even if it's an error (it still points out the same routines/files in the stderr output)?


Regards,

Juho

Stefan Knecht

unread,
Mar 18, 2013, 8:58:16 AM3/18/13
to dirac...@googlegroups.com
dear Juho,


On 18/03/13 13.53, JR wrote:
Dear Stefan,

I just tried to comment out the line you mentioned, re-ran setup (creating a new build directory) and compiled the program. I still get the same error with the test calculation; should I expect to see something different, even if it's an error (it still points out the same routines/files in the stderr output)?
no probably not, it was just a "wild guess". bad news, though.
anyway, i just got access to a Cray machine in Norway today (thanks to Kenneth :) ) where i will try myself to compile the code the same way you did and reproduce your error (i hope to reproduce it).
if so, i can debug it and eventually come up with a satisfying solution for you (and potential users on other Cray machines as well).

Stefan Knecht

unread,
Mar 18, 2013, 7:11:31 PM3/18/13
to dirac...@googlegroups.com
dear Juho,

the "good news" are that i can reproduce the crash also on our Cray system (using Cray's MPICH) which gives me hope to solve this issue by debugging. :)

with best regards,

stefan

Stefan Knecht

unread,
Mar 18, 2013, 7:57:32 PM3/18/13
to dirac...@googlegroups.com
dear Juho,

found the problem and the bug has been fixed on the release-branch - patch v12.6 which will be released soon.
until then, what follows is what you should fix, namely only three lines in one file, src/abacus/herpar.F, and
more specifically in the
subroutine rvinit

below are the changes, starting with "-" means old (to be deleted) whereas "+" means new (fixes BUG).

hope, this helps.

with best regards,

stefan

diff --git a/src/abacus/herpar.F b/src/abacus/herpar.F
index f6583f9..c8731eb 100644
--- a/src/abacus/herpar.F
+++ b/src/abacus/herpar.F
@@ -1390,7 +1390,7 @@ C
       IF(NFMAT.GT.0) THEN       
         CALL MEMGET('REAL',KFMAT,LFMAT,WORK,KFREE,LFREE)
         CALL MEMGET('INTE',KIRD,NDMAT,WORK,KFREE,LFREE)
-        CAll interface_mpi_BCAST(WORK(KIRD),NDMAT,MPARID,
+        call interface_mpi_bcast_i1_work_f77(WORK(KIRD),NDMAT,MPARID,
      &               global_communicator)
       ELSE
         KFMAT = KFREE
@@ -1400,12 +1400,12 @@ C
         KIFC = KFREE
       ELSEIF (ITYPE.EQ.2) THEN
         CALL MEMGET('INTE',KIFC,NDMAT,WORK,KFREE,LFREE)
-        CAll interface_mpi_BCAST(WORK(KIFC),NDMAT,MPARID,
-     &                 global_communicator)
+        call interface_mpi_bcast_i1_work_f77(WORK(KIFC),NDMAT,MPARID,
+     &               global_communicator)
       ELSE
         CALL MEMGET('INTE',KIFC,NFMAT,WORK,KFREE,LFREE)
-        CAll interface_mpi_BCAST(WORK(KIFC),NFMAT,MPARID,
-     &                  global_communicator)
+        call interface_mpi_bcast_i1_work_f77(WORK(KIFC),NFMAT,MPARID,
+     &               global_communicator)

JR

unread,
Mar 19, 2013, 7:15:51 AM3/19/13
to dirac...@googlegroups.com
Dear Stefan,

I applied the patch (manually), ran some test calculations and they finished successfully. Thank you for solving the issue so quickly!

Regards,

Juho

Stefan Knecht

unread,
Mar 19, 2013, 7:25:59 AM3/19/13
to dirac...@googlegroups.com
dear Juho,

great, i am glad that it worked. :)

with best regards,

stefan

Radovan Bast

unread,
Mar 19, 2013, 9:00:54 AM3/19/13
to dirac...@googlegroups.com
hi,
i will create the official 12.6 patch probably tomorrow
(i want to wait for one round of nightly testing).
best regards,
  radovan
Reply all
Reply to author
Forward
0 new messages