Fatal error in PMPI_Bcast: Other MPI error, error stack:

1,532 views
Skip to first unread message

Ania Szumska

unread,
Oct 10, 2017, 6:16:32 AM10/10/17
to dirac-users
Hi, 

I have a problem with calculation excitation energies. I've calculated a lot of molecules, but in some I got an error at the beginning of the first iteration: 

########## START MICROITERATION NO.   1 ##########   Mon Oct  9 15:22:23 2017

SCR        scr.thr.    Step1    Step2  Coulomb  Exchange   WALL-time
rank 1 in job 1  cx1-101-13-3_36672   caused collective abort of all ranks
  exit status of rank 1: return code 1 

 ====  below this line is the stderr stream  ====
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2434)........: MPI_Bcast(buf=0x2b37f7a35028, count=285700608, dtype=0x4c000829, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1807)...: 
MPIR_Bcast(1835)........: 
I_MPIR_Bcast_intra(2016): Failure during collective
MPIR_Bcast_intra(1596)..: 
MPIR_Bcast_binomial(256): message sizes do not match across processes in the collective routine: Received -32766 but expected -2009362432

I run jobs using cluster scripts:

#PBS -l walltime=71:58:02
#PBS -l select=1:ncpus=4:mem=68gb
module load dirac/15-alt intel-suite/2015 mpi
pam-dirac --mw=2000 --scratch=$TMPDIR --machfile=$PBS_NODEFILE --mol=***.xyz --inp=***.inp


Please find two output files in the attachment. One is from molecule which was fully calculated and the other one with the error. There is only little difference between them (additional 3 groups- CH3).
I would appreciate your help. Have any of you had similar problem? I played with increasing memory and number of nodes, but nothing solved the problem.

All the best,
Anna 

BTSeT2_C1_I2.out
BTSeT2_I1.out

Radovan Bast

unread,
Oct 14, 2017, 11:17:44 AM10/14/17
to dirac...@googlegroups.com
dear Anna,

the error suggests that increasing memory will not help and that increasing number of nodes might only mask it.
it is probably a problem in the configuration or a bug in the code.

i suggest two things:
1) find out whether the problem is reproducible: does it reproducibly crash? in other words submit it few times and see whether all calculations stop at the same place.
2) if yes, are you able to simplify the test case while still reproducing the error?

best regards,
  radovan

--
You received this message because you are subscribed to the Google Groups "dirac-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/dirac-users.
For more options, visit https://groups.google.com/d/optout.

Ilias Miroslav, doc. RNDr., PhD.

unread,
Oct 14, 2017, 1:17:10 PM10/14/17
to dirac...@googlegroups.com


Hi Anna,


might be unproper memory setting: for failing response modul, save DFPCMO,  increase node memory to max. value


and increase  -aw parameter close to memory limit divided by number of threads


because in your calculations your aw (or ag) is only the default value, what is 16gb. But this could be not enough for your larger system.



I wrote this tutorial for relCC (dirac devel.version)

maybe I have to extend it to linear response calculations as well.


Ah, see  in your output:

 Peak memory usage    (Mb) :     15289.00  <---- this is per thread
       reached at subroutine : dftgrd_+0x89e    

this is close to 16GB !  Therefore, for your larger system you must set aw to higher value.


M.






Od: dirac...@googlegroups.com <dirac...@googlegroups.com> v mene používateľa Radovan Bast <radova...@gmail.com>
Odoslané: 14. októbra 2017 17:17
Komu: dirac...@googlegroups.com
Predmet: Re: [dirac-users] Fatal error in PMPI_Bcast: Other MPI error, error stack:
 
To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users...@googlegroups.com.

Ania Szumska

unread,
Nov 3, 2017, 11:25:14 AM11/3/17
to dirac-users
Dear Radovan and Miroslav,

Thank you very much for your feedback. In order to use more memory I need Dirac installed with --int64 flag, am I right?

Best regards,
Anna 
To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users...@googlegroups.com.
Visit this group at https://groups.google.com/group/dirac-users.
For more options, visit https://groups.google.com/d/optout.

Radovan Bast

unread,
Nov 6, 2017, 4:15:06 PM11/6/17
to dirac...@googlegroups.com
dear Anna,

then yes, but then you also need a int64-enabled MPI installation and the same goes for math libraries.

good luck!
  radovan

To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users+unsubscribe@googlegroups.com.

Ania Szumska

unread,
Jan 3, 2018, 8:36:45 AM1/3/18
to dirac-users
Dear Radovan, 

Even with Dirac int64 I still got the same error. I prepared two example minimal inputs. The only difference between them is replacing sulphur with selenium. Both structures were optimised using Gaussian. In the sulphur version everything is fine, in the selenium case I got memory error at the beginning of excited states calculations. I'm wondering if it is something wrong with my machine. Could you try to reproduce this outputs on your computer? I would be very grateful. I got this error for a few bigger molecules that I need. If you have any suggestions what else I can change in the script, I would be happy to give it a try.  

I'm using university cluster with running scripts as follows: 

#PBS -l walltime=71:58:02
#PBS -l select=1:ncpus=4:mem=96gb

module load dirac/16 intel-suite/2015 mpi

pam-dirac --mw=2400 --aw=3200 --mpi=4 --scratch=$TMPDIR --machfile=$PBS_NODEFILE --mol=BTSeT2_C1_I2.xyz --inp=BTSeT2_C1_I2_minimal.inp

or 

pam-dirac --mw=2400 --aw=3200 --mpi=4 --scratch=$TMPDIR --machfile=$PBS_NODEFILE --mol=BTT2_C1_I2.xyz --inp=BTT2_C1_I2_minimal.inp

Best wishes,
Ania
BTSeT2_C1_I2.xyz
BTSeT2_C1_I2_minimal.inp
BTSeT2_C1_I2_minimal_BTSeT2_C1_I2.out
BTT2_C1_I2.xyz
BTT2_C1_I2_minimal.inp
BTT2_C1_I2_minimal_BTT2_C1_I2.out

Radovan Bast

unread,
Jan 5, 2018, 7:03:39 AM1/5/18
to dirac...@googlegroups.com
dear Ania,

is there any chance to expose the same problem with a smaller test case? smaller basis, "bad" thresholds, less iterations, ...

i am asking this because the prospect of debugging a problem that takes 70 hours on 4 cores is not very attractive :-)

it might not be possible to reduce the system while keeping the problem present but we should really try. i know it is a lot of effort.

best regards,
  radovan


To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users+unsubscribe@googlegroups.com.

Ania Szumska

unread,
Jan 5, 2018, 7:31:26 AM1/5/18
to dirac-users
I will do my best and let you know.

Anna 
Reply all
Reply to author
Forward
0 new messages