Fatal error in PMPI_Bcast: Other MPI error, error stack:

Ania Szumska

unread,

Oct 10, 2017, 6:16:32 AM10/10/17

to dirac-users

Hi,

I have a problem with calculation excitation energies. I've calculated a lot of molecules, but in some I got an error at the beginning of the first iteration:

########## START MICROITERATION NO. 1 ########## Mon Oct 9 15:22:23 2017

SCR scr.thr. Step1 Step2 Coulomb Exchange WALL-time

rank 1 in job 1 cx1-101-13-3_36672 caused collective abort of all ranks

exit status of rank 1: return code 1

==== below this line is the stderr stream ====

Fatal error in PMPI_Bcast: Other MPI error, error stack:

PMPI_Bcast(2434)........: MPI_Bcast(buf=0x2b37f7a35028, count=285700608, dtype=0x4c000829, root=0, MPI_COMM_WORLD) failed

MPIR_Bcast_impl(1807)...:

MPIR_Bcast(1835)........:

I_MPIR_Bcast_intra(2016): Failure during collective

MPIR_Bcast_intra(1596)..:

MPIR_Bcast_binomial(256): message sizes do not match across processes in the collective routine: Received -32766 but expected -2009362432

I run jobs using cluster scripts:

#PBS -l walltime=71:58:02

#PBS -l select=1:ncpus=4:mem=68gb

module load dirac/15-alt intel-suite/2015 mpi

pam-dirac --mw=2000 --scratch=$TMPDIR --machfile=$PBS_NODEFILE --mol=***.xyz --inp=***.inp

Please find two output files in the attachment. One is from molecule which was fully calculated and the other one with the error. There is only little difference between them (additional 3 groups- CH3).

I would appreciate your help. Have any of you had similar problem? I played with increasing memory and number of nodes, but nothing solved the problem.

All the best,

Anna

BTSeT2_C1_I2.out

BTSeT2_I1.out

Radovan Bast

unread,

Oct 14, 2017, 11:17:44 AM10/14/17

to dirac...@googlegroups.com

dear Anna,

the error suggests that increasing memory will not help and that increasing number of nodes might only mask it.

it is probably a problem in the configuration or a bug in the code.

i suggest two things:

1) find out whether the problem is reproducible: does it reproducibly crash? in other words submit it few times and see whether all calculations stop at the same place.

2) if yes, are you able to simplify the test case while still reproducing the error?

best regards,

radovan

--
You received this message because you are subscribed to the Google Groups "dirac-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/dirac-users.
For more options, visit https://groups.google.com/d/optout.

Ilias Miroslav, doc. RNDr., PhD.

unread,

Oct 14, 2017, 1:17:10 PM10/14/17

to dirac...@googlegroups.com

Hi Anna,

might be unproper memory setting: for failing response modul, save DFPCMO, increase node memory to max. value

and increase -aw parameter close to memory limit divided by number of threads

because in your calculations your aw (or ag) is only the default value, what is 16gb. But this could be not enough for your larger system.

I wrote this tutorial for relCC (dirac devel.version)

http://diracprogram.org/doc/master/tutorials/cc_memory_count/count_cc_memory.html

maybe I have to extend it to linear response calculations as well.

Ah, see in your output:

Peak memory usage (Mb) : 15289.00 <---- this is per thread

reached at subroutine : dftgrd_+0x89e

this is close to 16GB ! Therefore, for your larger system you must set aw to higher value.

M.

Od: dirac...@googlegroups.com <dirac...@googlegroups.com> v mene používateľa Radovan Bast <radova...@gmail.com>
Odoslané: 14. októbra 2017 17:17
Komu: dirac...@googlegroups.com
Predmet: Re: [dirac-users] Fatal error in PMPI_Bcast: Other MPI error, error stack:

To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users...@googlegroups.com.

Ania Szumska

unread,

Nov 3, 2017, 11:25:14 AM11/3/17

to dirac-users

Dear Radovan and Miroslav,

Thank you very much for your feedback. In order to use more memory I need Dirac installed with --int64 flag, am I right?

Best regards,

Anna

To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users...@googlegroups.com.
Visit this group at https://groups.google.com/group/dirac-users.
For more options, visit https://groups.google.com/d/optout.

Radovan Bast

unread,

Nov 6, 2017, 4:15:06 PM11/6/17

to dirac...@googlegroups.com

dear Anna,

if you are in this territory: http://diracprogram.org/doc/release-16/installation/int64/faq.html

then yes, but then you also need a int64-enabled MPI installation and the same goes for math libraries.

good luck!

radovan

To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users+unsubscribe@googlegroups.com.

Ania Szumska

unread,

Jan 3, 2018, 8:36:45 AM1/3/18

to dirac-users

Dear Radovan,

Even with Dirac int64 I still got the same error. I prepared two example minimal inputs. The only difference between them is replacing sulphur with selenium. Both structures were optimised using Gaussian. In the sulphur version everything is fine, in the selenium case I got memory error at the beginning of excited states calculations. I'm wondering if it is something wrong with my machine. Could you try to reproduce this outputs on your computer? I would be very grateful. I got this error for a few bigger molecules that I need. If you have any suggestions what else I can change in the script, I would be happy to give it a try.

I'm using university cluster with running scripts as follows:

#PBS -l walltime=71:58:02

#PBS -l select=1:ncpus=4:mem=96gb

module load dirac/16 intel-suite/2015 mpi

pam-dirac --mw=2400 --aw=3200 --mpi=4 --scratch=$TMPDIR --machfile=$PBS_NODEFILE --mol=BTSeT2_C1_I2.xyz --inp=BTSeT2_C1_I2_minimal.inp

or

pam-dirac --mw=2400 --aw=3200 --mpi=4 --scratch=$TMPDIR --machfile=$PBS_NODEFILE --mol=BTT2_C1_I2.xyz --inp=BTT2_C1_I2_minimal.inp

Best wishes,

Ania

BTSeT2_C1_I2.xyz

BTSeT2_C1_I2_minimal.inp

BTSeT2_C1_I2_minimal_BTSeT2_C1_I2.out

BTT2_C1_I2.xyz

BTT2_C1_I2_minimal.inp

BTT2_C1_I2_minimal_BTT2_C1_I2.out

Radovan Bast

unread,

Jan 5, 2018, 7:03:39 AM1/5/18

to dirac...@googlegroups.com

dear Ania,

is there any chance to expose the same problem with a smaller test case? smaller basis, "bad" thresholds, less iterations, ...

i am asking this because the prospect of debugging a problem that takes 70 hours on 4 cores is not very attractive :-)

it might not be possible to reduce the system while keeping the problem present but we should really try. i know it is a lot of effort.

best regards,

radovan

To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users+unsubscribe@googlegroups.com.

Ania Szumska

unread,

Jan 5, 2018, 7:31:26 AM1/5/18

to dirac-users

I will do my best and let you know.

Anna

Reply all

Reply to author

Forward