MPI Issues?

33 views
Skip to first unread message

Claudia M

unread,
Aug 29, 2023, 10:41:37 AM8/29/23
to ADDA questions and answers
Hello,

I have been running ADDA in parallel on my university's supercomputer with no problem. Recently, I was given priority access on an older part of the supercomputer in order to speed up my queuing time, but now am running into issues. I have run ADDA the exact same way as I did on the newer machines but can not figure out if the issue is on the ADDA side or the MPI side or both.

I normally use "mpirun ./adda_mpi ...." with all of the particle parameters in a script called myhost.sh and then run it via command line with "sbatch .... ./myhost.sh " with all the time and node constraints.

The error I'm now getting is multiple lines that say:
"INFO: (ringID=100) No real dipoles are assigned" and then it goes into the calculations with:

"box dimensions: 140x121x140
lambda: 0.569   Dipoles/lambda: 10.6939
Required relative residual norm: 1e-05
Total number of occupied dipoles: 1779680
Memory usage for MatVec matrices (per processor): 25.1 MB
Calculating Green's function (Dmatrix)
Fourier transform of Dmatrix
[elf37:1172628:0:1172628]    ud_iface.c:779  Fatal: transport error: Endpoint timeout"

Then, it ends with:
"mpirun noticed that process rank 93 with PID 3989517 on node elf35 exited on signal 6 (Aborted)."

I have a feeling it has something to do with the versions of everything but I'm not sure. I can send the whole error output as an attachment, too, if that helps.

Thanks so much,
Claudia Morello




Maxim Yurkin

unread,
Aug 29, 2023, 4:56:32 PM8/29/23
to adda-d...@googlegroups.com
Dear Claudia,

I am not sure about the particular problem, but it may be related to different versions of MPI or other libraries used
for compilation and at runtime. Do you run the same ADDA binary on different parts of the supercomputer, i.e. you choose
the required part only when running sbatch? If yes, then a simple solution may be to compile ADDA anew for an older part
(its versions of operating system, compiler, and mpi). From my experience, the most common difference is with the MPI,
since such clusters often have many versions installed simultaneously. You can probaby recompile ADDA on the login node,
but before enabling the correct MPI version (e.g., using modules, if they are used at your cluster) - the same as used
on the target nodes (where you plan to run ADDA). The same can probably be done for the gcc or whatever compiler you are
using.

A more complicated issue if the OS version is different. Sometimes it doesn't affects ADDA, but sometimes you will need
to compile ADDA using the same OS. Since you most probably cannot change the login node, you will need to compile it
through sbatch. Either you run a separate job to compile ADDA before your standard ones (production runs) or you add
compilation command to any of the production runs (incurs some waste of computer time, but quick to test).

Finally, 'adda_mpi -V' may help to get some information on the MPI used for compilation, which can be contrasted against
those used at older nodes. So if the above ideas does not solve your issue, please send all the versions in the next
message.

Maxim.

Claudia M

unread,
Aug 29, 2023, 6:17:26 PM8/29/23
to ADDA questions and answers
Hello again,

Thanks! It was a version agreement problem!

 However, I am getting concerned about the "No real dipoles are assigned issue" though because it is still happening. It seems to happen when I give ADDA too many resources (a high number of CPUs). I ran a couple small runs with different numbers of cores and the Mueller matrices do not agree. I have not changed any of the ADDA input parameters, just the number of nodes and tasks per node for MPI. What does "no real dipoles" mean?

Thanks,
Claudia Morello

Maxim Yurkin

unread,
Aug 30, 2023, 5:08:29 AM8/30/23
to adda-d...@googlegroups.com
Since this message is labeled as INFO (not ERROR or WARNING), it should have no effect on the simulation results. The
particle is divided over the processors (cores) by slices perpendicular to the z-axis, see Section 6.6 "Partition over
processors in parallel mode" of the manual. So when the number of processors is larger than the number of slices, you
will get such a message. Alternatively, there may be one-two slices assigned to the processor, but they may contain no
dipoles due to the complicated shape of the particle (with voids or some aggregate). Real dipoles mean those occupied by
the particle in contrast to the voxels of the circumscribing rectangular grid, all of which are used for FFT-accelerated
matrix-vector multiplication. Overall, this message indicates that parallel performance is not that good, i.e. the
computational time does not decrease proportionally to the number of employed cores. So it is recommended to decrease
number of cores per run (to have at least 1 or 2 slices per core) and run several independent runs instead (this can
usually be easily done with batch system).

However, you also mention that Mueller matrices do not agree - that is definitely unexpected (and worrisome), if the
difference . Can you please share the whole output directories of these two runs (including both log and mueller files)?

Maxim.

Maxim Yurkin

unread,
Aug 30, 2023, 5:21:23 AM8/30/23
to adda-d...@googlegroups.com
I just realized that I haven't finished writing the sentence. I meant that the difference in Mueller matrices is
worrisome if it is significant (not on the order of threshold for the convergence of the iterative solver, which is 1e-5
by default). Actually, the number of cores may slightly affect the discretization of the particle, but let's look at the
log files first.

Claudia M

unread,
Aug 30, 2023, 9:38:44 AM8/30/23
to ADDA questions and answers
Hello,

Ah okay that makes sense! The differences are smaller than the threshold. I won't worry about it too much then.

Thanks again,
Claudia Morello
Reply all
Reply to author
Forward
0 new messages