inquiry about MPI using

Thea

unread,

Jun 29, 2025, 11:44:26 PMJun 29

to adda-d...@googlegroups.com

Dear the ADDA Research Team:

I want to know how should the parameter of mpiexec -n <N> be selected? I tried a brief comparison of different numbers of processes for several different particle sizes on a 64-core instance and found that there is still a certain difference in calculation speed, and the part where N is greater than 28 and less than 60 is missing in the attached data table because an error (segmentation fault caused by accessing an unmapped memory address) will occur, possibly because N exceeds nx. Is there also a rule of thumb for choosing the most suitable N to achieve the fastest calculation?

Hope to get your answers, looking forward to your reply!

Best regards,

Thea

Maxim Yurkin

unread,

Jun 30, 2025, 11:30:56 AMJun 30

to adda-d...@googlegroups.com

Thea, please look at this answer - https://groups.google.com/g/adda-discuss/c/fVKbRjMxvak/m/WROW2iQxAwAJ (and the whole
thread around it), as well as https://groups.google.com/g/adda-discuss/c/1aaShDx0u30 . In principle, the more nodes you
use - the less is parallel efficiency. This is especially problematics for shared-memory multi-core systems. First, adda
performs unnecessary memory exchange in this case - see https://github.com/adda-team/adda/issues/38 . This can be
circumvented by running several ADDA instances, each using single or several cores, if memory is not an issue. Second,
I've noticed that memory bandwidth may become a bottleneck (even if the above issue is solved), then it is hard to use
all cores efficiently with any DDA code. Additionally, the best values of nx are the ones that do not have large prime
factors and are divisors of both nz and 2*nx (or numbers slightly larger). So, in your case I would test N equal to 30
and 45 (for the largest size).

Coming back to second issue above, modern processors are used efficiently if a lot of dipoles are assigned to each of
them. The latter cannot be smaller than 1 full slice, but if the slice is not that wide, 1 slice would be also very
inefficient (that is what your test also indicates). For your relatively small particles (but still requiring
significant simulation time due to, e.g., orientation averaging), using only several cores can be the best option. The
above idea of several independent runs in parallel can be useful for systematic parameter sweeps.

The remaining problematic niche is a single moderately large problem in random orientation, which you want to calculate
as fast as possible. This thread provides some workarounds (and repeats most of the above discussion) -
https://groups.google.com/g/adda-discuss/c/Jqh0Fjmphfc . The participants have actually done such simulations, so maybe
they will be able to provide more details.

Last, but not least, ADDA should run fine for any choice of N, even N>nx. The only immediate suggestion is to double
check the MPI installation - see this thread https://groups.google.com/g/adda-discuss/c/TW9uGfzMUME . If that does not
help, please create an issue at https://github.com/adda-team/adda/issues providing as much details as possible (better
to have all details for a single failing run, then some details for several runs).

Maxim.

Maxim Yurkin

unread,

Jul 3, 2025, 5:54:22 AMJul 3

to adda-d...@googlegroups.com

I have created issue https://github.com/adda-team/adda/issues/341 for an external script to perform orientation
averaging with ADDA (when individual problem is only moderately large). It also discusses existing capabilities, but no
ideal solution exist yet.

Maxim.

Reply all

Reply to author

Forward