Multinode jobs have poor scaling

Nathan Keilbart

unread,

Apr 11, 2023, 2:19:29 PM4/11/23

to cp2k

Hello,

I recently finished compiling on an intel based hpc machine and appear to have a working binary. I initially tested a case with a single node job and got what appears to be relatively good times. Upon increasing the number of nodes to two, I actually saw in increase in time per step instead of less. The nodes are connected with infiniband I believe so there shouldn't be an issue with node to node communication. I'm wondering if I set some flag wrong when compiling and what I should look into to find out what's going on here. Let me know what kind of information I can provide. Thanks

Nathan

Eric Patterson

unread,

Apr 11, 2023, 2:31:14 PM4/11/23

to cp...@googlegroups.com

Hello Nathan,

What are your settings for MPI processes and OMP threads? On the machine I’m using (an older Intel machine with 48 physical cores per node and Omni-Path interconnect), I found good multi-node performance with 12 MPI processes per node and 4 OMP threads per process. Assigning all cores to MPI was nearly 3x slower and often resulted in memory issues. I did quite a bit of testing to come up with this configuration.

I imagine this could depend heavily on the type of job (mine are periodic cell optimizations and vibrational analysis, nothing fancy), so I definitely recommend doing some testing to see what works for you.

Cheers,

Eric

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/d5c11154-82e8-4c09-9713-9d43f9eb4b7bn%40googlegroups.com.

Nathan Keilbart

unread,

Apr 12, 2023, 6:17:27 PM4/12/23

to cp2k

Hi Eric,

Thanks for the response. I went and tested this out as you suggested by your example. I have nodes with 56 processors each and I tested using two nodes comparing with time steps I'm getting for one node. I tested the following settings:

1 MPI 112 Threads

2 MPI 56 Threads

4 MPI 28 Threads

8 MPI 14 Threads

16 MPI 7 Threads

I end up getting the best performance at and above 4 MPI processes and higher but this is still slower than simply using one node with 56 MPI processes. Thanks for the suggestion though.

I'm not an experienced user with CP2K but just using an input file my colleague gave me to test out the installation that I'm helping out with. I would expect that a system of 100 water molecules and a single Au atom would still see some acceleration by using two nodes but I might be wrong. Any other thoughts?

Eric Patterson

unread,

Apr 14, 2023, 11:18:19 AM4/14/23

to cp...@googlegroups.com

Hi Nathan,

Are those the total processes/threads running across two nodes? If so, maybe your interconnect is not a good as it should be? If that’s what you’re requesting per node, then I would suggesting cutting the threads in half so you’re not hyperthreading.

I honestly have very little experience with MPI codes. CP2K is the first code I’ve used where MPI makes sense to use. I’m afraid I’m near the end of my useful comments… Perhaps someone else in the group or one of your sysadmin people can help you a bit more?

- Eric

To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/b9dde686-4391-46fe-9f46-4c0f55c30ab1n%40googlegroups.com.

Nathan Keilbart

unread,

Apr 19, 2023, 5:41:55 PM4/19/23

to cp2k

Yeah if no one else has any suggestions here I'll try and reach out to my system admin people to see if there's some bottle neck on our end but I've built other codes, VASP/QE/etc..., that don't show these limitations so I'm just wondering if it's a setting I put when installing.

Reply all

Reply to author

Forward