gromacs mpi-based multiple walkers may require all parallelization to be pushed into mpirun (and out of -ntomp) to get good performance

457 views
Skip to first unread message

Chris Neale

unread,
Jul 25, 2016, 2:27:37 PM7/25/16
to PLUMED users
Dear users:

I just want to report some scaling issues with plumed 2.2.3 metadynamics MPI-based multiple walkers in gromacs 5.1.2.

Summary: with MPI-based multiple walkers, there is a huge performance degradation with gromacs option -ntomp not equal to 1 (at least when using GPUs and at least for this test system). The solution is to push all parallelization into mpirun.

Details:

I have a system that runs at 35 ns/day on a single node (12 physical cores with hyperthreading to 24 logical cores + 4 GPUs). When I turn on metadynamics for a single walker, the speed is 29 ns/day (so a slowdown factor of 1.2x, which seems reasonable to me).

If I do file-based multiple walkers, the speed is the same. However, if I do MPI-based multiple walkers, the speed goes down to 8 ns/day per walker (each walker is given its own node, so the speed should be about 29 ns/day). 

The only way to get multiple walkers with MPI to get the expected 29 ns/day per walker is to push all of the parallelization into the mpi and out of OMP.

Specifically, this is slow:

mpirun -npernode 4 /home/cneale/exec/GROMACS/exec/gromacs-5.1.2_plumed-2-2.3/gpu_mpi/bin/gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes -npme ${NPME} -cpt 60 -maxh ${MAXH} -cpi MD_.cpt -ntomp 6 -gpu_id 0123 -plumed plumed.dat -multi 2

But this is fast:

mpirun -bind-to core:overload-allowed -npernode 24 /home/cneale/exec/GROMACS/exec/gromacs-5.1.2_plumed-2-2.3/gpu_mpi/bin/gmx_mpi mdrun -notunepme -deffnm MD_ -dlb yes -npme ${NPME} -cpt 60 -maxh ${MAXH} -cpi MD_.cpt -ntomp 1 -gpu_id 000000111111222222333333 -plumed plumed.dat -multi 2

where the "-bind-to core:overload-allowed" was only required because of the hyperthreading (I think), but the difference is in () the mpirun -npernode option and (2) the gromacs -ntomp option, this combination also then requiring changes to (3) the gromacs -gpu_id option.

Note that when I used 24 separate walkers (instead of 2) there was a further overhead, but it was only on the order of 5% (26.7 vs 28.3 ns/d).

Chris Neale

unread,
Jul 25, 2016, 2:33:33 PM7/25/16
to PLUMED users
Important addition: on the cluster in question, nodes are connected by ethernet. I suppose that it might be different with IB, though I can't see why since each simulation is supposed to be within its own node.

Giovanni Bussi

unread,
Jul 26, 2016, 6:45:15 AM7/26/16
to plumed...@googlegroups.com
Walkers should only communicate every PACE steps. Can you check if without plumed (or with larger PACE) you get the same overhead?

Other hints:
- Increase PACE
- Use grids (always use grids with METAD)
- Try v2.3 from github (there has been some optimization of this part)
- hyperthreading: in my experience, this is usually not accelerating gromacs or plumed

Giovanni
--
You received this message because you are subscribed to the Google Groups "PLUMED users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plumed-users...@googlegroups.com.
To post to this group, send email to plumed...@googlegroups.com.
Visit this group at https://groups.google.com/group/plumed-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/plumed-users/a198928b-e762-41ac-b354-8615367f2e2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Giovanni Bussi
Scuola Internazionale Superiore di Studi Avanzati - SISSA
via Bonomea 265, 34136 Trieste, Italy
          http://srnas.sissa.it

Chris Neale

unread,
Jul 26, 2016, 11:56:11 AM7/26/16
to PLUMED users
Dear Giovanni:
 
 Thank you for the great suggestion to try -multi without plumed. This turns out to be a gromacs issue rather than a plumed issue. Details are below. I will post this issue to the gromacs list.
 
 (1) Increase pace from 2500 to 25000 does not increase speed.
 
 (mpirun -npernode 24) -ntomp 1; PACE=2500 :: 28.3 ns/day
 (mpirun -npernode 24) -ntomp 1; PACE=25000 :: 28.5 ns/day

 (mpirun -npernode 4) -ntomp 6; PACE=2500 :: 8.6 ns/day
 (mpirun -npernode 4) -ntomp 6; PACE=25000 :: 8.6 ns/day
 
 (1b) Without plumed, the same reliance on -ntomp 1 with -multi still exists (surprised me)
 
 (mpirun -npernode 24) -ntomp 1 :: 34.2 ns/day
 (mpirun -npernode 4) -ntomp 6 :: 9.8 ns/day             <-- good catch, so it's not a plumed issue after all!
 (mpirun -npernode 4) -ntomp 3 :: 10.0 ns/day 
  
 (2) I am using grids.
 
 metad: METAD ARG=dist.z SIGMA=0.0125 HEIGHT=0.8 PACE=2500 INTERVAL=-2.2,0.5 BIASFACTOR=60.0 TEMP=310.0 WALKERS_MPI GRID_MIN=-3.2 GRID_MAX=1.5 GRID_SPACING=0.00125 GRID_WFILE=GRID GRID_WSTRIDE=250000 STORE_GRIDS
 
 No hyperthreading (2 walkers): 25.6 ns/day
 Hyperthreading (2 walkers): 28.3 ns/day
 
 (3) Will try v2.3 when I get some time and will report back them.
 
 (4) Hyperthreading.
 
 This effect of hyperthreading increasing performance is reproducible even without plumed (single node, no "-multi"). Could be due to the fact that this is a hybrid CPU/GPU run and I am CPU-bound. Also, see the last entry in part 1b, above, to show that without hyperthreading the -ntomp >1 issue still exists.

Chris Neale

unread,
Jul 26, 2016, 1:03:57 PM7/26/16
to PLUMED users
After some more tests, this seems like perhaps I just didn't even use mpirun correctly. Specifying both the -np and -npernode options to mpirun also solves the problem. So e.g. "mpirun -np 8 -npernode 4 gmx_mpi -ntomp 6 -gpu_id 0123" works well. The manual docs for openmpi mpirun don't make it clear to me that -npernode should be used in conjunction with (not instead of) the -np option, but it seems as if that is the case.

Chris Neale

unread,
Jul 26, 2016, 5:28:32 PM7/26/16
to PLUMED users
Just to follow up for completion, here is my post to the gromacs mailing list and a reply from a knowledgeable gromacs developer:

Reply all
Reply to author
Forward
0 new messages