Dear Giovanni:
Thank you for the great suggestion to try -multi without plumed. This turns out to be a gromacs issue rather than a plumed issue. Details are below. I will post this issue to the gromacs list.
(1) Increase pace from 2500 to 25000 does not increase speed.
(mpirun -npernode 24) -ntomp 1; PACE=2500 :: 28.3 ns/day
(mpirun -npernode 24) -ntomp 1; PACE=25000 :: 28.5 ns/day
(mpirun -npernode 4) -ntomp 6; PACE=2500 :: 8.6 ns/day
(mpirun -npernode 4) -ntomp 6; PACE=25000 :: 8.6 ns/day
(1b) Without plumed, the same reliance on -ntomp 1 with -multi still exists (surprised me)
(mpirun -npernode 24) -ntomp 1 :: 34.2 ns/day
(mpirun -npernode 4) -ntomp 6 :: 9.8 ns/day <-- good catch, so it's not a plumed issue after all!
(mpirun -npernode 4) -ntomp 3 :: 10.0 ns/day
(2) I am using grids.
metad: METAD ARG=dist.z SIGMA=0.0125 HEIGHT=0.8 PACE=2500 INTERVAL=-2.2,0.5 BIASFACTOR=60.0 TEMP=310.0 WALKERS_MPI GRID_MIN=-3.2 GRID_MAX=1.5 GRID_SPACING=0.00125 GRID_WFILE=GRID GRID_WSTRIDE=250000 STORE_GRIDS
No hyperthreading (2 walkers): 25.6 ns/day
Hyperthreading (2 walkers): 28.3 ns/day
(3) Will try v2.3 when I get some time and will report back them.
(4) Hyperthreading.
This effect of hyperthreading increasing performance is reproducible even without plumed (single node, no "-multi"). Could be due to the fact that this is a hybrid CPU/GPU run and I am CPU-bound. Also, see the last entry in part 1b, above, to show that without hyperthreading the -ntomp >1 issue still exists.