terrible performance across infiniband

163 views
Skip to first unread message

Ronald Cohen

unread,
Mar 21, 2016, 4:48:40 PM3/21/16
to cp2k
On the dco machine deepcarbon I find decent single node mpi performnace, but running on the same number of processors across two nodes is terrible, even with the infiniband interconect. This is the cp2k  H2O-64 benchmark:


 
On 16 cores on 1 node: total time 530 seconds
 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
 CP2K                                 1  1.0    0.015    0.019  530.306  530.306
 -                                                                             -
 -                         MESSAGE PASSING PERFORMANCE                         -
 -                                                                             -
 -------------------------------------------------------------------------------

 ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE [MB/s]
 MP_Group                5         0.000
 MP_Bcast             4103         0.029              44140.             6191.05
 MP_Allreduce        21860         7.077                263.                0.81
 MP_Gather              62         0.008                320.                2.53
 MP_Sync                54         0.001
 MP_Alltoall         19407        26.839             648289.              468.77
 MP_ISendRecv        21600         0.091              94533.            22371.25
 MP_Wait            238786        50.545
 MP_comm_split          50         0.004
 MP_ISend            97572         0.741             239205.            31518.68
 MP_IRecv            97572         8.605             239170.             2711.98
 MP_Memory          167778        45.018
 -------------------------------------------------------------------------------


on 16 cores on 2 nodes: total time 5053 seconds !!

SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
 CP2K                                 1  1.0    0.311    0.363 5052.904 5052.909


-------------------------------------------------------------------------------
 -                                                                             -
 -                         MESSAGE PASSING PERFORMANCE                         -
 -                                                                             -
 -------------------------------------------------------------------------------

 ROUTINE             CALLS  TOT TIME [s]  AVE VOLUME [Bytes]  PERFORMANCE [MB/s]
 MP_Group                5         0.000
 MP_Bcast             4119         0.258              43968.              700.70
 MP_Allreduce        21892      1546.186                263.                0.00
 MP_Gather              62         0.049                320.                0.40
 MP_Sync                54         0.071
 MP_Alltoall         19407      1507.024             648289.                8.35
 MP_ISendRecv        21600         0.104              94533.            19656.44
 MP_Wait            238786       513.507
 MP_comm_split          50         4.096
 MP_ISend            97572         1.102             239206.            21176.09
 MP_IRecv            97572         2.739             239171.             8520.75
 MP_Memory          167778        18.845
 -------------------------------------------------------------------------------

Any ideas? The code was built with the latest gfortran and I built all of the dependencies, using this arch file.

CC   = gcc
CPP  =
FC   = mpif90
LD   = mpif90
AR   = ar -r
PREFIX   = /home/rcohen
FFTW_INC   = $(PREFIX)/include
FFTW_LIB   = $(PREFIX)/lib
LIBINT_INC = $(PREFIX)/include
LIBINT_LIB = $(PREFIX)/lib
LIBXC_INC  = $(PREFIX)/include
LIBXC_LIB  = $(PREFIX)/lib
GCC_LIB = $(PREFIX)/gcc-trunk/lib
GCC_LIB64  = $(PREFIX)/gcc-trunk/lib64
GCC_INC = $(PREFIX)/gcc-trunk/include
DFLAGS  = -D__FFTW3 -D__LIBINT -D__LIBXC2\
    -D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\
    -D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3 
CPPFLAGS   =
FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\
    -fopenmp -ftree-vectorize -funroll-loops\
    -mtune=native  \
     -I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \
     -I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules
LIBS    =  \
    $(PREFIX)/lib/libscalapack.a $(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \
    $(FFTW_LIB)/libfftw3.a\
    $(FFTW_LIB)/libfftw3_threads.a\
    $(LIBXC_LIB)/libxcf90.a\
    $(LIBXC_LIB)/libxc.a\
    $(PREFIX)/lib/liblapack.a  $(PREFIX)/lib/libtmglib.a $(PREFIX)/lib/libgomp.a  \
    $(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a  -lelpa_openmp -lgomp -lopenblas
LDFLAGS = $(FCFLAGS)  -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran -L$(PREFIX)/lib 

It was run with  OMP_NUM_THREADS=2 on the two nodes and  OMP_NUM_THREADS=1 on the one node.
Running with  OMP_NUM_THREADS=1 on two nodes .

I am now checking whether OMP_NUM_THREADS=1 on two nodes is faster than OMP_NUM_THREADS=2 , but I do not think so.

Ron Cohen



Glen MacLachlan

unread,
Mar 21, 2016, 5:04:04 PM3/21/16
to cp...@googlegroups.com

Are you conflating MPI with OpenMP? OMP_NUM_THREADS sets the number of threads used by OpenMP and OpenMP doesn't work on a distributed memory environment unless you piggyback on MPI which would be a hybrid use and I'm not sure CP2K ever worked optimally in hybrid mode or at least that's what I've gotten from reading the comments on the source code.

As for MPI, are you sure your MPI stack was compiled with IB bindings? I had similar issues and the problem was that I wasn't actually using IB. If you can, disable eth and leave only IB and see what happens.

Glen

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To post to this group, send email to cp...@googlegroups.com.
Visit this group at https://groups.google.com/group/cp2k.
For more options, visit https://groups.google.com/d/optout.

Cohen, Ronald

unread,
Mar 21, 2016, 5:12:01 PM3/21/16
to cp2k
Yes I am using hybrid mode. But even if I set OMP_NUM_THREADS=1 performance is terrible.

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

--
You received this message because you are subscribed to a topic in the Google Groups "cp2k" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cp2k/lVLso0oseHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cp2k+uns...@googlegroups.com.

Cohen, Ronald

unread,
Mar 21, 2016, 5:13:06 PM3/21/16
to cp2k
Sorry--the second question: I used configure with openmpi-1.10.2 and it seemed to discover the infiniband. But perhaps this is not set properly to build OK on the machine. It is a good point.

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Mon, Mar 21, 2016 at 5:04 PM, Glen MacLachlan <mac...@gwu.edu> wrote:

--
You received this message because you are subscribed to a topic in the Google Groups "cp2k" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cp2k/lVLso0oseHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cp2k+uns...@googlegroups.com.

Glen MacLachlan

unread,
Mar 21, 2016, 5:36:24 PM3/21/16
to cp...@googlegroups.com

It's hard to talk about the performance when you set OMP_NUM_THREADS = 1 because there is so much overhead associated with OpenMP that launching 1 thread almost always is a performance killer. In fact, OMP_NUM_THREADS=1 never rivals single-threaded performance-wise because of that overhead. No one ever sets  OMP_NUM_THREADS=1 unless they are playing around...We never do that in production jobs. How about when you scale up to 4 or 8 threads?

Glen

P.S. I see you're in DC...so am I. I support CP2K for the chemists at GWU. Hope you aren't using Metro to get around the DMV :p

Ronald Cohen

unread,
Mar 21, 2016, 6:05:37 PM3/21/16
to cp...@googlegroups.com
According to my experience in general, or the cp2k web pages in particular that is not the case.  Please see the performance page for cp2k.  The problem I am sure now is with the openmpi build not using the proper infiniband libraries or drivers.

Thank you!

Ron

Sent from my iPad

Andreas Glöss

unread,
Mar 22, 2016, 4:21:25 AM3/22/16
to cp2k
Hi Ron,

There are several things in your ARCH-file that doesn't fit together, or at least make no sense to me.
1) -I$(MKLROOT)/include, MKL is not used in your case.
2) reference (netlib) lapack, scalapack, openblas, will never give you peak performance, better use MKL if available
3) not sure, but CP2K + ELPA-2015-11-10 was never tested yet?

Please provide a snippet of the TIMINGS section (~30 first lines) - maybe we can locate the problem from there.

Btw., even thought that PSMP should run most efficient on MPI+OMP machine, we usually find that the pure POPT (no OMP) runs faster. Could you try this as well - 2 nodes, each running 16 MPI tasks?
To do this please remove '-fopenmp', '-lomp' and compile and link the non-threaded versions of FFTW3 and ELPA.

Best regards,
Andreas


Andreas Glöss

unread,
Mar 22, 2016, 4:25:57 AM3/22/16
to cp2k
Hi Ron,

Please send the output of the 'ompi_info' command as well - preferable as attached file?

Best regards,
Andreas

Glen MacLachlan

unread,
Mar 22, 2016, 10:33:28 AM3/22/16
to cp...@googlegroups.com
Hi Ron, 

There's a chance that OpenMPI wasn't configured to use IB properly. Why don't you disable tcp and see if you are using IB?  It's easy
mpirun --mca btl ^tcp ...

Regarding OpenMP:
I'm not sure we're converging on the same discussion anymore but setting OMP_NUM_THREADS=1 does not disable multithreading overhead -- you need to compile without the fopenmp to get a measure of true single thread performance. 


Best,
Glen

==========================================
Glen MacLachlan, PhD
HPC Specialist  for Physical Sciences &
Professorial Lecturer, Data Sciences
Office of Technology Services
The George Washington University
725 21st Street
Washington, DC 20052
Suite 211, Corcoran Hall
==========================================


Cohen, Ronald

unread,
Mar 22, 2016, 12:05:01 PM3/22/16
to cp2k, Fox, Peter
Thank you so much. It is a bit difficult because I did not set up this machine and do not have root access, but I know it is a mess. I backed up to just try the HPL benchmark.
I am finding 100 GFLOPS one node performance on N=2000 and 16 cores, and 1.5 GFLOPS using two nodes, 8 cores per node. So there is definately something really wrong. I need to getthis working before I can worry about threads or cp2k.
Was that a caret in your command above:

mpirun --mca btl ^tcp

?

I looked through my openmpi build and it seems to have found the infiniband includes such as they exist on the machine, but I could not the expected mxm or Mellanox drivers anywhere on the machine. 

I am CCing Peter Fox, the person who volunteers his time for this machine, and who has root access!

Sincerely,

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,
Mar 22, 2016, 12:12:57 PM3/22/16
to cp...@googlegroups.com, Fox, Peter
Yeah, the ^ is a regular expression character that means ignore what comes after -- think of it as a negation. 

Glen MacLachlan

unread,
Mar 22, 2016, 12:15:59 PM3/22/16
to cp...@googlegroups.com, Fox, Peter
Sorry, it's more accurate to say the circumflex "^" is a regex character that reverses the match.

Glen MacLachlan

unread,
Mar 22, 2016, 12:21:13 PM3/22/16
to cp...@googlegroups.com, Fox, Peter
There are more ways to benchmark MPI than you can shake a stick at but NASA has a pretty simple suite of tests called NPB that is really easy to compile and run:


You can benchmark MPI by itself, OpenMP by itself, or benchmark MPI+OpenMPI together.  

Cohen, Ronald

unread,
Mar 22, 2016, 12:21:19 PM3/22/16
to cp2k
OK, I ran xhpl with those flags and got the same 1 GF performance as without. So I guess my openmpi is not using ib. I wonder how to turn that on! My config.log for the build seems to show that it found infiniband. I attached it in case you have time to look. Thank you so much!

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

config.log

Cohen, Ronald

unread,
Mar 22, 2016, 12:29:20 PM3/22/16
to cp2k
So what was expected when I ran this test? THanks!
Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Tue, Mar 22, 2016 at 10:32 AM, Glen MacLachlan <mac...@gwu.edu> wrote:

Glen MacLachlan

unread,
Mar 22, 2016, 12:34:26 PM3/22/16
to cp...@googlegroups.com
Check with your admin to see what networks are available but if you disable tcp using mpirun --mca btl ^tcp then you should be giving MPI no choice but to use IB. You can also increase the verbosity by adding --mca btl_openib_verbose 1.

Also, did you run ompi_info --all as Andreas suggested? 

Cohen, Ronald

unread,
Mar 22, 2016, 1:48:24 PM3/22/16
to cp2k
So I must be using IB but just getting poor performance. Attached is the output of ompi_info --all  . The problem I have is there is essentially no admin who is available, and I don't have root access.  Thank you again!


Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

ompi_info.out

Cohen, Ronald

unread,
Mar 22, 2016, 2:05:57 PM3/22/16
to cp2k
Dear Glen,

I made NPB. Which test do you recommend me running? I have run several and it is not clear what to look for.

Sincerely,

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Cohen, Ronald

unread,
Mar 22, 2016, 2:49:39 PM3/22/16
to cp2k, Fox, Peter, Ding Pan, Craig Schiffries
I explicitly put

mpirun --mca btl openib,self -hostfile $PBS_NODEFILE -n 16  xhpl  > xhpl.out
for the hpl benchmark and still get terrible 1 GF performance on 2 nodes, but no errors. 
So it seems I am running on infiniband, but getting terrible performance. 
Does that mean a hardware problem?

Thank you again for your help!

Sincerely,

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,
Mar 22, 2016, 3:07:46 PM3/22/16
to cp...@googlegroups.com
Hi Ron, 

I think this is sort of off topic for the CP2K folks and more along the lines of OpenMPI but I'm happy to continue the discussion -- I'm afraid they might ask us to take it elsewhere though. 

So you want to do a couple of things
  1.  vary the number of tasks and look for scaling -- you need to do this across multiple nodes to see what affect infiniband is having -- I assume you know how to ask your scheduler to distribute the tasks across multiple nodes. 
  2. look for the throughput that you expect to be getting from your infiniband fabric. Did you mention what inifiniband you are running? qdr? fdr? You can compare the NPB benchmarks for your ib and ethernet networks. Do you know what your ethernet network throughput is? gigE? 10gig? You may want to have a look at this benchmark report that used NPB and NWChem, among others: http://www.dell.com/Downloads/Global/Power/ps1q10-20100215-Mellanox.pdf
Also, not having an admin handy or root access is not to bad of an impediment. You can stand up your own instance of openmpi without special privileges. Before you start chasing too many benchmarks (which can be difficult to resist) you may want to spin up your own OpenMPI instance and see if you can beat the ethernet performance. 

By the way, when you type ifconfig do you see an interface that looks like ib0 or ib1 or something like that?

Cohen, Ronald

unread,
Mar 22, 2016, 3:09:22 PM3/22/16
to cp2k
Yes, thank you so much. Basically I am getting mud even with 2 nodes, so using more could not be better. I understand it is off topic, so won't bother you. I have to get this working before I can worry about cp2k performance!

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,
Mar 22, 2016, 3:14:26 PM3/22/16
to cp...@googlegroups.com
No, no...don't misunderstand. I don't mind helping -- I want to figure this out too! Just saying we might want to take it over to the OpenMPI message boards. There you'll get hundreds of OpenMPI experts looking at your problem.



Cohen, Ronald

unread,
Mar 22, 2016, 3:15:38 PM3/22/16
to cp2k
Oh--thank you so much! I will write there.

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,
Mar 22, 2016, 3:17:41 PM3/22/16
to cp...@googlegroups.com
You want to subscribe to the "user" list and post your messages there. I'll look for your messages on that board. 

Cohen, Ronald

unread,
Mar 22, 2016, 3:52:13 PM3/22/16
to cp2k
Yes, I applied and am waiting. BTW, do you know how I find out what kind of infiniband we have? I think it is mellanox.. I don't know how to find out more. I fopund ls ./lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/infiniband/hw
mthca  mlx4  ipath  qib  nes  ..  cxgb3  .  cxgb4

and not much else no the machine.

Ron

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Cohen, Ronald

unread,
Mar 22, 2016, 3:58:42 PM3/22/16
to cp2k
I did this:
ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c903:00ec:9301
        base lid:        0x1
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            56 Gb/sec (4X FDR)
        link_layer:      InfiniBand

So it seems it is 4X FDR and should get a peak 56v GB/sec!

Ron
 

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,
Mar 22, 2016, 4:00:44 PM3/22/16
to cp...@googlegroups.com
You can usually figure that out but looking at the IB drivers or rdma devices. 
Try these:
  1. ibstat
  2. ibstatus
  3. ibv_devinfo
If you are running mellanox there will be a string that reads mlx4 or something similar. There a host of IB tools that sort of evolved in parallel but independently and there is a lot of overlap of what they do. You can see what you have by typing ib and then hitting [TAB] a few times. 

Glen MacLachlan

unread,
Mar 22, 2016, 4:01:32 PM3/22/16
to cp...@googlegroups.com
Yes, that's correct. 

Cohen, Ronald

unread,
Mar 22, 2016, 4:03:51 PM3/22/16
to cp2k
Yes:
[rcohen@deepcarbon 2nodesb]$ ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.11.500
        node_guid:                      0002:c903:00ec:9300
        sys_image_guid:                 0002:c903:00ec:9303
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

Ron


---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Cohen, Ronald

unread,
Mar 23, 2016, 1:29:34 PM3/23/16
to cp2k
So the problem is solved! I needed to rebuild openmpi giving the torque directory etc:

> ./configure --prefix=/home/rcohen --with-tm=/opt/torque
> make clean
> make -j 8
> make install
>


So I want to thank you so much! My benchmark for the 64 molecule H2O benchmark for 16 mpi processes, 8 each on two nodes with OMP=2,
went from 5052 seconds to 266 seconds with this simple fix! Now I will do further checking and tuning.
Thank you!

Ron

---
Ron Cohen
reco...@gmail.com
skypename: ronaldcohen
twitter: @recohen3


On Wed, Mar 23, 2016 at 11:00 AM, Ronald Cohen <reco...@gmail.com> wrote:
> Dear Gilles,
>
> --with-tm fails. I have now built with
> ./configure --prefix=/home/rcohen --with-tm=/opt/torque
> make clean
> make -j 8
> make install
>



---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,
Mar 23, 2016, 1:31:34 PM3/23/16
to cp...@googlegroups.com

Glad it worked out!

Reply all
Reply to author
Forward
0 new messages