terrible performance across infiniband

Ronald Cohen

unread,

Mar 21, 2016, 4:48:40 PM3/21/16

to cp2k

On the dco machine deepcarbon I find decent single node mpi performnace, but running on the same number of processors across two nodes is terrible, even with the infiniband interconect. This is the cp2k H2O-64 benchmark:

On 16 cores on 1 node: total time 530 seconds

SUBROUTINE CALLS ASD SELF TIME TOTAL TIME

MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM

CP2K 1 1.0 0.015 0.019 530.306 530.306

- -

- MESSAGE PASSING PERFORMANCE -

- -

-------------------------------------------------------------------------------

ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE [MB/s]

MP_Group 5 0.000

MP_Bcast 4103 0.029 44140. 6191.05

MP_Allreduce 21860 7.077 263. 0.81

MP_Gather 62 0.008 320. 2.53

MP_Sync 54 0.001

MP_Alltoall 19407 26.839 648289. 468.77

MP_ISendRecv 21600 0.091 94533. 22371.25

MP_Wait 238786 50.545

MP_comm_split 50 0.004

MP_ISend 97572 0.741 239205. 31518.68

MP_IRecv 97572 8.605 239170. 2711.98

MP_Memory 167778 45.018

-------------------------------------------------------------------------------

on 16 cores on 2 nodes: total time 5053 seconds !!

SUBROUTINE CALLS ASD SELF TIME TOTAL TIME

MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM

CP2K 1 1.0 0.311 0.363 5052.904 5052.909

-------------------------------------------------------------------------------

- -

- MESSAGE PASSING PERFORMANCE -

- -

-------------------------------------------------------------------------------

ROUTINE CALLS TOT TIME [s] AVE VOLUME [Bytes] PERFORMANCE [MB/s]

MP_Group 5 0.000

MP_Bcast 4119 0.258 43968. 700.70

MP_Allreduce 21892 1546.186 263. 0.00

MP_Gather 62 0.049 320. 0.40

MP_Sync 54 0.071

MP_Alltoall 19407 1507.024 648289. 8.35

MP_ISendRecv 21600 0.104 94533. 19656.44

MP_Wait 238786 513.507

MP_comm_split 50 4.096

MP_ISend 97572 1.102 239206. 21176.09

MP_IRecv 97572 2.739 239171. 8520.75

MP_Memory 167778 18.845

-------------------------------------------------------------------------------

Any ideas? The code was built with the latest gfortran and I built all of the dependencies, using this arch file.

CC = gcc

CPP =

FC = mpif90

LD = mpif90

AR = ar -r

PREFIX = /home/rcohen

FFTW_INC = $(PREFIX)/include

FFTW_LIB = $(PREFIX)/lib

LIBINT_INC = $(PREFIX)/include

LIBINT_LIB = $(PREFIX)/lib

LIBXC_INC = $(PREFIX)/include

LIBXC_LIB = $(PREFIX)/lib

GCC_LIB = $(PREFIX)/gcc-trunk/lib

GCC_LIB64 = $(PREFIX)/gcc-trunk/lib64

GCC_INC = $(PREFIX)/gcc-trunk/include

DFLAGS = -D__FFTW3 -D__LIBINT -D__LIBXC2\

-D__LIBINT_MAX_AM=7 -D__LIBDERIV_MAX_AM1=6 -D__MAX_CONTR=4\

-D__parallel -D__SCALAPACK -D__HAS_smm_dnn -D__ELPA3

CPPFLAGS =

FCFLAGS = $(DFLAGS) -O2 -ffast-math -ffree-form -ffree-line-length-none\

-fopenmp -ftree-vectorize -funroll-loops\

-mtune=native \

-I$(FFTW_INC) -I$(LIBINT_INC) -I$(LIBXC_INC) -I$(MKLROOT)/include \

-I$(GCC_INC) -I$(PREFIX)/include/elpa_openmp-2015.11.001/modules

LIBS = \

$(PREFIX)/lib/libscalapack.a $(PREFIX)/lib/libsmm_dnn_sandybridge-2015-11-10.a \

$(FFTW_LIB)/libfftw3.a\

$(FFTW_LIB)/libfftw3_threads.a\

$(LIBXC_LIB)/libxcf90.a\

$(LIBXC_LIB)/libxc.a\

$(PREFIX)/lib/liblapack.a $(PREFIX)/lib/libtmglib.a $(PREFIX)/lib/libgomp.a \

$(PREFIX)/lib/libderiv.a $(PREFIX)/lib/libint.a -lelpa_openmp -lgomp -lopenblas

LDFLAGS = $(FCFLAGS) -L$(GCC_LIB64) -L$(GCC_LIB) -static-libgfortran -L$(PREFIX)/lib

It was run with OMP_NUM_THREADS=2 on the two nodes and OMP_NUM_THREADS=1 on the one node.

Running with OMP_NUM_THREADS=1 on two nodes .

I am now checking whether OMP_NUM_THREADS=1 on two nodes is faster than OMP_NUM_THREADS=2 , but I do not think so.

Ron Cohen

Glen MacLachlan

unread,

Mar 21, 2016, 5:04:04 PM3/21/16

to cp...@googlegroups.com

Are you conflating MPI with OpenMP? OMP_NUM_THREADS sets the number of threads used by OpenMP and OpenMP doesn't work on a distributed memory environment unless you piggyback on MPI which would be a hybrid use and I'm not sure CP2K ever worked optimally in hybrid mode or at least that's what I've gotten from reading the comments on the source code.

As for MPI, are you sure your MPI stack was compiled with IB bindings? I had similar issues and the problem was that I wasn't actually using IB. If you can, disable eth and leave only IB and see what happens.

Glen

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To post to this group, send email to cp...@googlegroups.com.
Visit this group at https://groups.google.com/group/cp2k.
For more options, visit https://groups.google.com/d/optout.

Cohen, Ronald

unread,

Mar 21, 2016, 5:12:01 PM3/21/16

to cp2k

Yes I am using hybrid mode. But even if I set OMP_NUM_THREADS=1 performance is terrible.

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

--
You received this message because you are subscribed to a topic in the Google Groups "cp2k" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cp2k/lVLso0oseHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cp2k+uns...@googlegroups.com.

Cohen, Ronald

unread,

Mar 21, 2016, 5:13:06 PM3/21/16

to cp2k

Sorry--the second question: I used configure with openmpi-1.10.2 and it seemed to discover the infiniband. But perhaps this is not set properly to build OK on the machine. It is a good point.

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Mon, Mar 21, 2016 at 5:04 PM, Glen MacLachlan <mac...@gwu.edu> wrote:

--
You received this message because you are subscribed to a topic in the Google Groups "cp2k" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cp2k/lVLso0oseHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cp2k+uns...@googlegroups.com.

Glen MacLachlan

unread,

Mar 21, 2016, 5:36:24 PM3/21/16

to cp...@googlegroups.com

It's hard to talk about the performance when you set OMP_NUM_THREADS = 1 because there is so much overhead associated with OpenMP that launching 1 thread almost always is a performance killer. In fact, OMP_NUM_THREADS=1 never rivals single-threaded performance-wise because of that overhead. No one ever sets OMP_NUM_THREADS=1 unless they are playing around...We never do that in production jobs. How about when you scale up to 4 or 8 threads?

Glen

P.S. I see you're in DC...so am I. I support CP2K for the chemists at GWU. Hope you aren't using Metro to get around the DMV :p

Ronald Cohen

unread,

Mar 21, 2016, 6:05:37 PM3/21/16

to cp...@googlegroups.com

According to my experience in general, or the cp2k web pages in particular that is not the case. Please see the performance page for cp2k. The problem I am sure now is with the openmpi build not using the proper infiniband libraries or drivers.

Thank you!

Ron

Sent from my iPad

Andreas Glöss

unread,

Mar 22, 2016, 4:21:25 AM3/22/16

to cp2k

Hi Ron,

There are several things in your ARCH-file that doesn't fit together, or at least make no sense to me.
1) -I$(MKLROOT)/include, MKL is not used in your case.
2) reference (netlib) lapack, scalapack, openblas, will never give you peak performance, better use MKL if available
3) not sure, but CP2K + ELPA-2015-11-10 was never tested yet?

Please provide a snippet of the TIMINGS section (~30 first lines) - maybe we can locate the problem from there.

Btw., even thought that PSMP should run most efficient on MPI+OMP machine, we usually find that the pure POPT (no OMP) runs faster. Could you try this as well - 2 nodes, each running 16 MPI tasks?
To do this please remove '-fopenmp', '-lomp' and compile and link the non-threaded versions of FFTW3 and ELPA.

Best regards,
Andreas

Andreas Glöss

unread,

Mar 22, 2016, 4:25:57 AM3/22/16

to cp2k

Hi Ron,

Please send the output of the 'ompi_info' command as well - preferable as attached file?

Best regards,
Andreas

Glen MacLachlan

unread,

Mar 22, 2016, 10:33:28 AM3/22/16

to cp...@googlegroups.com

Hi Ron,

There's a chance that OpenMPI wasn't configured to use IB properly. Why don't you disable tcp and see if you are using IB? It's easy

mpirun --mca btl ^tcp ...

Regarding OpenMP:

I'm not sure we're converging on the same discussion anymore but setting OMP_NUM_THREADS=1 does not disable multithreading overhead -- you need to compile without the fopenmp to get a measure of true single thread performance.

Best,

Glen

==========================================

Glen MacLachlan, PhD

HPC Specialist for Physical Sciences &

Professorial Lecturer, Data Sciences

Office of Technology Services
The George Washington University
725 21st Street
Washington, DC 20052
Suite 211, Corcoran Hall

==========================================

Cohen, Ronald

unread,

Mar 22, 2016, 12:05:01 PM3/22/16

to cp2k, Fox, Peter

Thank you so much. It is a bit difficult because I did not set up this machine and do not have root access, but I know it is a mess. I backed up to just try the HPL benchmark.

I am finding 100 GFLOPS one node performance on N=2000 and 16 cores, and 1.5 GFLOPS using two nodes, 8 cores per node. So there is definately something really wrong. I need to getthis working before I can worry about threads or cp2k.

Was that a caret in your command above:

mpirun --mca btl ^tcp

?

I looked through my openmpi build and it seems to have found the infiniband includes such as they exist on the machine, but I could not the expected mxm or Mellanox drivers anywhere on the machine.

I am CCing Peter Fox, the person who volunteers his time for this machine, and who has root access!

Sincerely,

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,

Mar 22, 2016, 12:12:57 PM3/22/16

to cp...@googlegroups.com, Fox, Peter

Yeah, the ^ is a regular expression character that means ignore what comes after -- think of it as a negation.

Glen MacLachlan

unread,

Mar 22, 2016, 12:15:59 PM3/22/16

to cp...@googlegroups.com, Fox, Peter

Sorry, it's more accurate to say the circumflex "^" is a regex character that reverses the match.

Glen MacLachlan

unread,

Mar 22, 2016, 12:21:13 PM3/22/16

to cp...@googlegroups.com, Fox, Peter

There are more ways to benchmark MPI than you can shake a stick at but NASA has a pretty simple suite of tests called NPB that is really easy to compile and run:

http://www.nas.nasa.gov/publications/npb.html

You can benchmark MPI by itself, OpenMP by itself, or benchmark MPI+OpenMPI together.

Cohen, Ronald

unread,

Mar 22, 2016, 12:21:19 PM3/22/16

to cp2k

OK, I ran xhpl with those flags and got the same 1 GF performance as without. So I guess my openmpi is not using ib. I wonder how to turn that on! My config.log for the build seems to show that it found infiniband. I attached it in case you have time to look. Thank you so much!

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

config.log

Cohen, Ronald

unread,

Mar 22, 2016, 12:29:20 PM3/22/16

to cp2k

So what was expected when I ran this test? THanks!

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Tue, Mar 22, 2016 at 10:32 AM, Glen MacLachlan <mac...@gwu.edu> wrote:

Glen MacLachlan

unread,

Mar 22, 2016, 12:34:26 PM3/22/16

to cp...@googlegroups.com

Check with your admin to see what networks are available but if you disable tcp using mpirun --mca btl ^tcp then you should be giving MPI no choice but to use IB. You can also increase the verbosity by adding --mca btl_openib_verbose 1.

Also, did you run ompi_info --all as Andreas suggested?

Cohen, Ronald

unread,

Mar 22, 2016, 1:48:24 PM3/22/16

to cp2k

So I must be using IB but just getting poor performance. Attached is the output of ompi_info --all . The problem I have is there is essentially no admin who is available, and I don't have root access. Thank you again!

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

ompi_info.out

Cohen, Ronald

unread,

Mar 22, 2016, 2:05:57 PM3/22/16

to cp2k

Dear Glen,

I made NPB. Which test do you recommend me running? I have run several and it is not clear what to look for.

Sincerely,

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Cohen, Ronald

unread,

Mar 22, 2016, 2:49:39 PM3/22/16

to cp2k, Fox, Peter, Ding Pan, Craig Schiffries

I explicitly put

mpirun --mca btl openib,self -hostfile $PBS_NODEFILE -n 16 xhpl > xhpl.out

for the hpl benchmark and still get terrible 1 GF performance on 2 nodes, but no errors.

So it seems I am running on infiniband, but getting terrible performance.

Does that mean a hardware problem?

Thank you again for your help!

Sincerely,

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,

Mar 22, 2016, 3:07:46 PM3/22/16

to cp...@googlegroups.com

Hi Ron,

I think this is sort of off topic for the CP2K folks and more along the lines of OpenMPI but I'm happy to continue the discussion -- I'm afraid they might ask us to take it elsewhere though.

So you want to do a couple of things

vary the number of tasks and look for scaling -- you need to do this across multiple nodes to see what affect infiniband is having -- I assume you know how to ask your scheduler to distribute the tasks across multiple nodes.
look for the throughput that you expect to be getting from your infiniband fabric. Did you mention what inifiniband you are running? qdr? fdr? You can compare the NPB benchmarks for your ib and ethernet networks. Do you know what your ethernet network throughput is? gigE? 10gig? You may want to have a look at this benchmark report that used NPB and NWChem, among others: http://www.dell.com/Downloads/Global/Power/ps1q10-20100215-Mellanox.pdf

Also, not having an admin handy or root access is not to bad of an impediment. You can stand up your own instance of openmpi without special privileges. Before you start chasing too many benchmarks (which can be difficult to resist) you may want to spin up your own OpenMPI instance and see if you can beat the ethernet performance.

By the way, when you type ifconfig do you see an interface that looks like ib0 or ib1 or something like that?

Cohen, Ronald

unread,

Mar 22, 2016, 3:09:22 PM3/22/16

to cp2k

Yes, thank you so much. Basically I am getting mud even with 2 nodes, so using more could not be better. I understand it is off topic, so won't bother you. I have to get this working before I can worry about cp2k performance!

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,

Mar 22, 2016, 3:14:26 PM3/22/16

to cp...@googlegroups.com

No, no...don't misunderstand. I don't mind helping -- I want to figure this out too! Just saying we might want to take it over to the OpenMPI message boards. There you'll get hundreds of OpenMPI experts looking at your problem.

https://www.open-mpi.org/faq/

https://www.open-mpi.org/community/lists/ompi.php

Cohen, Ronald

unread,

Mar 22, 2016, 3:15:38 PM3/22/16

to cp2k

Oh--thank you so much! I will write there.

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,

Mar 22, 2016, 3:17:41 PM3/22/16

to cp...@googlegroups.com

You want to subscribe to the "user" list and post your messages there. I'll look for your messages on that board.

Cohen, Ronald

unread,

Mar 22, 2016, 3:52:13 PM3/22/16

to cp2k

Yes, I applied and am waiting. BTW, do you know how I find out what kind of infiniband we have? I think it is mellanox.. I don't know how to find out more. I fopund ls ./lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/infiniband/hw

mthca mlx4 ipath qib nes .. cxgb3 . cxgb4

and not much else no the machine.

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Cohen, Ronald

unread,

Mar 22, 2016, 3:58:42 PM3/22/16

to cp2k

I did this:

ibstatus

Infiniband device 'mlx4_0' port 1 status:

default gid: fe80:0000:0000:0000:0002:c903:00ec:9301

base lid: 0x1

sm lid: 0x1

state: 4: ACTIVE

phys state: 5: LinkUp

rate: 56 Gb/sec (4X FDR)

link_layer: InfiniBand

So it seems it is 4X FDR and should get a peak 56v GB/sec!

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,

Mar 22, 2016, 4:00:44 PM3/22/16

to cp...@googlegroups.com

You can usually figure that out but looking at the IB drivers or rdma devices.

Try these:

ibstat
ibstatus
ibv_devinfo

If you are running mellanox there will be a string that reads mlx4 or something similar. There a host of IB tools that sort of evolved in parallel but independently and there is a lot of overlap of what they do. You can see what you have by typing ib and then hitting [TAB] a few times.

Glen MacLachlan

unread,

Mar 22, 2016, 4:01:32 PM3/22/16

to cp...@googlegroups.com

Yes, that's correct.

Cohen, Ronald

unread,

Mar 22, 2016, 4:03:51 PM3/22/16

to cp2k

Yes:

[rcohen@deepcarbon 2nodesb]$ ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.11.500

node_guid: 0002:c903:00ec:9300

sys_image_guid: 0002:c903:00ec:9303

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x0

board_id: MT_1100120019

phys_port_cnt: 1

port: 1

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 1

port_lid: 1

port_lmc: 0x00

link_layer: InfiniBand

Ron

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Cohen, Ronald

unread,

Mar 23, 2016, 1:29:34 PM3/23/16

to cp2k

So the problem is solved! I needed to rebuild openmpi giving the torque directory etc:

> ./configure --prefix=/home/rcohen --with-tm=/opt/torque
> make clean
> make -j 8
> make install
>

So I want to thank you so much! My benchmark for the 64 molecule H2O benchmark for 16 mpi processes, 8 each on two nodes with OMP=2,
went from 5052 seconds to 266 seconds with this simple fix! Now I will do further checking and tuning.

Thank you!

Ron

---
Ron Cohen
reco...@gmail.com
skypename: ronaldcohen
twitter: @recohen3

On Wed, Mar 23, 2016 at 11:00 AM, Ronald Cohen <reco...@gmail.com> wrote:
> Dear Gilles,
>
> --with-tm fails. I have now built with
> ./configure --prefix=/home/rcohen --with-tm=/opt/torque
> make clean
> make -j 8
> make install
>

---
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rco...@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

Glen MacLachlan

unread,

Mar 23, 2016, 1:31:34 PM3/23/16

to cp...@googlegroups.com

Glad it worked out!

Reply all

Reply to author

Forward