cp2k speedup on multicore machines

458 views
Skip to first unread message

cavallo

unread,
Jan 24, 2008, 9:12:52 AM1/24/08
to cp2k

Dear All,

I just started to work with cp2k, and I would like to know the speedup
I can expect with cp2k on dual-core em64t and opteron machines.

I tested a simple job on an em64t. Running on 2 cpus the speedup is
1.07 instead of 2. Going up to 4 cpus the speedup is 1.28 instead of
4. Teo Laino was kind enough to test my input on a Cray, and he got a
speedup of 1.50 on 2 cpus, and 2.79 on 4 cpus, which is definitely
better.

Since this is my first cp2k installation, I would like to know if the
speedup I got is in line with other people experience on em64t and
opterons multicore machines (I hope not), or I have to work to compile
a better executable (in the case any tip is very wellcome). There's no
problem of memory or hd access, since two serial jobs running
simultaneously are not slowed down.

So far, I compiled cp2k on a fedora 7, intel 10.0.025, intel-mkl
libraries, and gcc compiled fftw3 and mpich2.

Thanks,
Luigi

Nichols A. Romero

unread,
Jan 24, 2008, 9:34:23 AM1/24/08
to cp...@googlegroups.com
It all depends on the sysetm size. Larger system sizes will scale better in general.

Reston, VA
443-567-8328 (C)
410-278-2692 (O)

cavallo

unread,
Jan 25, 2008, 12:57:21 AM1/25/08
to cp2k

I tried the QS H2O-256 benchmark, and cp2k on 2 cpus is even slower.

1 cpu = 22579 secs
2 cpus = 24942 secs

The em64t is equipped with 8MB of ram. Don't know if this can be a
bottleneck with 256 waters.
Anyone having some bechmark ? Also absolute execution time, 1 cpu is
wellcome
Thanks
Luigi

Axel

unread,
Jan 25, 2008, 8:32:35 AM1/25/08
to cp2k
luigi,

befor being able to make any comments or data to compare to,
please provide more information about how you compiled cp2k.
which compiler, which flags, which libraries etc. some specs
about the hardware, particularly cpu and memory would be great, too.

these days it is not that easy to say, i have an x GHz cpu
and you have a 2x GHz cpu, so you should be running twice as fast.
same goes for multi-core. under some, unfavorable, circumstances,
the second core can be a complete waste.

cheers,
axel.

Fawzi Mohamed

unread,
Jan 25, 2008, 8:46:02 AM1/25/08
to cp...@googlegroups.com
Hi luigi,

256 water should scale very well beyond 2 cpus.

Have you used the same optimizatios with 1 and 2 cpus? (or it is even
the same executable)?

what happens if you do top during a run?
* does the 1-cpu executable really use only 1 cpu (or lapack/fft use
multithreading, and use two cpus)
* does the 2 cpu really go on two processors? do you have some
problems with cpu affinity?
* the timing is cpu time? i.e 2 cpu= twice the time, but half the
real time?

ciao
Fawzi

Fawzi Mohamed

unread,
Jan 25, 2008, 9:06:13 AM1/25/08
to cp...@googlegroups.com
The em64t is equipped with 8MB of ram. Don't know if this can be a
bottleneck with 256 waters.

well 8MB RAM is for sure a bottleneck ;)
8GB should not, but, well 6 Hours for 10 steps... maybe it did swap, and if you swap then everything is over, do check, if memory consumption is more than your physical memory the cpu usage drops drastically, again in top you should see this.

Use a smaller system, if this is the problem, the 32 water should already scale well...

ciao
Fawzi

Teodoro Laino

unread,
Jan 25, 2008, 9:28:00 AM1/25/08
to cp...@googlegroups.com
The problem, unfortunately, is far being so obvious..

I think the MB is just mistyping.. (GB instead..)
And the problem is not only related to the benchmark input but also to small systems.
I suggested  Luigi to move to the benchmark test since in that case (a larger one)  you should definitely see scaling (more difficult
to observe with smaller systems). IO is reduced to the minimum in testing.. so don't consider the access to the disk.

Luigi, I just post the info you gave me about compilers/hardware.. I don't have the compilation flags.. Can you post them here in the group?

-------------------------------------------------------------------------------------------------------------------------------------------------
Linux k119 2.6.21-1.3194.fc7 #1 SMP Wed May 23 22:47:07 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
gcc version 4.1.2 20070925 (Red Hat 4.1.2-27)
ifort Build 20070613 Package ID: l_fc_c_10.0.025
intel mkl 10.0.1.014
fftw-3.1.2
mpich2-1.0.6p1
-------------------------------------------------------------------------------------------------------------------------------------------------

We use, on our local machines mostly the same setup.. The only difference is the kernel  (ours is a 2.6.16.21-0.8-smp, could this be an issue Axel?) and the mpich2 (we use 1.0.5p4)...

Definitely it's not an OMP issue.. following Manuel suggestion I asked Luigi to set OMP_NUM_THREADS=1 and nothing changes in the timings..

Let's see if we can figure out what's going on..
Sure.. it's a little bit weird...
Teo

Fawzi Mohamed

unread,
Jan 25, 2008, 11:15:39 AM1/25/08
to cp...@googlegroups.com
I know that I have an almost perfect scaling with 32 H2O going from 1 to 2 cpus.

You have to check what you are measuring (be sure to measure real time).

You have to be sure that the mpi correctly sets cpu affinity (so that the process do not swap processors and get slow).
I know lammpi and openmpi, both do it, I assume mpich also does it.

With lammpi you have to write the host filelike this
myhost  cpu=2

and not
myhost
myhost

ciao
Fawzi

cavallo

unread,
Jan 26, 2008, 2:31:37 PM1/26/08
to cp2k

Thanks to all.

Yes, 8MB is a typo, the machine is 8GB ram. It is a HP proliant dl140,
with 2 em64t Intel Xeon CPU 5160 @ 3.00GHz from /proc/cpuinfo.

This is the kernel/compilers/libraries I used to compile cp2k, and
after this you can find the compiling options I used.
I preparred a mpich2 machine file as
10.10.10.119 cpu=2 or with two lines with the ip, and after I
mpdboot -n 2 -f mpi.file I see this with ps x.

python2.5 /home/programs/mpich2/64/1.0.6p1/bin/mpd.py --ncpus=2 -e -d

Any idea ?
Thanks,
Luigi

Linux k119 2.6.21-1.3194.fc7 #1 SMP Wed May 23 22:47:07 EDT 2007
x86_64 x86_64 x86_64 GNU/Linux
gcc version 4.1.2 20070925 (Red Hat 4.1.2-27)
ifort Build 20070613 Package ID: l_fc_c_10.0.025
intel mkl 10.0.1.014
fftw-3.1.2
mpich2-1.0.6p1

INTEL_INC= /home/programs/intel/64/fce/10.0.025/include/
FFTW3_INC= /home/programs/fftw3/em64t/include/

MKL_LIB= /home/programs/intel/64/mkl/10.0.1.014/lib/em64t/
FFTW3_LIB= /home/programs/fftw3/em64t/lib/

CC = cc
CPP =
FC = mpif90
LD = mpif90
AR = ar -r
DFLAGS = -D__INTEL -D__FFTSG -D__parallel -D__BLACS -D__SCALAPACK -
D__FFTW3
CPPFLAGS = -I$(INTEL_INC) -I$(FFTW3_INC)
FCFLAGS = $(DFLAGS) -I$(INTEL_INC) -I$(FFTW3_INC) -O3 -xW -heap-
arrays 64 -funroll-loops -fpp -free
LDFLAGS = $(FCFLAGS) -I$(INTEL_INC) -L$(MKL_LIB)

LIBS = $(MKL_LIB)/libmkl_scalapack.a \
$(MKL_LIB)/libmkl_blacs.a \
$(MKL_LIB)/libmkl_em64t.a \
$(MKL_LIB)/libguide.a \
-lpthread \
$(FFTW3_LIB)/libfftw3.a


Matt W

unread,
Jan 26, 2008, 3:40:20 PM1/26/08
to cp2k
Do you have an example of any code running well on 2 cores?

Matt

Axel

unread,
Jan 29, 2008, 8:37:58 PM1/29/08
to cp2k
hi!

sorry for the delay, it took me a while to get some numbers together.
my machine is a dual processor intel xeon 5150 @ 2.66GHz (woodcrest).

first off, contrary to fawzi's statement, cpu affinity in OpenMPI has
to be explicitely enabled (e.g. via setting mpi_paffinity_alone=1 in
~/.openmpi/mca-params.conf). however what both LAM/MPI and
particularly
OpenMPI have activated by default are algorithms that can take
advantage
of locality and that require the correct specification of nodes.

to have better control, i'm not using the paffinity feature in OpenMPI
but use 'numactl --physcpubind=xx bash' to restrict the cpus available
to mpirun. with my set up i can either run one copy of cp2k.popt in
each of the two cpus or both on the two cores of the same cpu.

i get the following walltimes (all with the cp2k.popt binary):
1 cpu 1010.39s
2 cpu (2 cores in cpu) 624.90s
2 cpu (1 core each in 2 cpu) 550.87s
4 cpu (with processor affinity) 362.83s

so there is some significant speedup to be had. it also is
reduced in the case of quad core machines. we are currently
buying them anyways, because they cost almost the same as
dual core and with the dual-dual-core layout, one has effectively
twice the cpu cache when using only half the cores.

i guess speed thus depends a lot on the setup of the machine,
and particularly of the speed of the memory vs. the speed
of the CPU. the quality of the MPI implementation and
the cache efficiency of the compiled code (higher optimization
and vectorization rarely helps for large package codes).

here are my compiler settings:
CC = cc
CPP =
FC = mpif90 -FR
LD = mpif90
AR = ar -r
DFLAGS = -D__INTEL -D__FFTSG -D__FFTW3 \
-D__parallel -D__BLACS -D__SCALAPACK
CPPFLAGS = -traditional -C $(DFLAGS) -P
FCFLAGS = $(DFLAGS) -O2 -unroll -tpp6 -pc64 -fpp
LDFLAGS = -i-static -openmp $(FCFLAGS) -L/opt/intel/mkl/9.0/lib/em64t
\
-Wl,-rpath,/opt/intel/fce/9.1.040/lib:$(HOME)/openmpi/
lib
LIBS = $(HOME)/lib/libscalapack.a $(HOME)/lib/blacsF77init_MPI-
LINUX-0.a \
$(HOME)/lib/blacs_MPI-LINUX-0.a -lmkl_lapack -lmkl_em64t -
lfftw3

cheers,
axel.

Fawzi Mohamed

unread,
Jan 30, 2008, 1:06:35 AM1/30/08
to cp...@googlegroups.com
Hi Axel,

nice numbers... just for the record

> first off, contrary to fawzi's statement, cpu affinity in OpenMPI has
> to be explicitely enabled (e.g. via setting mpi_paffinity_alone=1 in
> ~/.openmpi/mca-params.conf).

well I didn't say it was automatic :), what I meant was that I knew
that both LAM MPI (on which I tested) and openMPI could do it, if
configured to do so, and then I told how to have it with LAM MPI...

> however what both LAM/MPI and
> particularly
> OpenMPI have activated by default are algorithms that can take
> advantage
> of locality and that require the correct specification of nodes.

yep, I think that Open MPI is the best choice to have a well
performing MPI, especially with multiple cores.
There is active development and they try to take advantage of the
latest hardware.

ciao
Fawzi

cavallo

unread,
Jan 30, 2008, 6:15:21 AM1/30/08
to cp2k

Thanks Axel,

Clearly something was wrong in my machine. The problem was related to
something in mpich2. I rebuild it from scratch and things are much
better now.
Using the cp2k/tests/QS/benchmarks/ H2O-64.inp H2O-256.inp as they
are, I got the following execution times and speedups:

H2O-64.inp H2O-256.inp
secs secs
1 cores 2347 1.00 27518 1.00
2 cores 1286 1.83 16526 1.66
4 cores 863 2.72 16311 1.68

However, beside the MD steps, lot of time is spent in the starting 50
steps for scf wf optimization.

Since none of my exec times is close to yours, I wonder which test
your ran. Can you post the input/recipe if any change from that in the
cp2k test dir ? Beside speedup, I am also interested in absolute
execution times, of course, and your tests are sort of final target
for me, since you tested on a machine extremely similar to the one I
am testing now.

Final question is: what's better, amd or intel ? Any experience on
this ?
Thanks again,
Luigi

Teodoro Laino

unread,
Jan 30, 2008, 6:31:55 AM1/30/08
to cp...@googlegroups.com
Ciao Luigi,

as we said already on a similar topic in this mailing list, there are
few things that can speed-up
the SCF during an MD.
I will repeat them here hoping that these suggestion will help also
other people.
The most fundamental one is a good extrapolator and a good
preconditioner for the SCF.

instead of those used in that tests you run you may want to try:

&QS
EXTRAPOLATION ASPC
EXTRAPOLATION_ORDER 3
&END

and in &OT

PRECONDITIONER FULL_SINGLE_INVERSE

Of course the choice of the preconditioner is very dependent from the
system you're running. But as a preliminar
trial this should be ok.


Moreover every time OT is used to optimize the wavefunction it is
HIGLY suggested to use a nested SCF procedure..
This does not help at all normal diagonalization schemes but improves
terribly the convergence with OT.
This is the way to go:

&SCF
MAX_SCF 30
&END
&OUTER_SCF
MAX_SCF 5
&END

Don't use too many steps for the inner SCF levels (in the range 30-40
should be ok) and quite few for the outer (5-10)..
Try it and you will see a great improvement for the first SCF
optimization..

Keep in mind that absolute running time depends on many things..
CUTOFF, basis set, preconditioner, extrapolation,
threshold for the convergence of the SCF, precision required in the
integration and collocation of the density and so on..
So a full comparison can be done only with the exactly the same input
file.

Cheers,
Teo

Axel

unread,
Jan 30, 2008, 7:31:19 AM1/30/08
to cp2k


On Jan 30, 6:15 am, cavallo <lcava...@unisa.it> wrote:
> Thanks Axel,
>
> Clearly something was wrong in my machine. The problem was related to
> something in mpich2. I rebuild it from scratch and things are much
> better now.

> Using the cp2k/tests/QS/benchmarks/ H2O-64.inp H2O-256.inp as they
> are, I got the following execution times and speedups:
>
> H2O-64.inp H2O-256.inp
> secs secs
> 1 cores 2347 1.00 27518 1.00
> 2 cores 1286 1.83 16526 1.66
> 4 cores 863 2.72 16311 1.68
>
> However, beside the MD steps, lot of time is spent in the starting 50
> steps for scf wf optimization.

this is just the way quickstep works. the first initial guess is not
very good, but for each subsequent step, the extrapolator is
taking care that the initial scf guess is improved. the main trick
for efficient MD is tailor the extrapolator and SCF convergence
to be most efficient, and also set it so that you conserve energy.

> Since none of my exec times is close to yours, I wonder which test
> your ran. Can you post the input/recipe if any change from that in the
> cp2k test dir ? Beside speedup, I am also interested in absolute

i was using the 32 water example from the same directory where
you took the 64 and 256 inputs from. i will run those if i get into
the office later.

> execution times, of course, and your tests are sort of final target
> for me, since you tested on a machine extremely similar to the one I
> am testing now.
>
> Final question is: what's better, amd or intel ? Any experience on
> this ?

there is no clear winner. it is the whole package that matters.
at high clock rates and core numbers, memory bandwidth and
latencies become more important than the cpu type and speed.

how much lack of memory bandwidth affects you, depends on
your specific job. large quickstep jobs are affected the most.
right now, you can get a pretty good deal on 45nm intel quad core.
amd cpus are by design less affected by memory bandwidth
restriction, but _require_ you to use working NUMA control (cpu
and memory affinity) for good performance. this becomes more
evident, when you have four-way machines.

cheers,
axel.

> Thanks again,
> Luigi

Axel

unread,
Jan 30, 2008, 7:38:23 AM1/30/08
to cp2k


On Jan 30, 1:06 am, Fawzi Mohamed <fa...@gmx.ch> wrote:
> Hi Axel,
>
> nice numbers... just for the record
>
> > first off, contrary to fawzi's statement, cpu affinity in OpenMPI has
> > to be explicitely enabled (e.g. via setting mpi_paffinity_alone=1 in
> > ~/.openmpi/mca-params.conf).
>
> well I didn't say it was automatic :), what I meant was that I knew
> that both LAM MPI (on which I tested) and openMPI could do it, if

please check the LAM/MPI sources. to the best of my knowledge
there is no processor affinity in LAM/MPI. i just downloaded the 7.1.4
sources and there is no call to the respective system or library
calls.

> configured to do so, and then I told how to have it with LAM MPI...

sorry, but as i wrote, this only tells about locality, _not_
processor
affinity. schedulers are still kicking processes around.


> > however what both LAM/MPI and
> > particularly
> > OpenMPI have activated by default are algorithms that can take
> > advantage
> > of locality and that require the correct specification of nodes.
>
> yep, I think that Open MPI is the best choice to have a well
> performing MPI, especially with multiple cores.
> There is active development and they try to take advantage of the
> latest hardware.

yes. OpenMPI has the best collective algorithms
amongst all available open source packages.

cheers,
axel.

>
> ciao
> Fawzi

Shawn T. Brown

unread,
Jan 30, 2008, 8:24:21 AM1/30/08
to cp...@googlegroups.com
I have a quad-core Cloverton here, would running these on there with OpenMPI be useful?

cavallo

unread,
Jan 30, 2008, 10:23:54 AM1/30/08
to cp2k
Dear all,

these are the final benchmarks on the following machine.
HP proliant dl140 with dual-core Intel Xeon CPU 5160 @ 3.00GHz .
kernel 2.6.21-1.3194.fc7
gcc version 4.1.2 20070925 (Red Hat 4.1.2-27)
ifort Build 20070613 Package ID: l_fc_c_10.0.025
intel mkl 10.0.1.014
fftw-3.1.2
mpich2-1.0.6p1

These are with inputs from cp2k/tests/QS/benchamrks, no changes
h2o-32.inp h2o-64.inp h2o-256.inp
secs secs secs
1 cores 908 1.00 2347 1.00 27518 1.00
2 cores 511 1.78 1286 1.83 16526 1.66
4 cores 329 2.76 863 2.72 16311 1.68

These are the results for the 64w test with Teo speedup tips (see
above)
1 cores 1229 1.00
1 cores 671 1.83
1 cores 663 1.85

During the next days I'll try to run the same tests with OpenMPI.
Ciao,
Luigi

Axel

unread,
Jan 30, 2008, 5:47:07 PM1/30/08
to cp2k
for the sake of completeness. here my data in the same style:

machine: 2x Intel(R) Xeon(R) CPU 5150 @ 2.66GHz (woodcrest)
cache size: 4096 KB

fedora core 6, kernel 2.6.22.14-72.fc6 #1 SMP
intel fortran 9.1.040 Build 20061101
intel mkl-9.0, fftw3, OpenMPI-1.2.1
optimization: -O2 -unroll -tpp6 -pc64

serial runs using: numactl --physcpubind=3 cp2k.sopt
single runs using: numactl --physcpubind=3 mpirun -np 1 cp2k.popt
dual-a runs using: numactl --physcpubind=0,1 bash ; mpirun -np 2
cp2k.popt
dual-b runs using: numactl --physcpubind=0,2 bash ; mpirun -np 2
cp2k.popt
quad runs using: mpirun -np 4 cp2k.popt

OpenMPI with mpi_paffinity_alone = 1 in ~/.openmpi/mca-params.conf

# 32-water benchmark input total wall time, scaling:
serial 985.73 1.03
single 1010.39 1.00
dual-a 550.87 1.83
dual-b 624.90 1.62
quad 362.83 2.78

# 64-water benchmark input total wall time, scaling:
serial 2527.28 1.04
single 2616.59 1.00
dual-a 1402.91 1.86
quad 940.68 2.78

so scaling is quite similar.

shawn, it would be great if you could check with
your machine, and particularly use numactl to run
a -np 4 job so that each process is on a different
dual-core die. i'm very curious to see whether already
with a single node you saturate the memory bandwidth
and thus 4-way/node would be faster than 8-way/node.

cheers,
axel.
Reply all
Reply to author
Forward
0 new messages