Running Cp2k in parallel using thread in a PC

2,024 views
Skip to first unread message

Nikhil Maroli

unread,
Sep 20, 2019, 9:09:07 AM9/20/19
to cp2k
Dear all,

I have installed all the versions of CP2K in my workstation with 2 x 12 core processor, total thread=48

I wanted to run cp2k in parallel using 42 threads, can anyone share the commands that i can use.

I have tried 

mpirun -n 42 cp2k.pop -i inp.inp -o -out.out

After this command there is a rise in memory to 100 % and the whole system freezes. (i have 128GB ram).

Any suggestion will be greatly appreciated,

Pierre Cazade

unread,
Sep 20, 2019, 10:45:55 AM9/20/19
to cp...@googlegroups.com
Hello Nikhil,

Withe command "mpirun -n 42 cp2k.pop -i inp.inp -o -out.out", you are requesting 42 MPI threads and not 42 OpenMP threads. MPI usually relies on replicated data which means that, for a poorly program software, it will request a total amount of memory which the amount of memory required by a scalar execution times the number of threads. This can very quickly become problematic, in particular for QM calculations. OpenMP, however relies on shared memory, the data is normally not replicated but shared between threads and therefore, in an ideal scenario, the amount of memory needed for 42 OpenMP threads is the same as a single one.

This might explains why you calculation freezes. You are out of memory. On your workstation, you should only use the executable "cp2k.ssmp" which is the OpenMP version. Then you don't need the mpirun command:

cp2k.ssmp -i inp.inp -o -out.out

To control the number of OpenMP threads, set the env variable: OMP_NUM_THREADS, e.g. in bash, export OMP_NUM_THREADS=48

Now, if you need to balance between MPI and OpenMP, you should use the executable named cp2k.psmp. Here is such an example:

export OMP_NUM_THREADS=24
mpirun -n 2 cp2k.psmp -i inp.inp -o -out.out

In this example, I am requesting two MPI threads and each of them can use up to 24 OpenMP threads.

Hope this clarifies things for you.

Regards,
Pierre
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/39284c57-f6eb-463e-81a6-3a123596a9f2%40googlegroups.com.

-- 
Dr Pierre Cazade, PhD
AD3-023, Bernal Institute,
University of Limerick,
Plassey Park Road,
Castletroy, co. Limerick,
Ireland

Nikhil Maroli

unread,
Sep 20, 2019, 11:29:23 AM9/20/19
to cp...@googlegroups.com
Thank you very much for your reply.
Could you please tell me how to use GPU in cp2k?
I have installed all the libraries and compiled with cuda. I couldn't find any instructions to assign GPU for the calculations.

Pierre Cazade

unread,
Sep 20, 2019, 11:47:12 AM9/20/19
to cp...@googlegroups.com
Hi Nikhil,

This is an excellent question. I did not try the GPU version of CP2K yet. I am actually trying to compile it on the cluster that I am using.

Normally, you only need to install CUDA libraries and set up the environment variables properly. Then, the executable detects the presence of the GPU automatically, provided you have installed the driver from nvidia. At least, this is how gromacs behaves, for example. Which linux distribution are you using?

If you use the GPU, avoid using too many threads. Ideally, one per GPU.

Regards,
Pierre

PS: Regarding your previous post: rather than "mpirun -n 2", try "mpirun -np 2". Finally, on a multiple node calculation on a cluster, you can use "mpirun -np 8 -ppn 2". The "-np" tells mpirun the total number of MPI threads requested and the "-ppn" tells how many threads per node you want. In the present example, I am using 4 nodes and I want 2 MPI threads for each of them, so a total of 8. Of course, don't forget to set the OMP_NUM_TREADS as well.

Nikhil Maroli

unread,
Sep 20, 2019, 11:54:51 AM9/20/19
to cp...@googlegroups.com
Hello,

Im using GROMACS with GPU since 2014. Running Cp2k doest shows any information about GPU. Im using ubuntu 16 LTS



--
Regards,
Nikhil Maroli

Pierre Cazade

unread,
Sep 20, 2019, 1:01:36 PM9/20/19
to cp...@googlegroups.com
Hello,

Well , I assumed CP2K would do something similar. Considering that you are able to use Gromacs on your workstation without any issue, your system seems perfectly setup for CUDA. Let's then wait for a more experienced user to address this issue. I am also looking forward to it.

Regards,
Pierre

Chn

unread,
Sep 27, 2019, 10:03:54 PM9/27/19
to cp...@googlegroups.com
Hi Pierre,
I tried to combine openMP with MPI as you mentioned above when I do vibrational analysis. I required 6 MPI threads and got 6 output files named as *-r-number.out, however in each file it printed that:
 GLOBAL| Total number of message passing processes                             1
 GLOBAL| Number of threads for this process                                  1
 GLOBAL| This output is from process                                       0
also I used 'top' command and found that only 6 cores were busy. 
I use 2 x 24 core processor, and set:
export OMP_NUM_THREADS=8
mpirun -n 6 /lib/CP2K/cp2k/exe/local/cp2k.psmp -i project.inp -o output.out
Any suggestion will be greatly appreciated..



在 2019年9月20日星期五 UTC+8下午10:45:55,Pierre Cazade写道:
Hello Nikhil,

Withe command "mpirun -n 42 cp2k.pop -i inp.inp -o -out.out", you are requesting 42 MPI threads and not 42 OpenMP threads. MPI usually relies on replicated data which means that, for a poorly program software, it will request a total amount of memory which the amount of memory required by a scalar execution times the number of threads. This can very quickly become problematic, in particular for QM calculations. OpenMP, however relies on shared memory, the data is normally not replicated but shared between threads and therefore, in an ideal scenario, the amount of memory needed for 42 OpenMP threads is the same as a single one.

This might explains why you calculation freezes. You are out of memory. On your workstation, you should only use the executable "cp2k.ssmp" which is the OpenMP version. Then you don't need the mpirun command:

cp2k.ssmp -i inp.inp -o -out.out

To control the number of OpenMP threads, set the env variable: OMP_NUM_THREADS, e.g. in bash, export OMP_NUM_THREADS=48

Now, if you need to balance between MPI and OpenMP, you should use the executable named cp2k.psmp. Here is such an example:

export OMP_NUM_THREADS=24
mpirun -n 2 cp2k.psmp -i inp.inp -o -out.out

In this example, I am requesting two MPI threads and each of them can use up to 24 OpenMP threads.

Hope this clarifies things for you.

Regards,
Pierre

On 20/09/2019 14:09, Nikhil Maroli wrote:
Dear all,

I have installed all the versions of CP2K in my workstation with 2 x 12 core processor, total thread=48

I wanted to run cp2k in parallel using 42 threads, can anyone share the commands that i can use.

I have tried 

mpirun -n 42 cp2k.pop -i inp.inp -o -out.out

After this command there is a rise in memory to 100 % and the whole system freezes. (i have 128GB ram).

Any suggestion will be greatly appreciated,
--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp...@googlegroups.com.

Pierre-André Cazade

unread,
Sep 28, 2019, 8:14:54 AM9/28/19
to cp...@googlegroups.com
Hi Nikhil,

As you are using a mix of MPI and OpenMP, you have to use the executable with the extension psmp.

You can find a table describing all the executables in the section 3 of the “how to compile” page:


Yet, it does not explain why your calculation behaved as if there were 6 independent calculations.

Please try the same calculation with the psmp executable and let me know how it goes.

Regards,
Pierre

 

From: cp...@googlegroups.com on behalf of Chn <chen...@gmail.com>
Sent: Saturday, September 28, 2019 3:03 a.m.
To: cp2k
Subject: Re: [CP2K:12283] Running Cp2k in parallel using thread in a PC
 
Hi Pierre,
I tried to combine openMP with MPI as you mentioned above when I do vibrational analysis. I required 6 MPI threads and got 6 output files named as *-r-number.out, however in each file it printed that:
 GLOBAL| Total number of message passing processes                             1
 GLOBAL| Number of threads for this process                                  1
 GLOBAL| This output is from process                                       0
also I used 'top' command and found that only 6 cores were busy. 
I use 2 x 24 core processor, and set:
export OMP_NUM_THREADS=8
mpirun -n 6 /lib/CP2K/cp2k/exe/local/cp2k.popt -i project.inp -o output.out
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/2d26c264-d73a-44de-af89-13ff0ac86f69%40googlegroups.com.

Ari Paavo Seitsonen

unread,
Sep 28, 2019, 3:08:34 PM9/28/19
to cp...@googlegroups.com
Hello,

  Maybe you received the six files because of this:

<<

NPROC_REP {Integer}

Specify the number of processors to be used per replica environment (for parallel runs). In case of mode selective calculations more than one replica will start a block Davidson algorithm to track more than only one frequency  [Edit on GitHub]

This keyword cannot be repeated and it expects precisely one integer.

Default value: 1
>>

https://manual.cp2k.org/trunk/CP2K_INPUT/VIBRATIONAL_ANALYSIS.html#list_NPROC_REP

    Greetings from Paris,

       apsi



--
-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-
  Ari Paavo Seitsonen / Ari.P.S...@iki.fi / http://www.iki.fi/~apsi/
    Ecole Normale Supérieure (ENS), Département de Chimie, Paris
    Mobile (F) : +33 789 37 24 25    (CH) : +41 79 71 90 935

Chn

unread,
Sep 29, 2019, 11:31:19 PM9/29/19
to cp2k
Hi Pierre,
The six files were obtained by using cp2k.psmp and when I used .ssmp it output only one file named *-r-0.out. The number of files seems like to be connected with the threads I used, but I don't know that is it a parallel run job.
regrads,
chn

在 2019年9月28日星期六 UTC+8下午8:14:54,Pierre-André Cazade写道:

Chn

unread,
Sep 29, 2019, 11:38:32 PM9/29/19
to cp2k
Hi,
The keyword NPROC_REP has a default value of {1}, it should be only one output *-r.out file in a job. So can I explain that I got six files just because I used six threads..? Is it normal in a parallel job..?
regards,
chn

在 2019年9月29日星期日 UTC+8上午3:08:34,Ari Paavo Seitsonen写道:
Hello,

  Maybe you received the six files because of this:

<<

NPROC_REP {Integer}

Specify the number of processors to be used per replica environment (for parallel runs). In case of mode selective calculations more than one replica will start a block Davidson algorithm to track more than only one frequency  [Edit on GitHub]

This keyword cannot be repeated and it expects precisely one integer.

Default value: 1
>>

https://manual.cp2k.org/trunk/CP2K_INPUT/VIBRATIONAL_ANALYSIS.html#list_NPROC_REP

    Greetings from Paris,

       apsi

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp...@googlegroups.com.


--
-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-=*=-
  Ari Paavo Seitsonen / Ari.P....@iki.fi / http://www.iki.fi/~apsi/
Message has been deleted

Pierre-André Cazade

unread,
Sep 30, 2019, 10:34:57 AM9/30/19
to cp2k
Hi,

I find it strange the error you get. CP2K OMP+MPI works fine for me. Just for the sake of validation, could you try :

export OMP_NUM_THREADS=X

mpirun -np Y cp2k.psmp -i input > output

rather than

 mpirun -np Y cp2k.psmp -i input -o output

Just to check.

Regards,
Pierre

Chn

unread,
Oct 4, 2019, 3:02:10 AM10/4/19
to cp2k
Hi,

Thanks for your reply! It really helps to me.

chn

在 2019年9月30日星期一 UTC+8下午10:17:29,Travis写道:
Hi,

Vibrational analysis is done numerically for each mode. With many atoms, it takes a very long time. NPROC_REP requests that you split this job into smaller segments and use N processors for each of those tasks. For example,

mpiexec_mpt -np 2304 cp2k.7.0.psmp foo.inp > foo.log

with,

   NPROC_REP  576

dumps 4 files (numbered 0 to 3). These files follow the SCF procedure for each calculation. The final spectral data is written to the main output file (foo.log, above). This is written at the very end, not piecewise. You should also get a Molden format file for visualizing the spectrum. I'm using OMP_NUM_THREADS set to 1 in the example above. For jobs on a single node, I'd go with OMP_NUM_THREADS equal 1 when you already parallelize your calculations by splitting them up like this.

-T

Chn

unread,
Oct 4, 2019, 3:07:20 AM10/4/19
to cp2k
Hi,

I have solved this problem,  it's because that the environment variables I tried to set by command 'export' are invalid. Maybe some problems wrong with my os.
Thanks a lot for your reply!

regards,
chn

在 2019年9月30日星期一 UTC+8下午10:34:57,Pierre-André Cazade写道:

Matthew Graneri

unread,
May 18, 2022, 5:35:32 AM5/18/22
to cp2k
Hi Pierre,

I found this really valuable! Unfortunately, being very new to AIMD and very unfamiliar with computation in general, I was wondering if I might be able to get some advice? We have a HPC at my university where each node has 34 processors, and ~750 GB RAM available for use. It runs on a slurm queuing system.

Until now, I've run all my jobs using: mpirun -np $SLURM_NTASKS cp2k.popt -I input.inp -o output.out
where $SLURM_NTASKS is whatever number of processors I've allocated to the job via the --ntasks=x flag.

So instead, I'm thinking it might be more appropriate to use the .psmp executable, but I'm not sure what the difference between the OpenMP and the MPI threads are, and what kind of ratios between the OMP and MPI threads would be most effective for speeding up an AIMD job, and how many threads of each type you can add before the parallelisation becomes less efficient.

Do you (or anyone else) have any advice on the matter? Is it better to have more OMP or MPI threads? And how many OMP threads per MPI thread would be appropriate? What kinds of ratios are most effective at speeding up calculations?

I would really appreciate any help I can get!

Regards,

Matthew

Pierre-André Cazade

unread,
May 18, 2022, 6:23:50 AM5/18/22
to cp...@googlegroups.com

Hi Matthew,

 

Unfortunately, there’s no single way to determine the best MPI/OpenMP load. It is system, calculation type, and hardware dependant. I recommend testing the performance. The first thing you could try is check if your CPUs are multithreaded. For example, if they are made of 34 cores and 2 virtual cores per physical core (68 virtual cores in total), you could try OMP_NUM_THREADS=2 and keep your mpirun -np (34*#nodes).

 

Roughly speaking, MPI creates multiple replica of the calculation (called process), each replica dealing with part of the calculation. CP2K is efficiently parallelized with MPI. OpenMP generated multiple threads on the fly, generally to parallelize a loop. OpenMP can be used in a MPI thread but not the other way around. Typically, having more MPI processed consumes more memory than the same number of OpenMP threads. To use multiple nodes, MPI is mandatory and more efficient. These are generalities and, again, combining both is best but the ideal ratio varies. Testing is the best course of action, check which combination yields the largest number of ps/day with the minimum hardware resources. Doubling the hardware does not double the output, so increasing the number of nodes becomes a waste of resources at some point.  A rule of thumb, if the increase in output is less than 75-80% of the ideal case, then, it is not worth it.

 

As you can see, there is a lot of try and error, no systematic rule I am afraid.

 

Regards,

Pierre

Sam Broderick

unread,
May 20, 2022, 7:16:43 AM5/20/22
to cp2k
Hi Everyone

While I haven't figured out the GPU side of things (btw, only part of cp2k is GPU-optimized), I found this approach useful for mpirun. Note that many people do not recommend using hyper-threading for this kind of application, so this will not give hyper-threading.

     mpirun -n 2 --bind-to numa --map-by numa -display-map cp2k.psmp -i my-cp2k-run.inp > my-cp2k-run.out

  1. The 'bind-to numa' and 'map-by numa' make use of the os's understanding of the processor.
  2. These two together neatly puts the mpi ranks per cpu socket.
  3. The '-display-map' writes the mpi assignments at the beginning of the output.
Hope this helps!

Kind Regards

Sam

Matthew Graneri

unread,
Jun 6, 2022, 7:58:42 AM6/6/22
to cp2k
Hi Pierre,

Sorry it's taken so long to reply. Your reply really helped. Thank you!

Regards,

Matthew

Corrado Bacchiocchi

unread,
Feb 14, 2023, 5:54:21 AM2/14/23
to cp2k
Hi Everyone,

thanks for the many suggestions.

I have a single node server with two Xeon Gold 6154 CPU @ 3.00GHz, 18 cores each, 36 total, 72 threads.
I have found that the following launch command:

mpirun -np 72 --bind-to hwthread  cp2k.psmp -i cp2k.inp -o cp2k.out

performs about 5x faster than

mpirun -n 2 --bind-to numa --map-by numa -display-map cp2k.psmp -i cp2k.inp -o cp2k.out

Regards
Corrado

Léon Luntadila Lufungula

unread,
Jun 29, 2023, 4:01:05 PM6/29/23
to cp2k
Hi Corrado,

I have a similar (somewhat older) single node server with two Xeon Gold 6152 CPUs @ 2.10GHz, 22 cores each, 44 total, 88 threads (hyperthreading enabled). I was currently running calculations with only OMP parallelization as I only have a limited amount of RAM (32GB), but I was wondering if I could also benefit from using mpirun with the --bind-to hwthread option, perhaps with -np 2 and OMP_NUM_THREADS=44?

Out of interest, how much RAM do you have in your machine? I was thinking about suggesting to put some more RAM into our machine so that I can do heavier calculations, because I easily hit the memory limit... For some calculations I'm doing now I have to idle half of my processors so that I can run a calculation with 44 cores while using all the RAM on the node...

Kind regards,
Léon

Reply all
Reply to author
Forward
0 new messages