Memory usage in hybrid DFT calculations

641 views
Skip to first unread message

Maxime Van den Bossche

unread,
Jun 2, 2020, 4:47:05 AM6/2/20
to cp...@googlegroups.com
Dear CP2K developers,


I have started to benchmark CP2K on our clusters, and there are a few things that I don't
fully understand about the memory consumption of hybrid DFT calculations.


1. What needs to be considered in setting MAX_MEMORY such that CP2K's total memory usage
does not go beyond a certain total amount "X" per compute node? MAX_MEMORY being the maximal
amount of memory per MPI process, in MiB, that the HFX module is allowed to use.

At first, I thought it would simply involve dividing this total amount "X" by the number of MPI processes
per node, and then subtracting a modest amount that will be needed by the rest of the program
(i.e. for modules other than the HFX module).

In the attached down-scaled version of the LiH-HFX benchmark, for example, the GGA prerun
on a 28-core node utilizes circa 250 MiB per MPI process with cp2k.popt. In a subsequent HFX
calculation on the same node with MAX_MEMORY = 1500 MiB, the HFX module uses all of its
allowed memory, which is fine. But the job's total memory usage amounts to around 3000 MiB
per MPI process, as given by CP2K's "max memory usage/rank" output and a check with "htop".

This is quite larger than the 1500+250=1750 MiB per MPI process that I naively expected.
Hence, if I would target a total memory usage of 1750 MiB per MPI process, I would need to use
a significantly lower value for MAX_MEMORY. So is there a more sophisticated "rule of thumb"
for predicting the total memory usage, or does this involve some trial-and-error?
Or is this not even how CP2K is supposed to behave?


2. What should I expect from using multiple OpenMP threads per MPI process in terms of memory
consumption? Suppose that MAX_MEMORY is such that 50% of the ERIs cannot be stored and
need to be (re)calculated on-the-fly, for OMP_NUM_THREADS=1. Suppose then that I set
OMP_NUM_THREADS=2, use half the number of MPI processes, and double the MAX_MEMORY,
to arrive at the same memory usage by the HFX module. Should this then allow all ERIs to be
stored in memory?


Best,
Maxime
inputs.zip

Maxime Van den Bossche

unread,
Jul 14, 2020, 5:10:24 AM7/14/20
to cp2k
Hello,

I received a response from Prof. Hutter (pasted below), which helps to resolve
the questions in my original post:

1. In addition to extra memory for storing the ERIs (compared to GGA calculations),
HFX calculations also need extra memory for the density matrix and KS matrix.

2. Reducing the number of MPI processes and increasing the number of OpenMP threads
the MAX_MEMORY value accordingly has no influence on the percentage of ERIs that can
be stored. This is simply because the total HFX module memory usage stays constant
and there is no duplication of the integrals across the MPI processes.

However, trading MPI processes for OpenMP threads can be beneficial if the reduced
duplication of the 'baseline' and density/KS matrices across the MPI processes allows
to further increase the value of MAX_MEMORY (provided that the overhead per OpenMP
thread is not too large).

best,
Maxime


-------------------------------------------------------------------------------------------------------------------------------

Hi

 

The MAX_MEMORY keyword is used for the storing of integrals per MPI task. Each MPI task

needs in addition memory for the replicated Density and KS matrices. The OpenMP threads also

need some memory to hold local data. These are mostly buffers for integrals that can be rather

large for large basis sets.

 

The total memory needed per MPI-task (using N OpenMP threads) is

 

Baseline + MAX_MEMORY + 2 full matrices + N*(OpenMP overhead)

 

All of this is system dependent and finding the best combination of MPI/OpenMP also depends

on your hardware, e.g. CPUs per node and memory per node.

In most cases a good starting point is to use one MPI task per CPU and N OpenMP threads

(no hyperthreading). Although I wouldn't go beyond 8 or 12 OpenMP threads but rather increase

the MPI tasks.

 

regards

 

Juerg Hutter

-------------------------------------------------------------------------------------------------------------------------------
Reply all
Reply to author
Forward
0 new messages