Dear CP2K developers,
I have started to benchmark CP2K on our clusters, and there are a few things that I don't
fully understand about the memory consumption of hybrid DFT calculations.
1. What needs to be considered in setting
MAX_MEMORY such that CP2K's total memory usage
does not go beyond a certain total amount "X" per compute node? MAX_MEMORY being the maximal
amount of memory per MPI process, in MiB, that the HFX module is allowed to use.
At first, I thought it would simply involve dividing this total amount "X" by the number of MPI processes
per node, and then subtracting a modest amount that will be needed by the rest of the program
(i.e. for modules other than the HFX module).
In the attached down-scaled version of the LiH-HFX benchmark, for example, the GGA prerun
on a 28-core node utilizes circa 250 MiB per MPI process with cp2k.popt. In a subsequent HFX
calculation on the same node with MAX_MEMORY = 1500 MiB, the HFX module uses all of its
allowed memory, which is fine. But the job's total memory usage amounts to around 3000 MiB
per MPI process, as given by CP2K's "max memory usage/rank" output and a check with "htop".
This is quite larger than the 1500+250=1750 MiB per MPI process that I naively expected.
Hence, if I would target a total memory usage of 1750 MiB per MPI process, I would need to use
a significantly lower value for MAX_MEMORY. So is there a more sophisticated "rule of thumb"
for predicting the total memory usage, or does this involve some trial-and-error?
Or is this not even how CP2K is supposed to behave?
2. What should I expect from using multiple OpenMP threads per MPI process in terms of memory
consumption? Suppose that MAX_MEMORY is such that 50% of the ERIs cannot be stored and
need to be (re)calculated on-the-fly, for OMP_NUM_THREADS=1. Suppose then that I set
OMP_NUM_THREADS=2, use half the number of MPI processes, and double the MAX_MEMORY,
to arrive at the same memory usage by the HFX module. Should this then allow all ERIs to be
stored in memory?
Best,
Maxime