Hybrid functional calculation problem

225 views
Skip to first unread message

Lucas Lodeiro

unread,
Nov 21, 2020, 7:21:05 PM11/21/20
to cp...@googlegroups.com
Hi all, 
I need to perform a hybrid calculation with CP2K7.1, over a big system (+1000 atoms). I study the manual, the tutorials and some videos of CP2K developers to improve my input. But the program exits the calculation when the HF part is running... I see the memory usage on the fly, and there is no peak which explains the fail (I used 4000Mb with 220 processors).
The output does not show some explanation... Thinking in the memory, I try with a largemem node at our cluster, using 15000Mb with 220 processors, but the program exists at the same point without message, just killing the process. 
The output shows a warning:

 *** WARNING in hfx_energy_potential.F:591 :: The Kohn Sham matrix is not  ***
 *** 100% occupied. This may result in incorrect Hartree-Fock results. Try ***
 *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS section. For  ***
 *** more information see FAQ: https://www.cp2k.org/faq:hfx_eps_warning    ***

but I read this is not a very complicated issue, and the calculation has to continue and not crash
Also I decrease the EPS__PGF_ORB, but the warning and the problem persist. 

I do not know if the problem could be located in other parts of my input... for example I use the PBE0-T_C-LR (I use PBC for XY), and ADMM. In the ADMM options, I use ADMM_PURIFICATION_METHOD = NONE, due to I read that ADMM1 is the only one useful for smearing calculations. 

I run this system with PBE (for the first guess of PBE0), and there is no problem in that case.
Moreover, I try with other CP2K versions (7.0, 6.1 and 5.1) compiled into the cluster with (libint_max_am=6), and the calculation crash, but show this problem:

 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]                                                                     *
 *  \___/       CP2K and libint were compiled with different LIBINT_MAX_AM.    *
 *    |                                                                        *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                                hfx_libint_wrapper.F:134 *
 *******************************************************************************


 ===== Routine Calling Stack =====

            2 hfx_create
            1 CP2K

It seems like this problem is not present in the 7.1 version, as the program does not show it, and the compilation information does not show LIBINT_MAX_AM value...

If somebody could give me some advice, I will appreciate it. :)
I attach the input file, and the output file for 7.1 version.

Regards - Lucas Lodeiro

system.inp
system.out

fabia...@gmail.com

unread,
Nov 22, 2020, 10:53:27 AM11/22/20
to cp2k
Dear Lucas,

cp2k was computes the four-center integrals during (or prior) to the first SCF cycle. I assume the job ran out of time during this task  For a system with more than 1000 atoms this takes a lot of time. With only 220 CPUs this could take several hours in fact.

To speed up the calculation you should use SCREEN_ON_INITIAL_P T and restart using a well converged PBE wfn. Other than that, there is little you can do other than assign the job more time and/or CPUs. (Of course, reducing CUTOFF_RADIUS        8.62 would help too but could negatively affect the result).

Cheers,
Fabian

Lucas Lodeiro

unread,
Nov 22, 2020, 11:24:13 AM11/22/20
to cp...@googlegroups.com
Dear Fabian,

Thanks for your advise. I forgot to clarify the time ejecution... my mistake. 
The calculation runs for 5 or 7 minutes, and stops... the walltime for the calculation was set as 72hrs, then I do not believe this is the problem. Now I am running the same input in a littler cluster (different form the problematic crash) with 64 proc and 250GB RAM, and the calculation works fine (so so slow, 9 hr per scf step, but runs... the total RAM assigned for the ERI's is not sufficient but the problem is not appear)... It is no practical to use this little cluster, then I need to fix the problem in the big one, to use more RAM and more processors (more than 220 it is possible), but as the program does not show what is happening, I cannot tell anything to the cluster admin to recompile or fix the problem. :(

This is the output in the little cluster:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------

  HFX_MEM_INFO| Est. max. program size before HFX [MiB]:                    1371

 *** WARNING in hfx_energy_potential.F:605 :: The Kohn Sham matrix is not  ***

 *** 100% occupied. This may result in incorrect Hartree-Fock results. Try ***
 *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS section. For  ***
 *** more information see FAQ: https://www.cp2k.org/faq:hfx_eps_warning    ***

  HFX_MEM_INFO| Number of cart. primitive ERI's calculated:       27043173676632
  HFX_MEM_INFO| Number of sph. ERI's calculated:                   4879985997918
  HFX_MEM_INFO| Number of sph. ERI's stored in-core:                116452577779
  HFX_MEM_INFO| Number of sph. ERI's stored on disk:                           0
  HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:        4763533420139
  HFX_MEM_INFO| Total memory consumption ERI's RAM [MiB]:                 143042
  HFX_MEM_INFO| Whereof max-vals [MiB]:                                     1380
  HFX_MEM_INFO| Total compression factor ERI's RAM:                         6.21
  HFX_MEM_INFO| Total memory consumption ERI's disk [MiB]:                     0
  HFX_MEM_INFO| Total compression factor ERI's disk:                        0.00
  HFX_MEM_INFO| Size of density/Fock matrix [MiB]:                           266
  HFX_MEM_INFO| Size of buffers [MiB]:                                        98
  HFX_MEM_INFO| Number of periodic image cells considered:                     5
  HFX_MEM_INFO| Est. max. program size after HFX  [MiB]:                    3834

     1 NoMix/Diag. 0.40E+00 ******     5.46488333    -20625.2826573514 -2.06E+04

About the SCREEN_ON_INITIAL_P, I read that to use it, you need a very good guess (more than de GGA converged one) as for example the last step or frame from a GEO_OPT or MD... Is it really useful when the guess is the GGA wavefunction?
About the CUTOFF_RADIUS, I read that 6 or 7 it is a good compromise, and as my cell is approximately twice, I use the minimal image convention to decide the 8.62 number which is near the recomended (6 or 7). If I reduce it, does the computational cost diminish considerably?

Regards - Lucas

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/96479ce2-d8a3-4ccf-b55c-0e935878f1c0n%40googlegroups.com.

fabia...@gmail.com

unread,
Nov 22, 2020, 11:42:47 AM11/22/20
to cp2k
Can cp2k access all the memory on the cluster? On linux you can use
ulimit -s unlimited
to remove any limit on the amount of memory a process can use.

I usually use SCREEN_ON_INITIAL_P. I found that for large systems it is faster to run two energy minimizations with the key word enabled (such that the second restarts from a converged PBE0 wfn) than running a single minimization without SCREEN_ON_INITIAL_P. But that probably depends on the system you simulate.

You should converge the cutoff with respect to the properties that you are interested in. Run a test system with increasing cutoff and look at, e.g. the energy, pdos, etc.

Number of sph. ERI's calculated on the fly:        4763533420139
This number should always be 0. If it is larger, increase the memory cp2k has available.

Fabian

Matt W

unread,
Nov 22, 2020, 11:55:18 AM11/22/20
to cp2k
Your input has

        &MEMORY
          MAX_MEMORY           4000
          EPS_STORAGE_SCALING  0.1
        &END MEMORY

This means that each MPI task (which can be multiple cores) should be able to allocate 4GBi of memory _exclusively_ for the 2 electron integrals.  If there is less than that available it will crash as the memory allocation can't occur. I guess your main cluster has less memory than the smaller one. You need to leave space for the operating system and the rest of the cp2k run besides the 2 electron integrals.

There is another thread where Juerg answers HFX memory in more detail form earlier this year.

Matt

Lucas Lodeiro

unread,
Nov 22, 2020, 5:08:17 PM11/22/20
to cp...@googlegroups.com
Hi Fabian and Matt,

About the access to the memory, I ran calculations without problems for months, using 90% of the node RAM without problems. But to check I set ulimit -s unlimited. There are some changes, before using ulimit, the calculation crashes and the use of RAM was so low (15%), after using ulimit, the calculation crashes equally, but the use of RAM shows a sustained rise to the limit and then the calculation crashes. This is a change. I adjunct an image.

About the SCREEN_ON_INITIAL_P, I will use it in the little cluster. I like the idea of running 2 calculations as climbing steps.

I know that the number of the ERIs calculated on the fly should be 0, and if it is different from zero, I need to use more RAM to store them and to not calculate them at each scf step. But in the case of the little cluster, I am using all processors and RAM resources.  But the way, the calculation runs without problems when ERIs calculated on the fly at each scf step, just is very slow.

About what Matt comments. In the little cluster, I have a single node with 250GB RAM. Then I use MAX_MEMORY = 2600, this is a total of 166.4 GB for the ERIS (the output informs 143 GB), and the rest for the whole program. 
In the case of the big cluster, we have access to many nodes with 44 proc and 192GB RAM, and 9 nodes with 44 proc and 768GB RAM. In the first case, I use 5 nodes (220 proc) using all memory (960GB), setting MAX_MEMORY = 4000 (4.0 GB * 220 proc = 880 GB RAM for ERIs). In the second case, I use 5 nodes (220 proc) using all memory (3840GB), setting MAX_MEMORY = 15000 (15.0 GB * 220 proc = 3300 GB RAM for ERIs).
In both cases the calculation crashes... I do not know if I am so credulous , but 3.3 TB of RAM seems, at least, enough to store so many of the ERIs...

Using the data informed in the output of little cluster:
  HFX_MEM_INFO| Number of sph. ERI's calculated:                   4879985997918
  HFX_MEM_INFO| Number of sph. ERI's stored in-core:                116452577779
  HFX_MEM_INFO| Number of sph. ERI's stored on disk:                           0
  HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:        4763533420139

The stored ERI's are the 1/42 of the total ERIs, and use 166.4 GB (143 GB informed)... Then if I want to store all of them, I need 166.4 GB * 42 = ~7.0 TB... Is that correct?
I can get 7.0 TB RAM using 9 nodes with 768 GB RAM each one. But I am not so clear about the idea that the amount of RAM is the problem, because in the little cluster it runs, calculating almost all ERIs at each scf step...

I am a little surprised why the calculation runs in the little cluster, but not in the big one.
Do you guess some other related problem?

Regards - Lucas



graph.png

fabia...@gmail.com

unread,
Nov 23, 2020, 12:18:18 PM11/23/20
to cp2k
Your graph nicely shows that cp2k runs out of memory. As Matt wrote, you have to decrease MAX_MEMORY to allow enough memory for the rest of the programm. Here are some details on memory consumption with HF: https://groups.google.com/g/cp2k/c/DZDVTIORyVY/m/OGjJDJuqBwAJ

Of course you can recalculate some the ERI's in each SCF cycle. But that slows down the minimization by a lot, I'd advise against doing that. Try to use screening, set a proper value for MAX_MEMORY, and use all the resources you have to store the ERI's

Fabian

Lucas Lodeiro

unread,
Nov 24, 2020, 12:53:51 AM11/24/20
to cp...@googlegroups.com
Thanks for your advices!

Now I can run it at least, It is so slow but run. The difference between the little and big cluster was that, in the little one, the total RAM consumption is practically MPI_PROCESS*(Baseline + MAX_MEMORY + 2 full matrices), as Prof. Hutter explains, but in the big one,  there are some cluster process which consume 5 or 10% of each nodes... then I had to optimize the MAX_MEMORY doing some test...

About the ERIs, it is so difficult to have 7 TB for them... I can take 4 TB without problem... But to take the whole cluster section is difficult. I try using the SCREENING option to speed it up, taking some ERIs on the fly.

Regards - Lucas Lodeiro


Matt W

unread,
Nov 24, 2020, 6:28:33 AM11/24/20
to cp2k
I think there is an option to run mixed MPI / openMP. If you run the CP2K.psmp executable and give 2 or 4  threads per MPI process you can give more memory per process for the integrals. If having to calculate on the fly is dominating that might be a good option.

Matt

Lucas Lodeiro

unread,
Nov 24, 2020, 10:30:54 AM11/24/20
to cp...@googlegroups.com
Thanks Matt!

Now I am using popt 7.1, I will ask to compile the psmp flavor of CP2K. Today, I optimize the MAX_MEMORY, and with 400 proc each scf step needs 30 minutes (calculating most of the ERIs on the fly), which is an affordable time for the whole calculation.

Regards - Lucas Lodeiro

Reply all
Reply to author
Forward
0 new messages