Possible Memory Leak

已查看 237 次
跳至第一个未读帖子

Matthew Emerson

未读,
2022年9月20日 17:21:072022/9/20
收件人 cp2k
Dear CP2K Developers/Community,

I have attached an input file which I believe shows an example of a possible memory leak in CP2K. It is a typical NVT DFT simulation of molten MgCl2 with PBE-D3 dispersion corrections. 

I have tried well-tested CP2K builds on our local cluster (v6.1, v8.1, v2022.1), our university supercomputer (v6.1, v8.1), and even the system-wide installations of CP2K on Cori at Oak Ridge National Lab (v8.1 and v9.1) and memory usage grows linear with time until either the node locks up/dies from insufficient memory usage or the job dies from maximum walltime (ORNL). I've done enough testing that I can almost tell how many MD steps before the job will die for a given machine with X amount of RAM and Y amount of MPI ranks. 

 I normally wouldn’t email about things like this but I’ve tried multiple combinations (w/unit-testing) of GCC, OpenMPI, OpenBLAS/MKL, etc. and nothing seems to work. I am hoping this is simply an input file issue or my own error.


Any help will be much appreciated.

Matthew S. Emerson
Margulis Research Group
Department of Chemistry
The University of Iowa

mgcl2-300ions.xyz
aimd-nvt.inp

Krack Matthias (PSI)

未读,
2022年9月21日 05:00:212022/9/21
收件人 cp...@googlegroups.com

Hello Matthew

 

I have run your case on our local compute cluster with CP2K v2022.1 (gnu 11.2.0, OpenMPI 4.1.3) using 144 CPU cores. I observe only a small increase in memory usage after the usual initial growth during the first MD steps (see attached plot).

 

Best regards

 

Matthias

 

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/0c20a8f2-3658-4429-a2e9-7f4aa0edb321n%40googlegroups.com.

Matthew Emerson

未读,
2022年9月21日 09:38:432022/9/21
收件人 cp2k
Dear Dr. Matthias, 

Do you have an ARCH file for this build that you could point me to? I would like to build using the same settings to test at ORNL.

Sincerely,
Matthew S. Emerson

Krack Matthias (PSI)

未读,
2022年9月21日 10:23:412022/9/21
收件人 cp...@googlegroups.com

Hi Matthew

 

I used this arch file to build the current cp2k release version 2022.1 on our local cluster, basically by running

  • source arch/Linux-gnu-x86_64.psmp

in the main cp2k folder and then run make as proposed after the cp2k toolchain has been built successfully. This is also done for the continuous regression testing (see first two entries in the CP2K dashboard, just click on the “OK” link to see the details).

 

HTH

 

Matthias

 

From: "cp...@googlegroups.com" <cp...@googlegroups.com> on behalf of Matthew Emerson <mr...@uiowa.edu>
Reply to: "cp...@googlegroups.com" <cp...@googlegroups.com>
Date: Wednesday, 21 September 2022 at 15:38
To: "cp...@googlegroups.com" <cp...@googlegroups.com>
Subject: Re: [CP2K:17727] Possible Memory Leak

 

Dear Dr. Matthias, 

Do you have an ARCH file for this build that you could point me to? I would like to build using the same settings to test at ORNL.

Sincerely,

Matthew S. Emerson

On Wednesday, September 21, 2022 at 4:00:21 AM UTC-5 Matthias Krack wrote:

Hello Matthew

 

I have run your case on our local compute cluster with CP2K v2022.1 (gnu 11.2.0, OpenMPI 4.1.3) using 144 CPU cores. I observe only a small increase in memory usage after the usual initial growth during the first MD steps (see attached plot).

 

Best regards

 

Matthias

 

Image removed by sender.

 

From: "cp...@googlegroups.com" <cp...@googlegroups.com> on behalf of Matthew Emerson <mr...@uiowa.edu>
Reply to: "cp...@googlegroups.com" <cp...@googlegroups.com>
Date: Tuesday, 20 September 2022 at 23:21
To: "cp...@googlegroups.com" <cp...@googlegroups.com>
Subject: [CP2K:17721] Possible Memory Leak

 

Dear CP2K Developers/Community,

 

I have attached an input file which I believe shows an example of a possible memory leak in CP2K. It is a typical NVT DFT simulation of molten MgCl2 with PBE-D3 dispersion corrections. 

I have tried well-tested CP2K builds on our local cluster (v6.1, v8.1, v2022.1), our university supercomputer (v6.1, v8.1), and even the system-wide installations of CP2K on Cori at Oak Ridge National Lab (v8.1 and v9.1) and memory usage grows linear with time until either the node locks up/dies from insufficient memory usage or the job dies from maximum walltime (ORNL). I've done enough testing that I can almost tell how many MD steps before the job will die for a given machine with X amount of RAM and Y amount of MPI ranks. 

 I normally wouldn’t email about things like this but I’ve tried multiple combinations (w/unit-testing) of GCC, OpenMPI, OpenBLAS/MKL, etc. and nothing seems to work. I am hoping this is simply an input file issue or my own error.

 

Any help will be much appreciated.

Matthew S. Emerson
Margulis Research Group
Department of Chemistry
The University of Iowa

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/0c20a8f2-3658-4429-a2e9-7f4aa0edb321n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.

Matthew Emerson

未读,
2022年9月23日 10:54:532022/9/23
收件人 cp2k
Dear Dr. Matthias, 

Sorry for the late response, I wanted to test these things carefully. Below is what I have found.

All versions of CP2K that I have tested appear to still have a memory leak. The issue is less severe on version 2022.1, but as you can see, in the graph "CP2K-2022.1.png", the program keeps growing with time and our runs are long.  For version 8.1, in graph "CP2K-8.1.png", the problem is much more severe (on the same hardware - 56 MPI ranks).

My primary concern is with the correctness of the calculations. We can of course restart the program, but how do I know that a code that looks like it is leaking is not corrupting data. If you continue the run you started, I predict it will continue growing in memory usage. I thank you for your time.


Sincerely,
Matthew S. Emerson
CP2K-8.1.png
CP2K-2022.1.png

Krack Matthias (PSI)

未读,
2022年9月23日 12:00:172022/9/23
收件人 cp...@googlegroups.com

Dear Matthew

 

The memory growth for v2022.1 looks not too bad and it might be fine to survive longer runs.

I remember issues with memory leaks caused by the MPI implementation. Especially OpenMPI showed such problems in the past and that is why I used only MPICH for years, because leak checking in CP2K was impossible with OpenMPI. Have a look at this issue from Jan this year for instance.

The presence of memory leaks usually does not imply that the results are wrong.

 

Best regards

Error! Filename not specified.

Francois Gygi

未读,
2022年9月25日 16:15:592022/9/25
收件人 cp2k
If you are using the Intel MKL library, you may consider changing the version of the library. We have observed (with Qbox, another code) memory leaks when using
(1) intel/19.1.1
(2) intelmpi/2019.up7+intel-19.1.1
(3) mkl/2020.up1

and the leaks disappeared when switching to older versions
(1)  intel/18.0
(2) intelmpi/2018.2.199+intel-18.0
(3) mkl/2018.up2

There is also a documented risk of memory leak when using the fast matrix multiply in the oneAPI Intel MKL library:

Francois
回复全部
回复作者
转发
0 个新帖子