memory issues with EOM-CCSD

74 views
Skip to first unread message

Peterson, Kirk

unread,
Nov 29, 2020, 6:15:35 PM11/29/20
to dirac-users

Dear Dirac experts,

 

I've been using the new EOM-CCSD module for a few weeks now as an alternative to FS-CCSD and in generally it works great.  I've run into a recurring problem however that only occurs when I use a Hamiltonian with spin-orbit on a molecule with only one symmetry irrep. I've tried 4-c Dirac-Coulomb and 2-c X2C, both the same general result, namely that when it gets into the EOM diagonalization steps, the memory grows and grows until eventually the OS kills the job via oom killer.  The issue is not node specific and these compute nodes have nearly 400 GB of RAM.  In my last attempt, I was just running over 4 cores and each core was allocated with --mw 1400 --aw 4000, so only about 160 GB.  It got through 10 iterations as noted below.  The system log showed at this time memory was too low and oom-killer was invoked on dirac.x.

 

I'm running Dirac19, pretty recent build from github, using OpenMPI (64-bit integers).

 

The EOM part of the input was very simple:

 

.EOMCC

*EOMCC

.EE

1 6

 

 

regards,   -Kirk

 

 

 

Output snippet:

 

Iteration                    10

 

  Number of OMP threads, procs in use:     1     1

    Eigenvalue      1 :  -0.786041E-02

    Eigenvalue      2 :  -0.268021E-02

    Eigenvalue      3 :   0.118441E-01

    Eigenvalue      4 :   0.212588E-01

    Eigenvalue      5 :   0.266271E-01

    Eigenvalue      6 :   0.311534E-01

DIRAC pam run in /home/kipeters/projects/Bowen/ThAu2/dirac/X2C

 

====  below this line is the stderr stream  ====

--------------------------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

 

 

 

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun noticed that process rank 0 with PID 0 on node compute-0-9 exited on signal 9 (Killed).

--------------------------------------------------------------------------

Andre Gomes

unread,
Nov 30, 2020, 6:34:11 PM11/30/20
to dirac...@googlegroups.com
Dear Kirk,

Thank you for your question.

What you are seeing is not stricly speaking abormal, and will be seen irrespective of the Hamiltonian you use.

The current EOM implementation has a shortcoming in that it is more memory hungry during the davidson procedure than it should/could be.

Roughly speaking, at each iteration you increase memory usage by a bit around 4*number of unconverged roots*dimension(T1+T2), and that per MPI process, as the number of trial and sigma vectors is increased. note dimension(T1+T2) is not the full length since these are stored in triangular form.

This memory consumption increases by a factor of 2 when the code operates in complex algebra mode, which may or may not be your case.

We are considering how to change that for future releases (but most likely not for dirac 2021), and try to reduce the memory usage by 2 as well as limit from the outset the maximum memory via the aw keyword (which is only partially used in the eom code : aw will be checked when constructing the storage for intermediates, but not when updating the trial vectors or sigma vectors).

So, in practice, what I can suggest you right now is either one of the following:

- running a sequential job in this case; if that doesn't take too long, you should be able to at least 38-39 iterations since you can complete 9 iterations with 4 mpi processes.
- try reducing the value of --aw so that it uses the minimum for performing the ground-state and setup of eom intermediates. that may free up some storage for the davidson step but may not be enough to get you to convergence. if you still have the scratch space (or more specifically, the files from moltra), you could try this second option out with restart.  you can skip the scf and moltra step entirely then, and the integral sorting steps but you should redo the CCSD iterations, i can help you set this up.

I hope this helps, and I'm sorry you hit these limitations.

All the best,

Andre


,





--
You received this message because you are subscribed to the Google Groups "dirac-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dirac-users/BYAPR01MB3640E2AD1C99EE8E115F029FD6F60%40BYAPR01MB3640.prod.exchangelabs.com.

Andre Gomes

unread,
Nov 30, 2020, 6:35:57 PM11/30/20
to dirac...@googlegroups.com
On Tue, Dec 1, 2020 at 12:33 AM Andre Gomes <asp...@gmail.com> wrote:
Dear Kirk,

Thank you for your question.

What you are seeing is not stricly speaking abormal, and will be seen irrespective of the Hamiltonian you use.

The current EOM implementation has a shortcoming in that it is more memory hungry during the davidson procedure than it should/could be.

Roughly speaking, at each iteration you increase memory usage by a bit around 4*number of unconverged roots*dimension(T1+T2), and that per MPI process, as the number of trial and sigma vectors is increased. note dimension(T1+T2) is not the full length since these are stored in triangular form.

this is for eom-ee-ccsd, for eom-ip and eom-ea the increase in memory usage will be a bit less important due to the smaller dimensions of the R1,R2 vectors.

Peterson, Kirk

unread,
Nov 30, 2020, 7:10:27 PM11/30/20
to dirac...@googlegroups.com

Dear Andre,

 

thank you for your reply, this is very helpful.  This gives me some good ideas on how to circumvent this issue.

 

best,

 

-Kirk

Reply all
Reply to author
Forward
0 new messages