crash in relccsd

38 views
Skip to first unread message

Peterson, Kirk

unread,
Oct 17, 2022, 10:18:36 AM10/17/22
to dirac...@googlegroups.com, Gulzari Malli

Dear Dirac experts,

 

I've got a really large DC-CCSD(T) calculation (224 correlated electrons!) that fails just after the MP2 calculation. I've included most of the last part of the output below. My first guess is maybe we've not given it enough memory. The output here says it requires about 115 GB of RAM and the job was allocated 120GB.  Does this look familiar to anyone?  This version was compiled with 64-bit integers (gnu compilers) with OpenMPI (64-bit) on the Cray at NERSC. Test cases are fine and I've run smaller CCSD(T) jobs successfully.

 

regards,

 

-Kirk

 

 

  MP2 results

 

 SCF energy :                            -109682.222279765090207

 MP2 correlation energy :                     -3.981023068481005

 Total MP2 energy :                      -109686.203302833571797

 T1 diagnostic :                               0.000003526477619

 

 

 CCSD options :

 Maximum number of iterations :           30

 Maximum size of DIIS space :              8

 Convergence criterium :             0.1E-11

DIRAC pam run in /global/cscratch1/sd/rthomas/malli/Rollin.input

 

 ====  below this line is the stderr stream  ====

 

Program received signal SIGBUS: Access to an undefined portion of a memory object.

 

Backtrace for this error:

#0  0x155553c563df in ???

#1  0x1023229 in readdi_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/gp/io.F:47

#2  0x122ae03 in waio_intio_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/waio.F:305

#3  0x122c2b2 in waio_intio_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/waio.F:227

#4  0x122c2b2 in master.0.rread

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/waio.F:415

#5  0x122c2b2 in rread_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/waio.F:398

#6  0x120c2d6 in getvovo_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/ccgetv.F:390

#7  0x11e26a9 in amplitude_equation_t1_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/cceqn_t1_amplitudes.F:156

#8  0x11e0b83 in cceqn_driver_amplitudes_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/cceqn_driver_amplitudes.F:107

#9  0x6bd689 in ccener_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/ccdriv.F:2575

#10  0x6d573b in ccmain_

        at /global/cscratch1/sd/rthomas/malli/Dirac64/src/relccsd/ccdriv.F:904

 

Visscher, L. (Luuk)

unread,
Oct 20, 2022, 5:07:39 AM10/20/22
to dirac...@googlegroups.com
Dear Kirk,

The error points to a routine that fetches the <vo||vo> type integrals that are (roughly) 4 times more numerous than the t2 amplitudes and <vv||oo> integrals addressed in MP2. It may be that memory ran out when allocating this array which then causes a failure when trying to fetch that data from file (first time the array is really used).

Only solution is probably to use still more memory. For this reason we started developing exacorr which works with distributed memory, but that one can not yet use symmetry so the total amount of memory required is still larger.

best regards,

Luuk


-- 
You received this message because you are subscribed to the Google Groups "dirac-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dirac-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dirac-users/792F4197-DA43-41C6-BB32-5E212AAAE791%40wsu.edu.

Trond Saue

unread,
Oct 20, 2022, 6:01:10 AM10/20/22
to dirac...@googlegroups.com
We are working to get the best of both worlds.
Trond

Sent from my iPad

On 20 Oct 2022, at 11:07, 'Visscher, L. (Luuk)' via dirac-users <dirac...@googlegroups.com> wrote:

 Dear Kirk,

Peterson, Kirk

unread,
Oct 20, 2022, 9:47:28 AM10/20/22
to dirac...@googlegroups.com

Dear Luuk,

 

thanks for your reply on this. I also guessed it perhaps just needed a bit more memory.  Unfortunately 120GB per node is all we can request at NERSC, so it seems this job is just too large for there. Building Dirac with Exacorr is on my to-do list at NERSC since their new machine stresses GPUs a lot.

 

best regards,

 

-Kirk

 

From: "'Visscher, L. (Luuk)' via dirac-users" <dirac...@googlegroups.com>
Reply-To: "dirac...@googlegroups.com" <dirac...@googlegroups.com>
Date: Thursday, October 20, 2022 at 2:07 AM
To: "dirac...@googlegroups.com" <dirac...@googlegroups.com>
Subject: Re: [dirac-users] crash in relccsd

 

[EXTERNAL EMAIL]

Kenneth Dyall

unread,
Oct 20, 2022, 11:13:36 AM10/20/22
to dirac...@googlegroups.com
In terms of algorithms, is it possible to fetch the integrals in blocks? For example, loop over oo pairs and fetch a block of <vo||vo> integrals for only one oo pair at a time. This would reduce the memory used as it wouldn't be necessary to have all the integrals in memory at the same time. The I/O would also be spread out a bit. But maybe that's being done already?

Visscher, L. (Luuk)

unread,
Oct 25, 2022, 8:57:15 AM10/25/22
to dirac...@googlegroups.com
Dear ken,

I do fetch this in blocks, but for some contractions integrals need to be resorted and this is done in memory for this class.

Luuk


Reply all
Reply to author
Forward
0 new messages