Error [hashv2: key not found 2 ] for CCSDTQ calculation

232 views
Skip to first unread message

Dominic Chien

unread,
Jan 9, 2021, 11:53:29 AM1/9/21
to NWChem Forum
I got a problem with a large CCSDTQ calculation using the latest NWChem (7.0.2). The input and the build script is given below (full input/output is available upon request for NWChem developers)

On a 2 node computing system (each has 2x64 cores AMD ROAM  processors with 512GB ram),  I tried different 2emet algorithms (i.e. 16, 15, 14, 13, 4, 3, 2,  io=ga), and I got different errors.

For 2emet =16, it hit on the [strided_to_subarray_dtype] error  discussed on https://github.com/nwchemgit/nwchem/issues/100,

(2emet 16, 2eorb)
##########################################################################################
v2    file size   =         18806764
4-index algorithm nr.  16 is used
imaxsize =       20
imaxsize ichop =        0
p[32] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 100144
p[30] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 345136
forrtl: error (78): process killed (SIGTERM)
##########################################################################################

So, I applied the env. variables and the job goes a little bit further., and then it hit 
another error with message: “ hashv2: key not found 2                   0”

(2emet 16, 2eorb, with  COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
##########################################################################################

t4 file size   =      35932110259
t4 file name   = /scratch/shchien/CCSDTQ-Test-1//CCSDTQ-Test-1.t4

CCSDTQ iterations
--------------------------------------------------------
Iter          Residuum       Correlation     Cpu    Wall
--------------------------------------------------------
key=                  1547
key=                   238
key=                  2218
hashv2: key not found 2                   0
------------------------------------------------------------------------
------------------------------------------------------------------------
key=                 22162
key=                 14713
hashv2: key not found 2                   0
------------------------------------------------------------------------
------------------------------------------------------------------------
 current input line :
    0:
------------------------------------------------------------------------
 current input line :
    0:
------------------------------------------------------------------------
------------------------------------------------------------------------
This error has not yet been assigned to a category
------------------------------------------------------------------------
For more information see the NWChem manual at
https://github.com/nwchemgit/nwchem/wiki

##########################################################################################

Similar error occured for 2emet = 2,3,4,13 14 and 15
(2emet 15, 2eorb, with  COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 14, 2eorb, with  COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 13, 2eorb, with  COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 4, 2eorb, with  COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 3, 2eorb, with  COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 2, 2eorb, with  COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
##############################################################################################################
CCSDTQ iterations
--------------------------------------------------------
Iter          Residuum       Correlation     Cpu    Wall
--------------------------------------------------------
key=                 18152
hashv2: key not found 2                   0
------------------------------------------------------------------------
------------------------------------------------------------------------
 current input line :
    0:
------------------------------------------------------------------------
key=                 38233
hashv2: key not found 2                   0
------------------------------------------------------------------------
------------------------------------------------------------------------
 current input line :
    0:
------------------------------------------------------------------------
------------------------------------------------------------------------
This error has not yet been assigned to a category
------------------------------------------------------------------------

##############################################################################################################

Most annoying to me (as a system admin) is that these errors will leave a lot of junk files in /dev/shm and require manual removal. (I thought we have discussed this error in version 6. 6 or 6.7, and it was fixed before)
##############################################################################################################
[root@hpcnode055 shm]# df
Filesystem        1K-blocks        Used    Available Use% Mounted on

tmpfs             264040960   141854272    122186688  54% /dev/shm


[shchien@hpcnode009 CCSDTQ-Test-1]$ cd /dev/shm



[shchien@hpcnode009 CCSDTQ-Test-1]$ ls dev/shm

cmx11806834260000037326000001  cmx1180683426000003803200006p

cmx11806834260000037326000002  cmx1180683426000003803200006q

cmx1180683426000003732600004u  cmx1180683426000003803200006r

cmx1180683426000003732600004v  cmx1180683426000003803200006s

cmx11806834260000037326000062  cmx1180683426000003803200006u

cmx11806834260000037326000064  cmx11806834260000038033000001

cmx11806834260000037326000065  cmx11806834260000038033000002

cmx11806834260000037327000001  cmx1180683426000003803300004u

cmx11806834260000037327000002  cmx1180683426000003803300004v

cmx1180683426000003732700004u  cmx11806834260000038033000062

cmx1180683426000003732700004v  cmx11806834260000038033000064

cmx11806834260000037327000062  cmx11806834260000038033000066

##############################################################################################################

This “hashv2: key not found 2 “  error has only been discussed once before, but the solution was not quite clear to me (ignore 2eorb?)
https://nwchemgit.github.io/Special_AWCforum/st/id1748/hashv2__key_not_found_with_TCE_a....html

After I removed 2eorb,  these algorithms were mot recognized by NWChem (tce_energy: invalid 2emet:                   16)
##############################################################################################################
4-index algorithm nr.  16 is used
4-index algorithm nr.  16 is used
Fock matrix recomputed
1-e file size   =            17800
1-e file name   = /home/shchien/nwchem/CCSDTQ-Test-1//CCSDTQ-Test-1.f1int.000000
Cpu & wall time / sec            9.0            9.1
tce_energy: invalid 2emet:                   16
tce_energy: invalid 2emet:                   16
------------------------------------------------------------------------
------------------------------------------------------------------------

##############################################################################################################



Here is part of the input:
##########################################################################################
echo
start CCSDTQ-Test-1
memory stack 3500 mb heap 200 mb global 11500 mb

permanent_dir /home/shchien/nwchem/CCSDTQ-Test-1/
SCRATCH_DIR /scratch/shchien/CCSDTQ-Test-1/

charge 1
geometry units angstrom
zmatrix
(The geometry was removed due to NDA with user)
end
end

basis "ao basis" spherical
 * library cc-pvtz-dk
end

scf
 vectors  input CCSDTQ-Test-1_scf_mo
 rohf
 doublet
 thresh 1e-8
 maxiter 200
end

RELATIVISTIC
 DOUGLAS-KROLL ON
END

tce
 SCF
 CCSDTQ
 thresh 1e-6
 io ga
 freeze atomic
 attilesize 20
 tilesize   3
 2emet 1
end

set tce:writeint t
set tce:readint f
set tce:writet t
set tce:readt f
set tce:save_interval 10
set tce:tceiop 2048
set tce:nts t

task scf energy
task tce energy
##########################################################################################



Here is how I compiled NWChem, I tried this with IntelMPI/IntelCompiler, OpenMPI/IntelCompiler and OpeMPI/UXC1.10/IntelCompiler
##########################################################################################
module purge
export MODULEPATH=/cm/local/modulefiles:/etc/modulefiles:/usr/share/modulefiles:/usr/share/Modules/modulefiles:/cm/shared/modulefiles/compiler:/cm/shared/modulefiles/library:/cm/shared/modulefiles/mpi
module load intel
module load  cuda/11.0.2  cuda/blas/11.0.2 cuda/fft/11.0.2
export NWCHEM_TOP=/home/shchien/nwchem-7.0.2
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=MPI-PR
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export ENABLE_COMPONENT=yes
export USE_OPENMP=y
export DISABLE_GAMIRROR=y
export USE_GAGITHUB=y

export BLAS_SIZE=8
export BLASOPT="-mkl -liomp5 -lpthread -lm -ldl -qopenmp -qopenmp-simd "
#export BLASOPT="/apps/openblas/0.3.6/lib/libopenblas_gcc_i8_s.a -lpthread -ldl "
export LAPACK_LIB=$BLASOPT

export USE_SCALAPACK=yes
export SCALAPACK_SIZE=8
export SCALAPACK=" ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a  ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a"
export SCALAPACK_LIB="$SCALAPACK $BLASOPT"
unset SCALAPACK_LIB
#unset SCALAPACK

export MPI_INCLUDE="/cm/shared/common_software_stack/apps/libraries/mpi/openmpi/4.1.0/hpcx_icc/include"
export LIBMPI="-L/cm/shared/common_software_stack/apps/libraries/mpi/openmpi/4.1.0/hpcx_icc/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -ldl -Wl,--export-dynamic -lutil  "

export NWCHEM_MODULES="all"

export CCSDTQ=y
export CCSDTLR=y
export MRCC_METHODS=TRUE

export CC=icc
export FC=ifort
export F77=ifort
export F90=ifort
export CXX=icpc
export MPICC=mpicc
export MPIFC=mpif90

export TCE_CUDA=Y
export CUDA_LIBS="-L /cm/shared/apps/cuda11.0/toolkit/11.0.2/lib64 -lcublas -lcudart "
export CUDA_FLAGS="-arch sm_70"
export CUDA_INCLUDE="-I. -I/usr/local/cuda/include"
export CUDA=nvcc
##########################################################################################
Message has been deleted

Dominic Chien

unread,
Jan 10, 2021, 12:33:34 AM1/10/21
to NWChem Forum

I removed 2eorb, and it seems that only 2emet = 1, 2 or 3 can be used for computation for CCSDT or CCSDTQ. 

After removing 2eorb, and using 2emet =1, 2, and 3,  I hit the divide by zero error immediately aftet  printing the line 

:2-e (intermediate) file name = /home/shchien/nwchem/CCSDTQ-Test-1//CCSDTQ-Test-1.v2i.001"

##########################################################################################


 Parallel file system coherency ......... OK


 Integral file          = /scratch/shchien/CCSDTQ-Test-1//CCSDTQ-Test-1.aoints.00

 Record size in doubles =    65536    No. of integs per rec  =    43688

 Max. records in memory =       17    Max. records in file   = 35280390

 No. of bits per label  =        8    No. of bits per value  =       64



 #quartets = 6.915D+05 #integrals = 3.871D+07 #direct =  0.0% #cached =100.0%



File balance: exchanges=    93  moved=   146  time=   0.0


  3 ga offset               13350 size_xx_perproc                4450mx    4

  2 ga offset                8900 size_xx_perproc                4450mx    4

 size_1e                 17800

  0 ga offset                   0 size_xx_perproc                4450mx    4

 WRITE TENSOR

  filename: /home/shchien/nwchem/CCSDTQ-Test-1//CCSDTQ-Test-1.f1int.000000

  unit nr:       77

  file size:           4450

  rec_mem (KB):     2048

  rec_size:         262144

  number of tasks:            1

  1 ga offset                4450 size_xx_perproc                4450mx    4


 Fock matrix recomputed

 1-e file size   =            17800

 1-e file name   = /home/shchien/nwchem/CCSDTQ-Test-1//CCSDTQ-Test-1.f1int.000000

 Cpu & wall time / sec           10.2           14.6


 tce_ao2e_disk: fast2e>1

 half-transformed integrals on disk


 2-e (intermediate) file size =       690449200

 2-e (intermediate) file name = /home/shchien/nwchem/CCSDTQ-Test-1//CCSDTQ-Test-1.v2i.001

[hpcnode009:91404:0:91404] Caught signal 8 (Floating point exception: integer divide by zero)

[hpcnode009:91389:0:91389] Caught signal 8 (Floating point exception: integer divide by zero)

[hpcnode009:91386:0:91386] Caught signal 8 (Floating point exception: integer divide by zero)

##########################################################################################

How can I fix this problem?

May I confirmed that, based on the code tce_energy.F,

      if(.not.intorb) THEN  !--------------

        if (read_integrals(2)) then

          call errquit('tce_energy: cannot restart without 2eorb',

     1                  911,GA_ERR)


  TCE methods without using 2eorb cannot be restarted, so all CCSDT and CCSDTQ will not be able to restart even if one saved the integrals?

Dominic Chien 在 2021年1月10日 星期日上午12:53:29 [UTC+8] 的信中寫道:

Edoardo Aprà

unread,
Jan 11, 2021, 12:47:33 PM1/11/21
to NWChem Forum
On Saturday, January 9, 2021 at 9:33:34 PM UTC-8 Dominic Chien wrote:
How can I fix this problem?

Providing a complete input file (complete in all its parts, including geometry) and associated error/output files that reproduces the problem.

Dominic Chien

unread,
Feb 17, 2021, 10:49:00 PM2/17/21
to NWChem Forum
I submitted a reply a week ago, why it hasn't been posted here yet?

Edoardo Aprà 在 2021年1月12日 星期二上午1:47:33 [UTC+8] 的信中寫道:

Edoardo Aprà

unread,
Feb 18, 2021, 12:15:55 AM2/18/21
to NWChem Forum
Could you try to post it again?

Edoardo Aprà

unread,
Feb 18, 2021, 1:14:20 PM2/18/21
to NWChem Forum
Posting email on Dominic Chien's behalf

Please find the attached input and output (with different 2emet settings) as well as the environment settings for building NWChem.

At the user decided to let MRCCC to crawl on 2 cores with 4TB for a month, rather than the endless testing of the different combinations of 2emet  with no success.


Thanks!

Best Regards,
Dominic

test.inp
2emeteq1.zip
test.inp

Edoardo Aprà

unread,
Feb 18, 2021, 1:16:56 PM2/18/21
to NWChem Forum
Posting  on Dominic Chien's behalf

More output files
2emets.zip

Kowalski, Karol

unread,
Feb 18, 2021, 8:17:11 PM2/18/21
to nwchem...@googlegroups.com

Dominic,

 

  1. 2emet stuff has not been extended to the CCSDT/CCSDTQ . There is a simple reason for this – 2emet works fine for a bigger systems and with CCSD. With the CCSDT and CCSDTQ you will never get to that system sizes.  The main problem is with storing T3/T4 amplitudes. 
  2. He can use larger tile for CCSDTQ (otherwise program will be slow).
  3. CCSDTQ with 140 orbitals is a pretty big simulation, which will require a lot of memory/cores. Please try at least 1000 cores or even more – GA for T4 may be pretty big.

Please use the following input:

 

echo

start TestInput-Q

memory stack 1200 mb heap 100 mb global 5000 mb

 

permanent_dir /home/chiensh/nwchem/testing

SCRATCH_DIR   /scratch/chiensh/tmp

 

charge 2

geometry units angstrom

 zmatrix

  Cr

  C  1  r1

  H  2  r2  1  a1

  H  2  r2  1  a1  3  +d1

  H  2  r2  1  a1  3  -d1

  constants

  r1   1.92348

  r2   1.09246

  a1 102.50924

  d1 120.23282

 end

end

 

basis "ao basis" spherical

  * library cc-pvtz-dk

end

 

scf

  vectors  input  TestInput-Q_scf_mo

  rohf

  doublet

  thresh 1e-8

  maxiter 200

end

 

RELATIVISTIC

  DOUGLAS-KROLL ON

END

 

tce

  SCF

  CCSDTQ

  thresh 1e-6

  io ga

  freeze atomic

  tilesize   8

end

 

task tce energy

 

 

From: <nwchem...@googlegroups.com> on behalf of Dominic Chien <chi...@gmail.com>
Reply-To: "nwchem...@googlegroups.com" <nwchem...@googlegroups.com>
Date: Wednesday, February 17, 2021 at 7:49 PM
To: NWChem Forum <nwchem...@googlegroups.com>
Subject: [nwchem-forum] Re: Error [hashv2: key not found 2 ] for CCSDTQ calculation

 

Check twice before you click! This email originated from outside PNNL.

 

--
You received this message because you are subscribed to the Google Groups "NWChem Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nwchem-forum...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nwchem-forum/61bf458a-241a-425f-be37-8cd635a3fa50n%40googlegroups.com.

Dominic Chien

unread,
Feb 18, 2021, 9:28:59 PM2/18/21
to NWChem Forum
Thank you Edo and Karol!

Regarding Point 1, is there a way for user to estimate the requirements for memory and core before submitting the job?  a very rough estimation is good enough for a better job scheduling plan. 

I did a number of tests with the tile size > 3,  and it will attempt to request for a rediculous large shmmax  (e.g. few TB); I understand it may not be sufficient memory for GA, and it can be solved by using more nodes,   but not much we can do with such a large shmmax requirement. 

For point 3,  yes it is a big calculation indeed and that's why I suggested user use more nodes, and NWChem seems to be the only package can handle CCSDTQ on distributed systems.  The userme has finished the job with Molpro/MRCC on a large memory node with 4TB; although it has 88 cpu cores on the node, MRCC basically can only utilize 22 cores efficiently for this calculation due to the I/O bandwidth. 

 The 1000 core requirement can be satisfied by 8 nodes (each has 128 cores and 512GB), and it is acceptable to me if the NWChem can finish these kinds of jobs in days rather than months. With such a large number of cores requirement, will all the cpu be fully utilized ? or is it just because the job needs more memory.   

karol.k...@pnnl.gov 在 2021年2月19日 星期五上午9:17:11 [UTC+8] 的信中寫道:

Dominic Chien

unread,
Mar 8, 2021, 5:10:44 AM3/8/21
to NWChem Forum
Thank you Karol,

Based on your input, the job managed to pass the Integral transformation on 8 nodes (1024 cores), but it hit another error which I am not sure if it is due to NWChem, ARMCI, Intel MPI or the IB driver:

To run the job,  I have to set the shmmaxx for each node to 100GB, or it will hit the insufficient share memory problem described before. 
I use the following command to run the job 

mpirun -genv I_MPI_PIN on  -genv MV2_USE_APM 0 -genv I_MPI_FABRICS shm:ofa -genv I_MPI_OFA_USE_XRC 1   -genv I_MPI_OFA_NUM_ADAPTERS 1 -genv I_MPI_OFA_ADAPTER_NAME mlx5_0 -genv I_MPI_OFA_NUM_PORTS 1 -genv MALLOC_MMAP_MAX_=0  -genv MALLOC_TRIM_THRESHOLD_=-1 -genv=ARMCI_DEFAULT_SHMMAX=8192  ./nwchem input >& output

Here are the last few lines before it crashed 
==============================================================================
tce_mo2e: fast2e=1
 2-e integrals stored in memory
  0 ga offset                   0 size_xx_perproc            56024070mx    4
 WRITE TENSOR
  filename: /home/chiensh/nwchem/testing/TestInput-Q.v2int.000000
  unit nr:       78
  file size:       56024070
  rec_mem (KB):     2048
  rec_size:         262144
  number of tasks:          214
  1 ga offset            56024070 size_xx_perproc            56024070mx    4
  2 ga offset           112048140 size_xx_perproc            56024070mx    4
  3 ga offset           168072210 size_xx_perproc            56024070mx    4

 2-e file size   =        224096280
 2-e file name   = /home/chiensh/nwchem/testing/TestInput-Q.v2int.000000
 Cpu & wall time / sec            9.5            9.8
 T1-number-of-tasks                    32

 t1 file size   =              727
 t1 file name   = /local/TestInput-Q.t1
 t1 file handle =       -998
 T2-number-of-boxes                   936

 t2 file size   =           434990
 t2 file name   = /local/TestInput-Q.t2
 t2 file handle =       -995

 t3 file size   =        166424658
 t3 file name   = /local/TestInput-Q.t3
2: WARNING:armci_set_mem_offset: offset changed -149422080 to -471310336
387: WARNING:armci_set_mem_offset: offset changed -149422080 to -471900160
769: WARNING:armci_set_mem_offset: offset changed -149422080 to -471900160
258: WARNING:armci_set_mem_offset: offset changed -149422080 to -471900160
901: WARNING:armci_set_mem_offset: offset changed -149422080 to -471900160
644: WARNING:armci_set_mem_offset: offset changed -149422080 to -471900160
513: WARNING:armci_set_mem_offset: offset changed -149422080 to -471900160
133: WARNING:armci_set_mem_offset: offset changed -149422080 to -471900160

 t4 file size   =      46795018459
 t4 file name   = /local/TestInput-Q.t4

 CCSDTQ iterations
 --------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall
 --------------------------------------------------------
mlx5: hpcnode105: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
000003de 00000000 00000000 00000000
00000000 9d005304 080016c5 80ee3dd3
Last System Error Message from Task 128:: Protocol not supported
128: error ival=4
(rank:128 hostname:hpcnode105 pid:122726):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
mlx5: hpcnode085: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000080 00000000 00000000 00000000
00000000 00008813 100149f3 65f41dd3
0: error ival=10
(rank:0 hostname:hpcnode085 pid:30788):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
==============================================================================



karol.k...@pnnl.gov 在 2021年2月19日 星期五上午9:17:11 [UTC+8] 的信中寫道:

Dominic,

Edoardo Aprà

unread,
Mar 8, 2021, 2:40:51 PM3/8/21
to NWChem Forum
Dominic,
In order to avoid the memory problems, you might want to try ARMCI_NETWORK=MPI-PR, instead of  ARMCI_NETWORK=OPENIB.
MPI-PR has a more robust handling of large shared memory allocations.
Reply all
Reply to author
Forward
0 new messages