I got a problem with a large CCSDTQ calculation using the latest NWChem (7.0.2). The input and the build script is given below (full input/output is available upon request for NWChem developers)
On a 2 node computing system (each has 2x64 cores AMD ROAM processors with 512GB ram), I tried different 2emet algorithms (i.e. 16, 15, 14, 13, 4, 3, 2, io=ga), and I got different errors.
For 2emet =16, it hit on the [strided_to_subarray_dtype] error discussed on
https://github.com/nwchemgit/nwchem/issues/100,
(2emet 16, 2eorb)
##########################################################################################
v2 file size = 18806764
4-index algorithm nr. 16 is used
imaxsize = 20
imaxsize ichop = 0
p[32] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 100144
p[30] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 345136
forrtl: error (78): process killed (SIGTERM)
##########################################################################################
So, I applied the env. variables and the job goes a little bit further., and then it hit
another error with message: “ hashv2: key not found 2 0”
(2emet 16, 2eorb, with COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
##########################################################################################
t4 file size = 35932110259
t4 file name = /scratch/shchien/CCSDTQ-Test-1//CCSDTQ-Test-1.t4
CCSDTQ iterations
--------------------------------------------------------
Iter Residuum Correlation Cpu Wall
--------------------------------------------------------
key= 1547
key= 238
key= 2218
hashv2: key not found 2 0
------------------------------------------------------------------------
------------------------------------------------------------------------
key= 22162
key= 14713
hashv2: key not found 2 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
This error has not yet been assigned to a category
------------------------------------------------------------------------
For more information see the NWChem manual at
https://github.com/nwchemgit/nwchem/wiki##########################################################################################
Similar error occured for 2emet = 2,3,4,13 14 and 15
(2emet 15, 2eorb, with COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 14, 2eorb, with COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 13, 2eorb, with COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 4, 2eorb, with COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 3, 2eorb, with COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
(2emet 2, 2eorb, with COMEX_ENABLE_GET_DATATYPE=0 and COMEX_ENABLE_PUT_DATATYPE=0 )
##############################################################################################################
CCSDTQ iterations
--------------------------------------------------------
Iter Residuum Correlation Cpu Wall
--------------------------------------------------------
key= 18152
hashv2: key not found 2 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
key= 38233
hashv2: key not found 2 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
This error has not yet been assigned to a category
------------------------------------------------------------------------
##############################################################################################################
Most annoying to me (as a system admin) is that these errors will leave a lot of junk files in /dev/shm and require manual removal. (I thought we have discussed this error in version 6. 6 or 6.7, and it was fixed before)
##############################################################################################################
[root@hpcnode055 shm]# df
Filesystem 1K-blocks Used Available Use% Mounted on
…
tmpfs 264040960 141854272 122186688 54% /dev/shm
…
[shchien@hpcnode009 CCSDTQ-Test-1]$ cd /dev/shm
[shchien@hpcnode009 CCSDTQ-Test-1]$ ls dev/shm
cmx11806834260000037326000001 cmx1180683426000003803200006p
cmx11806834260000037326000002 cmx1180683426000003803200006q
cmx1180683426000003732600004u cmx1180683426000003803200006r
cmx1180683426000003732600004v cmx1180683426000003803200006s
cmx11806834260000037326000062 cmx1180683426000003803200006u
cmx11806834260000037326000064 cmx11806834260000038033000001
cmx11806834260000037326000065 cmx11806834260000038033000002
cmx11806834260000037327000001 cmx1180683426000003803300004u
cmx11806834260000037327000002 cmx1180683426000003803300004v
cmx1180683426000003732700004u cmx11806834260000038033000062
cmx1180683426000003732700004v cmx11806834260000038033000064
cmx11806834260000037327000062 cmx11806834260000038033000066
…
##############################################################################################################
This “hashv2: key not found 2 “ error has only been discussed once before, but the solution was not quite clear to me (ignore 2eorb?)
https://nwchemgit.github.io/Special_AWCforum/st/id1748/hashv2__key_not_found_with_TCE_a....htmlAfter I removed 2eorb, these algorithms were mot recognized by NWChem (tce_energy: invalid 2emet: 16)
##############################################################################################################
4-index algorithm nr. 16 is used
4-index algorithm nr. 16 is used
Fock matrix recomputed
1-e file size = 17800
1-e file name = /home/shchien/nwchem/CCSDTQ-Test-1//CCSDTQ-Test-1.f1int.000000
Cpu & wall time / sec 9.0 9.1
tce_energy: invalid 2emet: 16
tce_energy: invalid 2emet: 16
------------------------------------------------------------------------
------------------------------------------------------------------------
##############################################################################################################
Here is part of the input:
##########################################################################################
echo
start CCSDTQ-Test-1
memory stack 3500 mb heap 200 mb global 11500 mb
permanent_dir /home/shchien/nwchem/CCSDTQ-Test-1/
SCRATCH_DIR /scratch/shchien/CCSDTQ-Test-1/
charge 1
geometry units angstrom
zmatrix
(The geometry was removed due to NDA with user)
end
end
basis "ao basis" spherical
* library cc-pvtz-dk
end
scf
vectors input CCSDTQ-Test-1_scf_mo
rohf
doublet
thresh 1e-8
maxiter 200
end
RELATIVISTIC
DOUGLAS-KROLL ON
END
tce
SCF
CCSDTQ
thresh 1e-6
io ga
freeze atomic
attilesize 20
tilesize 3
2emet 1
end
set tce:writeint t
set tce:readint f
set tce:writet t
set tce:readt f
set tce:save_interval 10
set tce:tceiop 2048
set tce:nts t
task scf energy
task tce energy
##########################################################################################
Here is how I compiled NWChem, I tried this with IntelMPI/IntelCompiler, OpenMPI/IntelCompiler and OpeMPI/UXC1.10/IntelCompiler
##########################################################################################
module purge
export MODULEPATH=/cm/local/modulefiles:/etc/modulefiles:/usr/share/modulefiles:/usr/share/Modules/modulefiles:/cm/shared/modulefiles/compiler:/cm/shared/modulefiles/library:/cm/shared/modulefiles/mpi
module load intel
module load cuda/11.0.2 cuda/blas/11.0.2 cuda/fft/11.0.2
export NWCHEM_TOP=/home/shchien/nwchem-7.0.2
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=MPI-PR
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export ENABLE_COMPONENT=yes
export USE_OPENMP=y
export DISABLE_GAMIRROR=y
export USE_GAGITHUB=y
export BLAS_SIZE=8
export BLASOPT="-mkl -liomp5 -lpthread -lm -ldl -qopenmp -qopenmp-simd "
#export BLASOPT="/apps/openblas/0.3.6/lib/libopenblas_gcc_i8_s.a -lpthread -ldl "
export LAPACK_LIB=$BLASOPT
export USE_SCALAPACK=yes
export SCALAPACK_SIZE=8
export SCALAPACK=" ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a"
export SCALAPACK_LIB="$SCALAPACK $BLASOPT"
unset SCALAPACK_LIB
#unset SCALAPACK
export MPI_INCLUDE="/cm/shared/common_software_stack/apps/libraries/mpi/openmpi/4.1.0/hpcx_icc/include"
export LIBMPI="-L/cm/shared/common_software_stack/apps/libraries/mpi/openmpi/4.1.0/hpcx_icc/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -ldl -Wl,--export-dynamic -lutil "
export NWCHEM_MODULES="all"
export CCSDTQ=y
export CCSDTLR=y
export MRCC_METHODS=TRUE
export CC=icc
export FC=ifort
export F77=ifort
export F90=ifort
export CXX=icpc
export MPICC=mpicc
export MPIFC=mpif90
export TCE_CUDA=Y
export CUDA_LIBS="-L /cm/shared/apps/cuda11.0/toolkit/11.0.2/lib64 -lcublas -lcudart "
export CUDA_FLAGS="-arch sm_70"
export CUDA_INCLUDE="-I. -I/usr/local/cuda/include"
export CUDA=nvcc
##########################################################################################