I cannot reproduce the problem of "Too many opened files on the system" even using the same number of nodes and core count.
It seems that this problem happens randomly, but it is somehow related to the high cores and nodes count, and it also depends on the number of files opened by other processes on the system.
So I did another study, using the same input, using 128 cores on each of the 32 nodes, ie 4096 cores. The CISDTQ was completed much slower due to overhead.
I killed the job at the last iteration (#16) to preserve the temp files at /dev/shm
There are only 6994 files at /dev/shm, i.e.
[root@hpcnode117 ~]# ls /dev/shm/sem.cmx0000001032000004* |wc -l
128
[root@hpcnode117 ~]# ls /dev/shm/cmx0000001032000004* |wc -l
6858
[root@hpcnode117 ~]# ls /dev/shm/* |grep -v cmx |wc -l
8
I also dump the list of opened files for one of the nodes during the 2nd to last iterations (#15), and there are 3899442 files opened on the system, which is quite closed to the system limit that I set at /proc/sys/fs/file-max (5480000)
I further study on these open files, and there are 3.6M files related to nwchem
[root@hpcnode117 ~]# cat temp |grep nwchem |wc -l
3699965
[root@hpcnode117 ~]# cat temp |grep sem.cmx |wc -l
65024
[root@hpcnode117 ~]# cat temp |grep -v sem|grep cmx |wc -l
3483476
[root@hpcnode117 ~]# cat temp |grep -v nwchem |wc -l
199477
There are about 190000 files opened on an idle system.
[root@hpcnode117 ~]# lsof |wc -l
192954
Hope this information will be helpful to analyse the cause of the problem
Thanks!