Questions in NWChem benchmark for MRCCSD calculations

86 views
Skip to first unread message

JEN-SHIANG YU

unread,
Mar 16, 2021, 2:34:47 PM3/16/21
to NWChem Forum
Dear Colleague,

Some time ago I requested for the input/output files for the MRCCSD benchmark in NWChem. 
Now I am trying to run benchmark in an HPC system equipped with 100Gb Infiniband as interconnect. The benchmark goal is to use all of the 900 nodes, each node with 192 GB of RAM and 56 cores of Intel Xeon Platinum 8280 CPU. I have already successfully compiled NWChem 7.0.2 by Intel Compiler 2021.1 using ARMCI_NETWORK=OPENIB. The queuing system is SLURM.

I started with 400 nodes (22,400 cores), and I have two problems regarding the run:
(1) It seems that the environment variables of 
export USE_NOFSCHECK=1
export USE_NOIO=1
were unable to avoid the directory check process (i.e. generation of *.dir_check_p.* files) so that each run would need to wait for about 10 minutes before the SCF calculation started. Did I get a wrong idea to avoid I/O check by the aforementioned two variables?

(2) The calculation died after the generation of "Reference static distribution", with the GA error messages copied below. I also tried to increase or decrease the tilesize (tried 20, 24 and 28) but it seemed useless. I attach the complete output of tilesize=28 in the following link and could you please offer some suggestions to help make the large calculation running. Thank you very much in advance!

BWCCSD output with 22,400 cores, tilesize=28: 
SLURM script:
 
Sincerely,
Jen-Shiang

==== some info near the end of output copied below ===
.....
Ref.   2 Half 2-e         109.38         123.84
V 2-e /work/u31jsy00/scrat in bytes=       215017721856
Ref.  18 Half 2-e         110.27         124.90
V 2-e /work/u31jsy00/scrat in bytes=       215017721856
Ref.  17 Half 2-e         110.80         125.46
V 2-e /work/u31jsy00/scrat in bytes=       215017721856
Ref.  19 Half 2-e         109.98         126.84
V 2-e /work/u31jsy00/scrat in bytes=       215017721856
(rank:16128 hostname:cpn3289 pid:88419):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/common/iterator.c:armci_stride_info_init():35 cond:(stride_levels>=0)
15680: error ival=10
(rank:15680 hostname:cpn3281 pid:68436):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
10472:Segmentation Violation error, status=: 11
(rank:10472 hostname:cpn3188 pid:139469):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/common/signaltrap.c:SigSegvHandler():315 cond:0
10080: error ival=10
(rank:10080 hostname:cpn3181 pid:141678):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
2688:Segmentation Violation error, status=: 11
(rank:2688 hostname:cpn3049 pid:123287):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/common/signaltrap.c:SigSegvHandler():315 cond:0
2240: error ival=10
(rank:2240 hostname:cpn3041 pid:98406):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
5040:Segmentation Violation error, status=: 11
(rank:5040 hostname:cpn3091 pid:22580):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/common/signaltrap.c:SigSegvHandler():315 cond:0
4480: error ival=10
(rank:4480 hostname:cpn3081 pid:63134):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
(rank:18368 hostname:cpn3329 pid:25206):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/common/iterator.c:armci_stride_info_init():35 cond:(stride_levels>=0)
(rank:18648 hostname:cpn3334 pid:94473):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/common/iterator.c:armci_stride_info_init():35 cond:(stride_levels>=0)
17920: error ival=10
(rank:17920 hostname:cpn3321 pid:205483):ARMCI DASSERT fail. ../../ga-5.7.2/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
.......
==== End of copied info ===  

Edoardo Aprà

unread,
Mar 16, 2021, 3:28:23 PM3/16/21
to NWChem Forum

(1) USE_NOIO (that triggers USE_NOFSCHECK, too) must be defined at compile time, not in the slurm script.

(2) As I have already stated a few times (https://groups.google.com/g/nwchem-forum/c/SqYVweRRr8U/m/qHfYN9XhAwAJ), 
in order to avoid the memory problems, you might want to try compile NWChem with ARMCI_NETWORK=MPI-PR, instead of  ARMCI_NETWORK=OPENIB.
MPI-PR has a more robust handling of large shared memory allocations.

PS Please  use the current NWChem website for future reference, e.g. https://nwchemgit.github.io/Benchmarks.html#parallel-performance-of-the-multireference-coupled-cluster-mrcc-methods. The old NWChem website name has been taken over by cyber squatters.

Reply all
Reply to author
Forward
0 new messages