Break of parallel calculation on 2/3 or more nodes

26 views
Skip to first unread message

Yuan Yin

unread,
Jul 21, 2022, 9:31:03 AM7/21/22
to DMFTwDFT
Dear developers,

Luckily, I can run a full CSC calculation on any single node on my servers. But a problem with the parallel calculation of DMFTwDFT codes on 2 or more nodes (e.g., node1: 20cores +node2: 20cores, equal to 40 cores) happens. Usually, the first 3-5 calculation loops run well, then the programs (dmft.x or ctqmc) becomes zombie. Everything stopped.
Here is the state of programs during running:

屏幕截图 2022-07-21 204503.jpg
屏幕截图 2022-07-21 204307.jpg

Could you help me to figure out what happens during the program execution?

Best thanks!
Yuan

Uthpala Herath

unread,
Jul 21, 2022, 2:30:30 PM7/21/22
to Yuan Yin, DMFTwDFT
Dear Yuan, 

I'm glad you got the calculations to run. 
Regarding the multiple node issue, I feel it has to do something with how the interconnects are set up on your cluster. 
Can you try to see if the issue persists with other programs?
Also, your system administrator might be able to help out too. 

Thank you, 

Best,
Uthpala

--
You received this message because you are subscribed to the Google Groups "DMFTwDFT" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dmftwdft+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dmftwdft/c99ae4cd-fd2e-4295-bee8-de5eb16378f2n%40googlegroups.com.


--
Uthpala Herath
Postdoctoral Associate
Department of Mechanical Engineering and Materials Science
Duke University
Durham, NC

Reply all
Reply to author
Forward
0 new messages