Linux Cluster - Simulation killed by signal 11 and 9

241 views
Skip to first unread message

Rohan Barot

unread,
Oct 13, 2021, 8:13:14 AM10/13/21
to FDS and Smokeview Discussions
Hello Community,

I want to run a large simulation with around 170 Million cells. For that I have made a Linux Cluster with four work stations. All four work stations are identical, each have 24 cores and 48 processors, 128 GB RAM (total 512 GB). All four work stations have FDS 6.7.6 and MPI Version 3.1 installed.

I have already updated the stack size to unlimited on all four work stations.

I have created simplified three test models with 96 Meshes (1 mesh on 1 core).

Model 1 with 24 million cells run perfectly with maximum 15 GB RAM consumption on each work station.

Model 2 with 68 millions cells crashed after few minutes. At the time of crash, the work stations had maximum RAM consumption around 52 GB (around 40%) with the error:
 

Model 3 with 98 million cells crashes as soon as with the same error.

I would like to ask if the error raised due to running out of memory, then why the RAM consumption is only 40%. While I have already updated the stack size to unlimited.

I am running the job using command:

mpiexec -hosts w1,w2,w3,w4 -n 96 -print-all-exitcodes fds job.fds

 
I would be grateful if you share your view to solve this problem. I want to run actual simulation with around 170 million cells, it would be great if you share your expert advices regarding handling of this kind of huge models. The simplified fds file is in the attachment.

Regards
Rohan
error.png
Test-Mesh-96.fds

John Van Workum

unread,
Oct 14, 2021, 9:09:16 AM10/14/21
to FDS and Smokeview Discussions
Hi Rohan,

That error usually indicates an issue with MPI on a particular node (in your case node 35).  I was able to run your Test-Mesh-96.fds on our cluster with 24 core nodes, 128GB RAM, FDS 6.7.6, Intel MPI, Infiniband. But we have hyperthreading disabled. I would first disable hyperthreading on your workstation nodes. Check for RAM or network issues with node 35 (dmesg or IPMI server health). 

John

Kevin McGrattan

unread,
Oct 14, 2021, 9:44:49 AM10/14/21
to fds...@googlegroups.com
Your 96 mesh test case is running successfully on our linux cluster which has nodes with 8 cores. I am using 12 nodes. No hyperthreading. Each MPI process appears to be using 1.6 GB RAM. 

Rohan Barot

unread,
Nov 1, 2021, 9:12:04 AM11/1/21
to FDS and Smokeview Discussions

Hi John and Kevin,

Thank you very much for the reply. We disabled hyperthreading, but it did not help, it shows the same segmentation error. @John As you mentioned that there could be an issue with the particular node (node 35), we changed the sequence of the nodes in the job start command, now it is showing the error with some other node not on node 35.

Thanks in advance!

Kind regards
Rohan 
fehler.png

John Van Workum

unread,
Nov 2, 2021, 9:00:50 AM11/2/21
to FDS and Smokeview Discussions
Describe the network are you using between the 4 workstations. 

Rohan Barot

unread,
Nov 2, 2021, 9:56:55 AM11/2/21
to FDS and Smokeview Discussions
Hi,

We have a Beowulf cluster. All nodes have static IP-address and via passwordless ssh with each other connected. We are using a NFS folder on the head node, which is connected with the other sub-nodes with the same mount point. For the test purpose, everything in a DMZ with a switch connected. The firewall is deactivated to avoid any further errors. It works for the limited number of the cells. Doesn't it means that the communication between the other nodes has been established? If you need any other information, then please let me know.

Thanks again for the support.

Regards
Rohan

Kevin McGrattan

unread,
Nov 2, 2021, 10:40:53 AM11/2/21
to fds...@googlegroups.com
My advice to you is to compile and run the simple "hello world" program in the FDS GitHub repository


It may be helpful to clone the fds repository and use the makefile that is in the same directory. This hello world program will help you diagnose the problems with a simple Fortran program that has nothing to do with FDS. 


John Van Workum

unread,
Nov 2, 2021, 10:53:15 AM11/2/21
to fds...@googlegroups.com
What is your network hardware? Switch, Ethernet, IB? Have you verified there are no network errors? Your largest test case could be affected by network issues. 

--
You received this message because you are subscribed to a topic in the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fds-smv/ou7wfz06LdI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fds-smv+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/CAAJimDF-yKdrS%3DDfqdJ9%2Bu0B_Eb4TGQTRQBOUZwwnqsm5jJ0fA%40mail.gmail.com.


--

John Van Workum
Principal, Sabalcore

A   3505 Lake Lynda Drive Suite 200, Orlando, FL 32817



Disclaimer 
The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
Reply all
Reply to author
Forward
0 new messages