MPI job fails on specific node combinations on PBS cluster (deal.II 9.7.0)

7 views
Skip to first unread message

ME20D503 NEWTON

unread,
1:53 AM (7 hours ago) 1:53 AM
to deal.II User Group

Hello deal.II community 

I am working with the deal.II finite element library and recently transitioned from workstation based simulations to an HPC environment. My simulations are 3D problems with a large number of degrees of freedom, so I am using MPI parallelisation.

I am facing an issue while running an MPI-based deal.II application on our institute’s HPC cluster, and I would appreciate your guidance.

Software details:

  • deal.II version: 9.7.0

  • deal.II module: dealii_9.7.0_intel

  • MPI launcher available on system: /usr/bin/mpiexec (OpenMPI 4.1.5)

  • Intel oneAPI environment is sourced in the job script

  • Scheduler: PBS Pro (version 23.06.06)

HPC node configuration (from pbsnodes):

  • node1: 32 CPUs 125 GB RAM

  • node2: 32 CPUs, 126 GB RAM

  • node3: 32 CPU   504 GB RAM

  • node4: 32 CPUs,  504 GB RAM

Observed behaviour:

  • The code runs correctly on any single node.

  • The code runs correctly when using node3 + node4 together.

  • The code fails when using node1 + node2 together, or other mixed node combinations.

PBS job script and Error file attached for your reference


Question:
Does this behaviour indicate a known issue related to MPI launcher usage, node allocation, or deal.II configuration on PBS-based clusters? Any guidance on how such node-combination-dependent failures should be diagnosed from the deal.II side would be very helpful.

Thank you for your time and support.

Best regards,
Newton


well_3variants_0degree.o546
script

Praveen C

unread,
2:14 AM (6 hours ago) 2:14 AM
to dea...@googlegroups.com
Have you tried running some hello world example ?

what happens if you put

mpirun -np 64 hostname

in your pbs script ?

best
praveen

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dealii/7ffb3122-ac5a-4218-9645-1cdb8a9ef91an%40googlegroups.com.

ME20D503 NEWTON

unread,
4:34 AM (4 hours ago) 4:34 AM
to deal.II User Group
Yes i tried, it is successfully launching 64 processes 32 on node 1 and 32 on node 2 ; but  giving same error as attached. i am able to run helloworld with any of one node and node3 and 4 together; but the code fails when using node1 + node2 together, or other mixed node combinations after successfully launching the processes. the error file is attached by name "well_3variants_0degree.o546" in the trailing mail , exactly same error is there for hello world too.

Veerendra Koilakuntla

unread,
8:01 AM (1 hour ago) 8:01 AM
to dea...@googlegroups.com
From the behaviour you explained, this problem is most likely not from deal.II, but from the MPI setup or the HPC cluster configuration. The code runs fine on any single node but it fails when different types of nodes are mixed, which clearly indicates an MPI or system-level issue. You are using the dealii_9.7.0_intel module and sourcing Intel oneAPI, but the job is launched using /usr/bin/mpiexec, which belongs to OpenMPI, and this mismatch may cause (in my opinion) node-dependent failures. The MPI launcher and the MPI library used to compile the code must be the same, and this can be checked using ldd ./your_executable | grep mpi. Also, it is very likely that node3 and node4 have a different network or communication setup compared to node1 and node2, such as different interconnects or MPI transport settings, which can make the code run on some node combinations but fail on others. To check this, you can try forcing MPI to use TCP communication only, and if the code runs fine in that case, it means the issue is with the HPC configuration and not with deal.II. The correct way to troubleshoot is to first make sure the MPI build and MPI launcher are consistent, then run MPI using the PBS hostfile.


**************************************************************************
This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.
************************************************************************************************
Reply all
Reply to author
Forward
0 new messages