ULFM on ompi/main getting stuck on MPI_FInalize()

38 views
Skip to first unread message

Rodrigo Coacci

unread,
Feb 3, 2023, 2:42:25 PM2/3/23
to User Level Fault Mitigation
Hi everyone, 

I'm currently trying UFLM from OpenMPI from the main branch (specifically commit 68395556ce), and while running on a single node everything works fine, as soon as I add another node, the living processes gets stuck on MPI_Finalize().

The test program I'm using is a variant (with more printf's basically) of https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/02.err_handler.c. And I've attached the result of ompi_info.

The cluster in question is a production/development cluster that has Infiniband, GPU, and Ethernet but I didn't enable UCX on the openmpi install (leaving CUDA, as seen on ompi_info.txt), and it seems to be using tcp without problem. I tried forcing it to use the ethernet interface (via btl_tcp_if_include) but had the same results.
I'm running it through the cluster slurm instalation (unfortunately its 20.11.9, as you probably know that's harder to change on a production cluster) using sbatch and the following mpirun commad line:

mpirun --with-ft ulfm --display-comm --display-comm-finalize  err_handler

The --display-comm parameters assure me that it's using tcp for communication between nodes;

While writing this e-mail I found out that after disabling shared memory (with --mca btl ^sm) the living processes exit (no one gets stuck at MPI_Finalize()) but the job never finishes and prted/prterun/srun processes keep running depending on the node.

So it seems there are maybe two issues here: one related to the sm component, and the other related to slurm.

Is there anything else I should try? Unfortunately I found it hard to find many resources on debugging openmpi itself.
 
Should I open a bug report on Github?

Regards,
Rodrigo.

ompi_info.txt

George Bosilca

unread,
Feb 6, 2023, 11:03:01 AM2/6/23
to ul...@googlegroups.com
Hi Rodrigo,

Thanks for the report, we are looking into it. In fact there was a similar bug report and I think a solution has been pushed into PRRTE, but it has not yet been merged in OMPI. Meanwhile, you might want to give it a try with [1].

Please create an issue, it allows us to track the problems and see how they get fixed.

Thanks,
  George.

[2] 

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/666555eb-8452-45bd-8ab6-760577c7d205n%40googlegroups.com.

Rodrigo Coacci

unread,
Feb 10, 2023, 11:15:23 AM2/10/23
to ul...@googlegroups.com
Thanks, I'll open a bug report on Github ASAP, with this.
On the other hand, I'm trying to use ULFM for my Masters' degree thesis, is there some version/release/commit you would consider a "stable" target? Perhaps rc7? The last 4.x ULFM?
Any advice?

Regards,
      Rodrigo


George Bosilca

unread,
Feb 12, 2023, 8:17:07 PM2/12/23
to ul...@googlegroups.com
The next stable is 5.0, it might be safe to assume things there are getting less easily broken. Personally, I tend to upgrade to main as much as possible. I always keep around the install directory and the sha keys of the last version I was happy with, and if the upgrade proves of lesser quality I can safely switch back to a working state.

Once we release the 5.0 that will be stable. Hopefully the release will happen soon.
  George.


Rodrigo Coacci

unread,
Feb 13, 2023, 5:52:45 PM2/13/23
to ul...@googlegroups.com
Thanks for the pointers. I'll try the 5.0 branch and see if I can make that work. BTW, this is the issue I created on Github: https://github.com/open-mpi/ompi/issues/11404. I've seen that the submodules were updated on github, so I'll try again and report back if the issues are gone.

Regards,
      Rodrigo


Reply all
Reply to author
Forward
0 new messages