api folder in openmpi-tests-public/ulfm-testing question

23 views
Skip to first unread message

Howard Pritchard

unread,
Aug 22, 2022, 2:22:54 PM8/22/22
to User Level Fault Mitigation
Hello, 

Is it expected that the tests in the openmpi-tests-public/ulfm-testing/api folder should run successfully when using current Open MPI main branch (checked out today)?

We're seeing failures when we run the script, namely:

four of the tests from the api folder failed: "ULFM error returns after failure", "ULFM error during ANY_SOURCE after failure", "ULFM error insulation from failure in another comm", "ULFM shrink after failure is compliant"

we added --with-ft=mpi to the configure line for Open MPI.

Thanks,

Howard

George Bosilca

unread,
Aug 22, 2022, 2:39:16 PM8/22/22
to ul...@googlegroups.com
That is not supposed to happen. Three of the tests explicitly return 0, and the last one (ULFM error insulation from failure in another comm) returns the default (which is expected to be 0).

Do you have the output of your run ?

Thanks,
  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/4db7eacd-dc7e-40eb-8bd0-4d4a67a23047n%40googlegroups.com.

Jacob Tronge

unread,
Aug 22, 2022, 5:02:26 PM8/22/22
to User Level Fault Mitigation
Hello,

I've attached output of two different runs that we did.

The first file (slurm-pml-ob1.out) is from running  with `OMPI_MCA_pml=ob1` and four tests fail.

The second one (slurm-no-uct.out) is from running with both `OMPI_MCA_pml=ob1` and `OMPI_MCA_btl=^uct` and 7 tests fail.

Thanks,
Jake
slurm-no-uct.out
slurm-pml-ob1.out

George Bosilca

unread,
Aug 22, 2022, 5:40:26 PM8/22/22
to ul...@googlegroups.com
The only thing I can say from these log files is that there seems to be a regression on PRTE, where a non-zero error code is returned despite the fact that the MPI job completed successfully. Indeed, if you look at the log you can see that the test prints "TEST PASSED" (and in the code that is followed by a return 0), but the mpirun returns something different.

  George.


Reply all
Reply to author
Forward
0 new messages