Hello,
I am trying to test ULFM in openmpi-5.0.0rc7 using a simple program and getting this error. I'm launching the program with 3 processes all on different nodes (using PBS). It's a simple program in which P0 sends data to P1, P1 to P2, and P2 to P0 in an infinite loop with 1-second sleep after each receive. I am killing P1 by logging into the node where it's running and killing it using kill -9 <pid>.
I am observing that the failure is detected and the communicator is successfully revoked and shrunk and the communication continues while excluding P1 in an expected manner. However, after a few iterations, I get this error due to which P0 aborts. Since it was coming from the openucx library, I tried setting the --mca btl ^uct parameter, but then the program is either running very slowly or did not advance at all after a few iterations such that even after running for a good 10 minutes (I'm currently stopping the execution by killing the job) the data counter (incremented by 1 after each receive) does not exceed 6 over multiple tests.
Looking into it further, I found that this is an outcome of the openucx library's internal polling operation, which can (depending on the communication standard used) also end up utilizing polling functions from other libraries like ibv_poll_cq from libibverbs for Infiniband. I was able to bypass this issue by building openmpi with a custom openucx library with modifications such that the uct_ib_mlx5_check_completion function ignores the MLX5_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR error code. Through that, I stopped receiving this error without any additional problems but I'm not sure if there will be any other side effects. Furthermore, this seems like something that should ideally be resolved from the ULFM layer itself.
I wanted to ask if there is a solution to this that I can apply at the openmpi/ulfm or at the application level. Thanks for your attention.
Sarthak