Transport retry count exceeded error

232 views
Skip to first unread message

Sarthak Joshi

unread,
Oct 25, 2022, 2:10:49 AM10/25/22
to User Level Fault Mitigation
Hello,
I am trying to test ULFM in openmpi-5.0.0rc7 using a simple program and getting this error. I'm launching the program with 3 processes all on different nodes (using PBS). It's a simple program in which P0 sends data to P1, P1 to P2, and P2 to P0 in an infinite loop with 1-second sleep after each receive. I am killing P1 by logging into the node where it's running and killing it using kill -9 <pid>.

I am observing that the failure is detected and the communicator is successfully revoked and shrunk and the communication continues while excluding P1 in an expected manner. However, after a few iterations, I get this error due to which P0 aborts. Since it was coming from the openucx library, I tried setting the --mca btl ^uct parameter, but then the program is either running very slowly or did not advance at all after a few iterations such that even after running for a good 10 minutes (I'm currently stopping the execution by killing the job) the data counter (incremented by 1 after each receive) does not exceed 6 over multiple tests.

Looking into it further, I found that this is an outcome of the openucx library's internal polling operation, which can (depending on the communication standard used) also end up utilizing polling functions from other libraries like ibv_poll_cq from libibverbs for Infiniband. I was able to bypass this issue by building openmpi with a custom openucx library with modifications such that the uct_ib_mlx5_check_completion function ignores the MLX5_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR error code. Through that, I stopped receiving this error without any additional problems but I'm not sure if there will be any other side effects. Furthermore, this seems like something that should ideally be resolved from the ULFM layer itself.

I wanted to ask if there is a solution to this that I can apply at the openmpi/ulfm or at the application level. Thanks for your attention.
Sarthak 
erroroutput.txt
uctdisabledoutput.txt
ulfmtest.c

Sarthak Joshi

unread,
Oct 28, 2022, 2:34:39 AM10/28/22
to User Level Fault Mitigation
As a further update to this, I am also encountering this issue when testing using a slightly modified benchmark from the Fault tolerance research hub website. Specifically, I used the revshrinkkill.c program from the SC'21 tutorial. I commented out the section of the program that is responsible for randomly killing the executing process and instead, manually killed a process by logging into the node (again running 3 processes on 3 different nodes). After a few seconds, one of the other processes also aborted with the same error. It seems the way this was being tested earlier was by killing the processes internally due to which they were only dying in essentially "safe" states.

Aurelien Bouteiller

unread,
Nov 1, 2022, 4:21:46 PM11/1/22
to User Level Fault Mitigation
Sarthak, what version of UCX are you using?

Also, w.r.t, the TCP error you saw in the prior email, you may want to try again but this time limiting the TCP BTL to using only select interfaces (with Open MPI, this error often is associated with having non-routable virtual machine interfaces on the nodes), e.g., adding —mca btl_tcp_if_include ib0 (or whatever routable interface is correct on your system).

Aurelien
> --
> You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/293fc003-c6b8-43f5-a698-45f49d6fffcan%40googlegroups.com.

Sarthak Joshi

unread,
Nov 2, 2022, 2:41:26 AM11/2/22
to User Level Fault Mitigation
I have tested this using openucx version 1.9.0, 1.11.0 and 1.13.1 (latest release) and gotten this error on all of them. I tried adding the flag you mentioned. I used ucx_info -d to find the network interfaces (got ib0 and eno1 for tcp transport). Using eno1 still gives the error and using ib0 results in the MPI processes not starting/ending immediately (the prterun and prted processes still spawn on the corresponding nodes).
Sarthak

Reply all
Reply to author
Forward
0 new messages