I built a latest upstream kernel 5.9.6 and enabling BBR algorithm in Centos 8 stream. Most times it works well, but when I tried to run intel-mpi Sendrecv benchmark, it runs to fail.
Here is the log.
#------------------------------------------------------------
# Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part
#------------------------------------------------------------
# Date : Tue Nov 17 16:18:46 2020
# Machine : x86_64
# System : Linux
# Release : 5.9.6-16.el8.x86_64
# Version : #1 SMP Tue Nov 10 02:01:06 UTC 2020
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# IMB-MPI1 Sendrecv -iter 1000000
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Sendrecv
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
# ( 54 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000000 0.70 0.70 0.70 0.00
1 1000000 0.70 0.70 0.70 2.87
2 1000000 0.71 0.71 0.71 5.64
4 1000000 0.73 0.73 0.73 10.92
8 1000000 0.73 0.73 0.73 21.80
[clr-pnp-server-11:43696] 111 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[clr-pnp-server-11:43696] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[clr-pnp-server-11:43696] 55 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
16 1000000 0.74 0.74 0.74 43.53
32 1000000 1.01 1.01 1.01 63.56
64 655360 0.96 0.96 0.96 133.65
128 327680 1.34 1.34 1.34 191.25
256 163840 1.63 1.63 1.63 314.23
512 81920 1.65 1.65 1.65 619.67
1024 40960 1.84 1.84 1.84 1115.00
2048 20480 2.58 2.58 2.58 1588.86
4096 10240 3.51 3.51 3.51 2335.63
8192 5120 5.26 5.26 5.26 3113.20
16384 2560 9.42 9.42 9.42 3477.34
32768 1280 19.39 19.39 19.39 3380.30
65536 640 38.35 38.37 38.36 3415.78
131072 320 76.83 76.93 76.88 3407.75
262144 160 140.89 140.92 140.90 3720.51
524288 80 269.36 269.60 269.48 3889.44
1048576 40 508.51 509.85 509.18 4113.27
2097152 20 1071.69 1087.46 1079.58 3856.96
[1605601137.391752] [clr-pnp-server-11:43705:0] sock.c:344 UCX ERROR recv(fd=144) failed: Bad address
[1605601137.392362] [clr-pnp-server-11:43704:0] sock.c:344 UCX ERROR recv(fd=42) failed: Connection reset by peer
[1605601137.392396] [clr-pnp-server-11:43704:0] sock.c:344 UCX ERROR sendv(fd=-1) failed: Bad file descriptor
[1605601137.392396] [clr-pnp-server-11:43705:0] sock.c:344 UCX ERROR sendv(fd=-1) failed: Bad file descriptor
[clr-pnp-server-11:43704] *** An error occurred in MPI_Sendrecv
[clr-pnp-server-11:43704] *** reported by process [1759838209,0]
[clr-pnp-server-11:43704] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[clr-pnp-server-11:43704] *** MPI_ERR_OTHER: known error not in list
[clr-pnp-server-11:43704] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[clr-pnp-server-11:43704] *** and potentially your MPI job)
[clr-pnp-server-11:43696] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
Base on the log, it looks the TCP link is unexpectedly reset by peer and lead to a segmentation fault in openmpi and ucx lib.
Here is segmentation fault back trace.
systemd-coredump[14186]: Process 13166 (IMB-MPI1) of user 1000 dumped core.
Stack trace of thread 13166:
#0 0x00007f76f1bfcf10 n/a (n/a)
#1 0x00007f76f26ff24b uct_tcp_iface_progress (libuct.so.0)
#2 0x00007f76f293442a ucp_worker_progress (libucp.so.0)
#3 0x00007f7704f27fa4 opal_progress (libopen-pal.so.40)
#4 0x00007f7705a39d6d ompi_request_default_wait (libmpi.so.40)
#5 0x00007f7705a9c255 ompi_coll_base_sendrecv_actual (libmpi.so.40)
#6 0x00007f7705a9ca6b ompi_coll_base_allreduce_intra_recursivedoubling (libmpi.so.40)
#7 0x00007f7705a4d989 PMPI_Allreduce (libmpi.so.40)
#8 0x0000562bbda3f0de n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)
#9 0x0000562bbda452b0 n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)
#10 0x0000562bbda0f944 n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)
#11 0x00007f770548d85d __libc_start_main (libc.so.6)
#12 0x0000562bbda0e18e n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)
We don't see this issue on other TCP congestion control algorithms, such as cubic, reno. Moreover, if we remove UCX lib dependency from openmpi lib, the issue would also disappear.
It looks there are some compatibility issues between UCX lib and BBR. Do you have any insight on this issue? Thanks.