Intel-mpi benchmark runs to fail when enabiling BBR in Centos 8 stream

151 views

Skip to first unread message

clar...@gmail.com

unread,

Nov 18, 2020, 9:26:44 PM11/18/20

to BBR Development

Hi BBR team,

I built a latest upstream kernel 5.9.6 and enabling BBR algorithm in Centos 8 stream. Most times it works well, but when I tried to run intel-mpi Sendrecv benchmark, it runs to fail.

Here is the log.

#------------------------------------------------------------

# Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part

#------------------------------------------------------------

# Date : Tue Nov 17 16:18:46 2020

# Machine : x86_64

# System : Linux

# Release : 5.9.6-16.el8.x86_64

# Version : #1 SMP Tue Nov 10 02:01:06 UTC 2020

# MPI Version : 3.1

# MPI Thread Environment:

# Calling sequence was:

# IMB-MPI1 Sendrecv -iter 1000000

# Minimum message length in bytes: 0

# Maximum message length in bytes: 4194304

# MPI_Datatype : MPI_BYTE

# MPI_Datatype for reductions : MPI_FLOAT

# MPI_Op : MPI_SUM

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------

# Benchmarking Sendrecv

# #processes = 2

# ( 54 additional processes waiting in MPI_Barrier)

#-----------------------------------------------------------------------------

#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec

0 1000000 0.70 0.70 0.70 0.00

1 1000000 0.70 0.70 0.70 2.87

2 1000000 0.71 0.71 0.71 5.64

4 1000000 0.73 0.73 0.73 10.92

8 1000000 0.73 0.73 0.73 21.80

[clr-pnp-server-11:43696] 111 more processes have sent help message help-mpi-btl-openib.txt / no device params found

[clr-pnp-server-11:43696] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

[clr-pnp-server-11:43696] 55 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

16 1000000 0.74 0.74 0.74 43.53

32 1000000 1.01 1.01 1.01 63.56

64 655360 0.96 0.96 0.96 133.65

128 327680 1.34 1.34 1.34 191.25

256 163840 1.63 1.63 1.63 314.23

512 81920 1.65 1.65 1.65 619.67

1024 40960 1.84 1.84 1.84 1115.00

2048 20480 2.58 2.58 2.58 1588.86

4096 10240 3.51 3.51 3.51 2335.63

8192 5120 5.26 5.26 5.26 3113.20

16384 2560 9.42 9.42 9.42 3477.34

32768 1280 19.39 19.39 19.39 3380.30

65536 640 38.35 38.37 38.36 3415.78

131072 320 76.83 76.93 76.88 3407.75

262144 160 140.89 140.92 140.90 3720.51

524288 80 269.36 269.60 269.48 3889.44

1048576 40 508.51 509.85 509.18 4113.27

2097152 20 1071.69 1087.46 1079.58 3856.96

[1605601137.391752] [clr-pnp-server-11:43705:0] sock.c:344 UCX ERROR recv(fd=144) failed: Bad address

[1605601137.392362] [clr-pnp-server-11:43704:0] sock.c:344 UCX ERROR recv(fd=42) failed: Connection reset by peer

[1605601137.392396] [clr-pnp-server-11:43704:0] sock.c:344 UCX ERROR sendv(fd=-1) failed: Bad file descriptor

[1605601137.392396] [clr-pnp-server-11:43705:0] sock.c:344 UCX ERROR sendv(fd=-1) failed: Bad file descriptor

[clr-pnp-server-11:43704] *** An error occurred in MPI_Sendrecv

[clr-pnp-server-11:43704] *** reported by process [1759838209,0]

[clr-pnp-server-11:43704] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0

[clr-pnp-server-11:43704] *** MPI_ERR_OTHER: known error not in list

[clr-pnp-server-11:43704] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

[clr-pnp-server-11:43704] *** and potentially your MPI job)

[clr-pnp-server-11:43696] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal

Base on the log, it looks the TCP link is unexpectedly reset by peer and lead to a segmentation fault in openmpi and ucx lib.

Here is segmentation fault back trace.

systemd-coredump[14186]: Process 13166 (IMB-MPI1) of user 1000 dumped core.

Stack trace of thread 13166:

#0 0x00007f76f1bfcf10 n/a (n/a)

#1 0x00007f76f26ff24b uct_tcp_iface_progress (libuct.so.0)

#2 0x00007f76f293442a ucp_worker_progress (libucp.so.0)

#3 0x00007f7704f27fa4 opal_progress (libopen-pal.so.40)

#4 0x00007f7705a39d6d ompi_request_default_wait (libmpi.so.40)

#5 0x00007f7705a9c255 ompi_coll_base_sendrecv_actual (libmpi.so.40)

#6 0x00007f7705a9ca6b ompi_coll_base_allreduce_intra_recursivedoubling (libmpi.so.40)

#7 0x00007f7705a4d989 PMPI_Allreduce (libmpi.so.40)

#8 0x0000562bbda3f0de n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)

#9 0x0000562bbda452b0 n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)

#10 0x0000562bbda0f944 n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)

#11 0x00007f770548d85d __libc_start_main (libc.so.6)

#12 0x0000562bbda0e18e n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)

We don't see this issue on other TCP congestion control algorithms, such as cubic, reno. Moreover, if we remove UCX lib dependency from openmpi lib, the issue would also disappear.

It looks there are some compatibility issues between UCX lib and BBR. Do you have any insight on this issue? Thanks.

Neal Cardwell

unread,

Nov 19, 2020, 2:36:01 PM11/19/20

to clar...@gmail.com, BBR Development

Thanks for the report.

BBR congestion control mainly just changes the timing of packet transmissions, so it should not cause a compatibility with a user-space library or application, unless that library or library or application makes assumptions about the timing of events (e.g., timer timeout values), and those assumptions happen to be violated when BBR causes transfers to proceed faster or slower than the assumed timing.

The output "sendv(fd=-1) failed: Bad file descriptor" is suspicious. The fd of -1 is generally used in user-space code to indicate a value for a variable that is expected to hold a file descriptor but does not currently. That the program is using such an fd variable looks like a bug (race condition?) in the user-space code that causes it to corrupt its own notion of what file descriptors are open or closed, or what file descriptors should be used for socket system calls. You might try a web search to see if other people have seen the same symptoms, or if bug fixes or more recent releases are available.

best,

neal

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/e73be09e-b2bd-491a-bb53-d85cafed9705n%40googlegroups.com.

clar...@gmail.com

unread,

Nov 19, 2020, 9:22:19 PM11/19/20

to BBR Development

Thanks Neal for your feedback.

I suspect the error is caused by the unexpected "Connection reset by peer". As the TCP link has been reset and the sock fd has been released and then lead to the subsequent segmentation fault. What I'm suspicious is that what triggers the connection reset. As I don't see there are connection resets on other congestion control algorithms, that's why I say there may be compatibility issue.

I have no much knowledge on the UCX lib, I would also report this issue on the UCX community.

Reply all

Reply to author

Forward

0 new messages