Intel-mpi benchmark runs to fail when enabiling BBR in Centos 8 stream

151 views
Skip to first unread message

clar...@gmail.com

unread,
Nov 18, 2020, 9:26:44 PM11/18/20
to BBR Development
Hi BBR team,

I built a latest upstream kernel 5.9.6 and enabling BBR algorithm in Centos 8 stream. Most times it works well, but when I tried to run intel-mpi Sendrecv benchmark, it runs to fail. 

Here is the log.

#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part    
#------------------------------------------------------------
# Date                  : Tue Nov 17 16:18:46 2020
# Machine               : x86_64
# System                : Linux
# Release               : 5.9.6-16.el8.x86_64
# Version               : #1 SMP Tue Nov 10 02:01:06 UTC 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# IMB-MPI1 Sendrecv -iter 1000000 

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT 
# MPI_Op                         :   MPI_SUM  

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 2 
# ( 54 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0      1000000         0.70         0.70         0.70         0.00
            1      1000000         0.70         0.70         0.70         2.87
            2      1000000         0.71         0.71         0.71         5.64
            4      1000000         0.73         0.73         0.73        10.92
            8      1000000         0.73         0.73         0.73        21.80
[clr-pnp-server-11:43696] 111 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[clr-pnp-server-11:43696] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[clr-pnp-server-11:43696] 55 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
           16      1000000         0.74         0.74         0.74        43.53
           32      1000000         1.01         1.01         1.01        63.56
           64       655360         0.96         0.96         0.96       133.65
          128       327680         1.34         1.34         1.34       191.25
          256       163840         1.63         1.63         1.63       314.23
          512        81920         1.65         1.65         1.65       619.67
         1024        40960         1.84         1.84         1.84      1115.00
         2048        20480         2.58         2.58         2.58      1588.86
         4096        10240         3.51         3.51         3.51      2335.63
         8192         5120         5.26         5.26         5.26      3113.20
        16384         2560         9.42         9.42         9.42      3477.34
        32768         1280        19.39        19.39        19.39      3380.30
        65536          640        38.35        38.37        38.36      3415.78
       131072          320        76.83        76.93        76.88      3407.75
       262144          160       140.89       140.92       140.90      3720.51
       524288           80       269.36       269.60       269.48      3889.44
      1048576           40       508.51       509.85       509.18      4113.27
      2097152           20      1071.69      1087.46      1079.58      3856.96
[1605601137.391752] [clr-pnp-server-11:43705:0]           sock.c:344  UCX  ERROR recv(fd=144) failed: Bad address
[1605601137.392362] [clr-pnp-server-11:43704:0]           sock.c:344  UCX  ERROR recv(fd=42) failed: Connection reset by peer
[1605601137.392396] [clr-pnp-server-11:43704:0]           sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor
[1605601137.392396] [clr-pnp-server-11:43705:0]           sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor
[clr-pnp-server-11:43704] *** An error occurred in MPI_Sendrecv
[clr-pnp-server-11:43704] *** reported by process [1759838209,0]
[clr-pnp-server-11:43704] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[clr-pnp-server-11:43704] *** MPI_ERR_OTHER: known error not in list
[clr-pnp-server-11:43704] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[clr-pnp-server-11:43704] ***    and potentially your MPI job)
[clr-pnp-server-11:43696] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal

Base on the log, it looks the TCP link is unexpectedly reset by peer and lead to a segmentation fault in openmpi and ucx lib.

Here is segmentation fault back trace.

systemd-coredump[14186]: Process 13166 (IMB-MPI1) of user 1000 dumped core.
                                                           
                                                           Stack trace of thread 13166:
                                                           #0  0x00007f76f1bfcf10 n/a (n/a)
                                                           #1  0x00007f76f26ff24b uct_tcp_iface_progress (libuct.so.0)
                                                           #2  0x00007f76f293442a ucp_worker_progress (libucp.so.0)
                                                           #3  0x00007f7704f27fa4 opal_progress (libopen-pal.so.40)
                                                           #4  0x00007f7705a39d6d ompi_request_default_wait (libmpi.so.40)
                                                           #5  0x00007f7705a9c255 ompi_coll_base_sendrecv_actual (libmpi.so.40)
                                                           #6  0x00007f7705a9ca6b ompi_coll_base_allreduce_intra_recursivedoubling (libmpi.so.40)
                                                           #7  0x00007f7705a4d989 PMPI_Allreduce (libmpi.so.40)
                                                           #8  0x0000562bbda3f0de n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)
                                                           #9  0x0000562bbda452b0 n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)
                                                           #10 0x0000562bbda0f944 n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)
                                                           #11 0x00007f770548d85d __libc_start_main (libc.so.6)
                                                           #12 0x0000562bbda0e18e n/a (/home/pnp/.phoronix-test-suite/installed-tests/pts/intel-mpi-1.0.1/mpi-benchmarks-IMB-v2019.3/IMB-MPI1)

We don't see this issue on other TCP congestion control algorithms, such as cubic, reno. Moreover, if we remove UCX lib dependency from openmpi lib, the issue would also disappear. 

It looks there are some compatibility issues between UCX lib and BBR. Do you have any insight on this issue? Thanks.

Neal Cardwell

unread,
Nov 19, 2020, 2:36:01 PM11/19/20
to clar...@gmail.com, BBR Development
Thanks for the report.

BBR congestion control mainly just changes the timing of packet transmissions, so it should not cause a compatibility with a user-space library or application, unless that library or library or application makes assumptions about the timing of events (e.g., timer timeout values), and those assumptions happen to be violated when BBR causes transfers to proceed faster or slower than the assumed timing.

The output "sendv(fd=-1) failed: Bad file descriptor" is suspicious. The fd of -1 is generally used in user-space code to indicate a value for a variable that is expected to hold a file descriptor but does not currently. That the program is using such an fd variable looks like a bug (race condition?) in the user-space code that causes it to corrupt its own notion of what file descriptors are open or closed, or what file descriptors should be used for socket system calls. You might try a web search to see if other people have seen the same symptoms, or if bug fixes or more recent releases are available.

best,
neal


--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/e73be09e-b2bd-491a-bb53-d85cafed9705n%40googlegroups.com.

clar...@gmail.com

unread,
Nov 19, 2020, 9:22:19 PM11/19/20
to BBR Development
Thanks Neal for your feedback.

I suspect the error is caused by the unexpected "Connection reset by peer". As the TCP link has been reset and the sock fd has been released and then lead to the subsequent segmentation fault.  What I'm suspicious is that what triggers the connection reset. As I don't see there are connection resets on other congestion control algorithms, that's why I say there may be compatibility issue.

I have no much knowledge on the UCX lib, I would also report this issue on the UCX community. 
Reply all
Reply to author
Forward
0 new messages