STF Crash with 100G mlx5

163 views
Skip to first unread message

John H

unread,
Feb 9, 2021, 8:37:00 AM2/9/21
to TRex Traffic Generator

Hi,

I was wondering if you've seen this crash before or if you have any tips on how to debug/solve it.

We're seeing this consistently after running an internet mix for more than 20-30 min even an hour with STF mode:

WATCHDOG: task 'master' has not responded for more than 2.99471 seconds - timeout is 2 seconds
*** traceback follows ***
1 0x56ad35 ./_t-rex-64() [0x56ad35]
2 0x7f960826c5d0 /lib64/libpthread.so.0(+0xf5d0) [0x7f960826c5d0]
3 0x7f96072d08d7 ioctl + 7
4 0x7f9606f92b77 mlx5_ifreq + 119
5 0x7f9606f9378c mlx5_link_update + 540
6 0x8387eb rte_eth_link_get_nowait + 123
7 0x4d7345 DpdkTRexPortAttr::update_link_status_nowait() + 23
8 0x4dc18f CGlobalTRex::check_for_ports_link_change() + 37
9 0x4e208b CGlobalTRex::handle_slow_path() + 23
10 0x4e2213 CGlobalTRex::run_in_master() + 347
11 0x4fb1e2 ./_t-rex-64() [0x4fb1e2]
12 0x80fe23 rte_eal_mp_remote_launch + 227
13 0x4e54c8 main_test(int, char**) + 1477
14 0x7f96071fe3d5 __libc_start_main + 245
15 0x4feed5 ./_t-rex-64() [0x4feed5]

uname -r
3.10.0-957.el7.x86_64

ofed_info
MLNX_OFED_LINUX-5.0-2.1.8.0 (OFED-5.0-2.1.8)

lspci
00:07.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
00:08.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

TREX Version : v2.87

hanoh haim

unread,
Feb 9, 2021, 9:11:20 AM2/9/21
to John H, TRex Traffic Generator
Hi John, 
OFED 5.02 is not officially tested using v2.87. We you try the latest master branch with latest OFED 5.2 
see here

Thanks
Hanoh

John H

unread,
Feb 10, 2021, 8:57:24 AM2/10/21
to TRex Traffic Generator
So does the latest master work with OFED 5.2 or is it not 100% tested yet?

Thanks!

hanoh haim

unread,
Feb 10, 2021, 9:12:58 AM2/10/21
to John H, TRex Traffic Generator
Latest code in master is stable and works with OFED 5.2 with CX-5 EN cards

--
You received this message because you are subscribed to the Google Groups "TRex Traffic Generator" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trex-tgn+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trex-tgn/cbc076bb-c30f-47b0-ab07-bad098dd4be8n%40googlegroups.com.


--
Hanoh
Sent from my iPhone

John H

unread,
Feb 11, 2021, 12:05:08 PM2/11/21
to TRex Traffic Generator
I'm getting this when running with latest from master:
EAL: so/x86_64/libmlx5-64.so: undefined symbol: rte_flow_expand_rss

Does this so need to be rebuilt?

hanoh haim

unread,
Feb 11, 2021, 12:21:27 PM2/11/21
to John H, TRex Traffic Generator
You need to build from scratch and install OFED 5.2 

John H

unread,
Feb 12, 2021, 9:53:23 AM2/12/21
to TRex Traffic Generator
Hi Hanoch,

So I was able to rebuild the so and get things running. The traffic seems to be stable but I'm seeing consistent packet drops with 2.88 vs 2.87. I'm running the same traffic, same duration, same setup etc.

ethtool -S ens7 | grep rx_discards_phy
     rx_discards_phy: 2393917

2.87
summary stats
 --------------
 Total-pkt-drop       : 0 pkts
 Total-tx-bytes       : 1254087753958 bytes
 Total-tx-sw-bytes    : 0 bytes
 Total-rx-bytes       : 1254087754342 byte
 
 Total-tx-pkt         : 1250019050 pkts
 Total-rx-pkt         : 1250019056 pkts
 Total-sw-tx-pkt      : 0 pkts
 Total-sw-err         : 0 pkts
 Total ARP sent       : 0 pkts
 Total ARP received   : 0 pkts


2.88
 summary stats
 --------------
 Total-pkt-drop       : 2405969 pkts
 Total-tx-bytes       : 1254087753958 bytes
 Total-tx-sw-bytes    : 0 bytes
 Total-rx-bytes       : 1251463825401 byte
 
 Total-tx-pkt         : 1250019050 pkts
 Total-rx-pkt         : 1247613081 pkts
 Total-sw-tx-pkt      : 0 pkts
 Total-sw-err         : 0 pkts
 Total ARP sent       : 0 pkts
 Total ARP received   : 0 pkts


The above are with an internet mix but I could also reproduce it with just http traffic:
./t-rex-64 -f cap2/http_simple.yaml -m 100000 -c 8 -d 100

Could it be nic firmware related?

Thanks
Message has been deleted

John H

unread,
Feb 12, 2021, 1:05:57 PM2/12/21
to TRex Traffic Generator
Still seeing drops after a fw update

ofed_info
MLNX_OFED_LINUX-5.2-2.2.0.0 (OFED-5.2-2.2.0):

ibstat
CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.29.2002
        Hardware version: 0
        Node GUID: 0x98039b03009894bf
        System image GUID: 0x98039b03009894be

./t-rex-64 --version
 Version : v2.88   
 DPDK version : DPDK 21.02.0-rc1  
..
Compiled with GCC     :   4.8.5 20150623 (Red Hat 4.8.5-44)
Compiled with glibc   :   2.17 (host: 2.17)

Besart Dollma

unread,
Feb 13, 2021, 2:56:55 AM2/13/21
to TRex Traffic Generator
Hi John,
There is no v2.88 yet, so I guess you mean v2.86 vs v2.87.
This is known, there is a bug with the MLX driver in the new DPDK. The Mellanox/Nvidia guys are working on this. Take a look here: https://github.com/cisco-system-traffic-generator/trex-core/issues
Best,
Bes

hanoh haim

unread,
Feb 13, 2021, 11:49:01 AM2/13/21
to Besart Dollma, TRex Traffic Generator

John H

unread,
Mar 1, 2021, 10:08:35 AM3/1/21
to TRex Traffic Generator
Noticed that 2.88 is out, were the packet drops fixed or is it still an issue?

hanoh haim

unread,
Mar 1, 2021, 2:54:18 PM3/1/21
to John H, TRex Traffic Generator
The issue is still opened. Actually there are a few issues with DPDK 21.01 and mlx5. Contact Mellanox for ETA 

Reply all
Reply to author
Forward
0 new messages