Support of Mellanox ConnectX-5 on HPE server

333 views
Skip to first unread message

Johannes Luther

unread,
Mar 25, 2022, 2:59:13 AM3/25/22
to TRex Traffic Generator
Hi TRex team,
I'm trying to set up TRex on a HPE ProLiant DL360 Gen10 server:
CPU: 2 x Intel Xeon
Memory: 192 GB Memory per CPU (6 x 32 GB Memory slots / CPU)
NIC: 2x HPE Eth 100Gb 1p 842QSFP28
=> MT27800 Family [ConnectX-5]
Distibution: CentOS Linux release 7.9.2009 (Core)


So I followed the Mellanox instructions described here: https://trex-tgn.cisco.com/trex/doc/trex_appendix_mellanox.html.

The OFED installation works without any issues and "ibv_devinfo" shows the desired information (attached: ibv_devinfo_ibdev2netdev.txt). "dpdk_setup_ports.py -t" shows the adapters (4: 12:00.0 / 7: d8:00.0 / output attached).

I modified the /etc/trex_cfg.yaml file accordingly (attachment).

When trying to run trex I get an error:
$ sudo ./t-rex-64-debug -f cap2/dns.yaml -c 1 -m 1 -d 10
The ports are bound/configured.
Starting  TRex v2.96 please wait  ...
net_mlx5: mlx5_os.c:2229: mlx5_os_pci_probe(): probe of PCI device 0000:d8:00.0 aborted after encountering an error: Cannot allocate memory
common_mlx5: mlx5_common_pci.c:256: drivers_probe(): Failed to load driver = net_mlx5.

EAL: Requested device 0000:d8:00.0 cannot be used
ERROR in DPDK map
Could not find requested interface d8:00.0


I guess the "Cannot allocate memory" output is the key here.
So I'm checking the DRAM channels:
sudo dmidecode -t memory | grep CHANNEL
=> This does not return any output... so let's try without the filter and there is some output (attached truncated: dmidecode-t_memory). I guess the main difference to the documentation is, that the "Bank Locator" is "Not Specified" in my output. I have no idea whether this is a problem or not. "dpdk_setup_ports.py -m" does not show anything

$ sudo ./dpdk_setup_ports.py -m
+------+
| NUMA |
+======+
+------+

Any help or hints are highly appreciated :)

Best regards
Johannes

trex_cfg.yaml
dmidecode-t_memory.txt
ibv_devinfo_ibdev2netdev.txt
dpdk_setup_ports.txt

hanoh haim

unread,
Mar 27, 2022, 10:02:20 AM3/27/22
to Johannes Luther, TRex Traffic Generator
Hi Jonatan, 
There are two issues with this configuration. 

1. You have one NIC/ one port per NUMA -- this is the reason we recommend the NIC with two ports 
2. The Second NUMA is 3 and we support a maximum of 2. 

For the second thing, try to move the NIC to a different location so it will be located in NUMA zero (instead of 3)  else NUMA 1 (if it not possible)
In case the NIC are located in 0 and 1, you will need to add "dummy" port for each dual to make the memory local (each dual port should be in the same NUMA for best performance)

Thanks
Hanoh




--
You received this message because you are subscribed to the Google Groups "TRex Traffic Generator" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trex-tgn+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trex-tgn/9c16dddb-3c56-4f92-9faf-b391ee244a53n%40googlegroups.com.


--
Hanoh
Sent from my iPhone

Johannes Luther

unread,
Mar 28, 2022, 1:37:48 AM3/28/22
to TRex Traffic Generator
Hello Hanoh,
thank you for your response. I'll try to let the server guys tweak some BIOS settings.
I have two processors and so I expect only two NUMAS (at least from my simplified point of view without deep hardware knowledge).
However for some reasons, there are four NUMA nodes and the logical cores are assigned to those:

$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 36 37 38 39 40 41 42 43 44
node 0 size: 96172 MB
node 0 free: 92833 MB
node 1 cpus: 9 10 11 12 13 14 15 16 17 45 46 47 48 49 50 51 52 53
node 1 size: 96765 MB
node 1 free: 93365 MB
node 2 cpus: 18 19 20 21 22 23 24 25 26 54 55 56 57 58 59 60 61 62
node 2 size: 96749 MB
node 2 free: 93901 MB
node 3 cpus: 27 28 29 30 31 32 33 34 35 63 64 65 66 67 68 69 70 71
node 3 size: 96764 MB
node 3 free: 93752 MB
node distances:
node   0   1   2   3
  0:  10  21  31  31
  1:  21  10  31  31
  2:  31  31  10  21
  3:  31  31  21  10

Theres a setting in the BIOS called "Sub-NUMA clustering", which is enabled. I guess this could be the reason for this (or a setting called "NUMA group size optimization").
Sub-NUMA cluster on Intel Xeon Scalable processor family
When enabled, sub-NUMA clustering divides the processor’s cores, cache, and memory into multiple NUMA domains. Enabling this feature can
increase performance for workloads that are NUMA aware and optimized. Note that when this option is enabled, up to 1 GB of system memory
may become unavailable.

For anyone encountering this on a HPE server, here's a link to the "Red Hat Enterprise Linux NUMA support for HPE ProLiant servers" technical whitepaper: https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/whitepaper/pdfs/a00039147enw.pdf

So I'll try to fiddle around with this a litte bit.

Regarding point 1 of your list. Would this adapter be a better choice?
Mellanox MCX623106AS-CDAT Ethernet 100Gb 2-port QSFP56 Adapter for HPE

Many thanks and best regards,
Johannes

Johannes Luther

unread,
Mar 28, 2022, 2:32:51 AM3/28/22
to TRex Traffic Generator
Quick update: I disabled "sub-NUMA clustering" on the server an now I only have NUMA node 0 and 1 and TRex works on those interfaces.
Now I'll try out what the speed limitations are - however, it's not about the performance in our cases (so 20 GBit/s will be enough in any case).

However, the first back-to-back connected test run (example "sudo ./t-rex-64 -f avl/sfr_delay_10_1g.yaml -c 4 -m 35 -d 100 -p") works as expected.
When I try to do it a second time, I'll get the error "Failed resolving dest MAC for default gateway:2.2.2.2 on port 0". After a reboot everything works again (for one test :) )
In dmesg I'll get the following log message at that time:
[  782.272396] mlx5_core 0000:12:00.0: mlx5_destroy_flow_table:2205:(pid 3709): Flow table 262157 wasn't destroyed, refcount > 1
[  782.300754] mlx5_core 0000:d8:00.0: mlx5_destroy_flow_table:2205:(pid 3709): Flow table 262157 wasn't destroyed, refcount > 1

So I know this has nothing to do with the initial question, but: Any advise here (except proposing a reboot)? :)

hanoh haim

unread,
Mar 28, 2022, 4:32:58 AM3/28/22
to Johannes Luther, TRex Traffic Generator
Hi Johannes, 

The NUMA issue is interesting, thanks for pointing this out. 
The issue with mlx5_core 0000:12:00.0: mlx5_destroy_flow_table:2205:(pid 3709)" seems as an OFED issue,
Could you make sure the OFED version is the recommended one  GA 5.3-1
If it is GA 5.3.-1 I would contact Mellanox 

Thanks
Hanoh

Johannes Luther

unread,
Mar 28, 2022, 7:19:14 AM3/28/22
to TRex Traffic Generator
Hi Hanoh,
I follewed the Mellanox documentation https://trex-tgn.cisco.com/trex/doc/trex_appendix_mellanox.html and in combination with CentOS 7.9 I used MLNX_OFED_LINUX-5.2-1.0.4.0-rhel7.9-x86_64.tgz

So it means, that 5.3-1 is ok as well? I'll try that out.
The current version is 5.5 . But this has not been tested by you guys, right?

Johannes Luther

unread,
Mar 28, 2022, 7:29:29 AM3/28/22
to TRex Traffic Generator
Quick update: With  "MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz" it works...

hanoh haim

unread,
Mar 28, 2022, 7:34:22 AM3/28/22
to Johannes Luther, TRex Traffic Generator
Hi Johannes, 
Sorry for that, there was a typo and I've fixed it a few days after we released the last version

see here, the typo is no longer there, there is a matrix of supported version 
https://github.com/cisco-system-traffic-generator/trex-core/blob/master/doc/trex_appendix_mellanox.asciidoc

It will be fixed in the new version

thanks
Hanoh

Johannes Luther

unread,
Mar 28, 2022, 8:23:04 AM3/28/22
to TRex Traffic Generator
Hello Hanoh,
thanks again for your time and help! So one last side question... anybody knows if these 100G Adapters are 40G compatible? So when cabling this over a 40GE Cisco Nexus switch, the links don't come up.

Johannes Luther

unread,
Apr 14, 2022, 7:59:47 AM4/14/22
to TRex Traffic Generator
Answered this myself. The Mellanox cards are 40GE capable. Make sure, that only active optical cables are used if using DAC.
See also the NVidia doc: https://docs.nvidia.com/networking/m/view-rendered-page.action?abstractPageId=19804680
Example for a working DAC cable: QSFP-H40G-AOC3M
Reply all
Reply to author
Forward
0 new messages