Hi BeeGFS users,
Our test setup:
- Two server nodes with 2 * ConnectX-5 cards, on 2 different IP, Ubuntu 20.04
- One client nodes with 2 * ConnectX-6 cards, on 2 different IP, Ubuntu 22.04
- All using MLNX OFED 5.4 LTS
- Two servers running 1 * meta + 2 * storage (in total 2 * meta + 4 * storage)
I'm using beegfs-net to check the connections. When multirail is not enabled, all connections are RDMA
However, when we turn on the multirail feature using the connRDMAInterfacesFile option. All connections fallback to TCP:
In both cases, we've made sure the configuration are correct:
- /proc/fs/beegfs/<clientID>/client_info show expected interfaces
- All machines are ibping-able
Additionally, by turning on the debug flags, we found the RDMA connection issues are caused by this function rdma_resolve_addr:
[14828.300006] beegfs: mount(33435): IBVSocket_connectByIP:189: rdma_resolve_addr failed, src client-ip1, dst node1-ip1
When establishing connection from an explicit ip, the connection attempt will fail. When connecting from src=NULL (any), this will work.
Establishing new RDMA connection from any to: beegfs-meta@node1-ip1
Connected: beegfs-meta@node1-ip1 (protocol: RDMA)
Preferred IP addr is xxxxx
Establishing new RDMA connection from client-ip1 to: beegfs-storage@node1-ip1
Connect failed: beegfs-storage@node1-ip1 (protocol: RDMA)
Several attempts we've tried but all not working:
- Reboot the machine
- Give it some traffic
- Disable TCP fallback using
- Upgrade firmware
- Upgrade MLNX OFED driver
Any help or suggestions would be accepted. Thank you!