BeeGFS-Client not detecting Infiniband interface

240 views
Skip to first unread message

Thomas Keller

unread,
Jun 24, 2022, 1:44:14 PM6/24/22
to beegfs-user
Dear BeeGFS Community

I am setting up a BeeGFS test environment on two nodes on an HPC cluster. That has worked really well, with the exception of Infiniband. The BeeGFS client seems not able to detect my IB interfaces and therefore insists on running over Ethernet.

I can natively (no IPoIB) pong between the two Infiniband nodes. When forcing the client to used IB (connTCPFallbackEnabled = false) I see the following lines in my log file:

 Jun24 19:11:34 *mount(27468) [App_logInfos] >> Usable NICs: eno4(TCP)
 (2) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [Init] >> Waiting for beegfs-mgmtd@master:8008...
 (3) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [NodeConn (acquire stream)] >> Connected: beegfs-...@127.0.0.1:8006 (protocol: TCP)
 (2) Jun24 19:11:34 *beegfs_DGramLis(27469) [Heartbeat incoming] >> New node: beegfs-mgmtd master [ID: 1];
 (3) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [Init] >> Management node found. Downloading node groups...
 (3) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [NodeConn (acquire stream)] >> Connected: beegfs...@192.168.1.10:8008 (protocol: TCP)
 (2) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [Sync] >> Nodes added (sync results): 2 (Type: beegfs-meta)
 (2) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [Sync] >> Nodes added (sync results): 2 (Type: beegfs-storage)
 (3) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [Init] >> Node registration...
 (2) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [Registration] >> Node registration successful.
 (3) Jun24 19:11:34 *beegfs_XNodeSyn(27470) [Init] >> Init complete.
 (1) Jun24 19:11:34 *mount(27468) [NodeConn (acquire stream)] >> Connect failed on all available routes: beegfs-meta master [ID: 1]
 (2) Jun24 19:11:34 *mount(27468) [Messaging (RPC)] >> Unable to connect to: beegfs-meta master [ID: 1]
 (0) Jun24 19:11:34 *mount(27468) [Mount sanity check] >> Retrieval of root directory entry failed. Are all metadata servers running and registered at the management daemon? (Error: Communication error)


I double checked that libbeegfs-ib is installed and rebuilt the BeeGFS-Client with buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1 (not using separate driver). But whatever I have tried the beegfs-net command only shows TCP connections.
 
I am using a Mellanox Technologies MT27500 Family [ConnectX-3] adapter

Thank you for any hint of what I could do next or where the problem might be.

Cheers

Thomas

Thomas Keller

unread,
Jul 6, 2022, 4:58:10 AM7/6/22
to beegfs-user
I made a bit of headway with my RDMA problem: applied IP addresses to the Infiniband Interfaces and re-installed BeeGFS. Now BeeGFS recognizes the RDMA transport but complains in the logs:

(3) Jul06 10:44:45 *mount(15614) [NodeConn (acquire stream)] >> Connect failed: beegfs-...@192.168.100.10:8003 (protocol: RDMA)
(4) Jul06 10:44:45 *mount(15614) [NodeConn (acquire stream)] >> Establishing new TCP connection to: beegfs-...@192.168.100.10:8003

I increased the logLevel to 5 but BeeGFS is staying as terse about the RDMA connection failures as with logLevel 2/3.

I really don't know how I could investigate this further. Any idea would be greatly appreciated.

Thanks

Thomas

Vinci Chow

unread,
Jul 6, 2022, 8:25:09 AM7/6/22
to beegfs-user
Did you rebuild beegfs-client with OFED drivers? RDMA will not work otherwise, at least that's the case up to version 7.2. 

Thomas Keller

unread,
Jul 6, 2022, 10:35:43 AM7/6/22
to beegfs-user
Thank you for your hint! I am using version 7.3.0 of BeeGFS. If I understand the Quick Start Guide correctly, the OFED drivers are no longer necessary.

In the hope to get anything more specific from BeeGFS, I disabled TCP fallback in /etc/beegfs/beegfs-client.conf. The log file tells me now:

(3) Jul06 13:51:02 *mount(12661) [NodeConn (acquire stream)] >> Connect failed: beegf...@192.168.1.100:8005 (protocol: RDMA)
(1) Jul06 13:51:02 *mount(12661) [NodeConn (acquire stream)] >> Connect failed on all available routes: beegfs-meta master [ID: 1]
(2) Jul06 13:51:02 *mount(12661) [Messaging (RPC)] >> Unable to connect to: beegfs-meta master [ID: 1]

For some reasons the BeeGFS client seems unable to connect to BeeGFS meta although all BeeGFS components (Meta, Storage, Management, and Client) are on the same machine. Local firewall rules are disabled, just for good measure.

On the other hand, Meta claims it is listening on TCP and Infiniband interfaces:

(3) Jul06 13:50:59 Main [DGramLis] >> Listening for UDP datagrams: Port 8005
(3) Jul06 13:50:59 Main [ConnAccept] >> Listening for RDMA connections: Port 8005
(3) Jul06 13:50:59 Main [ConnAccept] >> Listening for TCP connections: Port 8005

Thanks again

Thomas

Vinci Chow

unread,
Jul 6, 2022, 11:33:20 AM7/6/22
to beegfs-user
How many IB interfaces do you have on each machine? RDMA will not work if you do not set up multi-homed routing tables.

Thomas Keller

unread,
Jul 29, 2022, 1:15:00 PM7/29/22
to beegfs-user
Sorry for the very late reply. We do have two IB interfaces per node.

As a test: I can "pong" both IB interfaces on the master node from a compute node. The reverse also works (ponging both interfaces on the compute node from the master node).

Our Infiniband topology is organized in a fat tree / binary tree configuration.

You are right, BeeGFS still complains about routing "Connect failed on all available routes: beegfs-meta master [ID: 1]". But then, why can I reach the interfaces with pong, but BeeGFS can not?

Running all BeeGFS components on the same host, still throws the same error as above.

Is there anything else I can do to increase debugging output?

Cheers, and thank you for your suggestions.

Thomas

Thomas Keller

unread,
Aug 10, 2022, 8:45:47 AM8/10/22
to beegfs-user
After lots of sweat, tears and swearing, BeeGFS and Infiniband are finally working together. In the end, I resolved the problem by flashing the latest Mellanox firmware version (Stock firmware from https://network.nvidia.com/support/firmware/firmware-downloads/) onto our Infiniband adapters. In our case, the adapters are part of an Oracle Big Data Appliance and had some outdated Oracle customized firmware on it.

For those that are facing the same issue, I followed https://blog.swineson.me/en/mellanox-connectx-3-firmware-flashing-and-configuration-for-both-ethernet-and-infiniband-in-2021/ to flash the firmware. The instructions are for Windows, but in my case they worked with Ubuntu 20.04 too.

Flashing from one firmware vendor to another seems to be discouraged (there is a warning before flashing the firmware, as you might break the HCA), but in my case, this was the only way to get BeeGFS working.

Thanks again for the support.

Thomas
Reply all
Reply to author
Forward
0 new messages