The hosts IB cards 2 (in total) are passed through to the VM'S of storage and meta,
the mgmt server has no IB card.
I can successfully use ibping from storage to meta server and to all client nodes.
The two clients get recognized in the right way and the connection is intialized as RDMA connection.
Log of mgmt-server :
Worker2 [Node registration] >> New node: beegfs-client 4B396-661566B6-gpu0.cluster [ID: 17]; RDMA; Source: 192.168.1.20:36986My meta server doesn't connect through RDMA also the storage node doesn't.
root@bee-meta ~]# tail -f /var/log/beegfs-meta.log
(2) Apr09 17:38:46 Main [Register node] >> Node registration successful.
(3) Apr09 17:38:46 Main [NodeConn (acquire stream)] >> Connected: beegfs...@192.168.1.11:8008 (protocol: TCP)
(2) Apr09 17:38:46 Main [printSyncResults] >> Nodes added (sync results): 1 (Type: beegfs-storage)
(3) Apr09 17:38:46 Main [App] >> Registration and management info download complete.
(3) Apr09 17:38:46 Main [DGramLis] >> Listening for UDP datagrams: any Port 8005
(3) Apr09 17:38:46 Main [ConnAccept] >> Listening for TCP connections: Port 8005
(3) Apr09 17:38:46 Main [App] >> Restored 1 sessions and 0 mirrored sessions
(1) Apr09 17:38:46 Main [App] >> Version: 7.4.3
(2) Apr09 17:38:46 Main [App] >> LocalNode: beegfs-meta bee-meta [ID: 2]
(2) Apr09 17:38:46 Main [App] >> Usable NICs: ens19(TCP) ens18(TCP)Storage Logs:
tail -f /var/log/beegfs-storage.log
(2) Apr09 17:38:50 Main [App] >> Usable NICs: ens19(TCP) ens18(TCP)
(2) Apr09 17:38:50 Main [App] >> Storage targets: 1
(3) Apr09 17:38:50 Main [RegDGramLis] >> Listening for UDP datagrams: any Port 8003
(2) Apr09 17:38:50 Main [Register node] >> Node registration successful.
(2) Apr09 17:38:50 Main [InternodeSyncer.cpp:607] >> Storage targets registration successful.
(2) Apr09 17:38:50 Main [Sync results] >> Nodes added: 1 (Type: beegfs-meta)
(3) Apr09 17:38:50 Main [App] >> Registration and management info download complete.
(3) Apr09 17:38:50 Main [DGramLis] >> Listening for UDP datagrams: any Port 8003
(3) Apr09 17:38:50 Main [ConnAccept] >> Listening for TCP connections: Port 8003
(3) Apr09 17:38:50 Main [App] >> 1 sessions restored.
So i can't explain why it behaves so different when the setup is everywhere the same and the storage and meta servers won't recgonize the Infiniband Cards .
The only differences i recognized is that lsmod is showing slighlty different results on meta and storage as on the clients:
[root@bee-meta ~]# lsmod | grep ib
ib_ipoib 147456 0
ib_cm 118784 2 rdma_cm,ib_ipoib
ib_umad 28672 0
mlx5_ib 409600 0
ib_uverbs 159744 8 rdma_ucm,mlx5_ib
ib_core 401408 8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 4 nf_conntrack,nf_nat,nf_tables,xfs
mlx5_core 1810432 1 mlx5_ib
libata 270336 2 ata_piix,ata_generic
Client:
[root@gpu0 ~]# lsmod | grep ib
ib_ipoib 155648 0
ib_cm 114688 2 rdma_cm,ib_ipoib
ib_umad 28672 0
libnvdimm 200704 1 nfit
libcrc32c 16384 4 nf_conntrack,nf_nat,nf_tables,xfs
mlx5_ib 466944 0
ib_uverbs 143360 2 rdma_ucm,mlx5_ib
ib_core 442368 9
beegfs,rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx5_core 2146304 1 mlx5_ib
libahci 40960 1 ahci
libata 266240 2 libahci,ahci
mlx_compat 16384 12 beegfs,rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
I tested almost everything and i cant explain why the beegfs-meta and storage services doesn't recognize the ib interfaces and only are using TCP over the 10G network.
Does anyone know of a similar case?
Thanks in advance
Greetings
Omnia