FhGFS Infiniband problem

296 views
Skip to first unread message

Lars Oergel

unread,
Mar 8, 2012, 11:01:09 AM3/8/12
to fhgfs-user
Hello,

i've set up a small test installation of FhGFS 2011.04.r14 on our
OpenSuse 11.2 based Infiniband cluster
(4 storage nodes, 1 meta node, 2 client nodes), but i can't get it
working with infiniband support.

We use kernel 2.6.31.12-0.1-desktop and a custom OFED version
1.5.1.rc1. The server processes
seem to see the Infiniband devices (i see "/dev/infiniband/uverbs0"
and "infinibandevent" in
lsof | grep <storage-server-pid>). The client kernel module has been
built correctly and uses the ib_core module:

<clientnode>:~ # lsmod|grep fhgfs
fhgfs 308176 1
fhgfs_client_opentk 40264 1 fhgfs
rdma_cm 43060 2 fhgfs_client_opentk,rdma_ucm
ib_core 89740 11
fhgfs_client_opentk,rdma_ucm,rdma_cm,ib_cm,iw_cm,ib_sa,ib_uverbs,ib_umad,mlx4_ib,ib_mthca,ib_mad

So, everything _seem_ to be fine, but benchmarking and looking in the
logs shows the problem:

<one_storage_node>:~# cat /var/log/fhgfs-storage.log
(4) Mar08 16:36:03 Main [App] >> Initializing components...
(3) Mar08 16:36:03 Main [DGramLis] >> Listening for UDP datagrams:
Port 8003
(3) Mar08 16:36:03 Main [StreamLis] >> Listening for TCP connections:
Port 8003
(4) Mar08 16:36:03 Main [App] >> Components initialized.
(1) Mar08 16:36:03 Main [App] >> Version: 2011.04-r14
(2) Mar08 16:36:03 Main [App] >> LocalNodeID: 03cl11
(2) Mar08 16:36:03 Main [App] >> Usable NICs: eth0(TCP)
(4) Mar08 16:36:03 Main [App] >> Extended list of usable NICs:
+ eth0[inet addr: 10.111.11.3; bcast addr: 10.111.255.255; hw addr:
00.30.48.f0.19.54; metric: 0; bandwidth: 2; type: TCP]
(2) Mar08 16:36:03 Main [App] >> Storage targets: 1
(4) Mar08 16:36:03 Main [App] >> Detaching process...
(4) Mar08 16:36:03 Main [App] >> Starting up components...
(2) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Waiting for management
node...
(4) Mar08 16:36:03 Worker1 [Worker1] >> Ready (TID: 27208)
(4) Mar08 16:36:03 Worker2 [Worker2] >> Ready (TID: 27210)
(4) Mar08 16:36:03 Worker3 [Worker3] >> Ready (TID: 27211)
(4) Mar08 16:36:03 Worker4 [Worker4] >> Ready (TID: 27212)
(4) Mar08 16:36:03 Worker5 [Worker5] >> Ready (TID: 27213)
(4) Mar08 16:36:03 Worker6 [Worker6] >> Ready (TID: 27214)
(4) Mar08 16:36:03 Worker7 [Worker7] >> Ready (TID: 27215)
(4) Mar08 16:36:03 Worker8 [Worker8] >> Ready (TID: 27216)
(4) Mar08 16:36:03 DirectWorker1 [DirectWorker1] >> Ready (TID: 27217)
(4) Mar08 16:36:03 Main [App] >> Components running.
(4) Mar08 16:36:03 Main [App] >> Joining component threads...
(3) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Downloading node groups...
(3) Mar08 16:36:03 HBeatMgr [NodeConn (acquire stream)] >>
Establishing new TCP connection to: 10.111.11.1:8008
(2) Mar08 16:36:03 DGramLis [Heartbeat incoming] >> New node [ID:
01cl11; Type: Management; Source: 10.111.11.1]
(2) Mar08 16:36:03 DGramLis [Heartbeat incoming] >> Number of nodes in
the system: 1 (Type: Management)
(3) Mar08 16:36:03 HBeatMgr [NodeConn (acquire stream)] >> Connected:
10.111.11.1:8008
(3) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Node registration...
(2) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Node registration
successful.
(2) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Storage targets registration
successful.
(3) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Init complete.
(3) Mar08 16:36:06 StreamLis [StreamLis] >> Accepted new connection
from 10.111.11.1:57916 [SockFD: 16]
(3) Mar08 16:37:14 StreamLis [StreamLis] >> Accepted new connection
from 10.111.11.1:57917 [SockFD: 18]
(3) Mar08 16:47:41 StreamLis [StreamLis] >> Accepted new connection
from 10.111.11.7:53984 [SockFD: 20]

=> The line "Usable NICs: eth0(TCP)" shows the problem: the Infiniband
devices are not detected and
therefore not used. :-(

What's my mistake? Can you point me to the correct direction?

Thank you very much,

Lars Oergel

Sven Breuner

unread,
Mar 8, 2012, 12:46:24 PM3/8/12
to fhgfs...@googlegroups.com
Hi Lars,

does this print any useful output on the storage server:
$ fhgfs-opentk-lib-update-ib

From what I see below it seems like you don't have IP over IB enabled.
$ modprobe ib_ipoib
(...and setup IP addresses afterwards)

Even though fhgfs is using native Infiniband RDMA, it requires IP
addresses for the Infiniband-Interfaces. The reason for this is that
fhgfs is using the OFED RDMA Connection Manager to establish a native IB
connection, which generally works on different RDMA-enabled
interconnects besides IB (such as ROCE) and thus is based on IP addresses.

Best regards,
Sven
Fraunhofer

Lars Oergel

unread,
Mar 22, 2012, 10:13:51 AM3/22/12
to fhgfs-user
Hi Sven,

thank you for your quick answer. I think, it is our disabled IPoIB,
because fhgfs-opentk-lib-update-ib
just says "Setting symlink in /opt/fhgfs/lib: libfhgfs-opentk.so ->
libfhgfs-opentk-enabledIB.so", which
looks good to me.

In the moment, i don't habe a ib_ipoib kernel module, because our
vendor didn't compiled it and i have problems
to compile it so, that it fits in our existing infiniband environment.
My problem is, that i'm testing on
our productive cluster for cfd computations... Therefore i must be
careful what i do... :-)

But i think, this is the problem, so many thanks for your reply.

Best regards

Lars

On 8 Mrz., 18:46, Sven Breuner <sven.breu...@itwm.fraunhofer.de>
wrote:
Reply all
Reply to author
Forward
0 new messages