Hello,
i've set up a small test installation of FhGFS 2011.04.r14 on our
OpenSuse 11.2 based Infiniband cluster
(4 storage nodes, 1 meta node, 2 client nodes), but i can't get it
working with infiniband support.
We use kernel 2.6.31.12-0.1-desktop and a custom OFED version
1.5.1.rc1. The server processes
seem to see the Infiniband devices (i see "/dev/infiniband/uverbs0"
and "infinibandevent" in
lsof | grep <storage-server-pid>). The client kernel module has been
built correctly and uses the ib_core module:
<clientnode>:~ # lsmod|grep fhgfs
fhgfs 308176 1
fhgfs_client_opentk 40264 1 fhgfs
rdma_cm 43060 2 fhgfs_client_opentk,rdma_ucm
ib_core 89740 11
fhgfs_client_opentk,rdma_ucm,rdma_cm,ib_cm,iw_cm,ib_sa,ib_uverbs,ib_umad,mlx4_ib,ib_mthca,ib_mad
So, everything _seem_ to be fine, but benchmarking and looking in the
logs shows the problem:
<one_storage_node>:~# cat /var/log/fhgfs-storage.log
(4) Mar08 16:36:03 Main [App] >> Initializing components...
(3) Mar08 16:36:03 Main [DGramLis] >> Listening for UDP datagrams:
Port 8003
(3) Mar08 16:36:03 Main [StreamLis] >> Listening for TCP connections:
Port 8003
(4) Mar08 16:36:03 Main [App] >> Components initialized.
(1) Mar08 16:36:03 Main [App] >> Version: 2011.04-r14
(2) Mar08 16:36:03 Main [App] >> LocalNodeID: 03cl11
(2) Mar08 16:36:03 Main [App] >> Usable NICs: eth0(TCP)
(4) Mar08 16:36:03 Main [App] >> Extended list of usable NICs:
+ eth0[inet addr: 10.111.11.3; bcast addr: 10.111.255.255; hw addr:
00.30.48.f0.19.54; metric: 0; bandwidth: 2; type: TCP]
(2) Mar08 16:36:03 Main [App] >> Storage targets: 1
(4) Mar08 16:36:03 Main [App] >> Detaching process...
(4) Mar08 16:36:03 Main [App] >> Starting up components...
(2) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Waiting for management
node...
(4) Mar08 16:36:03 Worker1 [Worker1] >> Ready (TID: 27208)
(4) Mar08 16:36:03 Worker2 [Worker2] >> Ready (TID: 27210)
(4) Mar08 16:36:03 Worker3 [Worker3] >> Ready (TID: 27211)
(4) Mar08 16:36:03 Worker4 [Worker4] >> Ready (TID: 27212)
(4) Mar08 16:36:03 Worker5 [Worker5] >> Ready (TID: 27213)
(4) Mar08 16:36:03 Worker6 [Worker6] >> Ready (TID: 27214)
(4) Mar08 16:36:03 Worker7 [Worker7] >> Ready (TID: 27215)
(4) Mar08 16:36:03 Worker8 [Worker8] >> Ready (TID: 27216)
(4) Mar08 16:36:03 DirectWorker1 [DirectWorker1] >> Ready (TID: 27217)
(4) Mar08 16:36:03 Main [App] >> Components running.
(4) Mar08 16:36:03 Main [App] >> Joining component threads...
(3) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Downloading node groups...
(3) Mar08 16:36:03 HBeatMgr [NodeConn (acquire stream)] >>
Establishing new TCP connection to:
10.111.11.1:8008
(2) Mar08 16:36:03 DGramLis [Heartbeat incoming] >> New node [ID:
01cl11; Type: Management; Source: 10.111.11.1]
(2) Mar08 16:36:03 DGramLis [Heartbeat incoming] >> Number of nodes in
the system: 1 (Type: Management)
(3) Mar08 16:36:03 HBeatMgr [NodeConn (acquire stream)] >> Connected:
10.111.11.1:8008
(3) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Node registration...
(2) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Node registration
successful.
(2) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Storage targets registration
successful.
(3) Mar08 16:36:03 HBeatMgr [HBeatMgr] >> Init complete.
(3) Mar08 16:36:06 StreamLis [StreamLis] >> Accepted new connection
from
10.111.11.1:57916 [SockFD: 16]
(3) Mar08 16:37:14 StreamLis [StreamLis] >> Accepted new connection
from
10.111.11.1:57917 [SockFD: 18]
(3) Mar08 16:47:41 StreamLis [StreamLis] >> Accepted new connection
from
10.111.11.7:53984 [SockFD: 20]
=> The line "Usable NICs: eth0(TCP)" shows the problem: the Infiniband
devices are not detected and
therefore not used. :-(
What's my mistake? Can you point me to the correct direction?
Thank you very much,
Lars Oergel