solved: another 'unknown storage target' story;

1,015 views
Skip to first unread message

Harry Mangalam

unread,
Sep 11, 2014, 5:32:59 PM9/11/14
to fhgfs...@googlegroups.com


We had an interesting problem that temporarily increased my blood pressure more than a little.  Since was due to a simple problem that might yet bite some others, here's the description.

All nodes running CentOS 6.4, 1 MDS running meta, mgmt, admon, 4 storage servers with 2x2 RAIDs and 2x3 RAIDs on QDR IB.

We were upgrading our BeeGFS (bfs) to r9 a couple days ago and were having some problems convincing CentOS to give us the correct libibverbs library.
=====
Starting FhGFS Storage Server: libibverbs: Warning: couldn't load driver 'mlx5': /usr/lib64/libmlx5-rdmav2.so: symbol ibv_cmd_create_qp_ex, version IBVERBS_1.1 not defined in file libibverbs.so.1 with link time reference
libibverbs: Warning: couldn't load driver 'mlx4': /usr/lib64/libmlx4-rdmav2.so: symbol ibv_cmd_create_flow, version IBVERBS_1.0 not defined in file libibverbs.so.1 with link time reference
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
librdmacm: Fatal: no RDMA devices found
=====
ie, bfs would come up with tcp, but not rdma.


We finally tried 'yum updating' all the bfs nodes to see if that would do it.  (It did not - we still had to manually copy and link the correct lib into place, which finally solved it, but we're still investigating why this would be the case - the update seems to have insisted in providing an older version).

However, after the update and reboot, the clients were unable to mount the filesystem claiming that there weren't enough valid targets available. (BP starts to rise).  Sure enough, on the servers with 2x3 RAIDs, only 1 of them was mounted.  The others were listed as 'already mounted or busy'.

Skipping a few hours of increasing BP, we discovered that the yum update had triggered the old device renumbering bug to re-arrange our RAID device numbers so that only 1 of them was valid.  A quick edit of the fstab and we were back in business after reading about the 'unknown storage target':
and correcting them via the 'fhgfs-ctl --unmaptarget <targetID>'
(bc of the rearrangement, bfs had tried to allocate targetIDs to the wrong devices and had failed, but they were still listed.

Another reason to use disk UIDs rather than device names.

The explanation in the FAQ was accurate and worked as described and the next mount attempt brought everything back to normal.  We have also edited the FirstRun options to prevent some of these problems in the future.

There were several points in the process tho that made me wish for paid support and now that we're going to have about a PB running under bfs, we're going to revisit this again with our budget ppl.

Frank Kautz

unread,
Sep 12, 2014, 6:00:42 AM9/12/14
to fhgfs...@googlegroups.com
Hello Harry,

the IB problem sounds like a mixture of OFED and RedHat based IB
packages. This happens when you are using OFED and do a "yum update"
which installs the RedHat based packages. You can uninstall the OFED
packages with the uninstall script from OFED and install all IB related
packages with 'yum groupinstall "Infiniband Support"'. But be careful
the Red Hat packages does not support RDMA over Converged Ethernet
(RoCE), OFED does.

kind regards
Frank
> --
> You received this message because you are subscribed to the Google
> Groups "fhgfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fhgfs-user+...@googlegroups.com
> <mailto:fhgfs-user+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages