Hello,
we have a problem using Infiniband RDMA access on the following versions:
beegfs-*-7.0-el7.noarch
CentOS 7.5.1804
We are not using the latest kernel version
(kernel-3.10.0-862.11.6.el7.x86_64) as the RDMA doesn't work at all on it.
Instead we used the previous one - kernel-3.10.0-862.9.1.el7.x86_64.
Kernel module was compiled by using vanilla libs. Also we use vanilla
Mellanox drivers. Network cards on all boxes are Mellanox MT27520 Family
ConnectX-3 Pro.
Setup consists of 6 servers each running storage and meta services
storing data on local storage. We use RAID-6 with 16 x 10 TB HDDs for
storage and RAID-1 with 2 x 480 GB SSDs.
Starting beegfs-client works fine and the only anomaly are the following
kernel messages:
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey
Everything works fine until we run heavier load - io-500 benchmark with
just 8 clients on one node.
On the kernel-3.10.0-862.9.1.el7.x86_64 we hit the following issue:
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to destroy sendCQ
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to destroy recvCQ
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to destroy recvCompChannel
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to dealloc pd
which is similar to the situation described in this thread:
https://groups.google.com/forum/#!topic/fhgfs-user/bobA3YEvEMc
Final result was that we had to restart everything in order for BeeGFS
to continue working.
Anyone hit similar issues?
Cheers
emir