Problems with BeeGFS 7 on CentOS 7 with RDMA

1,763 views
Skip to first unread message

Emir Imamagic

unread,
Sep 20, 2018, 10:39:57 AM9/20/18
to fhgfs...@googlegroups.com
Hello,

we have a problem using Infiniband RDMA access on the following versions:
beegfs-*-7.0-el7.noarch
CentOS 7.5.1804

We are not using the latest kernel version
(kernel-3.10.0-862.11.6.el7.x86_64) as the RDMA doesn't work at all on it.
Instead we used the previous one - kernel-3.10.0-862.9.1.el7.x86_64.

Kernel module was compiled by using vanilla libs. Also we use vanilla
Mellanox drivers. Network cards on all boxes are Mellanox MT27520 Family
ConnectX-3 Pro.

Setup consists of 6 servers each running storage and meta services
storing data on local storage. We use RAID-6 with 16 x 10 TB HDDs for
storage and RAID-1 with 2 x 480 GB SSDs.

Starting beegfs-client works fine and the only anomaly are the following
kernel messages:
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey
Sep 20 16:23:40 storage01 kernel: __IBVSocket_createCommContext:
enabling unsafe global rkey

Everything works fine until we run heavier load - io-500 benchmark with
just 8 clients on one node.

On the kernel-3.10.0-862.9.1.el7.x86_64 we hit the following issue:
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to destroy sendCQ
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to destroy recvCQ
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to destroy recvCompChannel
Sep 20 11:13:30 storage01 beegfs-storage[1454]:
__IBVSocket_cleanupCommContext: Failed to dealloc pd
which is similar to the situation described in this thread:
https://groups.google.com/forum/#!topic/fhgfs-user/bobA3YEvEMc

Final result was that we had to restart everything in order for BeeGFS
to continue working.

Anyone hit similar issues?

Cheers
emir

Emir Imamagic

unread,
Sep 23, 2018, 2:12:15 AM9/23/18
to fhgfs...@googlegroups.com
Hello,

short update just for the record. We had to downgrade kernel on servers
to the one from CentOS 7.4 - 3.10.0-693.el7.x86_64 in order to get rid
of the server side errors (__IBVSocket_cleanupCommContext).

Message "__IBVSocket_createCommContext: enabling unsafe global rkey" is
there with kernels both from CentOS 7.4 and 7.5, but it doesn't seem to
affect the performance of client. According to the link
https://patchwork.kernel.org/patch/9313483/ it is just a warning so
we'll just ignore it.

It would be interesting to hear if anyone else is seeing them.

Cheers
emir

Alexander Åhman

unread,
Oct 1, 2018, 10:47:17 AM10/1/18
to fhgfs...@googlegroups.com
Hi,
Yes, we have seen it to. One of our login nodes (a BeeGFS client)
crashed and this was the last printout before the machine rebooted itself.

kernel: __IBVSocket_createCommContext: enabling unsafe global rkey
kernel: __IBVSocket_createCommContext: enabling unsafe global rkey
kernel: __IBVSocket_createCommContext: enabling unsafe global rkey

Don't know if the reboot has anything to do with BeeGFS or not but
somehow all other nodes was affected and switched over to a 1G fail-over
Ethernet link. None of them wanted to use IB RDMA after this (connection
error). Very strange...
After some reboots everything is working again using RDMA except for the
metadata servers.

By the way, we are also using CentOS 7.5.1804 and BeeGFS 7.0

Regards,
Alexander

Alexander Åhman

unread,
Oct 1, 2018, 10:57:12 AM10/1/18
to fhgfs...@googlegroups.com
Sorry for "spamming" the list. Apparently my e-mail client isn't working too great either. Only the last mail should have been sent, disregard the rest.

Regards,
Alexander

James Burton

unread,
Oct 1, 2018, 10:58:45 AM10/1/18
to fhgfs...@googlegroups.com
I have seen the enabling unsafe global rkey message before. It appears to be a warning and can be disregarded.


We're using Oracle Linux 7.5, which is a RHEL clone with a different kernel, and BeeGFS 7 on a very similar setup and having no problems with RDMA. 

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625

jiangbo...@gmail.com

unread,
Oct 6, 2018, 8:14:08 AM10/6/18
to fhgfs-user

Hello,

How to package RPM with the latest beegfs7.1 code. The wiki says there is a find command in 7.1, but the implementation of the command cannot be found in the code.

                                                                                                                                                            Thank you for answering



To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages