beegfs-mgmtd: Too many open files

432 views
Skip to first unread message

James Burton

unread,
Jan 18, 2019, 10:15:12 AM1/18/19
to fhgfs...@googlegroups.com
Greetings:

This morning users were having problems connecting to the beegfs filesystem.

Looking at the beegfs-mgmtd.log on the mgmtd server showed the following error:

(0) Jan18 09:25:59 StreamLis [StreamLis] >> Trying to continue after connection accept error: Error during socket accept(): Too many open files

I have already increased the limit of open files to 1000000, but this seems to only delay the problem.

[root@bgfs001 ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1030239
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1030239
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimitedI have increased the limit of open files to 

A workaround is to kill and restart the mgmtd, but I would like to know what the root cause is and how to prevent this from happening again.

I am running BeeGFS 7.1.1 on Oracle Linux 7.5 (RHEL clone—mostly) with kernel 4.1.12-103.9.4.el7uek.x86_64.

Any help would be appreciated.

Thanks,

Jim Burton



--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625

Roland Pabel

unread,
Jan 22, 2019, 2:59:25 AM1/22/19
to fhgfs...@googlegroups.com
Hi,

just as an addendum:

This happened to us a few days ago, too. Exactly the same error message,
ulimit of open files is 1048576. We have two beegfs filesystems, but only was
affected. We're running BeeGFS 7.1.1 on Centos 7 (3.10.0-862.14.4.el7.x86_64).

Whenever I look at the open files of the beegfs-mgmtd with lsof, the number is
well away from a million, more in the ballpark of one connection per cluster
node.

Roland
Dr. Roland Pabel
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Weyertal 121, Raum 3.07
D-50931 Köln

Tel.: +49 (221) 470-89589
E-Mail: pa...@uni-koeln.de


Jonathon Anderson

unread,
Jun 3, 2019, 12:16:06 PM6/3/19
to beegfs-user
We are also experiencing this issue, and we don't have a root cause yet. We've been engaged with TPQ on the issue, but they haven't found a root cause yet.

With our load we're seeing this issue every 24-72 hours; so we're restarting beegfs-mgmtd twice daily now to prevent it from happening.

We have some rough indications that the issue might be load-dependent (that is, dependent on cumulative activity since the last time beegfs-mgmtd was restarted) because the shortest time-to-failure we've seen was the night after we had done some benchmarking with mdtest.

Until we found this thread we assumed we were the only site experiencing this, and expected it might be related to our high number of storage targets. Might still be related, though.

We have a planned outage this Wednesday (5 June 2019) during which, among other things, we're going to hit the fs hard with mdtest and see if we can cause a failure. We're also pulling gdb backtraces of beegfs-mgmtd every 5 minutes; so if we see another failure, we should be able to catch it in the act.

If we're unable to provoke another failure we'll likely need to stop our proactive restarts long enough to let it fail naturally again, and hope that we get some good data out of the gdb.

Has anyone else learned more information about this failure mode since it was first reported?

~jonathon

Sternberger

unread,
Jun 4, 2019, 8:51:41 AM6/4/19
to beegfs-user
Hello!

we have the same problem here. 
Centos 7.6
BeeGFS 7.1.2

We are wating for a fix now for 6 month!

cheers

Sven Sternberger
System Engineer
DESY IT Hamburg, Germany
Reply all
Reply to author
Forward
0 new messages