Good morning,
On Tue, 2017-01-10 at 02:31:52 +0100, Sven Breuner wrote:
> Hi Steffen,
>
> Steffen Grunewald wrote on 09.01.2017 15:21:
> >I'm experiencing mysterious "hangs" of BeeGFS daemons (meta and/or storage).
> >(Load up beyond 120, not even ps would work any longer.)
>
> You should check to make sure that you are not using a kernel on the servers
> that is affected by this problem:
>
https://access.redhat.com/solutions/1386323
CentOS 7.2, (yet) un-upgraded kernel
storage02: Linux storage02 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
storage01: Linux storage01 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
so this one doesn't seem to apply.
Got to upgrade to a newer kernel soon (when Intel provide their OPA driver
for that one)
> >A forced reboot removes the open log files - the machines are using XFS.
> >Is there a way to send a second copy to a rsyslogd?
>
> There is currently no way to have the servers send their messages through
> rsyslog, although I can see that this would be interesting.
So my only chance to find out what exactly happened is to cat the right
local log file - if I can get this far at all.
Now, there's a big gap between the last rotated beegfs-*.log file and the
restart time, and all newer log files start at restart time (which seems
to indicate there was nothing to be rotated this time):
-rw-r--r-- 1 root root 5484 Jan 7 17:03 beegfs-scratch-meta.log.old-1
-rw-r--r-- 1 root root 89381 Jan 9 07:34 dmesg
-rw-r--r-- 1 root root 10742 Jan 9 07:34 boot.log
# head -n1 bee*log
==> beegfs-client.log <==
(1) Jan09 07:34:37 Main [App] >> BeeGFS Helper Daemon Version: 2015.03-r22
==> beegfs-home-meta.log <==
(3) Jan09 07:34:38 Main [App] >> Root directory loaded.
==> beegfs-home-storage.log <==
(3) Jan09 07:34:37 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8019
==> beegfs-scratch-meta.log <==
(3) Jan09 07:34:38 Main [App] >> Root directory loaded.
==> beegfs-scratch-storage.log <==
(3) Jan09 07:34:38 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8020
==> beegfs-work-meta.log <==
(3) Jan09 07:34:38 Main [App] >> Root directory loaded.
==> beegfs-work-storage.log <==
(3) Jan09 07:34:38 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8021
> But what do you mean by "a forced reboot removes the open log files"? Are
> they completely gone? Because usually, the BeeGFS services only rotate them
> to "beegfs-...log.old-1" unless log rotation is explicitly disabled in the
> the config files.
Log files seem to be open during the restart - and they apparently get removed
during xfs mount. That's all I can say right now.
> Did you check the kernel log for messages that might be relevant?
Nothing close to the time when the FS became inaccessible for the first time
(around 3am on Jan 9), but there's a bunch of those 9 hours before:
Jan 8 17:24:35 controller fm0_sm[1734]: WARN [PmAsyncRcv]: PM: PmPrintFailPort: Unable to Get(PortStatus) storage02 hfi1_0 Guid 0x001175010163452c LID 0x24 Port 1
> >Also, how do I quickly find which daemon doesn't rspond anymore? I might
> >start a beegfs-fsck and wait for the "servers incomplete" complaint, but
> >to be able to check quickly might save a life. Thinking about something
> >similar to "lfs check servers". Any suggestions?
>
> "beegfs-check-servers" will try to establish a connection to each service
> (without transmitting any data over this connection), so it is the most
> light-weight way to check if the servers are running.
I didn't know about this one. Thanks for the pointer. Now if there was a nice
return code? (No manual page, and --help doesn't tell me.)
> "beegfs-df" needs to communicate with each metadata and storage service to
> request free space, so it is also a way to check if each service is
> responsive.
I would have to do this for all mount points, right?
> "beegfs-ctl --listtargets --nodetype=meta --state" (or "--nodetype=storage")
> will ask the management for the online state of the metadata or storage
> services. This will be set to offline when the management service does not
> receive heartbeats from the meta/storage services.
Again, I would have to do this for all mount points, right?
Anyway, this list of commands will provide some tools to address the right service,
and I hopefully can come up with more details about the weird state soon (although
I hope that won't happen too soon as it undermines my credibility as an admin)
Thanks,
- S