Send BeeGFS daemon logs to local fs *and* rsyslog? Quick server status?

94 views
Skip to first unread message

Steffen Grunewald

unread,
Jan 9, 2017, 9:21:05 AM1/9/17
to BeeGFS user list
Hello,

I'm experiencing mysterious "hangs" of BeeGFS daemons (meta and/or storage).
(Load up beyond 120, not even ps would work any longer.)
A forced reboot removes the open log files - the machines are using XFS.
Is there a way to send a second copy to a rsyslogd?

Also, how do I quickly find which daemon doesn't rspond anymore? I might
start a beegfs-fsck and wait for the "servers incomplete" complaint, but
to be able to check quickly might save a life. Thinking about something
similar to "lfs check servers". Any suggestions?

Thanks,
S

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1
D-14476 Potsdam-Golm
Germany
~~~
Fon: +49-331-567 7274
Fax: +49-331-567 7298
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Sven Breuner

unread,
Jan 9, 2017, 8:31:44 PM1/9/17
to fhgfs...@googlegroups.com, Steffen Grunewald
Hi Steffen,

Steffen Grunewald wrote on 09.01.2017 15:21:
> I'm experiencing mysterious "hangs" of BeeGFS daemons (meta and/or storage).
> (Load up beyond 120, not even ps would work any longer.)

You should check to make sure that you are not using a kernel on the servers
that is affected by this problem:
https://access.redhat.com/solutions/1386323

> A forced reboot removes the open log files - the machines are using XFS.
> Is there a way to send a second copy to a rsyslogd?

There is currently no way to have the servers send their messages through
rsyslog, although I can see that this would be interesting.
But what do you mean by "a forced reboot removes the open log files"? Are they
completely gone? Because usually, the BeeGFS services only rotate them to
"beegfs-...log.old-1" unless log rotation is explicitly disabled in the the
config files.

Did you check the kernel log for messages that might be relevant?

> Also, how do I quickly find which daemon doesn't rspond anymore? I might
> start a beegfs-fsck and wait for the "servers incomplete" complaint, but
> to be able to check quickly might save a life. Thinking about something
> similar to "lfs check servers". Any suggestions?

"beegfs-check-servers" will try to establish a connection to each service
(without transmitting any data over this connection), so it is the most
light-weight way to check if the servers are running.
"beegfs-df" needs to communicate with each metadata and storage service to
request free space, so it is also a way to check if each service is responsive.
"beegfs-ctl --listtargets --nodetype=meta --state" (or "--nodetype=storage")
will ask the management for the online state of the metadata or storage
services. This will be set to offline when the management service does not
receive heartbeats from the meta/storage services.

Best regards,
Sven

Steffen Grunewald

unread,
Jan 10, 2017, 3:32:02 AM1/10/17
to Sven Breuner, fhgfs...@googlegroups.com
Good morning,

On Tue, 2017-01-10 at 02:31:52 +0100, Sven Breuner wrote:
> Hi Steffen,
>
> Steffen Grunewald wrote on 09.01.2017 15:21:
> >I'm experiencing mysterious "hangs" of BeeGFS daemons (meta and/or storage).
> >(Load up beyond 120, not even ps would work any longer.)
>
> You should check to make sure that you are not using a kernel on the servers
> that is affected by this problem:
> https://access.redhat.com/solutions/1386323

CentOS 7.2, (yet) un-upgraded kernel
storage02: Linux storage02 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
storage01: Linux storage01 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

so this one doesn't seem to apply.
Got to upgrade to a newer kernel soon (when Intel provide their OPA driver
for that one)

> >A forced reboot removes the open log files - the machines are using XFS.
> >Is there a way to send a second copy to a rsyslogd?
>
> There is currently no way to have the servers send their messages through
> rsyslog, although I can see that this would be interesting.

So my only chance to find out what exactly happened is to cat the right
local log file - if I can get this far at all.
Now, there's a big gap between the last rotated beegfs-*.log file and the
restart time, and all newer log files start at restart time (which seems
to indicate there was nothing to be rotated this time):

-rw-r--r-- 1 root root 5484 Jan 7 17:03 beegfs-scratch-meta.log.old-1
-rw-r--r-- 1 root root 89381 Jan 9 07:34 dmesg
-rw-r--r-- 1 root root 10742 Jan 9 07:34 boot.log

# head -n1 bee*log
==> beegfs-client.log <==
(1) Jan09 07:34:37 Main [App] >> BeeGFS Helper Daemon Version: 2015.03-r22

==> beegfs-home-meta.log <==
(3) Jan09 07:34:38 Main [App] >> Root directory loaded.

==> beegfs-home-storage.log <==
(3) Jan09 07:34:37 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8019

==> beegfs-scratch-meta.log <==
(3) Jan09 07:34:38 Main [App] >> Root directory loaded.

==> beegfs-scratch-storage.log <==
(3) Jan09 07:34:38 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8020

==> beegfs-work-meta.log <==
(3) Jan09 07:34:38 Main [App] >> Root directory loaded.

==> beegfs-work-storage.log <==
(3) Jan09 07:34:38 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8021

> But what do you mean by "a forced reboot removes the open log files"? Are
> they completely gone? Because usually, the BeeGFS services only rotate them
> to "beegfs-...log.old-1" unless log rotation is explicitly disabled in the
> the config files.

Log files seem to be open during the restart - and they apparently get removed
during xfs mount. That's all I can say right now.

> Did you check the kernel log for messages that might be relevant?

Nothing close to the time when the FS became inaccessible for the first time
(around 3am on Jan 9), but there's a bunch of those 9 hours before:

Jan 8 17:24:35 controller fm0_sm[1734]: WARN [PmAsyncRcv]: PM: PmPrintFailPort: Unable to Get(PortStatus) storage02 hfi1_0 Guid 0x001175010163452c LID 0x24 Port 1

> >Also, how do I quickly find which daemon doesn't rspond anymore? I might
> >start a beegfs-fsck and wait for the "servers incomplete" complaint, but
> >to be able to check quickly might save a life. Thinking about something
> >similar to "lfs check servers". Any suggestions?
>
> "beegfs-check-servers" will try to establish a connection to each service
> (without transmitting any data over this connection), so it is the most
> light-weight way to check if the servers are running.

I didn't know about this one. Thanks for the pointer. Now if there was a nice
return code? (No manual page, and --help doesn't tell me.)

> "beegfs-df" needs to communicate with each metadata and storage service to
> request free space, so it is also a way to check if each service is
> responsive.

I would have to do this for all mount points, right?

> "beegfs-ctl --listtargets --nodetype=meta --state" (or "--nodetype=storage")
> will ask the management for the online state of the metadata or storage
> services. This will be set to offline when the management service does not
> receive heartbeats from the meta/storage services.

Again, I would have to do this for all mount points, right?

Anyway, this list of commands will provide some tools to address the right service,
and I hopefully can come up with more details about the weird state soon (although
I hope that won't happen too soon as it undermines my credibility as an admin)

Thanks,
- S

Jan Behrend

unread,
Jan 20, 2017, 3:35:19 AM1/20/17
to fhgfs...@googlegroups.com
Hi Sven,

On Tue, 2017-01-10 at 02:31 +0100, Sven Breuner wrote:
> Is there a way to send a second copy to a rsyslogd?
> There is currently no way to have the servers send their messages through 
> rsyslog, although I can see that this would be interesting.

Just this morning I was just thinking about the same subject:
Is there a good way to collect all log messages accoss a BeeGFS cluster and
exhibit them at a single point, and so be able to more precisely analyze events
in their chronological order?
Isn't this part of the helperd's job (for clients only)?  "logType = syslog"
seems available on clients only.  Are there drawbacks to using "syslog" instead
of "helperd"?  What about the server side (storage, meta, mgmt, ...)? 

One solution would be to have rsyslog pickup the messages from the flat files,
but this seems an awkward way around ...

Thanks for the insight,
cheers Jan

--
MAX-PLANCK-INSTITUT fuer Radioastronomie
Jan Behrend - Rechenzentrum
----------------------------------------
Auf dem Huegel 69, D-53121 Bonn
Tel: +49 (228) 525 359, Fax: +49 (228) 525 229
http://www.mpifr-bonn.mpg.de


Sven Breuner

unread,
Jan 24, 2017, 10:20:18 AM1/24/17
to fhgfs...@googlegroups.com, Jan Behrend, Steffen Grunewald
Hi Steffen and Jan,

we have created an internal development ticket to add a syslog option also for
the mgmtd/meta/storage services. It's not assigned yet, so don't expect this to
be coming within the next few weeks, but we hear you, so sooner or later...

Like you said, there are of course tools that could do such kind of log
gathering, but this is not as most elegant as having the beegfs services send
their messages directly to syslog.

Best regards,
Sven
Reply all
Reply to author
Forward
0 new messages