mgmtd dies on startup

John Hanks

unread,

Jun 17, 2017, 5:11:32 AM6/17/17

to beegfs-user

Hi,

We have a test setup of BeeGFS that has been working quite well for a year and has been upgrades from 2015 to 6.x along the way. We were at 6.9 when the problem below started, upgrading to 6.11 didn't change the status. Recently we moved our clusters into a new subnet requiring IP address changes, after which things seemed to run normally for a week or so. We now get this when (re)starting beegfs-mgmtd:

(3) Jun17 11:54:02 Main [App] >> Loaded metadata nodes: 1

(3) Jun17 11:54:02 Main [App] >> Loaded storage nodes: 1

(3) Jun17 11:54:02 Main [App] >> Loaded client nodes: 742

(3) Jun17 11:54:02 Main [App] >> Loaded target mappings: 1

(3) Jun17 11:54:02 Main [App.cpp:675] >> Loaded storage target states. NodeType: Storage target

(3) Jun17 11:54:02 Main [App.cpp:675] >> Loaded storage target states. NodeType: Metadata node

(3) Jun17 11:54:02 Main [DGramLis] >> Listening for UDP datagrams: Port 8008

(3) Jun17 11:54:02 Main [StreamLis] >> Listening for TCP connections: Port 8008

(1) Jun17 11:54:02 Main [App] >> Version: 6.9

(2) Jun17 11:54:02 Main [App] >> LocalNode: beegfs-mgmtd storage-810-35 [ID: 1]

(2) Jun17 11:54:02 Main [App] >> Usable NICs: virbr0(TCP) ib2(TCP) enp5s0f0(TCP)

(3) Jun17 11:54:02 HBeatMgr [HBeatMgr] >> Notifying stored nodes...

(0) Jun17 11:54:02 HBeatMgr [App (component exception handler)] >> This component encountered an unrecoverable error. [SysErr: Invalid argument] Exception message: sendto(10.109.240.69:8004): Hard Disconnect from Listen(Port: 8008): Invalid argument

(2) Jun17 11:54:02 HBeatMgr [App (component exception handler)] >> Shutting down...

(0) Jun17 11:54:02 HBeatMgr [PThread::signalHandler] >> Received a SIGABRT. Trying to shut down...

(1) Jun17 11:54:02 HBeatMgr [PThread::signalHandler] >> Backtrace:

1: /opt/beegfs/sbin/beegfs-mgmtd(_ZN7PThread13signalHandlerEi+0x63) [0x52af83]

2: /lib64/libc.so.6(+0x35250) [0x7efd407f9250]

3: /lib64/libc.so.6(gsignal+0x37) [0x7efd407f91d7]

4: /lib64/libc.so.6(abort+0x148) [0x7efd407fa8c8]

5: /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165) [0x7efd413199d5]

6: /lib64/libstdc++.so.6(+0x5e946) [0x7efd41317946]

7: /lib64/libstdc++.so.6(+0x5e973) [0x7efd41317973]

8: /lib64/libstdc++.so.6(__cxa_rethrow+0x49) [0x7efd41317be9]

9: /opt/beegfs/sbin/beegfs-mgmtd(_ZN24AbstractDatagramListener18sendDummyToSelfUDPEv+0x19e) [0x5124fe]

10: /opt/beegfs/sbin/beegfs-mgmtd(_ZN3App14stopComponentsEv+0x161) [0x452731]

11: /opt/beegfs/sbin/beegfs-mgmtd(_ZN3App24handleComponentExceptionERSt9exception+0x277) [0x453607]

12: /opt/beegfs/sbin/beegfs-mgmtd(_ZN16HeartbeatManager3runEv+0x2f1) [0x4b3d51]

13: /opt/beegfs/sbin/beegfs-mgmtd(_ZN7PThread9runStaticEPv+0x125) [0x449105]

14: /lib64/libpthread.so.0(+0x7dc5) [0x7efd40b8cdc5]

15: /lib64/libc.so.6(clone+0x6d) [0x7efd408bb73d]

None of those substrings produced a search result that was useful and as I'm still learning my way around BeeGFS, not sure which direction to head first for more clues. Any suggestions or pointers would be appreciated.

Thanks,

jbh

--

‘[A] talent for following the ways of yesterday, is not sufficient to improve the world of today.’

- King Wu-Ling, ruler of the Zhao state in northern China, 307 BC

Sven Breuner

unread,

Jun 17, 2017, 6:15:19 AM6/17/17

to fhgfs...@googlegroups.com, John Hanks

Hi John,

looks like there is a problem with sending a UDP message from the management
host to IP 10.109.240.69 at port 8004. As this is UDP, a problem detected
immediately on send can only be caused on the sender side.

Since you said it was running fine for a week after the IP address changes, is
10.109.240.69 an old client (i.e. was running the whole week already) or has it
just recently been added when the problem started?

Is there any firewall or other auditing software running on the management host
that could actively reject packets? Or maybe a missing route (as in /sbin/route)
for the destination 10.109.240.69?

Does e.g. this command return without any error on the management host?
$ echo test | netcat -u -q 1 10.109.240.69 8004

(From the BeeGFS point of view, this kind of error messagen when trying to send
a packet to 10.109.240.69:8004 is indeed symptom of something that is critical
and needs attention, so the shutdown in this case is intentional.)

The problem happens on startup, because on startup, the management service tries
to get in touch all previously registered clients and servers to check if they
still exist. If the client 10.109.240.69 no longer exists and you cannot find a
problem with the firewall or the routing, you could also delete the stored
clients file. This file is called "clients.nodes" inside the management directory.

I also noticed that the management service is exporting virbr0 as first
interface in its list. If that is unintended, you can set a connInterfacesFile
in /etc/beegfs/beegfs-mgmtd.conf to configure the exported interfaces
(documentation of options is included in the config file).

Best regards,
Sven

John Hanks wrote on 17.06.2017 11:11:
> Hi,
>
> We have a test setup of BeeGFS that has been working quite well for a year and
> has been upgrades from 2015 to 6.x along the way. We were at 6.9 when the
> problem below started, upgrading to 6.11 didn't change the status. Recently we
> moved our clusters into a new subnet requiring IP address changes, after which
> things seemed to run normally for a week or so. We now get this when
> (re)starting beegfs-mgmtd:
>
> (3) Jun17 11:54:02 Main [App] >> Loaded metadata nodes: 1
> (3) Jun17 11:54:02 Main [App] >> Loaded storage nodes: 1
> (3) Jun17 11:54:02 Main [App] >> Loaded client nodes: 742
> (3) Jun17 11:54:02 Main [App] >> Loaded target mappings: 1
> (3) Jun17 11:54:02 Main [App.cpp:675] >> Loaded storage target states. NodeType:
> Storage target
> (3) Jun17 11:54:02 Main [App.cpp:675] >> Loaded storage target states. NodeType:
> Metadata node
> (3) Jun17 11:54:02 Main [DGramLis] >> Listening for UDP datagrams: Port 8008
> (3) Jun17 11:54:02 Main [StreamLis] >> Listening for TCP connections: Port 8008
> (1) Jun17 11:54:02 Main [App] >> Version: 6.9
> (2) Jun17 11:54:02 Main [App] >> LocalNode: beegfs-mgmtd storage-810-35 [ID: 1]
> (2) Jun17 11:54:02 Main [App] >> Usable NICs: virbr0(TCP) ib2(TCP) enp5s0f0(TCP)
> (3) Jun17 11:54:02 HBeatMgr [HBeatMgr] >> Notifying stored nodes...
> (0) Jun17 11:54:02 HBeatMgr [App (component exception handler)] >> This
> component encountered an unrecoverable error. [SysErr: Invalid argument]

> Exception message: sendto(10.109.240.69:8004 <http://10.109.240.69:8004>): Hard

John Hanks

unread,

Jun 17, 2017, 7:09:37 AM6/17/17

to Sven Breuner, beegfs-user

Sven,

Thanks for you response, replying inline.

On Sat, Jun 17, 2017 at 1:15 PM Sven Breuner <sven.b...@thinkparq.com> wrote:

Hi John,

looks like there is a problem with sending a UDP message from the management
host to IP 10.109.240.69 at port 8004. As this is UDP, a problem detected
immediately on send can only be caused on the sender side.

Since you said it was running fine for a week after the IP address changes, is
10.109.240.69 an old client (i.e. was running the whole week already) or has it
just recently been added when the problem started?

Everything was moved to 10.109.*.* from a 192.168.*.* range, so any old clients would stil be using the 192.168 address for which there is no longer an interface.

Is there any firewall or other auditing software running on the management host
that could actively reject packets? Or maybe a missing route (as in /sbin/route)
for the destination 10.109.240.69?

No firewalls, this is all taking place inside the trusted IB network (using IPoIB).

Does e.g. this command return without any error on the management host?
$ echo test | netcat -u -q 1 10.109.240.69 8004

Yes, that works fine.

(From the BeeGFS point of view, this kind of error messagen when trying to send
a packet to 10.109.240.69:8004 is indeed symptom of something that is critical
and needs attention, so the shutdown in this case is intentional.)

The problem happens on startup, because on startup, the management service tries
to get in touch all previously registered clients and servers to check if they
still exist. If the client 10.109.240.69 no longer exists and you cannot find a
problem with the firewall or the routing, you could also delete the stored
clients file. This file is called "clients.nodes" inside the management directory.

Moving client.nodes out of the way got us going again, thank you.

I also noticed that the management service is exporting virbr0 as first
interface in its list. If that is unintended, you can set a connInterfacesFile
in /etc/beegfs/beegfs-mgmtd.conf to configure the exported interfaces
(documentation of options is included in the config file).

I just noticed that too :). libvirtd isn't need, so I shut that down as part of my troubleshooting.

Best regards,
Sven

Thanks for your help!

jbh

Reply all

Reply to author

Forward