beegfs-meta does not register and shows up as dead

367 views

Skip to first unread message

Tobias Jakobi

unread,

Aug 18, 2016, 4:54:12 AM8/18/16

to beegfs-user

Hello everyone,

our building was struck by a power outage yesterday, the batteries kicked in and I cleanly shut down the servers since we did not know how long the power would be gone.

However, when power came back, I restarted all machines and BeeGFS servers and one of our two meta-nodes does not seem to register any more on the mgmt server:

Here the log from the not registering meta-server (10.13.37.1):

(3) 10:45:20 Main [App] >> Root directory loaded.

(1) 10:45:20 Main [App] >> I got root (by possession of root directory)

(4) 10:45:20 Main [App] >> Disposal directory loaded.

(4) 10:45:20 Main [App] >> Detaching process...

(4) 10:45:20 Main [App] >> Initializing components...

(3) 10:45:20 Main [DGramLis] >> Listening for UDP datagrams: Port 8005

(3) 10:45:20 Main [ConnAccept] >> Listening for RDMA connections: Port 8005

(3) 10:45:20 Main [ConnAccept] >> Listening for TCP connections: Port 8005

(4) 10:45:20 Main [App] >> Components initialized.

(4) 10:45:20 Main [SessionStore (load)] >> load sessions from file: /data/beegfs/meta/sessions

(3) 10:45:20 Main [App] >> 4 sessions restored.

(1) 10:45:20 Main [App] >> Version: 2015.03-r16

(2) 10:45:20 Main [App] >> LocalNode: beegfs-meta hydra [ID: 1]

(2) 10:45:20 Main [App] >> Usable NICs: ib0(RDMA) ib0(TCP) eth0(TCP)

(4) 10:45:20 Main [App] >> Extended list of usable NICs:

+ ib0[ip addr: 10.13.37.1; hw addr: 80.00.00.48.fe.80; metric: 0; bandwidth: 5; type: RDMA]

+ ib0[ip addr: 10.13.37.1; hw addr: 80.00.00.48.fe.80; metric: 0; bandwidth: 5; type: TCP]

+ eth0[ip addr: 129.206.148.153; hw addr: 0c.c4.7a.6d.38.04; metric: 0; bandwidth: 2; type: TCP]

(4) 10:45:20 Main [App] >> Starting up components...

(2) 10:45:20 HBeatMgr [HBeatMgr] >> Waiting for management node...

(2) 10:45:20 DGramLis [Heartbeat incoming] >> New node: beegfs-mgmtd luna [ID: 1];

(3) 10:45:20 HBeatMgr [HBeatMgr] >> Management node found. Downloading node groups...

(4) 10:45:20 DGramLis [Heartbeat incoming] >> Number of nodes: Meta: 1; Storage: 0

(4) 10:45:20 HBeatMgr [NodeConn (acquire stream)] >> Establishing new TCP connection to: beegfs...@10.13.37.3:8008

(3) 10:45:20 HBeatMgr [NodeConn (acquire stream)] >> Connected: beegfs...@10.13.37.3:8008 (protocol: TCP)

(2) 10:45:20 HBeatMgr [HBeatMgr] >> Nodes added (sync results): 1 (Type: beegfs-meta)

(2) 10:45:20 HBeatMgr [HBeatMgr] >> Nodes added (sync results): 2 (Type: beegfs-storage)

(4) 10:45:20 HBeatMgr [HBeatMgr] >> Removing 4 client sessions.

(3) 10:45:20 HBeatMgr [HBeatMgr] >> 503-57B45B79-atlas: Removing 12 file sessions. (0 are unremovable)

I somehow missing the node registering part here.. .The process is not dying, it normally shows up in ps. We are using 2015.03-r16 on all machines.

And here the log from the mgmt server (10.13.37.3):

(3) Aug18 10:42:31 Main [App] >> Loaded metadata nodes: 2

(3) Aug18 10:42:31 Main [App] >> Loaded storage nodes: 2

(3) Aug18 10:42:31 Main [App] >> Loaded clients: 0

(4) Aug18 10:42:31 Main [App] >> Loaded target numeric ID mappings: 2

(3) Aug18 10:42:31 Main [App] >> Loaded target mappings: 2

(3) Aug18 10:42:31 Main [App] >> Loaded targets to resync list.

(4) Aug18 10:42:31 Main [App] >> Initializing components...

(3) Aug18 10:42:31 Main [DGramLis] >> Listening for UDP datagrams: Port 8008

(3) Aug18 10:42:31 Main [StreamLis] >> Listening for TCP connections: Port 8008

(4) Aug18 10:42:31 Main [App] >> Components initialized.

(1) Aug18 10:42:31 Main [App] >> Version: 2015.03-r17

(2) Aug18 10:42:31 Main [App] >> LocalNode: beegfs-mgmtd luna [ID: 1]

(2) Aug18 10:42:31 Main [App] >> Usable NICs: ib0(TCP) eth0(TCP)

(4) Aug18 10:42:31 Main [App] >> Extended list of usable NICs:

+ ib0[ip addr: 10.13.37.3; hw addr: 80.00.00.48.fe.80; metric: 0; bandwidth: 4; type: TCP]

+ eth0[ip addr: 129.206.148.155; hw addr: 0c.c4.7a.6b.ce.32; metric: 0; bandwidth: 2; type: TCP]

(4) Aug18 10:42:31 Main [App] >> Detaching process...

(4) Aug18 10:42:31 Main [App] >> Starting up components...

(3) Aug18 10:42:31 HBeatMgr [HBeatMgr] >> Notifying stored nodes...

(4) Aug18 10:42:31 Worker1 [Worker1] >> Ready (TID: 9357; WorkType: 1)

(4) Aug18 10:42:31 Worker2 [Worker2] >> Ready (TID: 9358; WorkType: 1)

(4) Aug18 10:42:31 Worker3 [Worker3] >> Ready (TID: 9359; WorkType: 1)

(4) Aug18 10:42:31 Main [App] >> Components running.

(4) Aug18 10:42:31 Main [App] >> Joining component threads...

(4) Aug18 10:42:31 Worker4 [Worker4] >> Ready (TID: 9360; WorkType: 1)

(4) Aug18 10:42:31 DirectWorker1 [DirectWorker1] >> Ready (TID: 9361; WorkType: 0)

(3) Aug18 10:42:31 HBeatMgr [HBeatMgr] >> Init complete.

(2) Aug18 10:42:36 XNodeSync [Assign node to capacity pool] >> Metadata node capacity pool assignment updated. NodeID: 1; Pool: Emergency; Reason: No capacity report received (yet).

(2) Aug18 10:42:36 XNodeSync [Assign node to capacity pool] >> Metadata node capacity pool assignment updated. NodeID: 2; Pool: Emergency; Reason: No capacity report received (yet).

(2) Aug18 10:42:36 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 1; TargetID: 1; Pool: Emergency; Reason: No capacity report received (yet).

(2) Aug18 10:42:36 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 2; TargetID: 2; Pool: Emergency; Reason: No capacity report received (yet).

(4) Aug18 10:42:39 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.2:59419 [SockFD: 12]

(4) Aug18 10:42:41 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.2:59420 [SockFD: 14]

(2) Aug18 10:42:41 DirectWorker1 [Change consistency states] >> Storage target is coming online. ID: 2

(2) Aug18 10:42:42 Worker2 [Change consistency states] >> Metadata node is coming online. ID: 2

(4) Aug18 10:42:42 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.1:50143 [SockFD: 16]

(2) Aug18 10:42:42 Worker2 [Change consistency states] >> Storage target is coming online. ID: 1

(2) Aug18 10:44:11 XNodeSync [Assign node to capacity pool] >> Metadata node capacity pool assignment updated. NodeID: 2; Pool: Normal; Reason: Free capacity threshold

(2) Aug18 10:44:11 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 1; TargetID: 1; Pool: Normal; Reason: Free capacity threshold

(2) Aug18 10:44:11 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 2; TargetID: 2; Pool: Normal; Reason: Free capacity threshold

(4) Aug18 10:45:20 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.1:50144 [SockFD: 18]

(4) Aug18 10:45:20 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.3:43997 [SockFD: 20]

The meta-server seems to connect, the connection is accepted, but not added as meta data server. The non-added server is also the root meta-data server. I did file systems checks on the metadata RAID array and everything looks fine. Is it possible to get even some more debug output? It does not seem to be a network problem, the same problems shows up when using the non-IB network. The storage servers are on the same machines as the meta-data server and they are registering fine.

Also, beegfs--ctls looks fine...

beegfs-ctl --listnodes --nodetype=meta --details

hydra [ID: 1]

Ports: UDP: 8005; TCP: 8005

Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP)

charon [ID: 2]

Ports: UDP: 8005; TCP: 8005

Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP)

Number of nodes: 2

Root: 1

but when I try to run it with --route:

beegfs-ctl --listnodes --nodetype=meta --details --route

hydra [ID: 1]

Ports: UDP: 8005; TCP: 8005

Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP)

It gets stuck when trying to figure out the routing.

The same command works flawless for the storage server:

hydra [ID: 1]

Ports: UDP: 8003; TCP: 8003

Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP)

Route: 10.13.37.1:8003 (protocol: RDMA)

charon [ID: 2]

Ports: UDP: 8003; TCP: 8003

Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP)

Route: 10.13.37.2:8003 (protocol: RDMA)

Any idea what could be the problem here? Any help would be greatly appreciated.

Cheers,

Tobias

Tobias Jakobi

unread,

Aug 18, 2016, 10:51:42 AM8/18/16

to beegfs-user

Hello again,

just to save this for other users maybe running into this situation:

I was able to trace back the error back to the session cleanup function within the metadata service.

The cleanup process got stuck and never actually finished the cleanup, yielding to a kind of endless loop.

I was able to fix this error by moving the "sessions" file in the metadata folder away (you therefore would loose any information in there).