beegfs-meta does not register and shows up as dead

367 views
Skip to first unread message

Tobias Jakobi

unread,
Aug 18, 2016, 4:54:12 AM8/18/16
to beegfs-user
Hello everyone,

our building was struck by a power outage yesterday, the batteries kicked in and I cleanly shut down the servers since we did not know how long the power would be gone.

However, when power came back, I restarted all machines and BeeGFS servers and one of our two meta-nodes does not seem to register any more on the mgmt server:

Here the log from the not registering meta-server (10.13.37.1):
(3) 10:45:20 Main [App] >> Root directory loaded.
(1) 10:45:20 Main [App] >> I got root (by possession of root directory)
(4) 10:45:20 Main [App] >> Disposal directory loaded.
(4) 10:45:20 Main [App] >> Detaching process...
(4) 10:45:20 Main [App] >> Initializing components...
(3) 10:45:20 Main [DGramLis] >> Listening for UDP datagrams: Port 8005
(3) 10:45:20 Main [ConnAccept] >> Listening for RDMA connections: Port 8005
(3) 10:45:20 Main [ConnAccept] >> Listening for TCP connections: Port 8005
(4) 10:45:20 Main [App] >> Components initialized.
(4) 10:45:20 Main [SessionStore (load)] >> load sessions from file: /data/beegfs/meta/sessions
(3) 10:45:20 Main [App] >> 4 sessions restored.
(1) 10:45:20 Main [App] >> Version: 2015.03-r16
(2) 10:45:20 Main [App] >> LocalNode: beegfs-meta hydra [ID: 1]
(2) 10:45:20 Main [App] >> Usable NICs: ib0(RDMA) ib0(TCP) eth0(TCP) 
(4) 10:45:20 Main [App] >> Extended list of usable NICs: 
+ ib0[ip addr: 10.13.37.1; hw addr: 80.00.00.48.fe.80; metric: 0; bandwidth: 5; type: RDMA] 
+ ib0[ip addr: 10.13.37.1; hw addr: 80.00.00.48.fe.80; metric: 0; bandwidth: 5; type: TCP] 
+ eth0[ip addr: 129.206.148.153; hw addr: 0c.c4.7a.6d.38.04; metric: 0; bandwidth: 2; type: TCP] 
(4) 10:45:20 Main [App] >> Starting up components...
(2) 10:45:20 HBeatMgr [HBeatMgr] >> Waiting for management node...
(2) 10:45:20 DGramLis [Heartbeat incoming] >> New node: beegfs-mgmtd luna [ID: 1]; 
(3) 10:45:20 HBeatMgr [HBeatMgr] >> Management node found. Downloading node groups...
(4) 10:45:20 DGramLis [Heartbeat incoming] >> Number of nodes: Meta: 1; Storage: 0
(4) 10:45:20 HBeatMgr [NodeConn (acquire stream)] >> Establishing new TCP connection to: beegfs...@10.13.37.3:8008
(3) 10:45:20 HBeatMgr [NodeConn (acquire stream)] >> Connected: beegfs...@10.13.37.3:8008 (protocol: TCP)
(2) 10:45:20 HBeatMgr [HBeatMgr] >> Nodes added (sync results): 1 (Type: beegfs-meta)
(2) 10:45:20 HBeatMgr [HBeatMgr] >> Nodes added (sync results): 2 (Type: beegfs-storage)
(4) 10:45:20 HBeatMgr [HBeatMgr] >> Removing 4 client sessions. 
(3) 10:45:20 HBeatMgr [HBeatMgr] >> 503-57B45B79-atlas: Removing 12 file sessions. (0 are unremovable)

I somehow missing the node registering part here.. .The process is not dying, it normally shows up in ps. We are using 2015.03-r16 on all machines.


And here the log from the mgmt server (10.13.37.3):
(3) Aug18 10:42:31 Main [App] >> Loaded metadata nodes: 2
(3) Aug18 10:42:31 Main [App] >> Loaded storage nodes: 2
(3) Aug18 10:42:31 Main [App] >> Loaded clients: 0
(4) Aug18 10:42:31 Main [App] >> Loaded target numeric ID mappings: 2
(3) Aug18 10:42:31 Main [App] >> Loaded target mappings: 2
(3) Aug18 10:42:31 Main [App] >> Loaded targets to resync list.
(4) Aug18 10:42:31 Main [App] >> Initializing components...
(3) Aug18 10:42:31 Main [DGramLis] >> Listening for UDP datagrams: Port 8008
(3) Aug18 10:42:31 Main [StreamLis] >> Listening for TCP connections: Port 8008
(4) Aug18 10:42:31 Main [App] >> Components initialized.
(1) Aug18 10:42:31 Main [App] >> Version: 2015.03-r17
(2) Aug18 10:42:31 Main [App] >> LocalNode: beegfs-mgmtd luna [ID: 1]
(2) Aug18 10:42:31 Main [App] >> Usable NICs: ib0(TCP) eth0(TCP) 
(4) Aug18 10:42:31 Main [App] >> Extended list of usable NICs: 
+ ib0[ip addr: 10.13.37.3; hw addr: 80.00.00.48.fe.80; metric: 0; bandwidth: 4; type: TCP] 
+ eth0[ip addr: 129.206.148.155; hw addr: 0c.c4.7a.6b.ce.32; metric: 0; bandwidth: 2; type: TCP] 
(4) Aug18 10:42:31 Main [App] >> Detaching process...
(4) Aug18 10:42:31 Main [App] >> Starting up components...
(3) Aug18 10:42:31 HBeatMgr [HBeatMgr] >> Notifying stored nodes...
(4) Aug18 10:42:31 Worker1 [Worker1] >> Ready (TID: 9357; WorkType: 1)
(4) Aug18 10:42:31 Worker2 [Worker2] >> Ready (TID: 9358; WorkType: 1)
(4) Aug18 10:42:31 Worker3 [Worker3] >> Ready (TID: 9359; WorkType: 1)
(4) Aug18 10:42:31 Main [App] >> Components running.
(4) Aug18 10:42:31 Main [App] >> Joining component threads...
(4) Aug18 10:42:31 Worker4 [Worker4] >> Ready (TID: 9360; WorkType: 1)
(4) Aug18 10:42:31 DirectWorker1 [DirectWorker1] >> Ready (TID: 9361; WorkType: 0)
(3) Aug18 10:42:31 HBeatMgr [HBeatMgr] >> Init complete.
(2) Aug18 10:42:36 XNodeSync [Assign node to capacity pool] >> Metadata node capacity pool assignment updated. NodeID: 1; Pool: Emergency; Reason: No capacity report received (yet).
(2) Aug18 10:42:36 XNodeSync [Assign node to capacity pool] >> Metadata node capacity pool assignment updated. NodeID: 2; Pool: Emergency; Reason: No capacity report received (yet).
(2) Aug18 10:42:36 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 1; TargetID: 1; Pool: Emergency; Reason: No capacity report received (yet).
(2) Aug18 10:42:36 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 2; TargetID: 2; Pool: Emergency; Reason: No capacity report received (yet).
(4) Aug18 10:42:39 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.2:59419 [SockFD: 12]
(4) Aug18 10:42:41 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.2:59420 [SockFD: 14]
(2) Aug18 10:42:41 DirectWorker1 [Change consistency states] >> Storage target is coming online. ID: 2
(2) Aug18 10:42:42 Worker2 [Change consistency states] >> Metadata node is coming online. ID: 2
(4) Aug18 10:42:42 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.1:50143 [SockFD: 16]
(2) Aug18 10:42:42 Worker2 [Change consistency states] >> Storage target is coming online. ID: 1
(2) Aug18 10:44:11 XNodeSync [Assign node to capacity pool] >> Metadata node capacity pool assignment updated. NodeID: 2; Pool: Normal; Reason: Free capacity threshold
(2) Aug18 10:44:11 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 1; TargetID: 1; Pool: Normal; Reason: Free capacity threshold
(2) Aug18 10:44:11 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 2; TargetID: 2; Pool: Normal; Reason: Free capacity threshold
(4) Aug18 10:45:20 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.1:50144 [SockFD: 18]
(4) Aug18 10:45:20 StreamLis [StreamLis] >> Accepted new connection from 10.13.37.3:43997 [SockFD: 20]

The meta-server seems to connect, the connection is accepted, but not added as meta data server. The non-added server is also the root meta-data server. I did file systems checks on the metadata RAID array and everything looks fine. Is it possible to get even some more debug output? It does not seem to be a network problem, the same problems shows up when using the non-IB network. The storage servers are on the same machines as the meta-data server and they are registering fine.

Also, beegfs--ctls looks fine...
beegfs-ctl --listnodes --nodetype=meta --details
hydra [ID: 1]
   Ports: UDP: 8005; TCP: 8005
   Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP) 
charon [ID: 2]
   Ports: UDP: 8005; TCP: 8005
   Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP) 

Number of nodes: 2
Root: 1


but when I try to run it with --route:

 beegfs-ctl --listnodes --nodetype=meta --details --route
hydra [ID: 1]
   Ports: UDP: 8005; TCP: 8005
   Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP) 
^C

It gets stuck when trying to figure out the routing.

The same command works flawless for the storage server:

hydra [ID: 1]
   Ports: UDP: 8003; TCP: 8003
   Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP) 
   Route: 10.13.37.1:8003 (protocol: RDMA)
charon [ID: 2]
   Ports: UDP: 8003; TCP: 8003
   Interfaces: ib0(RDMA) ib0(TCP) eth0(TCP) 
   Route: 10.13.37.2:8003 (protocol: RDMA)


Any idea what could be the problem here? Any help would be greatly appreciated.

Cheers,
Tobias

Tobias Jakobi

unread,
Aug 18, 2016, 10:51:42 AM8/18/16
to beegfs-user
Hello again,

just to save this for other users maybe running into this situation:

I was able to trace back the error back to the session cleanup function within the metadata service.

The cleanup process got stuck and never actually finished the cleanup, yielding to a kind of endless loop.

I was able to fix this error by moving the "sessions" file in the metadata folder away (you therefore would loose any information in there).

During the next startup the server would again register and show up as before.

Cheers,
Tobias
Reply all
Reply to author
Forward
0 new messages