meta server hang too lang

80 views
Skip to first unread message

安小默

unread,
Apr 24, 2023, 4:14:45 AM4/24/23
to beegfs-user
Hi everyone,This is my beegfs cluster info:
mgmt_nodes
=============
master01 [ID: 1]

meta_nodes
=============
master01 [ID: 201]
master02 [ID: 202]

storage_nodes
=============
master01 [ID: 201]

master02 [ID: 202]
.....more(Each node has only one storage target)

And have config a buddy group:
     BuddyGroupID     PrimaryNodeID   SecondaryNodeID
     =========     =========   ===============
                1               201                  202


A storage node in the cluster go down due to some reason, there are some question here:
1. What caused the meta service of 201 to resync job 
2. When resync job running,all client nodes reported "*du(29474) [Messaging (RPC)] >> Receive timed out from beegfs-meta master01 [ID: 201]",This message occurred at night and more than 10 hours,why so long? 

This is more service info:
beegfs-mgmtd log:
(2) Apr12 20:09:05 HBeatMgr [HeartbeatManager.cpp:171] >> Node is not responding and will be removed. node: beegfs-client 1DA82-6414AEFE-node03 [ID: 96]; remaining nodes: 9
(2) Apr12 20:20:17 DGramLis [Node registration] >> New node: beegfs-client 1DA82-6414AEFE-node03 [ID: 96]; Source: x.x.x.x
(2) Apr12 21:57:24 HBeatMgr [HeartbeatManager.cpp:171] >> Node is not responding and will be removed. node: beegfs-client 6EF0-6414AEDF-node01 [ID: 94]; remaining nodes: 8
(2) Apr12 21:58:29 DGramLis [Node registration] >> New node: beegfs-client 6EF0-6414AEDF-node01 [ID: 94]; Source: x.x.x.x
.....more client not responding and registration

beegfs-meta on master01:
(2) Apr12 18:46:17 CommSlave47 [Stat chunk file work] >> Communication with storage target failed. TargetID: 2004; EntryID: 4-643683BB-C9
(2) Apr12 18:46:17 Worker20 [Stat Helper (refresh chunk files)] >> Problems occurred during file attribs refresh. entryID: 4-643683BB-C9
(2) Apr12 18:46:22 XNodeSync [BuddyCommTk.cpp:206] >> Resync job currently running. Buddy node ID: 202
(2) Apr12 18:46:52 XNodeSync [BuddyCommTk.cpp:206] >> Resync job currently running. Buddy node ID: 202


beegfs-meta on master02:
(2) Apr12 18:06:11 CommSlave9 [Trunc chunk file work] >> Communication with storage target failed. TargetID: 2004; EntryID: 522-61825DD6-C9
(2) Apr12 18:06:11 DirectWorker1 [Trunc chunk file helper] >> Problems occurred during truncation of storage server chunk files. File: 522-61825DD6-C9
(1) Apr13 10:11:42 Main [App::signalHandler] >> Received a SIGTERM. Shutting down...
(2) Apr13 10:11:44 Main [App (wait for component termination)] >> Still waiting for this component to stop: Worker13


Any response would be greatly appreciated.

安小默

unread,
Apr 24, 2023, 4:20:15 AM4/24/23
to beegfs-user
sorry,  I'm not good at english, words in title "too lang" means "too long"

Byron Shen

unread,
Apr 25, 2023, 1:18:23 AM4/25/23
to beegfs-user
The resync job will be started if the Consistency of secondary meta is Bad or Need-resync. And the state change is often caused by secondary meta failure, which stops synchronization between primary and secondary. 
You can use beegfs-ctl to examine status of metas. If both of primary and secondary are not Good, meta will not respond clients' requests and client retries until reaching timeout. 

xiaomo An

unread,
Apr 25, 2023, 7:09:53 AM4/25/23
to beegfs-user
Thank you very much for receiving a reply.

I used command like "beegfs-ctl --listtargets --nodetype=meta --state" to check the status of meta service,  It seems that the meta of master01 is good:
TargetID     Reachability    Consistency       NodeID
=====     ========   =========   ======
     201           Online         Good                 201
     202           Online         Need-resync      202
  
I have also used "telnet or nc" to check with each other that tcp 8005 port is reachable.

Additionally, I got some logs from meta of master01 before "resync job" like this:
(2) Apr12 18:14:19 CommSlave55 [MessagingTk.cpp:445] >> Unable to connect, is the node offline? node: beegfs-storage node01 [ID: 204]; Message type: GetChunkFileAttribs (2017)
(2) Apr12 18:14:19 CommSlave55 [Stat chunk file work] >> Communication with storage target failed. TargetID: 2004; EntryID: 17-639FDADD-C9
(2) Apr12 18:14:19 Worker32 [Stat Helper (refresh chunk files)] >> Problems occurred during file attribs refresh. entryID: 17-639FDADD-C9
(2) Apr12 18:14:19 XNodeSync [BuddyCommTk.cpp:206] >> Resync job currently running. Buddy node ID: 202
(2) Apr12 18:14:20 Worker29 [Close Helper (close chunk files S)] >> Communication with storage target failed: 2004; FileHandle: 60ECBDD2#4EB-61612DA7-C9; Error: Communication error
(2) Apr12 18:14:20 Worker29 [Close Helper (close chunk files S)] >> Problems occurred during close of chunk files. FileHandle: 60ECBDD2#4EB-61612DA7-C9
(2) Apr12 18:14:23 CommSlave23 [Trunc chunk file work] >> Communication with storage target failed. TargetID: 2004; EntryID: 0-6436846F-C9
(2) Apr12 18:14:23 Worker27 [Trunc chunk file helper] >> Problems occurred during truncation of storage server chunk files. File: 0-6436846F-C9
.........more problem chunk files
(2) Apr12 18:46:17 CommSlave47 [Stat chunk file work] >> Communication with storage target failed. TargetID: 2004; EntryID: 4-643683BB-C9
(2) Apr12 18:46:17 Worker20 [Stat Helper (refresh chunk files)] >> Problems occurred during file attribs refresh. entryID: 4-643683BB-C9                ---------------management.log show node01 storage connected at 18:44
(2) Apr12 18:46:22 XNodeSync [BuddyCommTk.cpp:206] >> Resync job currently running. Buddy node ID: 202
(2) Apr12 18:46:52 XNodeSync [BuddyCommTk.cpp:206] >> Resync job currently running. Buddy node ID: 202
.....until Apr 13 09:00 


In my opinion, even if the meta data synchronization is required, the synchronized data is samll and should be completed quickly.
In fact, I disable the services of other nodes and then restart nodes, master01 restarted at the end and manually setting its status to good after the meta service started, the automatic synchronization was completed quickly.
Does this mean that the meta of master01 also failure but did not update its status before i reboot? but I didnot see any logs by "systemctl status beegfs-meta" or "more /var/log/beegfs-meta.log".
If the meata service crashes, then I should not have received many "Resync job currently running" logs at all night;if meta service is normal,  then I should not have received "Receive timeout from beegfs-meta master01" from other clients.

Do you have any more ideas? Thanks.

Byron Shen

unread,
Apr 25, 2023, 10:41:18 AM4/25/23
to beegfs-user
So the timeline was (If I got it): A storage node died -> Meta 202 became Need-resync for some reason -> Meta 202 resynced all night, not responded -> Restart Meta 201 and set its status to Good -> Resync done

If the Need-resync was set by primary (not likely though), it would be logged in primary Meta log (xxx was ), and Management should log this state change too.

As for the long resync time, you should check if resync job failed and was restarted repeatedly, or it just got stuck. The former means there might be some metadata corruptions, which is probably not the case because manually restarting works. The latter is hard to troubleshooting. If it was repreducible, you could use beegfs-ctl --resyncstats to show progress. And use gdb to determine what code caused the hang, maybe an unexpected infinite-loop, a deadlock or something. The deadlock in resync job could make meta unresponsive 'cause resync thread itself will lock the metadata from being written by worker threads to prevent modification.

What version are you using BTW? Newer version like 7.2.4 has resync bugfixes. If it is an old version, you should upgrade it.

xiaomo An

unread,
Apr 26, 2023, 4:06:26 AM4/26/23
to beegfs-user
Yeah, The timeline is correct. It's a good idea to use "beegfs-ctl --resycstats", I will try it next time.
You mentioned in your reply that "Newer version like 7.2.4 has resync bugfixes"
-------------------------------------
Would you provide merge or commit info in this?If it suits, I will upgrade as soon as possible

Byron Shen

unread,
Apr 26, 2023, 10:12:27 AM4/26/23
to beegfs-user
The version 7.2.5 fix a bug: when storage is unreachable, secondary meta will failed to truncate corresponding storage which caused Resync.  (Release Notes, commit)

xiaomo An

unread,
Apr 27, 2023, 1:48:38 AM4/27/23
to beegfs-user
thank you very much. Best wishes for you.
Reply all
Reply to author
Forward
0 new messages