Manual resync conditions following a crash?

William Scullin

unread,

Dec 12, 2024, 12:43:52 PM12/12/24

to fhgfs...@googlegroups.com

Last night we had a server which was running beegfs-mgmtd in a two
node cluster panic, followed by a severe hang of its partner leading
to a forced reboot this morning. It appears the file system became
very full based on monitoring - 95% when monitoring cut out. All
server file systems and state appears normal at this time, but it
appears the double hang, and potentially the double hang, have left
the metadata in a bad place. This is an ancient version (7.1.4) and we
're trying to figure out how to safely proceed from here.

Any advice or guidance would be appreciated.

- William

[root@hpcfast1 ~]# beegfs-df
METADATA SERVERS:
TargetID Cap. Pool Total Free % ITotal IFree %
======== ========= ===== ==== = ====== ===== =
9000 emergency 3724.5GiB 3679.8GiB 99% 3726.0M 3669.5M 98%
18000 emergency 3724.5GiB 3705.7GiB 99% 3726.0M 3686.0M 99%

STORAGE TARGETS:
TargetID Cap. Pool Total Free % ITotal IFree %
======== ========= ===== ==== = ====== ===== =
11 normal 93132.4GiB 7766.4GiB 8% 15566.5M 15532.8M 100%
31 normal 93132.4GiB 7756.1GiB 8% 15545.8M 15512.1M 100%

[root@hpcfast1 ~]# beegfs-ctl --listtargets --nodetype=meta --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
9000 Online Needs-resync 9000
18000 Online Needs-resync 18000
[root@lle-prod-hpcfast1 ~]# beegfs-ctl --listtargets --nodetype=storage --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
11 Online Good 100
31 Online Good 300

Management
==========
hpcfast1 [ID: 1]: reachable at 192.168.22.225:8008 (protocol: TCP)

Metadata
==========
hpcfast1 [ID: 9000]: reachable at 192.168.22.225:8005 (protocol: RDMA)
hpcfast2 [ID: 18000]: reachable at 192.168.22.226:8005 (protocol: TCP)

Storage
==========
fast1-inst01 [ID: 100]: reachable at 192.168.22.225:8003 (protocol: RDMA)
fast2-inst01 [ID: 300]: reachable at 192.168.22.226:8003 (protocol: TCP)

Joshua Baker-LePain

unread,

Dec 12, 2024, 3:26:25 PM12/12/24

to fhgfs...@googlegroups.com

So it would seem that you're using metadata mirroring, and both
targets are in state "needs-resync". To get back up and running,
you'll need to force one of those metadata targets into the "good"
state. The question is, of course, which one. It should be the
target which was most recently the primary when data was flowing.
That *may* be the target that's currently the primary, or things may
have failed over during the mgtmd issues. So you'll need to look
through your logs and determine via timestamps if any failovers
occurred and whether they were successful. Once you figure out which
target to mark good, you can do so via 'beegfs-ctl --setstate
--force". Obviously, be very, very careful with that command. If
you've got a support contract, I'd definitely ask ThinkParq for help.
Have I mentioned please be careful?

Good luck.

> --
> You received this message because you are subscribed to the Google Groups "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/fhgfs-user/CAFmX%3DZzPhWnAsQZYE9VeKiePqwBQc%3DEZrUrW_7dr94Cjnr6pgg%40mail.gmail.com.

--
Joshua Baker-LePain

William Scullin

unread,

Dec 12, 2024, 4:04:41 PM12/12/24

to fhgfs...@googlegroups.com

Is there any way to assess the level of risk if I choose poorly? Would the resync potentially only damage or loose files that were open or would there be a risk of widespread damage?

Likewise, is it a good or bad idea to follow up with a beegfs-fsck?

To view this discussion visit https://groups.google.com/d/msgid/fhgfs-user/CAAFxLHJE4UqU8vGuzw%2BmvyLD-AtWXn85DW9Ro4se7nJGD7Q_AQ%40mail.gmail.com.

Joshua Baker-LePain

unread,

Dec 13, 2024, 2:23:04 PM12/13/24

to fhgfs...@googlegroups.com

As I understand it, if you pick the "wrong" primary to "setstate" to
good, then the only risk should be to files open and/or modified while
the actual primary was up and the secondary (which you have, in this
scenario, mistakenly made the primary) was down. But, again, this
isn't something I've tested. And I have mixed experience with
beegfs-fsck. It certainly wouldn't hurt to do a read-only run,

> To view this discussion visit https://groups.google.com/d/msgid/fhgfs-user/CAFmX%3DZzgxjT%3Dqbv0JSf4H8VU6TNVhUV50uhetJwwB6V%2B%2B8SfLg%40mail.gmail.com.

--
Joshua Baker-LePain

Reply all

Reply to author

Forward