Last night we had a server which was running beegfs-mgmtd in a two
node cluster panic, followed by a severe hang of its partner leading
to a forced reboot this morning. It appears the file system became
very full based on monitoring - 95% when monitoring cut out. All
server file systems and state appears normal at this time, but it
appears the double hang, and potentially the double hang, have left
the metadata in a bad place. This is an ancient version (7.1.4) and we
're trying to figure out how to safely proceed from here.
Any advice or guidance would be appreciated.
- William
[root@hpcfast1 ~]# beegfs-df
METADATA SERVERS:
TargetID Cap. Pool Total Free % ITotal IFree %
======== ========= ===== ==== = ====== ===== =
9000 emergency 3724.5GiB 3679.8GiB 99% 3726.0M 3669.5M 98%
18000 emergency 3724.5GiB 3705.7GiB 99% 3726.0M 3686.0M 99%
STORAGE TARGETS:
TargetID Cap. Pool Total Free % ITotal IFree %
======== ========= ===== ==== = ====== ===== =
11 normal 93132.4GiB 7766.4GiB 8% 15566.5M 15532.8M 100%
31 normal 93132.4GiB 7756.1GiB 8% 15545.8M 15512.1M 100%
[root@hpcfast1 ~]# beegfs-ctl --listtargets --nodetype=meta --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
9000 Online Needs-resync 9000
18000 Online Needs-resync 18000
[root@lle-prod-hpcfast1 ~]# beegfs-ctl --listtargets --nodetype=storage --state
TargetID Reachability Consistency NodeID
======== ============ =========== ======
11 Online Good 100
31 Online Good 300
Management
==========
hpcfast1 [ID: 1]: reachable at
192.168.22.225:8008 (protocol: TCP)
Metadata
==========
hpcfast1 [ID: 9000]: reachable at
192.168.22.225:8005 (protocol: RDMA)
hpcfast2 [ID: 18000]: reachable at
192.168.22.226:8005 (protocol: TCP)
Storage
==========
fast1-inst01 [ID: 100]: reachable at
192.168.22.225:8003 (protocol: RDMA)
fast2-inst01 [ID: 300]: reachable at
192.168.22.226:8003 (protocol: TCP)