BeeGFS meta service refuses to start

Jay Fink

unread,

Mar 25, 2024, 6:06:05 AMMar 25

to beegfs-user

Hello,

Yesterday our meta server wedged and had to be rebooted. It had gone offline and could not be logged into. After rebooting it won't start. I tried increasing FDlimit to 100k and I disconnected or powered down all clients to make sure there wasn't a connection hanging it. Here is what the meta log says:

(3) Mar23 21:54:45 Main [App] >> Root directory loaded.
(1) Mar23 21:54:45 Main [App] >> Root metadata server (by possession of root directory): 1
(3) Mar23 21:54:45 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8005
(1) Mar23 21:54:45 Main [App] >> Waiting for beegfs-mgmtd@beegfs-meta1:8008...
(2) Mar23 21:54:45 RegDGramLis [Heartbeat incoming] >> New node: beegfs-mgmtd beegfs-meta1.dfci.harvard.edu [ID: 1]
(3) Mar23 21:54:45 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8005
(2) Mar23 21:54:45 Main [Register node] >> Node registration successful.
(3) Mar23 21:54:45 Main [NodeConn (acquire stream)] >> Connected: beegfs...@172.24.224.197:8008 (protocol: TCP)
(2) Mar23 21:54:45 Main [printSyncResults] >> Nodes added (sync results): 1 (Type: beegfs-meta)
(2) Mar23 21:54:45 Main [printSyncResults] >> Nodes added (sync results): 10 (Type: beegfs-storage)
(0) Mar23 21:54:45 Main [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
(1) Mar23 21:54:45 Main [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x755647]
2: /lib64/libc.so.6(+0x36280) [0x7f026de82280]
3: /opt/beegfs/sbin/beegfs-meta(_ZN18ExceededQuotaStore19updateExceededQuotaEPSt4listIjSaIjEE13QuotaDataType14QuotaLimitType+0x1e) [0x74be4e]
4: /opt/beegfs/sbin/beegfs-meta(_ZN15InternodeSyncer29downloadAllExceededQuotaListsESt10shared_ptrI11StoragePoolE+0x169) [0x4c3269]
5: /opt/beegfs/sbin/beegfs-meta(_ZN15InternodeSyncer29downloadAllExceededQuotaListsERKSt6vectorISt10shared_ptrI11StoragePoolESaIS3_EE+0xb2) [0x4c3c42]
6: /opt/beegfs/sbin/beegfs-meta(_ZN3App16downloadMgmtInfoER22TargetConsistencyState+0x1fa) [0x48771a]
7: /opt/beegfs/sbin/beegfs-meta(_ZN3App9runNormalEv+0x12f) [0x48c9ef]
8: /opt/beegfs/sbin/beegfs-meta(_ZN3App3runEv+0x52) [0x48cf72]
9: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0xfe) [0x481fee]
10: /opt/beegfs/sbin/beegfs-meta(_ZN7Program4mainEiPPc+0x49) [0x47f169]
11: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f026de6e3d5]
12: /opt/beegfs/sbin/beegfs-meta() [0x4818e5]
(0) Mar23 21:54:45 Main [PThread.cpp:135] >> Received a SIGABRT. Trying to shut down...
(1) Mar23 21:54:45 Main [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x755647]
2: /lib64/libc.so.6(+0x36280) [0x7f026de82280]
3: /lib64/libc.so.6(gsignal+0x37) [0x7f026de82207]
4: /lib64/libc.so.6(abort+0x148) [0x7f026de838f8]
5: /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165) [0x7f026e9ad7d5]
6: /lib64/libstdc++.so.6(+0x5e746) [0x7f026e9ab746]
7: /lib64/libstdc++.so.6(+0x5d6f9) [0x7f026e9aa6f9]
8: /lib64/libstdc++.so.6(__gxx_personality_v0+0x564) [0x7f026e9ab364]
9: /lib64/libgcc_s.so.1(+0xf8a3) [0x7f026e4448a3]
10: /lib64/libgcc_s.so.1(_Unwind_RaiseException+0xfb) [0x7f026e444c3b]
11: /lib64/libstdc++.so.6(__cxa_throw+0x66) [0x7f026e9ab986]
12: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x296) [0x755896]
13: /lib64/libc.so.6(+0x36280) [0x7f026de82280]
14: /opt/beegfs/sbin/beegfs-meta(_ZN18ExceededQuotaStore19updateExceededQuotaEPSt4listIjSaIjEE13QuotaDataType14QuotaLimitType+0x1e) [0x74be4e]
15: /opt/beegfs/sbin/beegfs-meta(_ZN15InternodeSyncer29downloadAllExceededQuotaListsESt10shared_ptrI11StoragePoolE+0x169) [0x4c3269]
16: /opt/beegfs/sbin/beegfs-meta(_ZN15InternodeSyncer29downloadAllExceededQuotaListsERKSt6vectorISt10shared_ptrI11StoragePoolESaIS3_EE+0xb2) [0x4c3c42]
17: /opt/beegfs/sbin/beegfs-meta(_ZN3App16downloadMgmtInfoER22TargetConsistencyState+0x1fa) [0x48771a]
18: /opt/beegfs/sbin/beegfs-meta(_ZN3App9runNormalEv+0x12f) [0x48c9ef]
19: /opt/beegfs/sbin/beegfs-meta(_ZN3App3runEv+0x52) [0x48cf72]
20: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0xfe) [0x481fee]
21: /opt/beegfs/sbin/beegfs-meta(_ZN7Program4mainEiPPc+0x49) [0x47f169]
22: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f026de6e3d5]
23: /opt/beegfs/sbin/beegfs-meta() [0x4818e5]

I contacted thinkparq (we have a contract with them) but nothing back yet.

I suspect meta data corruption but I am not sure how to fix it.

Waltar

unread,

Mar 25, 2024, 12:31:10 PMMar 25

to beegfs-user

Hello Jay,

what say "df" and "df -i" onto your metadata server ?

I don't think you got corruption as you couldn't write meta- and so even no data itself.

Jay Fink

unread,

Mar 25, 2024, 12:55:18 PMMar 25

to fhgfs...@googlegroups.com

Looks good:

[rcadmin@beegfs-meta1 ~]$ df
Filesystem                             1K-blocks      Used  Available Use% Mounted on
/dev/mapper/centos_beegfs--meta1-root   68775108   1133052   67642056   2% /
devtmpfs                               131920224         0  131920224   0% /dev
tmpfs                                  131932552         0  131932552   0% /dev/shm
tmpfs                                  131932552     10396  131922156   1% /run
tmpfs                                  131932552         0  131932552   0% /sys/fs/cgroup
/dev/md127                               1941504    188348    1753156  10% /boot
/dev/md125                                488120     11328     476792   3% /boot/efi
/dev/md0                              1464243216 213711848 1152852112  16% /mnt/mdt-1
/dev/mapper/centos_beegfs--meta1-home    9750528     33504    9717024   1% /home
/dev/mapper/centos_beegfs--meta1-var   121972544    703868  121268676   1% /var
tmpfs                                   26386512         0   26386512   0% /run/user/515
[rcadmin@beegfs-meta1 ~]$ df -i
Filesystem                               Inodes     IUsed     IFree IUse% Mounted on
/dev/mapper/centos_beegfs--meta1-root  34404352     32356  34371996    1% /
devtmpfs                               32980056       553  32979503    1% /dev
tmpfs                                  32983138         1  32983137    1% /dev/shm
tmpfs                                  32983138      1100  32982038    1% /run
tmpfs                                  32983138        16  32983122    1% /sys/fs/cgroup
/dev/md127                               975872        24    975848    1% /boot
/dev/md125                                    0         0         0     - /boot/efi
/dev/md0                              976650240 278055668 698594572   29% /mnt/mdt-1
/dev/mapper/centos_beegfs--meta1-home   4880384        53   4880331    1% /home
/dev/mapper/centos_beegfs--meta1-var   61016064      2328  61013736    1% /var
tmpfs                                  32983138         1  32983137    1% /run/user/515

also no xfs errors or anything. 

--
You received this message because you are subscribed to a topic in the Google Groups "beegfs-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fhgfs-user/f62Aq4IgXDI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fhgfs-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/a8698ccc-c35b-4e9f-bfb5-c4fdfe8d2750n%40googlegroups.com.

--

Jay Fink

Sr. HPC Engineer Scientific Computing, Cloud Network Engineer

HPC Request Form and UserGuide/SLA Links:
https://informatics-analytics.dfci.harvard.edu/services

/technical-support/#high-performance-computing-hpc-cluster

Waltar

unread,

Mar 26, 2024, 3:21:04 PMMar 26

to beegfs-user

Hello Jay,

where's your metadata directory or filesystem, on /mnt/mdt-1 ?

What does "strace /opt/beegfs/sbin/beegfs-meta cfgFile=/etc/beegfs/beegfs-meta.conf runDaemonized=false" generate (I think you take even the default cfg path and file) ?

Quentin Le Burel

unread,

Mar 26, 2024, 3:21:07 PMMar 26

to fhgfs...@googlegroups.com

Ive seen some cases where beegfs-meta was crashing like that on startup, that was caused by a bug in the binary.

yum update beegfs-meta fixed it for me, but that means upgrading beegfs packages in your cluster, and following carefully the upgrade procedure described here: https://doc.beegfs.io/7.4.3/advanced_topics/upgrade.html#upgrade.

Kind regards

Quentin

You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/CALOAZ0D7tMcQf%2BYsLXBwY7YP8Ac0UEsfYDF6njRyJj%2BdP8Lz8w%40mail.gmail.com.

Jay Fink

unread,

Mar 26, 2024, 5:38:49 PMMar 26

to fhgfs...@googlegroups.com

It is an old version, 7.1.9.

First we lost all of our storagebuddymirrors info, I had to repopulate that by hand.

We managed to get it running again, we did a full stop of everything then I repopulated the storagebuddymirrors file. Then when we brought up mgmtd, we waited about 10 minutes, then brought up storage (we are multimode too so - great.)

It looked okay, then we started getting these errors in what I thought were random locations:

ERROR: cannot read `somefile' (Remote I/O error)

Then I found these errors in the logs for the beegfs-meta@storage service:

(0) Mar26 14:26:29 CommSlave28 [Messaging (RPC target)] >> Invalid target mirror buddy group ID: 10
(0) Mar26 14:26:29 CommSlave47 [Messaging (RPC target)] >> Invalid target mirror buddy group ID: 12
(0) Mar26 14:26:29 CommSlave20 [Messaging (RPC target)] >> Invalid target mirror buddy group ID: 11
(0) Mar26 14:26:29 CommSlave25 [Messaging (RPC target)] >> Invalid target mirror buddy group ID: 10
(0) Mar26 14:26:29 CommSlave32 [Messaging (RPC target)] >> Invalid target mirror buddy group ID: 10

Closer examination started showing specific files:
(0) Mar26 15:09:34 CommSlave44 [Messaging (RPC target)] >> Invalid target mirror buddy group ID: 13
(2) Mar26 15:09:34 CommSlave45 [Close chunk file work] >> Communication with storage target failed. Mirror TargetID: 15; Session: 34; FileHandle: 577D8088#1D9-639A1304-B
(2) Mar26 15:09:34 CommSlave44 [Close chunk file work] >> Communication with storage target failed. Mirror TargetID: 13; Session: 34; FileHandle: 577D8088#1D9-639A1304-B
(2) Mar26 15:09:34 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 577D8088#1D9-639A1304-B
(3) Mar26 15:09:34 TimerWork/0 [Sync clients] >> closing file. ParentID: 41A-639A025C-B FileName: .bash_history

that is an example of one. So far, this only affecting buddy groups 10-15 which we added last year.

I've contacted support but I won't hear from them until tomorrow morning. 

It looks like it could be a few problems:
- corruption and there is nothing we can do but delete and replace as we find them
- some kind of config issue we just got away with
- beegfs-fsck? 
- .... or maybe I need to force a resync on targets 10-15? 

 I was surprised none of the storage buddy mirrors needed a resync when I brought it back. The meta server did - which makes sense - but not a single buddy mirror that I can see.

And I just looked .... it renumbered the MirrorGroupIDs

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/CABGSqPJ87v2Nf9%2BXn77wQ4-prq%3DLcyQFdBkUBw4COQGxhZhX7Q%40mail.gmail.com.

Jay Fink

unread,

Mar 29, 2024, 11:21:50 AMMar 29

to fhgfs...@googlegroups.com

The buddymirrors file was incorrect, once we fixed it we are up.

We have some artifact mirrorgroupids our storage pool but they haven't been an issue.

We are planning on patching the binary with help from an engineer (they already did it, we have to setup downtime in a few weeks) then we want to try to bump to 7.2.11 as we are still on centos7.

After that, not sure, but we are all good now.

Reply all

Reply to author

Forward