MetaData service crashes weekly after upgrade from 7.2 ->7.4.2

82 views
Skip to first unread message

Sean Brisbane

unread,
Jun 11, 2024, 2:07:02 PMJun 11
to beegfs-user
Hi

We have recently experienced 2 crashes in the metadata service separated by 1 week after upgrading the estate to 7.4.2.
We have 2 metadata services on different servers, both have crashed once each.
The error message reports a similar but not identical file name each time. Neither file exists on the file system as it is currently.

I suspected something to do with hard links.  All 8000 files on the system with a similar name to those in the error message (kmers_raw*) have a hard link count of 1.  We have not checked for and migrated "old style" hard links if that turns out to be relevant. 

The error messages from the metadata server are below my signature. Has this been seen before? Any hints as to how we might take some steps to debug this greatly appreciated.

Thanks
Sean



(2) May30 07:45:57 Worker7 [FileInode (store updated Inode)] >> Failed to
write inlined inode: parentID: 44-665814BE-1 entryID: 1D-665816F
4-1 fileName: kmers_raw_LH11Hc.0 Error: Internal error
(0) May30 14:07:16 Worker24 [PThread.cpp:99] >> Received a SIGSEGV. Trying to
shut down...
(1) May30 14:07:16 Worker24 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x73fef7]
2: /lib64/libc.so.6(+0x54df0) [0x7f8cf6654df0]
3: /opt/beegfs/sbin/beegfs-meta(_ZN14MsgHelperClose9closeFileE9NumericIDIj12NumNodeIDTagERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESa
IcEEEP9EntryInfoijPbSD_PSt6vectorI18DynamicFileAttribsSaISF_EEP18MirroredTimestamps+0x215)
[0x58ad65]
4: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x2d4)
[0x5cd174]
5: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x524)
[0x5cfb64]
6: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x180)
[0x74f900]
7: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x146)
[0x749c16]
8: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x58) [0x74a2d8]
9: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x11c) [0x4f6d8c]
10: /lib64/libc.so.6(+0x9f802) [0x7f8cf669f802]
11: /lib64/libc.so.6(+0x3f450) [0x7f8cf663f450]
(0) May30 14:07:16 Worker24 [App (component exception handler)] >> The
component [Worker24] encountered an unrecoverable error. [SysErr: N
o such file or directory] Exception message: Segmentation fault
(2) May30 14:07:16 Worker24 [App (component exception handler)] >> Shutting
down...
(3) May30 14:07:17 Main [App] >> Stored 2 sessions and 0 mirrored sessions


###############################################################


(2) Jun04 23:24:26 Worker18 [FileInode (store updated Inode)] >> Failed to
write inlined inode: parentID: 0-665F92B0-1 entryID: F4-665F928A-2 fileName: k
mers_raw_Y1YsA9.6 Error: Internal error
(0) Jun05 10:17:03 Worker7 [PThread.cpp:99] >> Received a SIGSEGV. Trying to
shut down...
(1) Jun05 10:17:03 Worker7 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x73fef7]
2: /lib64/libc.so.6(+0x54df0) [0x7f4351054df0]
3: /opt/beegfs/sbin/beegfs-meta(_ZN14MsgHelperClose24closeChunkFileSequentialE9NumericIDIj12NumNodeIDTagERKNSt7__cxx1112basic_stringIcSt11char_traitsIcES
aIcEEEiR9FileInodeP9EntryInfojPSt6vectorI18DynamicFileAttribsSaISG_EE+0xb5)
[0x587e85]
4: /opt/beegfs/sbin/beegfs-meta(_ZN14MsgHelperClose9closeFileE9NumericIDIj12NumNodeIDTagERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEP9EntryInf
oijPbSD_PSt6vectorI18DynamicFileAttribsSaISF_EEP18MirroredTimestamps+0x240)
[0x58ad90]
5: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x2d4)
[0x5cd174]
6: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x524)
[0x5cfb64]
7: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x180)
[0x74f900]
8: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x146)
[0x749c16]
9: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x58) [0x74a2d8]
10: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x11c) [0x4f6d8c]
11: /lib64/libc.so.6(+0x9f802) [0x7f435109f802]
12: /lib64/libc.so.6(+0x3f450) [0x7f435103f450]
(0) Jun05 10:17:03 Worker7 [App (component exception handler)] >> The
component [Worker7] encountered an unrecoverable error. [SysErr: No such file
or directory] Exception message: Segmentation fault
(2) Jun05 10:17:03 Worker7 [App (component exception handler)] >> Shutting
down...
(3) Jun05 10:17:04 Main [App] >> Stored 7 sessions and 0 mirrored sessions

Jure Pečar

unread,
Jun 11, 2024, 2:24:45 PMJun 11
to fhgfs...@googlegroups.com
On Tue, 11 Jun 2024 01:21:12 -0700 (PDT)
Sean Brisbane <seancb...@gmail.com> wrote:

> Hi
>
> We have recently experienced 2 crashes in the metadata service separated
> by 1 week after upgrading the estate to 7.4.2.

Upgrade to 7.4.3.


--

Jure Pečar
https://f5j.eu

mike g

unread,
Aug 14, 2024, 8:18:19 PMAug 14
to beegfs-user
Greetings,
I'm seeing similar crashes on 2 separate clusters that I recently installed and are running 7.4.3 (Ubuntu 22.04.04/5.15.0-107-generic). The underlying filesystem is ZFS in a RAIDZ2 configuration. Here's the info from a couple of crashes on one of the clusters.

Anyone have any suggestions?

Thanks!
-Mike

Episode from a week ago on ID 1:
(0) Aug08 15:40:27 Worker126 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
(1) Aug08 15:40:27 Worker126 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x65) [0x55be16470c85]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f30dbe38520]
3: /lib/x86_64-linux-gnu/libc.so.6(pthread_rwlock_tryrdlock+0x4) [0x7f30dbe90e24]
4: /opt/beegfs/sbin/beegfs-meta(_ZN6RWLock8readLockEv+0x30) [0x55be161f23b0]
5: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x174) [0x55be1630f214]
6: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x570) [0x55be16311820]
7: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x198) [0x55be16441428]
8: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x176) [0x55be1644e386]
9: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x6c) [0x55be1644eadc]
10: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x13e) [0x55be161edbce]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f30dbe8aac3]
12: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f30dbf1c850]
(0) Aug08 15:40:27 Worker126 [PThread.cpp:135] >> Received a SIGABRT. Trying to shut down...
(1) Aug08 15:40:27 Worker126 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x65) [0x55be16470c85]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f30dbe38520]
3: /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f30dbe8c9fc]
4: /lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f30dbe38476]
5: /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f30dbe1e7f3]
6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7f30dc0e3b9e]
7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f30dc0ef20c]
8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9) [0x7f30dc0ee1e9]
9: /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99) [0x7f30dc0ee959]
10: /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884) [0x7f30dc037884]
11: /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311) [0x7f30dc037f41]
12: /lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b) [0x7f30dc0ef4cb]
13: /opt/beegfs/sbin/beegfs-meta(+0xfcb0b) [0x55be161ccb0b]
14: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f30dbe38520]
15: /lib/x86_64-linux-gnu/libc.so.6(pthread_rwlock_tryrdlock+0x4) [0x7f30dbe90e24]
16: /opt/beegfs/sbin/beegfs-meta(_ZN6RWLock8readLockEv+0x30) [0x55be161f23b0]
17: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x174) [0x55be1630f214]
18: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x570) [0x55be16311820]
19: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x198) [0x55be16441428]
20: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x176) [0x55be1644e386]
21: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x6c) [0x55be1644eadc]
22: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x13e) [0x55be161edbce]
23: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f30dbe8aac3]
24: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f30dbf1c850]
(2) Aug08 15:40:29 Main [App (wait for component termination)] >> Still waiting for this component to stop: Worker126

Episode last night on ID 2:
(0) Aug13 21:54:10 Worker17 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
(1) Aug13 21:54:10 Worker17 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x65) [0x557179ddec85]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f1b02e49520]
3: /lib/x86_64-linux-gnu/libc.so.6(pthread_rwlock_tryrdlock+0x4) [0x7f1b02ea1e24]
4: /opt/beegfs/sbin/beegfs-meta(_ZN6RWLock8readLockEv+0x30) [0x557179b603b0]
5: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x174) [0x557179c7d214]
6: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x570) [0x557179c7f820]
7: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x198) [0x557179daf428]
8: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x176) [0x557179dbc386]
9: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x6c) [0x557179dbcadc]
10: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x13e) [0x557179b5bbce]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1b02e9bac3]
12: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f1b02f2d850]
(0) Aug13 21:54:10 Worker17 [PThread.cpp:135] >> Received a SIGABRT. Trying to shut down...
(1) Aug13 21:54:10 Worker17 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x65) [0x557179ddec85]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f1b02e49520]
3: /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f1b02e9d9fc]
4: /lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f1b02e49476]
5: /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f1b02e2f7f3]
6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7f1b030f4b9e]
7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f1b0310020c]
8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9) [0x7f1b030ff1e9]
9: /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99) [0x7f1b030ff959]
10: /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884) [0x7f1b03048884]
11: /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311) [0x7f1b03048f41]
12: /lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b) [0x7f1b031004cb]
13: /opt/beegfs/sbin/beegfs-meta(+0xfcb0b) [0x557179b3ab0b]
14: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f1b02e49520]
15: /lib/x86_64-linux-gnu/libc.so.6(pthread_rwlock_tryrdlock+0x4) [0x7f1b02ea1e24]
16: /opt/beegfs/sbin/beegfs-meta(_ZN6RWLock8readLockEv+0x30) [0x557179b603b0]
17: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x174) [0x557179c7d214]
18: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x570) [0x557179c7f820]
19: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x198) [0x557179daf428]
20: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x176) [0x557179dbc386]
21: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x6c) [0x557179dbcadc]
22: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x13e) [0x557179b5bbce]
23: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1b02e9bac3]
24: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f1b02f2d850]
(2) Aug13 21:54:12 Main [App (wait for component termination)] >> Still waiting for this component to stop: Worker17

mike g

unread,
Aug 20, 2024, 7:19:10 PMAug 20
to beegfs-user
I observed another episode of this on a cluster earlier today. These machines have been running 7.4.3 since they were deployed several months ago. If anybody has any suggestions, they're greatly appreciated.

Thanks!
-Mike
 
output from meta ID 1:

(0) Aug20 11:13:48 Worker82 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
(1) Aug20 11:13:48 Worker82 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x65) [0x55b7cd993c85]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f0e8e56d520]
3: /lib/x86_64-linux-gnu/libc.so.6(pthread_rwlock_tryrdlock+0x4) [0x7f0e8e5c5e24]
4: /opt/beegfs/sbin/beegfs-meta(_ZN6RWLock8readLockEv+0x30) [0x55b7cd7153b0]
5: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x174) [0x55b7cd832214]
6: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x570) [0x55b7cd834820]
7: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x198) [0x55b7cd964428]
8: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x176) [0x55b7cd971386]
9: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x6c) [0x55b7cd971adc]
10: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x13e) [0x55b7cd710bce]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f0e8e5bfac3]
12: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f0e8e651850]
(0) Aug20 11:13:48 Worker82 [PThread.cpp:135] >> Received a SIGABRT. Trying to shut down...
(1) Aug20 11:13:48 Worker82 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x65) [0x55b7cd993c85]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f0e8e56d520]
3: /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f0e8e5c19fc]
4: /lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f0e8e56d476]
5: /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f0e8e5537f3]
6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7f0e8e818b9e]
7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f0e8e82420c]
8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9) [0x7f0e8e8231e9]
9: /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99) [0x7f0e8e823959]
10: /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884) [0x7f0e8e76c884]
11: /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311) [0x7f0e8e76cf41]
12: /lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b) [0x7f0e8e8244cb]
13: /opt/beegfs/sbin/beegfs-meta(+0xfcb0b) [0x55b7cd6efb0b]
14: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f0e8e56d520]
15: /lib/x86_64-linux-gnu/libc.so.6(pthread_rwlock_tryrdlock+0x4) [0x7f0e8e5c5e24]
16: /opt/beegfs/sbin/beegfs-meta(_ZN6RWLock8readLockEv+0x30) [0x55b7cd7153b0]
17: /opt/beegfs/sbin/beegfs-meta(_ZN14CloseFileMsgEx16closeFilePrimaryERN10NetMessage15ResponseContextE+0x174) [0x55b7cd832214]
18: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI12CloseFileMsg10FileIDLockE15processIncomingERN10NetMessage15ResponseContextE+0x570) [0x55b7cd834820]
19: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x198) [0x55b7cd964428]
20: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x176) [0x55b7cd971386]
21: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x6c) [0x55b7cd971adc]
22: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x13e) [0x55b7cd710bce]
23: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f0e8e5bfac3]
24: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f0e8e651850]
(2) Aug20 11:13:50 Main [App (wait for component termination)] >> Still waiting for this component to stop: Worker82

Reply all
Reply to author
Forward
0 new messages