beegfs-meta crashes with SIGFPE

69 views
Skip to first unread message

James Burton

unread,
Dec 5, 2021, 10:21:09 PM12/5/21
to fhgfs...@googlegroups.com
Greetings,

We have had a problem with the beegfs-meta service repeatedly crashing with a SIGFPE error. It crashes about every 6 hours and crashes on restart for the next 5 minutes. Then it restarts and the filesystem is up until the next crash. There are no indications of hardware problems in any of the system logs or messages.

The relevant portion of the beegfs-meta.log file is below.

(4) Dec05 21:10:02 TimerWork/0 [Sync clients] >> Removing 1 client sessions.
(4) Dec05 21:10:02 DirectWorker1 [SessionStore (ref)] >> Creating a new session. SessionID: 0
(4) Dec05 21:10:05 XNodeSync [InternodeSyncer.cpp:296] >> Downloading capacity pools. Pool type: Meta
(4) Dec05 21:10:05 XNodeSync [InternodeSyncer.cpp:296] >> Downloading capacity pools. Pool type: Meta buddies
(4) Dec05 21:10:21 ConnAccept [ConnAccept] >> Ignoring an internal event on the listening RDMA socket
(4) Dec05 21:10:21 ConnAccept [ConnAccept] >> Accepted new RDMA connection from 10.128.21.15:60329 [SockFD: 655]
(4) Dec05 21:10:23 XNodeSync [InternodeSyncer.cpp:376] >> Starting state update.
(4) Dec05 21:10:23 XNodeSync [InternodeSyncer.cpp:401] >> Beginning target state update...
(4) Dec05 21:10:23 XNodeSync [InternodeSyncer.cpp:756] >> Downloading target states and buddy groups
(0) Dec05 21:10:45 Worker76 [PThread.cpp:108] >> Received a SIGFPE. Trying to shut down...
(1) Dec05 21:10:45 Worker76 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x75a3e7]
2: /lib64/libc.so.6(+0x37400) [0x7ffb4e08a400]
3: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInode15initFileInfoVecEv+0x33a) [0x659cba]
4: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInodeC2ESsP18FileInodeStoreData12DirEntryTypej+0x492) [0x65a262]
5: /opt/beegfs/sbin/beegfs-meta(_ZN8DirEntry15createInodeByIDERKSsP9EntryInfo+0x12d) [0x62f4ed]
6: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInode22createFromInlinedInodeEP9EntryInfo+0x197) [0x655c17]
7: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInode19createFromEntryInfoEP9EntryInfo+0xf) [0x65b5bf]
8: /opt/beegfs/sbin/beegfs-meta(_ZN14InodeFileStore4statEP9EntryInfobR8StatData+0x1da) [0x671aba]
9: /opt/beegfs/sbin/beegfs-meta(_ZN9MetaStore4statEP9EntryInfobR8StatDataP9NumericIDIj12NumNodeIDTagEPSs+0x108) [0x6787d8]
10: /opt/beegfs/sbin/beegfs-meta(_ZN13MsgHelperStat4statEP9EntryInfobjR8StatDataP9NumericIDIj12NumNodeIDTagEPSs+0x3a) [0x605e2a]
11: /opt/beegfs/sbin/beegfs-meta(_ZN17LookupIntentMsgEx14executeLocallyERN10NetMessage15ResponseContextEb+0x663) [0x5ca813]
12: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI15LookupIntentMsgSt5tupleIJ9DirIDLock14ParentNameLock10FileIDLockEEE15processIncomingERN10NetMessage15ResponseContextE+0x47a) [0x5cdb8a]
13: /opt/beegfs/sbin/beegfs-meta(_ZN17LookupIntentMsgEx15processIncomingERN10NetMessage15ResponseContextE+0xa0) [0x5cb060]
14: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x17d) [0x6e945d]
15: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x162) [0x6ef022]
16: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x4c) [0x6efd2c]
17: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0xfe) [0x4828fe]
18: /lib64/libpthread.so.0(+0x814a) [0x7ffb4e42014a]
19: /lib64/libc.so.6(clone+0x43) [0x7ffb4e14fdc3]
(4) Dec05 21:10:45 Worker1 [Worker1] >> Component stopped.
(4) Dec05 21:10:45 Worker2 [Worker2] >> Component stopped.

Does anyone have any ideas about what is causing this or how to fix it?

We are running BeeGFS 7.1.5 on CentOS 8.4

Thanks,

Jim Burton
--
James Burton
OS and Storage Architect
Advanced Computing Infrastructure
Clemson University Computing and Information Technology
340 Computer Court
Anderson, SC 29625

James Burton

unread,
Dec 8, 2021, 9:13:26 AM12/8/21
to fhgfs...@googlegroups.com
*** Possible solution ***

The stack trace indicated that the program was failing in FileInode::initFileInfoVecEv This is located in meta/source/storage/FileInode.cpp in the beegfs-source. 

Looking at the source, it appeared the SIGFPE might have been caused by garbage in the stripe pattern. So I added a debug statement to FileInode::initFileInfoVecEv at line 97 before computing the stripeset start

        LogContext(logContext).log(Log_DEBUG,
            "File: " + this->inodeDiskData.getEntryID() +
            " fileSize=" + std::to_string(fileSize) +
           " stripeSetSize=" + std::to_string(stripeSetSize) +
           "  numTargets=" + std::to_string(numTargets) ) ;

I rebuilt the beegfs RPMS, deployed the "debug" version on the metadata servers that were failing, increased the log level to 5, and the log size to accommodate the "noise".

Before the crash, the debug statement showed that the file with ID 63-61A9E7D6-3E8 had a stripe set size of 0. 

(4) Dec08 00:17:28 Worker30 [File (Init File Info Vec))] >> File: 422-61AEF883-3E8 fileSize=5389 stripeSetSize=2097152  numTargets=4
(4) Dec08 00:17:28 Worker102 [File (Init File Info Vec))] >> File: 63-61A9E7D6-3E8 fileSize=0 stripeSetSize=0  numTargets=20


This was causing an FPE when trying to calculate the lastStripeSetSize. 

   if(MathTk::isPowerOfTwo(numTargets) )
   { // quick path => optimized without division/modulo
      lastStripeSetSize = fileSize & (stripeSetSize-1);
      stripeSetStart = fileSize - lastStripeSetSize;
      fullLengthPerTarget = stripeSetStart >> MathTk::log2Int32(numTargets);
   }
   else
   { // slow path => requires division/modulo
      lastStripeSetSize = fileSize % stripeSetSize; // FPE is here.
      stripeSetStart = fileSize - lastStripeSetSize;
      fullLengthPerTarget = stripeSetStart / numTargets;
   }


To find the offending file, I searched for the file ID on the metadata storage target to find the parent directory and rebuild the directory path. Fortunately, the file was only one directory below the root.

[root@beegfs01 ~]# find /beegfs/meta0/buddymir/dentries/ -name 63-61A9E7D6-3E8 -print
/beegfs/meta0/buddymir/dentries/5E/25/4D6-61117B14-3E8/#fSiDs#/63-61A9E7D6-3E8
[root@beegfs01 ~]# attr -g fhgfs /beegfs/meta0/buddymir/inodes/5E/25/4D6-61117B14-3E8
Attribute "fhgfs" had a 130 byte value for /beegfs/meta0/buddymir/inodes/5E/25/4D6-61117B14-3E8:
�A?{a?{a��a��a?'4D6-61117B14-3E8root�@
[root@beegfs01 ~]# ls -l /beegfs/meta1/buddymir/dentries/5E/25/4D6-61117B14-3E8
drwxr-xr-x 2 root root 4096 Dec  7 23:14 '#fSiDs#'
-rw-r--r-- 2 root root    0 Dec  7 23:14  badfile.dat

I was then able to use beegfs-ctl --getentryinfo to find the directory with ID and check the stripe pattern. 


[root@beegfs01 ~]# beegfs-ctl --getentryinfo /scratch1/bgfsuser
Entry type: directory
EntryID: 4D6-61117B14-3E8
Metadata buddy group: 1
Current primary metadata node: burstbuff02-meta1 [ID: 2001]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 1G
+ Number of storage targets: desired: 20
+ Storage Pool: 1 (Default)

This is not an appropriate stripe pattern. It should look be:

[root@beegfs01 ~]# beegfs-ctl --getentryinfo /scratch1/jburto2
Entry type: directory
EntryID: 0-61117A96-3E8
Metadata buddy group: 5
Current primary metadata node: burstbuff03-meta2 [ID: 3002]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 512K
+ Number of storage targets: desired: 4
+ Storage Pool: 1 (Default)

Attempting to ls /scratch1/bgfsuser crashed the metadata server with a SIGFPE. Attempting to access /scratch1/bgfsuser/badfile.dat also crashed the metadata server with a SIGFPE.

Removing the dentry of badfile.dat allowed the ls to succeed on  /scratch1/bgfsuser. 

mv /beegfs/meta0/buddymir/dentries/5E/25/4D6-61117B14-3E8/badfile.dat /beegfs/meta0/garbage/

The striping on /scratch1/bgfsuser was then reset to the default.

[root@beegfs01 ~]# beegfs-ctl --setpattern --chunksize=512k --numtargets=4 /scratch1/bgfsuser
New chunksize: 524288
New number of storage targets: 4


The server has remained up and stable since the file was removed. It's been up almost 8 hours since this message was written. 


James Burton

unread,
Dec 8, 2021, 9:19:01 AM12/8/21
to fhgfs...@googlegroups.com
attr confirmed that badfile.dat was the offending file 63-61A9E7D6-3E8

[root@beegfs01 ~]# attr -g fhgfs /beegfs/meta0/buddymir/dentries/5E/25/4D6-61117B14-3E8/badfile.dat
Attribute "fhgfs" had a 166 byte value for /beegfs/meta0/buddymir/dentries/5E/25/4D6-61117B14-3E8/badfile.dat:                  ����a��a��a��a?'63-61A9E7D6-3E8B@0�VW�W�N�NMO�OPyP�PAQ�Q RmR�R5S�S�SaT�U

Reply all
Reply to author
Forward
0 new messages