We have had a problem with the beegfs-meta service repeatedly crashing with a SIGFPE error. It crashes about every 6 hours and crashes on restart for the next 5 minutes. Then it restarts and the filesystem is up until the next crash. There are no indications of hardware problems in any of the system logs or messages.
The relevant portion of the beegfs-meta.log file is below.
(4) Dec05 21:10:02 TimerWork/0 [Sync clients] >> Removing 1 client sessions.
(4) Dec05 21:10:02 DirectWorker1 [SessionStore (ref)] >> Creating a new session. SessionID: 0
(4) Dec05 21:10:05 XNodeSync [InternodeSyncer.cpp:296] >> Downloading capacity pools. Pool type: Meta
(4) Dec05 21:10:05 XNodeSync [InternodeSyncer.cpp:296] >> Downloading capacity pools. Pool type: Meta buddies
(4) Dec05 21:10:21 ConnAccept [ConnAccept] >> Ignoring an internal event on the listening RDMA socket
(4) Dec05 21:10:21 ConnAccept [ConnAccept] >> Accepted new RDMA connection from 10.128.21.15:60329 [SockFD: 655] (4) Dec05 21:10:23 XNodeSync [InternodeSyncer.cpp:376] >> Starting state update.
(4) Dec05 21:10:23 XNodeSync [InternodeSyncer.cpp:401] >> Beginning target state update...
(4) Dec05 21:10:23 XNodeSync [InternodeSyncer.cpp:756] >> Downloading target states and buddy groups
(0) Dec05 21:10:45 Worker76 [PThread.cpp:108] >> Received a SIGFPE. Trying to shut down...
(1) Dec05 21:10:45 Worker76 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x75a3e7]
2: /lib64/libc.so.6(+0x37400) [0x7ffb4e08a400]
3: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInode15initFileInfoVecEv+0x33a) [0x659cba]
4: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInodeC2ESsP18FileInodeStoreData12DirEntryTypej+0x492) [0x65a262]
5: /opt/beegfs/sbin/beegfs-meta(_ZN8DirEntry15createInodeByIDERKSsP9EntryInfo+0x12d) [0x62f4ed]
6: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInode22createFromInlinedInodeEP9EntryInfo+0x197) [0x655c17]
7: /opt/beegfs/sbin/beegfs-meta(_ZN9FileInode19createFromEntryInfoEP9EntryInfo+0xf) [0x65b5bf]
8: /opt/beegfs/sbin/beegfs-meta(_ZN14InodeFileStore4statEP9EntryInfobR8StatData+0x1da) [0x671aba]
9: /opt/beegfs/sbin/beegfs-meta(_ZN9MetaStore4statEP9EntryInfobR8StatDataP9NumericIDIj12NumNodeIDTagEPSs+0x108) [0x6787d8]
10: /opt/beegfs/sbin/beegfs-meta(_ZN13MsgHelperStat4statEP9EntryInfobjR8StatDataP9NumericIDIj12NumNodeIDTagEPSs+0x3a) [0x605e2a]
11: /opt/beegfs/sbin/beegfs-meta(_ZN17LookupIntentMsgEx14executeLocallyERN10NetMessage15ResponseContextEb+0x663) [0x5ca813]
12: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI15LookupIntentMsgSt5tupleIJ9DirIDLock14ParentNameLock10FileIDLockEEE15processIncomingERN10NetMessage15ResponseContextE+0x47a) [0x5cdb8a]
13: /opt/beegfs/sbin/beegfs-meta(_ZN17LookupIntentMsgEx15processIncomingERN10NetMessage15ResponseContextE+0xa0) [0x5cb060]
14: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x17d) [0x6e945d]
15: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x162) [0x6ef022]
16: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x4c) [0x6efd2c]
17: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0xfe) [0x4828fe]
18: /lib64/libpthread.so.0(+0x814a) [0x7ffb4e42014a]
19: /lib64/libc.so.6(clone+0x43) [0x7ffb4e14fdc3]
(4) Dec05 21:10:45 Worker1 [Worker1] >> Component stopped.
(4) Dec05 21:10:45 Worker2 [Worker2] >> Component stopped.