beegfs-meta crashes with Bug: Refusing to release the directory, its fileStore still has references!

32 views
Skip to first unread message

Dr. Thomas Orgis

unread,
Apr 14, 2023, 10:56:15 AM4/14/23
to fhgfs...@googlegroups.com
Hi,

our BeeGFS instance on CentOS 7.x chugged along for many years since
2015 without really major outages, save for the time when we hit file
descriptor limits on the servers.

But now we got a metadata server crash that got triggered twice in
rather short succession. We are still trying to identify the workload
responsible for this. The errors in beegfs-meta.log look like this:

(0) Apr11 21:35:53 Worker11 [File (store updated xattr metadata)] >> Unable to write FileInode metadata update: inodes/55/51/1052-6435B674-7B5. SysErr: No such file or directory
(0) Apr11 21:35:53 Worker45 [make meta dir-entry] >> Failed to create link from: inodes/52/5B/106A-6435B674-7B5 To: dentries/35/5B/disposal/106A-6435B674-7B5 SysErr: File exists
(0) Apr11 21:35:53 Worker41 [InodeDirStore.cpp:128] >> Bug: Refusing to release the directory, its fileStore still has references! dirID: 1064-6435B674-7B5
(0) Apr11 21:35:53 Worker27 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
[backtrace follows]
(0) Apr11 21:35:53 Worker27 [App (component exception handler)] >> The component [Worker27] encountered an unrecoverable error. [SysErr: No such file or directory] Exception message: Segmentation fault
(2) Apr11 21:35:53 Worker27 [App (component exception handler)] >> Shutting down...
(3) Apr11 21:35:54 Main [App] >> Stored 380 sessions and 0 mirrored sessions

This scheme looks very much alike for the second occurence, just with a
different entryID. We just re-started the beegfs-metad and things went
along. Of course we are not sure if the fs is damaged in some way.
Maybe the intentional Seppuku of beegfs-metad is right in time to avoid
that.

This is version 7.2.4. I looked at the release notes for later versions
and only found one entry that looked related:

https://doc.beegfs.io/7.3.2/release_notes.html

- Changed EntryID locking in the metadata lock store to resolve an
issue with rmdir. This change also leads to significantly improved
stat and read performance in our testing.

Would that fit? Or are we encountering something else? I am right now
trying to get at which workload the indicated file belongs to (just
checking possible candidates of users active during both indicents
didn't result in something sufficiently suspicious in terms of
write-heavy fs access). I am not sure how many days I am supposed to
wait for the result of

# beegfs-ctl --find --entryid=106A-6435B674-7B5 --storagepoolid=1 /work

to appear (for a I can locate inodes/52/5B/106A-6435B674-7B5 and
dentries/35/5B/disposal/106A-6435B674-7B5 in the underlying storage on
the metadata server … but the xattr blob doesn't tell me much.
Shouldn't it be quicker to get at which user/path this refers to
without a reverse search through the whole tree?

Anyone with pointers? Even if an upgrade could fix this (easily only to
7.2.7 on CentOS 7.9 on a system to be decommissioned within the year),
I'd like to be sure that the issue is indeed fixed, not just hidden
under different circumstances.


Alrighty then,

Thomas

--
Dr. Thomas Orgis
HPC @ Universität Hamburg
Reply all
Reply to author
Forward
0 new messages