Hello all,
I've scanned all of the hardware and no issues have been reported and there are plenty of free inodes in the system.
What led us to running a beegfs-fsck was an issue with quota reporting. Despite updating the quota and waiting a few minutes, any attempt to synchronize content resulted in a disk quota exceeded message. This was corrected by restarting the metadata service on the metadata node logging the message.
During the initial readOnly fsck, there didn't appear to be any messages out of the ordinary so we went ahead and ran the fsck while the system was online using --automatic.
Once I received the first message from a user reporting an issue (some
files named via their chunk ID's were present in their directory), I viewed the beegfs-fsck.log and saw messages similar to the following:
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F5B00-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F82F2-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F8779-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F87F6-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F8F91-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F9627-2
(1) Oct08 15:23:08 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 1-615F5B00-2
(1) Oct08 15:23:08 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 1-615F827A-2
I
stopped the fsck to minimize any further corruption and/or damage to the
system, given the sheer size and number of users with stored data.
There are now messages within the metadata logs:
(0) Oct11 09:02:10 Worker28 [Directory (remove contents dir)] >> Unable to delete contents directory: dentries/4/56/C-60DF651D-1. SysErr: Directory not empty
(0) Oct11 09:02:10 Worker68 [Directory (remove contents dir)] >> Unable to delete dirEntryID directory: dentries/5/5A/5EB-60CBD74C-1/#fSiDs#/. SysErr: No such file or directory
(0) Oct11 09:02:10 Worker68 [Directory (remove contents dir)] >> Unable to delete contents directory: dentries/5/5A/5EB-60CBD74C-1. SysErr: Directory not empty
Each of the logged directories contains a single file which appears to be a hash string:
# ls -l dentries/5/5A/5EB-60CBD74C-1 dentries/4/56/C-60DF651D-1 dentries/4/48/68B-60CBD74C-1
dentries/4/48/68B-60CBD74C-1:
-rw-r--r-- 1 root root 0 Sep 15 22:56 4e3046bd66eb3ffcecee1104138b42c6d2c577
dentries/4/56/C-60DF651D-1:
-rw-r--r-- 1 root root 0 Jul 2 15:12 45f257cb36098ef25e5bbd4901e770ea3b29b0
dentries/5/5A/5EB-60CBD74C-1:
-rw-r--r-- 1 root root 0 Jul 2 15:12 86bfe5bccc5b9902b19d7ebd2f2de4db5cd79a
In some cases users were able to delete offending directories without errors, while others received remote I/O errors.
We're running BeeGFS 7.1.5.
Has anyone else experienced this recently? If so, were the errors corrected with an offline fsck? We're not sure how much damage has been done (we're scanning the file system now). We're not even sure if an offline fsck will fix the problems.
Thanks!
John DeSantis