Metadata issues after beegfs-fsck

188 views
Skip to first unread message

John Desantis

unread,
Oct 11, 2021, 9:31:53 AM10/11/21
to beegfs-user
Hello all,

We're seeing basically the same issue per the following thread:

https://groups.google.com/g/fhgfs-user/c/CJsTw8qrOd4/m/X4Oopl6XBAAJ

I've scanned all of the hardware and no issues have been reported and there are plenty of free inodes in the system.

What led us to running a beegfs-fsck was an issue with quota reporting.  Despite updating the quota and waiting a few minutes, any attempt to synchronize content resulted in a disk quota exceeded message.  This was corrected by restarting the metadata service on the metadata node logging the message.

During the initial readOnly fsck, there didn't appear to be any messages out of the ordinary so we went ahead and ran the fsck while the system was online using --automatic.

Once I received the first message from a user reporting an issue (some files named via their chunk ID's were present in their directory), I viewed the beegfs-fsck.log and saw messages similar to the following:

(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F5B00-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F82F2-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F8779-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F87F6-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F8F91-2
(1) Oct08 15:21:49 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 0-615F9627-2
(1) Oct08 15:23:08 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 1-615F5B00-2
(1) Oct08 15:23:08 Main [MsgHelperRepair (recreateDentries)] >> Failed to recreate dentry. entryID: 1-615F827A-2

I stopped the fsck to minimize any further corruption and/or damage to the system, given the sheer size and number of users with stored data.

There are now messages within the metadata logs:

(0) Oct11 09:02:10 Worker28 [Directory (remove contents dir)] >> Unable to delete contents directory: dentries/4/56/C-60DF651D-1. SysErr: Directory not empty
(0) Oct11 09:02:10 Worker68 [Directory (remove contents dir)] >> Unable to delete dirEntryID directory: dentries/5/5A/5EB-60CBD74C-1/#fSiDs#/. SysErr: No such file or directory
(0) Oct11 09:02:10 Worker68 [Directory (remove contents dir)] >> Unable to delete contents directory: dentries/5/5A/5EB-60CBD74C-1. SysErr: Directory not empty

Each of the logged directories contains a single file which appears to be a hash string:

# ls -l dentries/5/5A/5EB-60CBD74C-1 dentries/4/56/C-60DF651D-1 dentries/4/48/68B-60CBD74C-1
dentries/4/48/68B-60CBD74C-1:
-rw-r--r-- 1 root root 0 Sep 15 22:56 4e3046bd66eb3ffcecee1104138b42c6d2c577
dentries/4/56/C-60DF651D-1:
-rw-r--r-- 1 root root 0 Jul  2 15:12 45f257cb36098ef25e5bbd4901e770ea3b29b0
dentries/5/5A/5EB-60CBD74C-1:
-rw-r--r-- 1 root root 0 Jul  2 15:12 86bfe5bccc5b9902b19d7ebd2f2de4db5cd79a

In some cases users were able to delete offending directories without errors, while others received remote I/O errors.

We're running BeeGFS 7.1.5. 

Has anyone else experienced this recently?  If so, were the errors corrected with an offline fsck?  We're not sure how much damage has been done (we're scanning the file system now).  We're not even sure if an offline fsck will fix the problems.

Thanks!
John DeSantis


John Desantis

unread,
Oct 11, 2021, 12:18:58 PM10/11/21
to beegfs-user
Hello all,

Just to provide another update, the entire file system is littered with BeeGFS inode chunk files:

# md5sum ./*|sort -k1 -V
0a2554ccf1912ff4a743326e46565817  ./83-6116B882-2
0a2554ccf1912ff4a743326e46565817  ./tpb_sim0035.dat
0a83557685455b9e6a682eebe707326f  ./AE-6116B269-2
0a83557685455b9e6a682eebe707326f  ./tpb_sim0002.dat

If a chunk file is removed (users may perform this action), then the file it's pointing to (83-6116B882-2 -> tpb_sim0035.dat) becomes unavailable with ?'s.  The only way to correct this issue that we know of is to copy the offending file first, remove the chunk file reference, and then move the copied file to the original file name.  Please see below for an example session:

# md5sum 81-60AE57A2-1 esri
51d33a408268cf895ea0b75490480bc7  81-60AE57A2-1
51d33a408268cf895ea0b75490480bc7  esri
# stat 81-60AE57A2-1 esri
  File: ‘81-60AE57A2-1’
  Size: 453436        Blocks: 886        IO Block: 524288 regular file
Device: 25h/37d    Inode: 1385451496698350251  Links: 1
Access: (0640/-rw-r-----)  Uid: (663800172/ UNKNOWN)   Gid: (663800172/ UNKNOWN)
Access: 2021-05-26 10:14:23.000000000 -0400
Modify: 2021-05-26 10:14:23.000000000 -0400
Change: 2021-05-26 10:14:48.000000000 -0400
 Birth: -
  File: ‘esri’
  Size: 453436        Blocks: 886        IO Block: 524288 regular file
Device: 25h/37d    Inode: 1385451496698350251  Links: 1
Access: (0640/-rw-r-----)  Uid: (663800172/ UNKNOWN)   Gid: (663800172/ UNKNOWN)
Access: 2021-05-26 10:14:23.000000000 -0400
Modify: 2021-05-26 10:14:23.000000000 -0400
Change: 2021-05-26 10:14:48.000000000 -0400
 Birth: -

# cp -p esri esri.tmp
# ls -lhS
total 2.5M
-rw-r----- 1 663800172 663800172 443K May 26 10:14 81-60AE57A2-1
-rw-r----- 1 663800172 663800172 443K May 26 10:14 esri
-rw-r----- 1 663800172 663800172 443K May 26 10:14 esri.tmp

# rm 81-60AE57A2-1
rm: remove regular file ‘81-60AE57A2-1’? y
# ls -lhS
ls: cannot access esri: No such file or directory
total 1.6M
-????????? ? ?         ?            ?            ? esri

# mv esri.tmp esri
# ls -lhS
total 1.6M
-rw-r----- 1 663800172 663800172 443K May 26 10:14 esri
-rw-r----- 1 663800172 663800172 439K May 26 10:14 7E-60AE57A2-1

Can this corruption be fixed without further damage to users' files?  Will an offline beegfs-fsck correct this?

Thanks,
John DeSantis

Dr. Thomas Orgis

unread,
Nov 10, 2021, 3:16:21 AM11/10/21
to fhgfs...@googlegroups.com
Am Mon, 11 Oct 2021 09:18:57 -0700 (PDT)
schrieb John Desantis <desa...@mail.usf.edu>:

> Just to provide another update, the entire file system is littered with
> BeeGFS inode chunk files:
>
> # md5sum ./*|sort -k1 -V
> 0a2554ccf1912ff4a743326e46565817 ./83-6116B882-2
> 0a2554ccf1912ff4a743326e46565817 ./tpb_sim0035.dat
> 0a83557685455b9e6a682eebe707326f ./AE-6116B269-2
> 0a83557685455b9e6a682eebe707326f ./tpb_sim0002.dat

Clarification: Does this mean the user-visible filesystem? You see
chunk files alongside user files? I did not see that yet.


Alrighty then,

Thomas

--
Dr. Thomas Orgis
HPC @ Universität Hamburg

John Desantis

unread,
Nov 10, 2021, 10:43:45 AM11/10/21
to beegfs-user
Thomas,

> Clarification: Does this mean the user-visible filesystem? You see  chunk files alongside user files? I did not see that yet.

Yes, that's correct.  And, because the users didn't recognize these files and didn't think to contact support personnel, they deleted the files causing potential irreparable damage.  Here's another example:

# ls -iln |sort -k1 -V |head
11654970278505774502 -rw-rw-r-- 1 1839611 10001  1920 Dec 13  2020 26-6155ACAE-2
11654970278505774502 -rw-rw-r-- 1 1839611 10001  1920 Dec 13  2020 panedwindow.tcl
12172555935262268081 -rw-rw-r-- 1 1839611 10001  4411 Dec 13  2020 defaults.tcl
13326373792836723737 -rw-rw-r-- 1 1839611 10001  5576 Dec 13  2020 6-6155ACAE-2
13326373792836723737 -rw-rw-r-- 1 1839611 10001  5576 Dec 13  2020 fonts.tcl
13582606475456525872 -rw-rw-r-- 1 1839611 10001  1089 Dec 13  2020 2C-6155ACAE-2
13582606475456525872 -rw-rw-r-- 1 1839611 10001  1089 Dec 13  2020 progress.tcl
14963159557428088649 -rw-rw-r-- 1 1839611 10001  4007 Dec 13  2020 cursors.tcl
15679023562517988415 -rw-rw-r-- 1 1839611 10001  3818 Dec 13  2020 aquaTheme.tcl
16421555084575193365 -rw-rw-r-- 1 1839611 10001  2781 Dec 13  2020 81-6155ACAE-2

Notice the duplicate inodes associated with the chunk files and the "real" files.  If you delete the chunk file the inode is lost which results in all metadata associated with the file it's associated with to be lost.  It also cannot be deleted.

> # rm 26-6155ACAE-2
> rm: remove regular file ‘26-6155ACAE-2’? y
>
> # ls -l
> ls: cannot access panedwindow.tcl: No such file or directory
>
> -????????? ? ?        ?           ?            ? panedwindow.tcl
>
> # rm panedwindow.tcl
> rm: cannot remove ‘panedwindow.tcl’: No such file or directory

A "fix" would be to create a new file via `touch` and then to move it over the file that is damaged.

> # touch panedwindow.tcl.new
> # mv panedwindow.tcl.new panedwindow.tcl
> # ls -la
> -rw-r----- 1 0     10001     0 Nov 10 10:39 panedwindow.tcl

With "users being users" there is extreme risk that corruption will occur since it's easier to simply delete files that don't belong vs. reaching out and inquiring about them to start.

John DeSantis


On Wednesday, November 10, 2021 at 3:16:21 AM UTC-5 Dr. Thomas Orgis wrote:
Am Mon, 11 Oct 2021 09:18:57 -0700 (PDT)
schrieb John Desantis <>:
Reply all
Reply to author
Forward
0 new messages