errors beegfs-fsck cannot fix

318 views
Skip to first unread message

Steffen Grunewald

unread,
Apr 5, 2017, 8:57:32 AM4/5/17
to BeeGFS user list
Hi,

I have run a beegfs-fsck on one of our filesystems, and ended up with the
following:

...
Step 3: Check for errors...

* Duplicated inode IDs ...
* Duplicated chunks ...
>>> Found 17 errors. Detailed information can also be found in check.out.

Found errors beegfs-fsck cannot fix. Please consult the log for more information.


In the output file, I find entries like

>>> Found duplicated Chunks for ID 5C-58DD2BB9-2
* Found on target 6 in path uA98/58D3/F/4FA9-58D3F5C1-2
* Found on target 6 in path uA98/58D3/F/4FA9-58D3F5C1-2

(Yes, the "Found on target" lines are pairwise identical.)
How can I find out which files have been affected, and how can I resolve this
situation manually?

Thanks,
Steffen

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1
D-14476 Potsdam-Golm
Germany
~~~
Fon: +49-331-567 7274
Fax: +49-331-567 7298
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Steffen Grunewald

unread,
Apr 5, 2017, 9:52:27 AM4/5/17
to BeeGFS user list
On Wed, 2017-04-05 at 14:57:30 +0200, Steffen Grunewald wrote:
> Hi,
>
> I have run a beegfs-fsck on one of our filesystems, and ended up with the
> following:
>
> ...
> Step 3: Check for errors...
>
> * Duplicated inode IDs ...
> * Duplicated chunks ...
> >>> Found 17 errors. Detailed information can also be found in check.out.
>
> Found errors beegfs-fsck cannot fix. Please consult the log for more information.
>
>
> In the output file, I find entries like
>
> >>> Found duplicated Chunks for ID 5C-58DD2BB9-2
> * Found on target 6 in path uA98/58D3/F/4FA9-58D3F5C1-2
> * Found on target 6 in path uA98/58D3/F/4FA9-58D3F5C1-2
>
> (Yes, the "Found on target" lines are pairwise identical.)

I have found that all these chunks (hint: chunk/$path/$ID below mountpoint of storage
target) have been created within 3 minutes while there should have been no issues with
the cluster - but they also all reside on the same storage target, same server:

storage01: -rw-rw-rw- 1 user1 group2 66484 Apr 3 22:53 /mnt/storage3/chunks/uA98/58D3/F/4FA9-58D3F5C1-2/5C-58DD2BB9-2
storage01: -rw-rw-rw- 1 user1 group2 124215 Apr 3 22:53 /mnt/storage3/chunks/uA98/58D3/F/17C7-58D3F5C1-2/1E7-58DD2B7B-2
storage01: -rw-rw-rw- 1 user1 group2 39219 Apr 3 22:52 /mnt/storage3/chunks/uA98/58D3/F/16C6-58D3F84D-1/277-58DD2BB9-2
storage01: -rw-rw-rw- 1 user1 group2 28527 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/4F56-58D3F5C1-2/306-58DD2B7B-2
storage01: -rw-rw-rw- 1 user1 group2 23345 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/D7C3-58D3F598-1/376-58DD2B7B-2
storage01: -rw-rw-rw- 1 user1 group2 30704 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/39CB-58D3F84D-1/445-58DD2BB9-2
storage01: -rw-rw-rw- 1 user1 group2 17908 Apr 3 22:52 /mnt/storage3/chunks/uA98/58D3/F/32F-58D3F876-2/46B-58DD2BB0-1
storage01: -rw-rw-rw- 1 user1 group2 22833 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/3B0A-58D3F84D-1/480-58DD2BB0-1
storage01: -rw-rw-rw- 1 user1 group2 23407 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/3BAA-58D3F84D-1/499-58DD2BB9-2
storage01: -rw-rw-rw- 1 user1 group2 21043 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/8792-58D3F582-2/4A1-58DD2B71-1
storage01: -rw-rw-rw- 1 user1 group2 23407 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/3BAA-58D3F84D-1/4DF-58DD2BB9-2
storage01: -rw-rw-rw- 1 user1 group2 124214 Apr 3 22:53 /mnt/storage3/chunks/uA98/58D3/F/17C4-58D3F5C1-2/4FC-58DD2B71-1
storage01: -rw-rw-rw- 1 user1 group2 56686 Apr 3 22:52 /mnt/storage3/chunks/uA98/58D3/F/2FAD-58D3F837-2/525-58DD2BB9-2
storage01: -rw-rw-rw- 1 user1 group2 17203 Apr 3 22:53 /mnt/storage3/chunks/uA98/58D3/F/19A6-58D3F5C1-2/542-58DD2B71-1
storage01: -rw-rw-rw- 1 user1 group2 27058 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/3A91-58D3F84D-1/667-58DD2BB9-2
storage01: -rw-rw-rw- 1 user1 group2 13683 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/3D4C-58D3F84D-1/70B-58DD2BB0-1
storage01: -rw-rw-rw- 1 user1 group2 13683 Apr 3 22:51 /mnt/storage3/chunks/uA98/58D3/F/3D4C-58D3F84D-1/727-58DD2BB0-1

I'm afraid there's an issue with the underlying XFS :( - migrate everything off it,
and clean it up?

> How can I find out which files have been affected, and how can I resolve this
> situation manually?

Also, "Duplicated chunks" usually isn't the last check to be performed. How do
I get beegfs-fsck to continue?

Would I be safe just to ignore the issues for now? Time for maintenance is running
out :(

- S

Ely de Oliveira

unread,
Apr 13, 2017, 5:16:10 AM4/13/17
to fhgfs...@googlegroups.com
Hello Steffen,

I would like to inform the other users about our latest findings regarding this issue.

We discovered that the directory /mnt/storage3/chunks/uA98/58D3/F/4FA9-58D3F5C1-2 at server storage01 actually contains 2 entries with the same name: 5C-58DD2BB9-2. That is why beegfs-fsck printed the same path twice.

$ ls
...
5C5-58DA8594-2
5C-58DD2BB9-2
5C-58DD2BB9-2
5CE-58DBBC03-2
...


The XFS partition seems to be corrupt. In cases like this, beegfs-fsck stops checking the file system, because it is not capable of fixing this sort of errors. Using xfs_repair to repair the XFS partition seems to be the right thing to do at this point.

Best regards,

Ely

Steffen Grunewald

unread,
Apr 24, 2017, 6:09:09 AM4/24/17
to fhgfs...@googlegroups.com
Hello,

(just returned from a vacation, trying to get back in sync)

On Thu, 2017-04-13 at 11:22:15 +0200, Ely de Oliveira wrote:
> Hello Steffen,
>
> I would like to inform the other users about our latest findings regarding
> this issue.
>
> We discovered that the directory
> /mnt/storage3/chunks/uA98/58D3/F/4FA9-58D3F5C1-2 at server storage01
> actually contains 2 entries with the same name: 5C-58DD2BB9-2. That is why
> beegfs-fsck printed the same path twice.
>
> $ ls
> ...
> 5C5-58DA8594-2
> *5C-58DD2BB9-2**
> **5C-58DD2BB9-2*
> 5CE-58DBBC03-2
> ...
>
>
> The XFS partition seems to be corrupt. In cases like this, beegfs-fsck stops
> checking the file system, because it is not capable of fixing this sort of
> errors. Using xfs_repair to repair the XFS partition seems to be the right
> thing to do at this point.

Thanks Ely for your analysis.

While XFS is a very reliable filesystem, bad things still happen... in particular
if the server is (has to be) mistreated.

I had set the storage target to read-only when I became aware of the issue (and
suspected it might be related to the underlying fs), and accessing the chunks
seems to work.

There are at least two options right now.
(1) Get rid of the chunks by removing the files (got to talk to the owner about
this), and see whether the dup entries will vanish - in the worst case, this
may crash the xfs :(
(2) Migrate everything off the storage target, and do a fresh init of the xfs
(3) Run xfs_repair

Of course, the order of these actions has to be negotiated - each of them may
have nasty side-effects, but fortunately it's the "scratch" fs that has to be
fixed.
(What possibly could go wrong? How long would it take to migrate 27 TB off that
fs, and how long does it take to xfs_repair a 50 TB on a 128 GB RAM machine?)


Thanks,
Steffen
Reply all
Reply to author
Forward
0 new messages