Crashes of beegfs-metad (7.1.5 and 7.2.4) after unclean shutdown of servers (SIGSEGV, Bug: Refusing to release the directory)

448 views
Skip to first unread message

Dr. Thomas Orgis

unread,
Nov 1, 2021, 5:28:01 PM11/1/21
to beegfs-user
Hi,

we had a rather calm time with BeeGFS as long as the system was in
support both by the hardware vendor and ThinkParQ … now we are in
between systems, where the old stuff just has to live long enough until
we get the new one, and of course we got our first really serious
BeeGFS issue.

I'm putting this out to the community and hopefull into the developer's
view … maybe someone experienced something similar?

It's a CentOS 7.9 cluster with hardware installed in 2015, now with
BeeGFS 7.2.4. It was on 7.1.5 during an unclean shutdown of the servers
(a kernel bug seemed to trigger a crash during unmount of the metadata
filesystem, possibly also for storage targets), that apparently some
state in the data underlying the BeeGFS was triggered to make
beegfs-meta crash after some usage. This might be days or just minutes.

I did a fsck that found some minor issues, updated to 7.2.4 … but we
get lots of these in the beegfs-meta log:

(2) Nov01 20:37:20 CommSlave71 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16424; FileHandle: 2BAEE90B#1A93-6125EFC1-7B5
(2) Nov01 20:37:20 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 2BAEE90B#1A93-6125EFC1-7B5
(3) Nov01 20:37:20 TimerWork/0 [Sync clients] >> closing file. ParentID: 1929-6125EFC1-7B5 FileName: libselinux.so.1
(2) Nov01 20:37:20 CommSlave29 [Close chunk file work] >> Communication with storage target failed. TargetID: 40077; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
(2) Nov01 20:37:20 CommSlave21 [Close chunk file work] >> Communication with storage target failed. TargetID: 44346; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
(2) Nov01 20:37:20 CommSlave68 [Close chunk file work] >> Communication with storage target failed. TargetID: 51045; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
(2) Nov01 20:37:20 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
(3) Nov01 20:37:20 TimerWork/0 [Sync clients] >> closing file. ParentID: 1929-6125EFC1-7B5 FileName: libgcrypt.so.20.0.5
(2) Nov01 20:37:20 CommSlave19 [Close chunk file work] >> Communication with storage target failed. TargetID: 26846; Session: 16424; FileHandle: 2BAEE910#1B8E-612608A3-7B5
(2) Nov01 20:37:20 CommSlave19 [Close chunk file work] >> Communication with storage target failed. TargetID: 33731; Session: 16424; FileHandle: 2BAEE910#1B8E-612608A3-7B5

… this is with the storage targets being healthy by all other means.
And clients _seem_ to be happy. Are these already fatal errors? Things
seem to work fine for some time, as long as beegfs-metad does not
crash. When it does, it looks like this:

(2) Nov01 21:34:54 CommSlave84 [Trunc chunk file work] >> Truncation of chunk file on target failed. TargetID: 38759; EntryID: 14D4-61804DD4-7B5; Error: Internal error
(2) Nov01 21:34:54 Worker13 [Trunc chunk file helper] >> Problems occurred during truncation of storage server chunk files. File: 14D4-61804DD4-7B5
(0) Nov01 21:34:54 Worker3 [make meta dir-entry] >> Failed to create link from: inodes/7/2A/87E0-61804F4D-7B5 To: dentries/35/5B/disposal/87E0-61804F4D-7B5 SysErr: File exists
(0) Nov01 21:34:54 Worker6 [InodeDirStore.cpp:128] >> Bug: Refusing to release the directory, its fileStore still has references! dirID: 87DD-61804F4D-7B5
(0) Nov01 21:34:54 Worker15 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
(1) Nov01 21:34:54 Worker15 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x6fd607]
2: /lib64/libc.so.6(+0x36400) [0x7fcf3a68d400]
3: [0x7fcec0d80398]
(0) Nov01 21:34:54 Worker15 [PThread.cpp:135] >> Received a SIGABRT. Trying to shut down...
(1) Nov01 21:34:54 Worker15 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x47) [0x6fd607]
2: /lib64/libc.so.6(+0x36400) [0x7fcf3a68d400]
3: /lib64/libc.so.6(gsignal+0x37) [0x7fcf3a68d387]
4: /lib64/libc.so.6(abort+0x148) [0x7fcf3a68ea78]
5: /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165) [0x7fcf3b1b9a95]
6: /lib64/libstdc++.so.6(+0x5ea06) [0x7fcf3b1b7a06]
7: /lib64/libstdc++.so.6(+0x5ea33) [0x7fcf3b1b7a33]
8: /lib64/libstdc++.so.6(+0x5ec53) [0x7fcf3b1b7c53]
9: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x296) [0x6fd856]
10: /lib64/libc.so.6(+0x36400) [0x7fcf3a68d400]
11: [0x7fcec0d80398]
(2) Nov01 21:34:56 Main [App (wait for component termination)] >> Still waiting for this component to stop: Worker15


I remember that error message about refusing to release the directory.
It did not change from 7.1.5. The dirID differs between crashes.

Apart from some clear sign that the code might do something else there
than to segfault … any hints on how one could debug that? Delete some
data that is in a bad state? Should beegfs-fsck catch such situation?
The root cause may well be data corruption due to unclean server
shutdown. But what I expect in such a case is some loss of recent data,
not crashing of the whole filesystem.

This experience really brings down the so far really nice track record
we had with BeeGFS. We carried the systems over the years, starting
with fhgfs 2014 or 2015. We only lost the quota settings on one update,
but no data. I try to teach the users not to rely on data safety on the
BeeGFS and have their own decentral backup. But in reality, it was
perfectly reliable for 6 years. Is it now the time to teach them about
the reality of data loss and create a fresh BeeGFS without our damaged
data? As a programmer, I'd rather have the bug fixed / a fsck that
removes/fixes the damaged sections of the underlying data.

I do have an instance of beegfs-fsck running right now, will take some
hours. In case that doesn't find something definite, maybe someone else
has a suggestion how to get they FS back into a happy state.


Alrighty then,

Thomas

--
Dr. Thomas Orgis
HPC @ Universität Hamburg

Bernd Schubert

unread,
Nov 1, 2021, 6:20:24 PM11/1/21
to fhgfs...@googlegroups.com, Dr. Thomas Orgis


On 11/1/21 22:27, Dr. Thomas Orgis wrote:
> Hi,
>

[...]

> (2) Nov01 20:37:20 CommSlave71 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16424; FileHandle: 2BAEE90B#1A93-6125EFC1-7B5
> (2) Nov01 20:37:20 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 2BAEE90B#1A93-6125EFC1-7B5
> (3) Nov01 20:37:20 TimerWork/0 [Sync clients] >> closing file. ParentID: 1929-6125EFC1-7B5 FileName: libselinux.so.1
> (2) Nov01 20:37:20 CommSlave29 [Close chunk file work] >> Communication with storage target failed. TargetID: 40077; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5

Does this happen for all storage targets? And it is it reproducible on
closing and truncating files? If so, I would enable debug log level.
Even better would be to recompile with BEEGFS_DEBUG and then to set the
debug log level.


> (2) Nov01 20:37:20 CommSlave21 [Close chunk file work] >> Communication with storage target failed. TargetID: 44346; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
> (2) Nov01 20:37:20 CommSlave68 [Close chunk file work] >> Communication with storage target failed. TargetID: 51045; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
> (2) Nov01 20:37:20 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
> (3) Nov01 20:37:20 TimerWork/0 [Sync clients] >> closing file. ParentID: 1929-6125EFC1-7B5 FileName: libgcrypt.so.20.0.5
> (2) Nov01 20:37:20 CommSlave19 [Close chunk file work] >> Communication with storage target failed. TargetID: 26846; Session: 16424; FileHandle: 2BAEE910#1B8E-612608A3-7B5
> (2) Nov01 20:37:20 CommSlave19 [Close chunk file work] >> Communication with storage target failed. TargetID: 33731; Session: 16424; FileHandle: 2BAEE910#1B8E-612608A3-7B5
>
> … this is with the storage targets being healthy by all other means.
> And clients _seem_ to be happy. Are these already fatal errors? Things
> seem to work fine for some time, as long as beegfs-metad does not
> crash. When it does, it looks like this:
>
> (2) Nov01 21:34:54 CommSlave84 [Trunc chunk file work] >> Truncation of chunk file on target failed. TargetID: 38759; EntryID: 14D4-61804DD4-7B5; Error: Internal error
> (2) Nov01 21:34:54 Worker13 [Trunc chunk file helper] >> Problems occurred during truncation of storage server chunk files. File: 14D4-61804DD4-7B5
> (0) Nov01 21:34:54 Worker3 [make meta dir-entry] >> Failed to create link from: inodes/7/2A/87E0-61804F4D-7B5 To: dentries/35/5B/disposal/87E0-61804F4D-7B5 SysErr: File exists
> (0) Nov01 21:34:54 Worker6 [InodeDirStore.cpp:128] >> Bug: Refusing to release the directory, its fileStore still has references! dirID: 87DD-61804F4D-7B5
> (0) Nov01 21:34:54 Worker15 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...

[...]

>
> I remember that error message about refusing to release the directory.
> It did not change from 7.1.5. The dirID differs between crashes.

This message exists for a much much longer time, since around 2012...
What happens is that the in memory object reference count for a
directory went to 0, but at the same time the directory still has
objects (directory entries).

>
> Apart from some clear sign that the code might do something else there
> than to segfault … any hints on how one could debug that? Delete some
> data that is in a bad state? Should beegfs-fsck catch such situation?
> The root cause may well be data corruption due to unclean server
> shutdown. But what I expect in such a case is some loss of recent data,
> not crashing of the whole filesystem.

I wonder if a simple reproducer would be to get the object IDs, in order
to find the right disposal path and then

- have a single file in a directory (client operation)
- manually create a file with the right name in the right disposal path
(server side operation)
- open the file on a client (client)
- unlink the file while it is hold open (client)

==> The unlink of the opened file should link it to the disposal dir,
but that file already exists - should give the error message. If now
dispoal and initial dir have only one file either of both dir-inodes
might get the reference issue.

My guess is that there is a dirInode object derefence, which should not
be done if creating the hard link failed. And given how rarely an issue
like that would happen, it _might_ be a really ancient issue.

[...]

> I do have an instance of beegfs-fsck running right now, will take some
> hours. In case that doesn't find something definite, maybe someone else
> has a suggestion how to get they FS back into a happy state.

Not sure if this is going to help - I would try to figure out what is
going on between the meta and storage targets - that is probably the
main trigger of issues.


Hope it helps,
Bernd

Dr. Thomas Orgis

unread,
Nov 2, 2021, 11:14:26 AM11/2/21
to Bernd Schubert, fhgfs...@googlegroups.com
Hi Bernd,

thanks for your detailed suggestions! They're a bit above my current
understanding of BeeGFS. I hope that can be changed quickly.

I just did xfs_repair on all storage targets with BeeGFS offline and
there was nothing to fix. Somehow I don't think there's actual
corruption on that side. But the recent beegfs-fsck last night did find
a 'Chunk is saved in wrong path' still relating to a file created back
before the first crash, at the day of the shutdown. Clearly,
beegfs-fsck is not doing any consistent healing:-(

So my current state is verified to have clean filesystems with some bad
beegfs state, presumably in the metadata.

Am Mon, 1 Nov 2021 23:20:20 +0100
schrieb Bernd Schubert <bernd.s...@fastmail.fm>:

> > (2) Nov01 20:37:20 CommSlave29 [Close chunk file work] >> Communication with storage target failed. TargetID: 40077; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5

To distill the variants of this in one place, hand-picked:

(2) Oct20 13:32:04 CommSlave8 [Stat chunk file work] >> Communication with storage target failed. TargetID: 16662; EntryID: 0-616FFE15-7B5
(2) Oct20 13:33:15 CommSlave21 [Unlink chunk file work] >> Communication with storage target failed. TargetID: 44346; EntryID: 192-616E57B9-7B5
(2) Oct20 13:34:56 Worker3 [Close Helper (close chunk files S)] >> Communication with storage target failed: 37055; FileHandle: 7F4C04D7#2B-616FFE54-7B5; Error: Communication error
(2) Oct20 13:34:57 CommSlave2 [Trunc chunk file work] >> Communication with storage target failed. TargetID: 33731; EntryID: 77-616FD16F-7B5
(2) Nov01 20:37:20 CommSlave71 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16424; FileHandle: 2BAEE90B#1A93-6125EFC1-7B5
(2) Nov01 20:37:21 Worker33 [Close Helper (close chunk files S)] >> Communication with storage target failed: 26846; FileHandle: 4C5DC753#178-617C2727-7B5; Error: Communication error

The crash on Oct20 was more colourful. 20211102 relates to the logs of
the current instance with beegfs 7.2.4, since boot.

$ grep 'Communication with storage target failed' 20211101-beegfs-meta/beegfs-meta.log* \
| perl -ne '++$op{$1} if / (\[.+\])/; END { for (keys %op){ print "$_\t$op{$_}\n" } }'
[Close Helper (close chunk files S)] 54
[Stat chunk file work] 931
[Unlink chunk file work] 353
[Trunc chunk file work] 35
$ grep 'Communication with storage target failed' 20211102-beegfs-meta/beegfs-meta.log* \
| perl -ne '++$op{$1} if / (\[.+\])/; END { for (keys %op){ print "$_\t$op{$_}\n" } }'
[Close Helper (close chunk files S)] 20
[Close chunk file work] 130448

> Does this happen for all storage targets?

I'm mostly limiting myself to the logs of the current instances with
7.2.4, started on Nov 1 evening. Things get confusing. But I might fish
in older logs for certain IDs.

$ grep 'Communication with storage target failed' 20211102-beegfs-meta/* \
| wc -l
130468

$ grep 'Communication with storage target failed' 20211102-beegfs-meta/* \
> | perl -ne '++$target{$1} if /TargetID: (\d+)/; END { for (sort {$a <=> $b} (keys %target)){ print "$_\t$target{$_}\n"; }}'
12219 7347
16795 6976
17364 6976
24796 7013
26846 7013
33731 7013
35466 7013
37055 7892
38841 7892
38900 7892
40077 7394
44346 7394
51045 7394
55273 6973
55481 6973
55501 6973
57849 6973
62448 7347

This is 18 differing targets, out of 24 I have in total. Seems rather evenly distributed.

The IDs, EntryID and FileHandle, are in total 53836 differing ones … no
easy match. But if I only focus on the part after #, the IDs like
3C9-617CDE89-7B5, I get 234 before the update to 7.2.4, 2674 after.

> And it is it reproducible on
> closing and truncating files? If so, I would enable debug log level.

Looking into that. I need to map those IDs to absolute paths to files …

Oh, well … I overlooked one thing about the storage targets:

(0) Oct30 07:50:28 ConnAccept [ConnAccept] >> Trying to continue after connection accept error: Error during socket accept(): Too many open files

Many of those at that time. That's odd.
$ for n in 1 2 3 4; do grep 'Too many open files' 20211102-beegfs$n/*|wc -l; done
244827
243016
0
239853

That, while /proc/sys/fs/file-max is 13 million, suggests a rather
heavy leak in file descriptors, no?

Then:

(2) Oct26 14:55:01 Worker1-3 [WriteChunkFileMsg incoming] >> Potential cache loss for open file handle. (Server crash detected.) No session for file available. FileHandleID: 3D1E4E8B#8D2-61263B7D-7B5

I got serveral of those on two storage servers, at several points in
time, correlating with at least most of the metadata server crashes.
Since I don't have all logs collected, it is safe to assume that these
errors correlate between storage and meta. However, it is only a small
amount. Cause or effect?

$ grep 'Server crash' */*|perl -pe 's,Worker.*: ,... ,'
20211101-beegfs1/beegfs-storage.log:(2) Oct26 14:55:01 ... 3D1E4E8B#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct26 15:53:40 ... 3D1E8C03#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct26 15:57:19 ... 51E81C0D#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct26 15:58:34 ... 51E81D64#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct30 07:56:35 ... 3D22A067#3C9-617CDE89-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:39 ... 4D8A44F8#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:41 ... 7C77B9E6#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:42 ... 40C283CD#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:42 ... 66AF826B#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:39:49 ... 6A714EAE#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:40:37 ... 8130180#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:40:40 ... 6B865D99#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 62BE7F3A#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 9A67A51#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 2C8A4BDA#EF7-612DD121-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:10 ... 6E2B5EE0#7C-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:14 ... 4A9561BE#2D-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:15 ... 2796CEE6#29-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:15 ... 648E7F52#67-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:15 ... 7F3A480#43-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:16 ... 6052460F#49-616FFED1-7B5

$ grep 'Server crash' 20211101-beegfs?/* \
| perl -ne 'print "$1\n" if /FileHandleID: (\S+#\S+)$/'

That gives me 22 FileHandIDs. When only using the part after the #,
can find some of those in my metadata logs. For the complete ID, I find
exactly one match.

But these messages did not occur with BeeGFS 7.2.4, anyway.

$ grep '64C8EBE0#EF7-612DD121-7B5' */*
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 Worker4-3 [WriteChunkFileMsg incoming] >> Potential cache loss for open file handle. (Server crash detected.) No session for file available. FileHandleID: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 CommSlave73 [Close chunk file work] >> Communication with storage target failed. TargetID: 37055; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 CommSlave83 [Close chunk file work] >> Communication with storage target failed. TargetID: 38900; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 CommSlave81 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 CommSlave73 [Close chunk file work] >> Communication with storage target failed. TargetID: 37055; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 CommSlave83 [Close chunk file work] >> Communication with storage target failed. TargetID: 38900; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 CommSlave81 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 64C8EBE0#EF7-612DD121-7B5

Hm. I botched up log collection (they're thrown away on shutdown, not
being configured to be passed to the central logging at cluster setup).
But the logs from Nov01 should be complete. Cache loss in the morning
and crash 11 hours later? The crash itself seems unrelated, but this
indicates breakage, anyway.

I don't see any errors at the same time matching the metadata errors, but this

> Even better would be to recompile with BEEGFS_DEBUG and then to set the
> debug log level.

Hm. We got LogLevel=3 right now. First, I'd need a way to trigger the
crash anyway. I'm looking at converting those IDs to paths that I can
nudge on the client side.

> > … this is with the storage targets being healthy by all other means.

Caveat about them being out of file handles. I recant my use of the
word 'healthy'.

> > (2) Nov01 21:34:54 CommSlave84 [Trunc chunk file work] >> Truncation of chunk file on target failed. TargetID: 38759; EntryID: 14D4-61804DD4-7B5; Error: Internal error
> > (2) Nov01 21:34:54 Worker13 [Trunc chunk file helper] >> Problems occurred during truncation of storage server chunk files. File: 14D4-61804DD4-7B5
> > (0) Nov01 21:34:54 Worker3 [make meta dir-entry] >> Failed to create link from: inodes/7/2A/87E0-61804F4D-7B5 To: dentries/35/5B/disposal/87E0-61804F4D-7B5 SysErr: File exists
> > (0) Nov01 21:34:54 Worker6 [InodeDirStore.cpp:128] >> Bug: Refusing to release the directory, its fileStore still has references! dirID: 87DD-61804F4D-7B5
> > (0) Nov01 21:34:54 Worker15 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...

I wonder if this is the only point I should focus on, or if also the
other messages discussed above are fatal enough, and more frequent, so
that I should try to debug those instead.

> This message exists for a much much longer time, since around 2012...
> What happens is that the in memory object reference count for a
> directory went to 0, but at the same time the directory still has
> objects (directory entries).

I don't have the picture on which levels are involved here. This is an
inconsinstency on the metadata server alone, right?

> I wonder if a simple reproducer would be to get the object IDs, in order
> to find the right disposal path and then
>
> - have a single file in a directory (client operation)
> - manually create a file with the right name in the right disposal path
> (server side operation)
> - open the file on a client (client)
> - unlink the file while it is hold open (client)

I need to read about those IDs and how they map to paths. Is the server
side the metadata one (/beegfs/meta/dentries)? Is there a document
with an overview these architecture details?

> ==> The unlink of the opened file should link it to the disposal dir,
> but that file already exists - should give the error message. If now
> dispoal and initial dir have only one file either of both dir-inodes
> might get the reference issue.
>
> My guess is that there is a dirInode object derefence, which should not
> be done if creating the hard link failed. And given how rarely an issue
> like that would happen, it _might_ be a really ancient issue.

Yes. Do you think that this is the only issue, though? Those storage
nodes running out of file descriptors look bad, IMHO, too. How many
bugs do we have at hand here?

> Not sure if this is going to help - I would try to figure out what is
> going on between the meta and storage targets - that is probably the
> main trigger of issues.

Yeah … I have some figuring out to do;-) But there is a trade-off to
do: I can invest a certain amount of time into recovering this and
hopefulling fixing BeeGFS to avoid such failure in future … but as long
as I do that, our cluster operation is stopped. I cann spend a month of
downtime on this and may have to setup a clean BeeGFS and get rid of
the old data.

If I could extract a minimal set of problematic data to debug later,
that would be nice, but a full backup of the BeeGFS is not happening.

Dr. Thomas Orgis

unread,
Nov 9, 2021, 11:25:17 AM11/9/21
to Bernd Schubert, fhgfs...@googlegroups.com
Hi again,

to conclude this affair: Phillip pointed out to me that the errors like

(0) Oct30 07:50:28 ConnAccept [ConnAccept] >> Trying to continue after connection accept error: Error during socket accept(): Too many open files

are at the beginning of the chain and easily avoidable by raising
tuneProcessFDLimit in beegfs-storage.conf. This does not really solve the actual crash indicated by

(0) Nov01 21:34:54 Worker3 [make meta dir-entry] >> Failed to create link from: inodes/7/2A/87E0-61804F4D-7B5 To: dentries/35/5B/disposal/87E0-61804F4D-7B5 SysErr: File exists
(0) Nov01 21:34:54 Worker6 [InodeDirStore.cpp:128] >> Bug: Refusing to release the directory, its fileStore still has references! dirID: 87DD-61804F4D-7B5
(0) Nov01 21:34:54 Worker15 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...

but does avoid the practical problem for us. Our downfall was not a
hardware error or corruption of any data, but the sudden start of many
jobs of a certain mix after the planned downtime.

There was one specific type of job that started a python script which
loaded many libraries and also did some JIT-compilation using Numba.
The first process needs some minutes doing that. Then, a worker pool
starts up the same script again and now does the same library loading
and compiling step repeatedly, but now on all CPU cores. After 5
minutes, we managed to load a python script with 16 processes and could
try to do some actual work. That's progress, I guess.

This works reasonably fine for one node doing the dance. This already
touches tousands of file descriptors, apparently. When some hundret
nodes do this at the same time, our storage servers get pushed closer
to the default limit of 50k open files for them. At the same time,
things start to slow down considerably, from 5 minutes towards 20
minutes for starting up the script's processes.

This alone did not kill the system yet, but other jobs that also rather
mindlessly bugger the filesystem then can push it over the configured
file descriptor limit. Once that happens, communications break also
within the BeeGFS cluster and all sorts of bad things happen. Depending
on your luck, you'll get some errors for a few hours or even days, and
then eventually a fatal crash of the services. In my case it was
beegfs-meta, but I imagine that such a wounded state could trigger bugs
also in other places.

Numba is avertised as being built for scientific computing. Well, it's
usage patterns at least seem destined to stress HPC systems a lot with
repetitive work compiling Python functions again and again to machine
code.

We got struck by the anaconda.

This case of loading libraries for running/compiling a Python script is
an example of how the missing effective read cache in BeeGFS can bring
the whole system down. But granted, BeeGFS was not written for hosting
large software installations. It was written for jobs reading and/or
producing large streams of binary data. This is less and less the
use-case the people arriving at our system have. They may think they do
data processing, but it is often in ridiculously little random
portions. They scale up computations that ran just fine on their laptop
because that basically has no I/O barrier due to everything happening
in the page cache in system RAM.

We need to get the users to be more conscious about I/O being a thing
and have to bring the load on the parallel filesystem down by doing
writs on local discs as much as possible, and having reads for things
like anaconda installs rather on a single NFS server than on a BeeGFS
cluster. For read-only serving of many small files, it's hard to beat a
NFS server with some RAM for caching itself plus the client caches.

Another job type I analyzed recently turned something like 50K of input
data and 5G of final output data into 200T read/written on the BeeGFS
during 3 days. It's just madness to do this scratch I/O on the parallel
filesystem. Sure, you could beef it up to be able to take such a load
from hundreds of clients, but it's just a big waste of money compared
to a cheap disk (not even necessarily flash-based) per node and some
smarter handling.

While the market will provide NVMe arrays that give clients millions of
ops on tiny files in a parallel filesystem, there is at least some
academic value in showing the users that some kinds of usage are just
wasteful and lots of network transfers unnecessary.

It's somehow of course rather sad that our mighty supercomputer (cheap
cluster of home computers) is overwhelmed by simply _loading_ current
software, not even the execution of the computations. To end on a
lighter note:

So maybe the BeeGFS client module could analyze the access patterns and
at some point throw ESTUPIDIO to make people think about what they do;-)


Alrighty then,

Thomas

PS: Me gusta el término ESTUPIDIO. Should be a thing.
Reply all
Reply to author
Forward
0 new messages