Hi Bernd,
thanks for your detailed suggestions! They're a bit above my current
understanding of BeeGFS. I hope that can be changed quickly.
I just did xfs_repair on all storage targets with BeeGFS offline and
there was nothing to fix. Somehow I don't think there's actual
corruption on that side. But the recent beegfs-fsck last night did find
a 'Chunk is saved in wrong path' still relating to a file created back
before the first crash, at the day of the shutdown. Clearly,
beegfs-fsck is not doing any consistent healing:-(
So my current state is verified to have clean filesystems with some bad
beegfs state, presumably in the metadata.
Am Mon, 1 Nov 2021 23:20:20 +0100
schrieb Bernd Schubert <
bernd.s...@fastmail.fm>:
> > (2) Nov01 20:37:20 CommSlave29 [Close chunk file work] >> Communication with storage target failed. TargetID: 40077; Session: 16424; FileHandle: 2BAEE90D#1A06-6125EFC1-7B5
To distill the variants of this in one place, hand-picked:
(2) Oct20 13:32:04 CommSlave8 [Stat chunk file work] >> Communication with storage target failed. TargetID: 16662; EntryID: 0-616FFE15-7B5
(2) Oct20 13:33:15 CommSlave21 [Unlink chunk file work] >> Communication with storage target failed. TargetID: 44346; EntryID: 192-616E57B9-7B5
(2) Oct20 13:34:56 Worker3 [Close Helper (close chunk files S)] >> Communication with storage target failed: 37055; FileHandle: 7F4C04D7#2B-616FFE54-7B5; Error: Communication error
(2) Oct20 13:34:57 CommSlave2 [Trunc chunk file work] >> Communication with storage target failed. TargetID: 33731; EntryID: 77-616FD16F-7B5
(2) Nov01 20:37:20 CommSlave71 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16424; FileHandle: 2BAEE90B#1A93-6125EFC1-7B5
(2) Nov01 20:37:21 Worker33 [Close Helper (close chunk files S)] >> Communication with storage target failed: 26846; FileHandle: 4C5DC753#178-617C2727-7B5; Error: Communication error
The crash on Oct20 was more colourful. 20211102 relates to the logs of
the current instance with beegfs 7.2.4, since boot.
$ grep 'Communication with storage target failed' 20211101-beegfs-meta/beegfs-meta.log* \
| perl -ne '++$op{$1} if / (\[.+\])/; END { for (keys %op){ print "$_\t$op{$_}\n" } }'
[Close Helper (close chunk files S)] 54
[Stat chunk file work] 931
[Unlink chunk file work] 353
[Trunc chunk file work] 35
$ grep 'Communication with storage target failed' 20211102-beegfs-meta/beegfs-meta.log* \
| perl -ne '++$op{$1} if / (\[.+\])/; END { for (keys %op){ print "$_\t$op{$_}\n" } }'
[Close Helper (close chunk files S)] 20
[Close chunk file work] 130448
> Does this happen for all storage targets?
I'm mostly limiting myself to the logs of the current instances with
7.2.4, started on Nov 1 evening. Things get confusing. But I might fish
in older logs for certain IDs.
$ grep 'Communication with storage target failed' 20211102-beegfs-meta/* \
| wc -l
130468
$ grep 'Communication with storage target failed' 20211102-beegfs-meta/* \
> | perl -ne '++$target{$1} if /TargetID: (\d+)/; END { for (sort {$a <=> $b} (keys %target)){ print "$_\t$target{$_}\n"; }}'
12219 7347
16795 6976
17364 6976
24796 7013
26846 7013
33731 7013
35466 7013
37055 7892
38841 7892
38900 7892
40077 7394
44346 7394
51045 7394
55273 6973
55481 6973
55501 6973
57849 6973
62448 7347
This is 18 differing targets, out of 24 I have in total. Seems rather evenly distributed.
The IDs, EntryID and FileHandle, are in total 53836 differing ones … no
easy match. But if I only focus on the part after #, the IDs like
3C9-617CDE89-7B5, I get 234 before the update to 7.2.4, 2674 after.
> And it is it reproducible on
> closing and truncating files? If so, I would enable debug log level.
Looking into that. I need to map those IDs to absolute paths to files …
Oh, well … I overlooked one thing about the storage targets:
(0) Oct30 07:50:28 ConnAccept [ConnAccept] >> Trying to continue after connection accept error: Error during socket accept(): Too many open files
Many of those at that time. That's odd.
$ for n in 1 2 3 4; do grep 'Too many open files' 20211102-beegfs$n/*|wc -l; done
244827
243016
0
239853
That, while /proc/sys/fs/file-max is 13 million, suggests a rather
heavy leak in file descriptors, no?
Then:
(2) Oct26 14:55:01 Worker1-3 [WriteChunkFileMsg incoming] >> Potential cache loss for open file handle. (Server crash detected.) No session for file available. FileHandleID: 3D1E4E8B#8D2-61263B7D-7B5
I got serveral of those on two storage servers, at several points in
time, correlating with at least most of the metadata server crashes.
Since I don't have all logs collected, it is safe to assume that these
errors correlate between storage and meta. However, it is only a small
amount. Cause or effect?
$ grep 'Server crash' */*|perl -pe 's,Worker.*: ,... ,'
20211101-beegfs1/beegfs-storage.log:(2) Oct26 14:55:01 ... 3D1E4E8B#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct26 15:53:40 ... 3D1E8C03#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct26 15:57:19 ... 51E81C0D#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct26 15:58:34 ... 51E81D64#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct30 07:56:35 ... 3D22A067#3C9-617CDE89-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:39 ... 4D8A44F8#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:41 ... 7C77B9E6#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:42 ... 40C283CD#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:12:42 ... 66AF826B#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:39:49 ... 6A714EAE#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:40:37 ... 8130180#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Oct31 08:40:40 ... 6B865D99#8D2-61263B7D-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 62BE7F3A#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 9A67A51#EF7-612DD121-7B5
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 ... 2C8A4BDA#EF7-612DD121-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:10 ... 6E2B5EE0#7C-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:14 ... 4A9561BE#2D-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:15 ... 2796CEE6#29-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:15 ... 648E7F52#67-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:15 ... 7F3A480#43-616FFED1-7B5
20211101-beegfs2/beegfs-storage.log:(2) Oct20 13:35:16 ... 6052460F#49-616FFED1-7B5
$ grep 'Server crash' 20211101-beegfs?/* \
| perl -ne 'print "$1\n" if /FileHandleID: (\S+#\S+)$/'
That gives me 22 FileHandIDs. When only using the part after the #,
can find some of those in my metadata logs. For the complete ID, I find
exactly one match.
But these messages did not occur with BeeGFS 7.2.4, anyway.
$ grep '64C8EBE0#EF7-612DD121-7B5' */*
20211101-beegfs1/beegfs-storage.log:(2) Nov01 09:08:57 Worker4-3 [WriteChunkFileMsg incoming] >> Potential cache loss for open file handle. (Server crash detected.) No session for file available. FileHandleID: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 CommSlave73 [Close chunk file work] >> Communication with storage target failed. TargetID: 37055; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 CommSlave83 [Close chunk file work] >> Communication with storage target failed. TargetID: 38900; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 CommSlave81 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211101-beegfs-meta-7.2.4/beegfs-meta.log.old-4:(2) Nov01 20:37:05 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 CommSlave73 [Close chunk file work] >> Communication with storage target failed. TargetID: 37055; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 CommSlave83 [Close chunk file work] >> Communication with storage target failed. TargetID: 38900; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 CommSlave81 [Close chunk file work] >> Communication with storage target failed. TargetID: 38841; Session: 16200; FileHandle: 64C8EBE0#EF7-612DD121-7B5
20211102-beegfs-meta/beegfs-meta.log.old-5:(2) Nov01 20:37:05 TimerWork/0 [Close Helper (close chunk files)] >> Problems occurred during release of storage server file handles. FileHandle: 64C8EBE0#EF7-612DD121-7B5
Hm. I botched up log collection (they're thrown away on shutdown, not
being configured to be passed to the central logging at cluster setup).
But the logs from Nov01 should be complete. Cache loss in the morning
and crash 11 hours later? The crash itself seems unrelated, but this
indicates breakage, anyway.
I don't see any errors at the same time matching the metadata errors, but this
> Even better would be to recompile with BEEGFS_DEBUG and then to set the
> debug log level.
Hm. We got LogLevel=3 right now. First, I'd need a way to trigger the
crash anyway. I'm looking at converting those IDs to paths that I can
nudge on the client side.
> > … this is with the storage targets being healthy by all other means.
Caveat about them being out of file handles. I recant my use of the
word 'healthy'.
> > (2) Nov01 21:34:54 CommSlave84 [Trunc chunk file work] >> Truncation of chunk file on target failed. TargetID: 38759; EntryID: 14D4-61804DD4-7B5; Error: Internal error
> > (2) Nov01 21:34:54 Worker13 [Trunc chunk file helper] >> Problems occurred during truncation of storage server chunk files. File: 14D4-61804DD4-7B5
> > (0) Nov01 21:34:54 Worker3 [make meta dir-entry] >> Failed to create link from: inodes/7/2A/87E0-61804F4D-7B5 To: dentries/35/5B/disposal/87E0-61804F4D-7B5 SysErr: File exists
> > (0) Nov01 21:34:54 Worker6 [InodeDirStore.cpp:128] >> Bug: Refusing to release the directory, its fileStore still has references! dirID: 87DD-61804F4D-7B5
> > (0) Nov01 21:34:54 Worker15 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
I wonder if this is the only point I should focus on, or if also the
other messages discussed above are fatal enough, and more frequent, so
that I should try to debug those instead.
> This message exists for a much much longer time, since around 2012...
> What happens is that the in memory object reference count for a
> directory went to 0, but at the same time the directory still has
> objects (directory entries).
I don't have the picture on which levels are involved here. This is an
inconsinstency on the metadata server alone, right?
> I wonder if a simple reproducer would be to get the object IDs, in order
> to find the right disposal path and then
>
> - have a single file in a directory (client operation)
> - manually create a file with the right name in the right disposal path
> (server side operation)
> - open the file on a client (client)
> - unlink the file while it is hold open (client)
I need to read about those IDs and how they map to paths. Is the server
side the metadata one (/beegfs/meta/dentries)? Is there a document
with an overview these architecture details?
> ==> The unlink of the opened file should link it to the disposal dir,
> but that file already exists - should give the error message. If now
> dispoal and initial dir have only one file either of both dir-inodes
> might get the reference issue.
>
> My guess is that there is a dirInode object derefence, which should not
> be done if creating the hard link failed. And given how rarely an issue
> like that would happen, it _might_ be a really ancient issue.
Yes. Do you think that this is the only issue, though? Those storage
nodes running out of file descriptors look bad, IMHO, too. How many
bugs do we have at hand here?
> Not sure if this is going to help - I would try to figure out what is
> going on between the meta and storage targets - that is probably the
> main trigger of issues.
Yeah … I have some figuring out to do;-) But there is a trade-off to
do: I can invest a certain amount of time into recovering this and
hopefulling fixing BeeGFS to avoid such failure in future … but as long
as I do that, our cluster operation is stopped. I cann spend a month of
downtime on this and may have to setup a clean BeeGFS and get rid of
the old data.
If I could extract a minimal set of problematic data to debug later,
that would be nice, but a full backup of the BeeGFS is not happening.