Crash Consistency Bugs in BeeGFS

71 views
Skip to first unread message

Tao Lyu

unread,
Feb 24, 2023, 10:25:21 AM2/24/23
to beegfs-user
Hi,

I find several issues about the persistence syscalls in BeeGFS.

Environment settings:
1 management server, 2 metadata server,  2 storage server and 1 client
client is configured with "tuneRemoteFSync = true"

Issue 1. fsync doesn't flush data and metadata of the corresponding file to server disks before it returns. But this is required by POSIX specification. (https://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html)

For example, if I issue the following command from bash:
echo "testing beegfs fsync on file0" > beegfs-client/file0;
strace sync beegfs-client;
strace sync beegfs-client/file0;

After the last one command finishes (here all commands executed correctly and successfully), I crash all 5 servers using "echo b > /proc/sysrq-trigger". And then restart all servers to wait whole file systems is back to normal. However, I can't list file0 from the client. And I can't find the data on the server storage disk.

Issue 2. fdatasync doens't flush directory entries, which is the data for directories. Here is the strace of our testing scripts: (beegfs-client is our client mount point)
openat(AT_FDCWD, "beegfs-client/file0", O_RDWR|O_CREAT, 0162100) = 3
close(3)                                = 0
openat(AT_FDCWD, "./beegfs-client", O_RDONLY) = 3
fdatasync(3)                            = 0
close(3)                                = 0

Similiarly with previous testing, we crash all server after the last call. After recovering from the crash, there is no file0 under the directory beegfs-client. And I tested the same command and failures on local Ext4, it can has the file0 after recovery.

I'm wonering whether any BeeGFS developers can take a look at these issues? If this is caused by some wrong configurations, please correct me.

Best,
Tao
Reply all
Reply to author
Forward
0 new messages