Crash Consistency Bugs in BeeGFS

71 views

Skip to first unread message

Tao Lyu

unread,

Feb 24, 2023, 10:25:21 AM2/24/23

to beegfs-user

Hi,

I find several issues about the persistence syscalls in BeeGFS.

Environment settings:

1 management server, 2 metadata server, 2 storage server and 1 client

client is configured with "tuneRemoteFSync = true"

Issue 1. fsync doesn't flush data and metadata of the corresponding file to server disks before it returns. But this is required by POSIX specification. (https://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html)

For example, if I issue the following command from bash:

echo "testing beegfs fsync on file0" > beegfs-client/file0;

strace sync beegfs-client;

strace sync beegfs-client/file0;

After the last one command finishes (here all commands executed correctly and successfully), I crash all 5 servers using "echo b > /proc/sysrq-trigger". And then restart all servers to wait whole file systems is back to normal. However, I can't list file0 from the client. And I can't find the data on the server storage disk.

Issue 2. fdatasync doens't flush directory entries, which is the data for directories. Here is the strace of our testing scripts: (beegfs-client is our client mount point)
openat(AT_FDCWD, "beegfs-client/file0", O_RDWR|O_CREAT, 0162100) = 3
close(3) = 0
openat(AT_FDCWD, "./beegfs-client", O_RDONLY) = 3
fdatasync(3) = 0
close(3) = 0

Similiarly with previous testing, we crash all server after the last call. After recovering from the crash, there is no file0 under the directory beegfs-client. And I tested the same command and failures on local Ext4, it can has the file0 after recovery.

I'm wonering whether any BeeGFS developers can take a look at these issues? If this is caused by some wrong configurations, please correct me.

Best,

Tao

Reply all

Reply to author

Forward

0 new messages