Support for avoiding fstatfs ZFS check in env/io_posix.cc::IsSyncFileRangeSupported

105 views
Skip to first unread message

Peter (Stig) Edwards

unread,
Sep 14, 2021, 9:08:22 AM9/14/21
to rocksdb
Hello and thanks for RocksDB,

  I am using RocksDB on a clustered filesystem, and noticed that on some nodes calls to fstatfs from RocksDB are faster than others.  There are many calls to fstatfs that check if the fs is ZFS, and as I am not running on ZFS, I was thinking of adding support to avoid making them.

  The fstatfs calls I noticed are from RocksDB during ((Optimistic)Transaction)DB::Open calls, from env/io_posix.cc::IsSyncFileRangeSupported by way of PosixWritableFile::PosixWritableFile(...).

https://github.com/facebook/rocksdb/blob/2e09a54c4fb82e88bcaa3e7cfa8ccbbbbf3635d5/env/io_posix.cc#L151 

The fstatfs for ZFS check was added in https://github.com/facebook/rocksdb/commit/8272a6de57ed701fb25bb660e074cab703ed3fe7 as part of https://github.com/facebook/rocksdb/pull/5183

The dynamic sync_file_range check was added in https://github.com/facebook/rocksdb/commit/2c9df9f9e5c757c8f368d0860e2da8adb63849a3 for https://github.com/facebook/rocksdb/pull/5416 , interestingly the code referenced https://github.com/postgres/postgres/commit/483520eca426fb1b428e8416d1d014ac5ad80ef4 uses a static bool not_implemented_by_kernel which is set to true when ENOSYS is returned, but I don't think this approach could be used with the RocksDB code here as the fd could be on different filesystems.  Anyway the sync_file_range call is always nice and fast.

Here is a simulated reproducer using RocksDB 6.20 on CentOS8 using /dev/shm and strace to add a delay:

rm -rf /dev/shm/db_bench ; ./db_bench -benchmarks updaterandom -num 1 -db /dev/shm/db_bench > /dev/null ; strace -y -T -e trace=fstatfs -e inject=fstatfs:delay_enter=100ms ./db_bench -benchmarks updaterandom -num 0 -db /dev/shm/db_bench -use_existing_db true -report_open_timing true 2>&1 | egrep 'fstatfs|OpenDb' | awk '{print $1,$2,$3,$16}'

fstatfs(6</dev/shm/db_bench/MANIFEST-000008>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.099964>

fstatfs(7</dev/shm/db_bench/000008.dbtmp>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.100016>

fstatfs(8</dev/shm/db_bench/000009.sst>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.100020>

fstatfs(9</dev/shm/db_bench/MANIFEST-000010>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.100012>

fstatfs(6</dev/shm/db_bench/000010.dbtmp>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.100015>

fstatfs(6</dev/shm/db_bench/000005.log>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.100016>

fstatfs(6</dev/shm/db_bench/000011.log>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.100018>

fstatfs(10</dev/shm/db_bench/OPTIONS-000012.dbtmp>, {f_type=TMPFS_MAGIC, f_bsize=4096, <0.100008>

OpenDb: 814.075 milliseconds

I was thinking a compile time define (DO_NOT_CHECK_ZFS_SYNC_FILE_RANGE) around the code in

https://github.com/facebook/rocksdb/blob/2e09a54c4fb82e88bcaa3e7cfa8ccbbbbf3635d5/env/io_posix.cc#L151

would be a simple fix for me as I would define it when building.  Would a patch like this be of interest as a PR?

Cheers,

Peter (Stig) Edwards

Mark Rambacher

unread,
Sep 14, 2021, 9:55:25 AM9/14/21
to Peter (Stig) Edwards, rocksdb
I think you can already disable this at build time by using ROCKSDB_DISABLE_SYNC_FILE_RANGE.  Is that not an option? (I am not sure if the flag can be easily disabled in CMAKE builds.)

Is it the statfs call that is slow, or the sync_file_range?  It might also be possible to improve the performance of the code by only doing the fstatfs once per directory.  If the io_posix code kept a static map of the directories to their FS type, the second call could be avoided.  I know of other PRs that were going to potentially add more fstatfs calls and could benefit from this map as well.  

/ Mark 


--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/b50c8d34-44ee-43f2-ba65-c5c9bbc599d0n%40googlegroups.com.

Peter (Stig) Edwards

unread,
Sep 14, 2021, 11:13:30 AM9/14/21
to rocksdb
Thanks for the reply.

wrt ROCKSDB_DISABLE_SYNC_FILE_RANGE 
 
sync_file_range is present and supported by the OS and filesystem being used, and not a performance problem, if ROCKSDB_DISABLE_SYNC_FILE_RANGE is used then ROCKSDB_RANGESYNC_PRESENT is not defined and the behaviour of PosixWritableFile::RangeSync changes to use FSWritableFile::RangeSync https://github.com/facebook/rocksdb/blob/2e09a54c4fb82e88bcaa3e7cfa8ccbbbbf3635d5/env/io_posix.cc#L1410 , which I don't think uses sync_file_range.

It is the fstatfs call that has variable performance (slower at times), and is not needed as I am not using ZFS.

ROCKSDB_FORCE_ENABLE_SYNC_FILE_RANGE could be implemented, and this could always return true for env/io_posix.cc::IsSyncFileRangeSupported and define ROCKSDB_RANGESYNC_PRESENT, then anyone not using ZFS or the Windows Subsystem for Linux could define it to avoid the 2 system calls.

I thought about an implementation that would call fstatfs once per RocksDB database, but am aware that some files such as the log files (db_log_dir) and WAL files (wal_dir), and also SST files (cf_paths) could be created outside the RocksDB database directory.  But I suppose a map/cache of directories to their FS type would work as long as files are not links to files on other filesystems, or a different filesystem is mounted on the path.

Stig

Mark Rambacher

unread,
Sep 14, 2021, 11:57:46 AM9/14/21
to Peter (Stig) Edwards, rocksdb
PR#8903 introduces another call to fstatfs, also to get the file system type (f_type).  My suggestion would be to add a static map of directory names to the f_type and avoid making the fstat call when the type is already known for the directory.  This needs to be done under a mutex (to protect the map) 

Care to submit a PR?  

/ Mark

Stig

unread,
Sep 14, 2021, 12:38:50 PM9/14/21
to Mark Rambacher, rocksdb
PR#8903 looks like it will result in a call to fstatfs when
NewDirectory is called, this code path is not a hot one for me, at
least compared to PosixWritableFile::PosixWritableFile(...).

I don't have the time for more involved* PR to use a dir path to
f_type cache at the moment or in the short term. For now I will
remove the ZFS fstatfs check when building RocksDB (because I am not
using ZFS I don't need a cache).

If there is interest in a PR for:
ROCKSDB_FORCE_ENABLE_SYNC_FILE_RANGE - always return true for
env/io_posix.cc::IsSyncFileRangeSupported and define
ROCKSDB_RANGESYNC_PRESENT.
or
ROCKSDB_DO_NOT_CHECK_ZFS_SYNC_FILE_RANGE - skip the ZFS fstat check
in env/io_posix.cc::IsSyncFileRangeSupported
then I can submit one.

Thanks,
Stig

* I think it could be more involved because the cache may need to be
bounded (to limit size) and also sharded (for multithreaded
performance), so maybe using LRUCacheShard.

Mark Rambacher

unread,
Sep 20, 2021, 10:39:52 AM9/20/21
to Stig, rocksdb
Out of curiosity, are you compiling with fallocate enabled?  The fallocate appears to do a "statfs" whenever a writable file is opened.  Wondered if you saw the same performance degradation there.  

/ Mark


Peter (Stig) Edwards

unread,
Sep 21, 2021, 4:48:02 AM9/21/21
to rocksdb
Yes, fallocate is enabled, ROCKSDB_FALLOCATE_PRESENT is being defined.

The statfs call from env/fs_posix.cc:SupportsFastAllocate is not a problem for our use case.  It is guarded by env/fs_posix.cc#:checkedDiskForMmap_


We also have the default allow_mmap_writes=false


so the guard at env/fs_posix.cc#L325 also avoids the call to SupportsFastAllocate.

Stig
Reply all
Reply to author
Forward
0 new messages