RocksDB - configuration tuning high read IOPS

176 views
Skip to first unread message

Luis Alves

unread,
Mar 4, 2024, 1:51:39 PMMar 4
to rocksdb
Hello everyone,

I'm doing a performance analysis on a system that uses RocksDB as its underlying storage. From my investigation, the bottleneck seems to be that RocksDB is exhausting the disk read IOPS. 

I suspect that this is data-related (lots of key/values and random reads, or maybe compactions) and that the solution is just to scale the system, but I would like to know if there is any obvious configuration change that could make sense for my scenario.

Follows some information about my setup.

RocksDB version: 8.10.2

Workload Description:
  1. The number of RocksDB writes (put or delete) is usually the same as the number of reads (point reads / random reads).
  2. Writes to RocksDB (put or delete) are performed using WriteBatch with at most 5k items.
  3. The application is in Java and uses RocksDBJNI.
  4. The application (process) has multiple RocksDB instances and they are configured to share the same BlockCache and WriteBufferManager.
Infrastructure:
  1. The application runs on AWS EKS (Kubernetes).
  2. It runs in a pod with 8 vCPUs and 44 GB memory.
  3. AWS gp3 volume with 250GB and 3k IOPS (which is the default IOPS).
  4. Using Jemalloc.
  5. The JVM is configured with 16GB heap and RocksDB BlockCache with 18GB. The remaining memory is left for the OS Page Cache.
RocksDB configurations:
Options.error_if_exists: 0
Options.create_if_missing: 1
Options.paranoid_checks: 0
Options.flush_verify_memtable_count: 1
Options.compaction_verify_record_count: 1
Options.track_and_verify_wals_in_manifest: 0
Options.verify_sst_unique_id_in_manifest: 1
Options.env: 0x7f80456e2700
Options.fs: PosixFileSystem
Options.info_log: 0x7f80455c4410
Options.max_file_opening_threads: 16
Options.statistics: 0
Options.use_fsync: 0
Options.max_log_file_size: 0
Options.max_manifest_file_size: 1073741824
Options.log_file_time_to_roll: 0
Options.keep_log_file_num: 1000
Options.recycle_log_file_num: 0
Options.allow_fallocate: 1
Options.allow_mmap_reads: 0
Options.allow_mmap_writes: 0
Options.use_direct_reads: 0
Options.use_direct_io_for_flush_and_compaction: 0
Options.create_missing_column_families: 0
Options.db_log_dir:
Options.wal_dir:
Options.table_cache_numshardbits: 9
Options.WAL_ttl_seconds: 0
Options.WAL_size_limit_MB: 0
Options.max_write_batch_group_size_bytes: 1048576
Options.manifest_preallocation_size: 4194304
Options.is_fd_close_on_exec: 1
Options.advise_random_on_open: 1
Options.db_write_buffer_size: 0
Options.write_buffer_manager: 0x7f80456fa550
Options.access_hint_on_compaction_start: 1
Options.random_access_max_buffer_size: 1048576
Options.use_adaptive_mutex: 0
Options.rate_limiter: 0
Options.sst_file_manager.rate_bytes_per_sec: 0
Options.wal_recovery_mode: 2
Options.enable_thread_tracking: 0
Options.enable_pipelined_write: 0
Options.unordered_write: 0
Options.allow_concurrent_memtable_write: 1
Options.enable_write_thread_adaptive_yield: 1
Options.write_thread_max_yield_usec: 100
Options.write_thread_slow_yield_usec: 3
Options.row_cache: None
Options.wal_filter: None
Options.avoid_flush_during_recovery: 1
Options.allow_ingest_behind: 0
Options.two_write_queues: 0
Options.manual_wal_flush: 0
Options.wal_compression: 0
Options.atomic_flush: 0
Options.avoid_unnecessary_blocking_io: 0
Options.persist_stats_to_disk: 0
Options.write_dbid_to_manifest: 0
Options.log_readahead_size: 0
Options.file_checksum_gen_factory: Unknown
Options.best_efforts_recovery: 0
Options.max_bgerror_resume_count: 2147483647
Options.bgerror_resume_retry_interval: 1000000
Options.allow_data_in_errors: 0
Options.db_host_id: __hostname__
Options.enforce_single_del_contracts: true
Options.max_background_jobs: 8
Options.max_background_compactions: -1
Options.max_subcompactions: 4
Options.avoid_flush_during_shutdown: 1
Options.writable_file_max_buffer_size: 1048576
Options.delayed_write_rate : 16777216
Options.max_total_wal_size: 0
Options.delete_obsolete_files_period_micros: 21600000000
Options.stats_dump_period_sec: 600
Options.stats_persist_period_sec: 600
Options.stats_history_buffer_size: 1048576
Options.max_open_files: -1
Options.bytes_per_sync: 0
Options.wal_bytes_per_sync: 0
Options.strict_bytes_per_sync: 0
Options.compaction_readahead_size: 2097152
Options.max_background_flushes: -1
Options.daily_offpeak_time_utc:
Compression algorithms supported:
kZSTDNotFinalCompression supported: 1
kZSTD supported: 1
kXpressCompression supported: 0
kLZ4HCCompression supported: 1
kLZ4Compression supported: 1
kBZip2Compression supported: 1
kZlibCompression supported: 1
kSnappyCompression supported: 1
Fast CRC32 supported: Not supported on x86
DMutex implementation: pthread_mutex_t
[/db_impl/db_impl_open.cc:325] Creating manifest 1
[/column_family.cc:618] --------------- Options for column family [default]:
Options.comparator: leveldb.BytewiseComparator
Options.merge_operator: None
Options.compaction_filter: None
Options.compaction_filter_factory: None
Options.sst_partitioner_factory: None
Options.memtable_factory: SkipListFactory
Options.table_factory: BlockBasedTable
table_factory options: flush_block_policy_factory: FlushBlockBySizePolicyFactory (0x7f8046934420)
cache_index_and_filter_blocks: 1
cache_index_and_filter_blocks_with_high_priority: 1
pin_l0_filter_and_index_blocks_in_cache: 1
pin_top_level_index_and_filter: 1
index_type: 2
data_block_index_type: 1
index_shortening: 1
data_block_hash_table_util_ratio: 0.750000
checksum: 4
no_block_cache: 0
block_cache: 0x7f80456dde10
block_cache_name: LRUCache
block_cache_options:
capacity : 19327352832
num_shard_bits : 9
strict_capacity_limit : 0
memory_allocator : None
high_pri_pool_ratio: 0.000
low_pri_pool_ratio: 0.000
persistent_cache: 0
block_size: 4096
block_size_deviation: 10
block_restart_interval: 16
index_block_restart_interval: 1
metadata_block_size: 4096
partition_filters: 1
use_delta_encoding: 1
filter_policy: bloomfilter
whole_key_filtering: 1
verify_compression: 0
read_amp_bytes_per_bit: 0
format_version: 5
enable_index_compression: 1
block_align: 0
max_auto_readahead_size: 262144
prepopulate_block_cache: 0
initial_auto_readahead_size: 8192
num_file_reads_for_auto_readahead: 2
Options.write_buffer_size: 268435456
Options.max_write_buffer_number: 2
Options.compression: LZ4
Options.bottommost_compression: Disabled
Options.prefix_extractor: nullptr
Options.memtable_insert_with_hint_prefix_extractor: nullptr
Options.num_levels: 7
Options.min_write_buffer_number_to_merge: 1
Options.max_write_buffer_number_to_maintain: 0
Options.max_write_buffer_size_to_maintain: 0
Options.bottommost_compression_opts.window_bits: -14
Options.bottommost_compression_opts.level: 32767
Options.bottommost_compression_opts.strategy: 0
Options.bottommost_compression_opts.max_dict_bytes: 0
Options.bottommost_compression_opts.zstd_max_train_bytes: 0
Options.bottommost_compression_opts.parallel_threads: 1
Options.bottommost_compression_opts.enabled: false
Options.bottommost_compression_opts.max_dict_buffer_bytes: 0
Options.bottommost_compression_opts.use_zstd_dict_trainer: true
Options.compression_opts.window_bits: -14
Options.compression_opts.level: 32767
Options.compression_opts.strategy: 0
Options.compression_opts.max_dict_bytes: 0
Options.compression_opts.zstd_max_train_bytes: 0
Options.compression_opts.use_zstd_dict_trainer: true
Options.compression_opts.parallel_threads: 1
Options.compression_opts.enabled: true
Options.compression_opts.max_dict_buffer_bytes: 0
Options.level0_file_num_compaction_trigger: 4
Options.level0_slowdown_writes_trigger: 20
Options.level0_stop_writes_trigger: 36
Options.target_file_size_base: 67108864
Options.target_file_size_multiplier: 1
Options.max_bytes_for_level_base: 268435456
Options.level_compaction_dynamic_level_bytes: 1
Options.max_bytes_for_level_multiplier: 10.000000
Options.max_bytes_for_level_multiplier_addtl[0]: 1
Options.max_bytes_for_level_multiplier_addtl[1]: 1
Options.max_bytes_for_level_multiplier_addtl[2]: 1
Options.max_bytes_for_level_multiplier_addtl[3]: 1
Options.max_bytes_for_level_multiplier_addtl[4]: 1
Options.max_bytes_for_level_multiplier_addtl[5]: 1
Options.max_bytes_for_level_multiplier_addtl[6]: 1
Options.max_sequential_skip_in_iterations: 8
Options.max_compaction_bytes: 1677721600
Options.ignore_max_compaction_bytes_for_input: true
Options.arena_block_size: 1048576
Options.soft_pending_compaction_bytes_limit: 68719476736
Options.hard_pending_compaction_bytes_limit: 274877906944
Options.disable_auto_compactions: 0
Options.compaction_style: kCompactionStyleLevel
Options.compaction_pri: kMinOverlappingRatio
Options.compaction_options_universal.size_ratio: 1
Options.compaction_options_universal.min_merge_width: 2
Options.compaction_options_universal.max_merge_width: 4294967295
Options.compaction_options_universal.max_size_amplification_percent: 200
Options.compaction_options_universal.compression_size_percent: -1
Options.compaction_options_universal.stop_style: kCompactionStopStyleTotalSize
Options.compaction_options_fifo.max_table_files_size: 1073741824
Options.compaction_options_fifo.allow_compaction: 0
Options.table_properties_collectors:
Options.inplace_update_support: 0
Options.inplace_update_num_locks: 10000
Options.memtable_prefix_bloom_size_ratio: 0.000000
Options.memtable_whole_key_filtering: 0
Options.memtable_huge_page_size: 0
Options.bloom_locality: 0
Options.max_successive_merges: 0
Options.optimize_filters_for_hits: 0
Options.paranoid_file_checks: 0
Options.force_consistency_checks: 1
Options.report_bg_io_stats: 0
Options.ttl: 2592000
Options.periodic_compaction_seconds: 0
Options.default_temperature: kUnknown
Options.preclude_last_level_data_seconds: 0
Options.preserve_internal_time_seconds: 0
Options.enable_blob_files: false
Options.min_blob_size: 0
Options.blob_file_size: 268435456
Options.blob_compression_type: NoCompression
Options.enable_blob_garbage_collection: false
Options.blob_garbage_collection_age_cutoff: 0.250000
Options.blob_garbage_collection_force_threshold: 1.000000
Options.blob_compaction_readahead_size: 0
Options.blob_file_starting_level: 0
Options.experimental_mempurge_threshold: 0.000000
Options.memtable_max_range_deletions: 0

DB and Compaction stats sample (from one of the RocksDB instances):
One thing I don't understand here is that the compaction stats does not show stats for L1, L2, or L3.

** DB Stats **
Uptime(secs): 90719.4 total, 600.0 interval
Cumulative writes: 328K writes, 4596M keys, 328K commit groups, 1.0 writes per commit group, ingest: 390.42 GB, 4.41 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 1577 writes, 22M keys, 1577 commit groups, 1.0 writes per commit group, ingest: 2101.99 MB, 3.50 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
Write Stall (count): write-buffer-manager-limit-stops: 0

** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 2/0 90.52 MB 0.5 0.0 0.0 0.0 83.4 83.4 0.0 1.0 0.0 21.6 3953.51 3675.82 1996 1.981 0 0 0.0 0.0
L4 5/0 314.17 MB 9.1 168.6 74.5 94.1 157.2 63.0 0.0 2.1 34.6 32.2 4994.20 8400.16 433 11.534 8773M 594M 0.0 0.0
L5 26/1 1.39 GB 9.6 172.8 67.5 105.3 157.1 51.8 3.8 2.3 13.6 12.4 13020.93 8277.02 1074 12.124 8922M 881M 0.0 0.0
L6 207/2 12.75 GB 0.0 329.2 54.5 274.6 287.4 12.7 0.0 5.3 7.8 6.8 43398.83 16110.04 958 45.301 14G 2290M 0.0 0.0
Sum 240/3 14.53 GB 0.0 670.5 196.5 474.1 685.1 211.0 3.8 8.2 10.5 10.7 65367.47 36463.04 4461 14.653 32G 3767M 0.0 0.0
Int 0/0 0.00 KB 0.0 3.9 1.1 2.8 3.9 1.1 0.0 8.9 6.2 6.4 635.56 215.95 24 26.482 170M 17M 0.0 0.0

** Compaction Stats [default] **
Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Low 0/0 0.00 KB 0.0 670.5 196.5 474.1 601.7 127.6 0.0 0.0 11.2 10.0 61413.96 32787.22 2465 24.914 32G 3767M 0.0 0.0
High 0/0 0.00 KB 0.0 0.0 0.0 0.0 83.4 83.4 0.0 0.0 0.0 21.6 3953.51 3675.82 1996 1.981 0 0 0.0 0.0

Blob file count: 0, total size: 0.0 GB, garbage size: 0.0 GB, space amp: 0.0

Uptime(secs): 90719.4 total, 600.0 interval
Flush(GB): cumulative 83.402, interval 0.441
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 685.06 GB write, 7.73 MB/s write, 670.54 GB read, 7.57 MB/s read, 65367.5 seconds
Interval compaction: 3.95 GB write, 6.74 MB/s write, 3.87 GB read, 6.60 MB/s read, 635.6 seconds
Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-
Block cache LRUCache@0x7f80456dde10#1 capacity: 18.00 GB seed: 982229212 usage: 17.99 GB table_size: 8388608 occupancy: 4435826 collections: 77 last_copies: 4 last_secs: 1.51524 secs_since: 600
Block cache entry stats(count,size,portion): DataBlock(3438468,13.42 GB,74.5779%) FilterBlock(488656,1.40 GB,7.79453%) FilterMetaBlock(700,52.52 MB,0.284952%) IndexBlock(502388,1.90 GB,10.5639%) WriteBuffer(3315,828.75 MB,4.49626%) Misc(3

Disk IOPS (negative values are read IOPS):
2024-03-04_18-35.png

Disk volume usage:
2024-03-04_18-36.png

Memory usage (in yellow the WSS):
2024-03-04_18-43.png

Thanks in advance,
Luis Alves

Mark Callaghan

unread,
Mar 4, 2024, 7:42:45 PMMar 4
to rocksdb
1) Level numbering with dynamic leveled


> One thing I don't understand here is that the compaction stats does not show stats for L1, L2, or L3.

You are using dynamic leveled compaction :  Options.level_compaction_dynamic_level_bytes: 1
Comments for that option are here -> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/advanced_options.h#L460

The option is great, although it leads to confusion for level names. Most implementations of leveled compaction fill L0, then L1, then L2, ...
The problem with this is that you frequently end up with a state where sizeof(Lmax) <= sizeof(Lmax_prev) where Lmax is the max level and Lmax_prev is the next to max level.
The problem with that state is you can have a lot more space-amp than you expect from leveled. For example, if sizeof(Lmax) == sizeof(Lmax_prev) and the workload is update only then all KV pairs in Lmax_prev can be the same as the KV pairs in Lmax and in that case the space-amp is ~2X.

Lets call the standard approach top down. And with dynamic leveled, the approach is bottom up. Assume you configured RocksDB to have 8 levels. After the L0 spills, it spills to Lmax (there is L0 and Lmax at this point, or L8 and L0). When Lmax becomes larger than it should be given the expected size for the first level after L0, then a new level is added in between Lmax and L0, so now you have L0, Lmax_prev, Lmax (or L0, L7, L8). And this continues. And it is confusing when you (or I) first see it. But I like the feature.

2) IO saturation

Are your reads mostly point queries, mostly range queries, or a mix of both?

From the compaction IO stats you shared, the database was 14.53G with L6 being 12.75G. 
All levels are compressed -> Options.compression: LZ4 and Options.bottommost_compression: Disabled
The RocksDB cache is 18G, but not large enough to cache the database because the database is ~15G compressed on disk, but the block cache stores uncompressed blocks.
So there can be reads from storage as you report.
From the config I think the bloom filter is enabled.

The challenge is to figure out what consumes the most IO -- compaction reads, compaction writes, user queries.
Well, it is easy to know whether writes are an issue via iostat or some other IO monitoring utility.
Figuring out whether reads from storage are more for compaction or more for user queries can be more of a challenge.

Can you share iostat data, or something that shows read IOPs, read MB/s (or average read size), write IOPs, write IOPs (or average write size)?

Luis Alves

unread,
Mar 5, 2024, 10:21:49 AMMar 5
to rocksdb
Hi Mark!

Thanks for the quick reply, and thanks for the explanation on dynamic leveled compaction and the compaction stats.

> From the config I think the bloom filter is enabled.
Yes, it is enabled.

> Are your reads mostly point queries, mostly range queries, or a mix of both?
All the reads are point queries. To give more context, this is an event processing system that for each event may have to perform around 300 point queries to RocksDB, in the worst case. We optimized the way we store data in RocksDB so that most of the time when we do the first point query, RocksDB will likely also fetch the data that the next point queries for the same event will look up for (i.e. will be cached in the block cache or the OS page cache).

> Can you share iostat data, or something that shows read IOPs, read MB/s (or average read size), write IOPs, write IOPs (or average write size)?
Follows the iostat data (the device being used is the nvme1n1):

03/05/24 12:47:52
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.56    0.07    2.60   24.94    0.01   62.81

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3016.40     13584.80     14241.20     135848     142412

03/05/24 12:48:02
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.88    0.08    1.67   22.45    0.01   64.91

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3015.90     13620.00     15361.60     136200     153616

03/05/24 12:48:12
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.06    0.08    2.63   42.93    0.02   45.28

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3019.00     14242.40      5764.00     142424      57640

03/05/24 12:48:22
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.71    0.08    1.64   45.04    0.01   43.50

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3014.20     13465.60      8277.20     134656      82772

03/05/24 12:48:32
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.76    0.11    2.14   45.23    0.01   41.74

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3017.80     13759.60      4206.40     137596      42064

03/05/24 12:48:42
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.03    0.09    1.83   47.17    0.02   41.86

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3014.30     13042.00     33210.00     130420     332100

03/05/24 12:48:52
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.22    0.08    2.14   34.35    0.01   54.21

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3022.30     13113.60     23137.60     131136     231376

03/05/24 12:49:02
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.88    0.08    2.00   44.81    0.02   44.22

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3019.50     13692.40      3463.20     136924      34632

03/05/24 12:49:12
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.55    0.09    1.90   46.75    0.01   41.69

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1        3022.80     13831.20     21986.40     138312     219864

Here are the compactions for the same time interval:

2024/03/05-12:48:02.784015 140716976462640 [/compaction/compaction_job.cc:2106] [default]: Compaction start summary: Base version 338 Base level 5, inputs: [20273(41MB)], [19645(69MB) 19650(69MB) 19652(46MB) 19049(69MB)]
2024/03/05-12:48:02.784039 140716976462640 EVENT_LOG_v1 {"time_micros": 1709642882784026, "job": 345, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L5": [20273], "files_L6": [19645, 19650, 19652, 19049], "score": 9.21739, "input_data_size": 311438773, "oldest_snapshot_seqno": -1}
2024/03/05-12:48:02.917377 140708333902640 [/compaction/compaction_job.cc:2106] [default]: Compaction start summary: Base version 338 Base level 0, inputs: [20293(45MB) 20289(45MB) 20284(45MB) 20276(46MB)], [20265(54MB) 20268(30MB) 20262(75MB) 20267(24MB) 20266(36MB)]
2024/03/05-12:48:02.917438 140708333902640 EVENT_LOG_v1 {"time_micros": 1709642882917407, "job": 346, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [20293, 20289, 20284, 20276], "files_L4": [20265, 20268, 20262, 20267, 20266], "score": 1, "input_data_size": 424328918, "oldest_snapshot_seqno": -1}
2024/03/05-12:48:46.316195 140716976315184 [/compaction/compaction_job.cc:2106] [default]: Compaction start summary: Base version 339 Base level 4, inputs: [20300(28MB)], [20204(63MB)]
2024/03/05-12:48:46.316211 140716976315184 EVENT_LOG_v1 {"time_micros": 1709642926316202, "job": 347, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L4": [20300], "files_L5": [20204], "score": 15.0259, "input_data_size": 96436904, "oldest_snapshot_seqno": -1}
2024/03/05-12:48:46.316269 140708313172784 [/compaction/compaction_job.cc:2106] [default]: Compaction start summary: Base version 339 Base level 4, inputs: [20302(12MB)], [20222(40MB)]
2024/03/05-12:48:46.316283 140708313172784 EVENT_LOG_v1 {"time_micros": 1709642926316276, "job": 348, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L4": [20302], "files_L5": [20222], "score": 13.9134, "input_data_size": 54939459, "oldest_snapshot_seqno": -1}
2024/03/05-12:48:46.316341 140708313467696 [/compaction/compaction_job.cc:2106] [default]: Compaction start summary: Base version 339 Base level 4, inputs: [20299(32MB)], [20246(11MB) 20241(66MB) 20270(63MB)]
2024/03/05-12:48:46.316357 140708313467696 EVENT_LOG_v1 {"time_micros": 1709642926316349, "job": 349, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L4": [20299], "files_L5": [20246, 20241, 20270], "score": 13.4384, "input_data_size": 183428842, "oldest_snapshot_seqno": -1}
2024/03/05-12:48:46.316369 140708333902640 [/compaction/compaction_job.cc:2106] [default]: Compaction start summary: Base version 339 Base level 4, inputs: [20301(23MB)], [20206(68MB) 20212(38MB)]
2024/03/05-12:48:46.316380 140708333902640 EVENT_LOG_v1 {"time_micros": 1709642926316374, "job": 350, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L4": [20301], "files_L5": [20206, 20212], "score": 12.1597, "input_data_size": 137648589, "oldest_snapshot_seqno": -1}
2024/03/05-12:49:06.664642 140708313172784 [/compaction/compaction_job.cc:2106] [default]: Compaction start summary: Base version 340 Base level 4, inputs: [20296(71MB)], [20271(71MB) 20274(63MB) 20278(64MB) 20280(64MB) 20283(50MB) 20217(19MB) 20199(69MB)]
2024/03/05-12:49:06.664668 140708313172784 EVENT_LOG_v1 {"time_micros": 1709642946664653, "job": 351, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L4": [20296], "files_L5": [20271, 20274, 20278, 20280, 20283, 20217, 20199], "score": 11.2398, "input_data_size": 498990148, "oldest_snapshot_seqno": -1}

Also, for the same period:

1. Disk read IOPS
2024-03-05_14-27.png

2. Disk write IOPS
2024-03-05_14-28.png

3. Disk read throughput
2024-03-05_14-30.png

4. Disk write throughput
2024-03-05_14-30_1.png

Thanks,
Luis Alves

MARK CALLAGHAN

unread,
Mar 6, 2024, 11:12:08 AMMar 6
to Luis Alves, rocksdb
Looking at the peaks from the graphs, I when read IOPs approaches 3000/s then read MB/s approaches 14.2.
At that point you saturated your IO budget doing reads of size ~5kb, assuming the budget was 3000 IOPs. 

I assume but can't be certain that this is from user queries, not compaction. What is the user query workload at that point?

From the config you shared, all levels are compressed with lz4 and block_size=4096. The compressed blocks stored on disk are probably smaller than 4kb, but they are not aligned to file system (4kb) boundaries so a read of a data block that is ~3kb might span two file system pages that are each 4kb. This might explain why the read size per iostat was ~5kb, assuming most reads were for user queries.

But I still have some doubt about the ratio of storage reads done for user queries versus compaction.

One thing you can try is block_align=true. That will avoid the problem of data blocks spanning file system page boundaries, but at the cost of using more space.
--
Mark Callaghan
mdca...@gmail.com

Luis Alves

unread,
Mar 18, 2024, 1:42:16 PMMar 18
to rocksdb

Hi Mark!

Sorry for the late reply, I got a bit sidetracked on other things. I was able to do a new test where I increased the block_size from the default of 4KiB to 16KiB, 32KiB, and 64KiB. 

My theory was that, as I mentioned, for each event that the system processes it needs to perform ~300 point queries. The records that these queries return are most of the time colocated (by design), which means that by reading one of the records the remaining ones will be also read from disk and cached in the Block Cache and/or in the OS Page Cache. I estimated that the key/value size is around 200 bytes (decompressed) - which means that for each event I will read ~60KiB. So, by increasing the block size to > 60KiB I would be able to read all data records with ~ a single read op.

What I observed was that the read IOPS decreased as I increased the block_size (as expected) - with a block size of 64KiB I managed to reduce the read IOPS to around 2k IOPS.

I will still try to enable the block_align toggle to see if it further decreases the read IOPS.

Thanks,
Luis Alves

MARK CALLAGHAN

unread,
Mar 18, 2024, 1:50:31 PMMar 18
to Luis Alves, rocksdb
That is good news.

Back in the day we spent much time dealing with IOPs shortages. Those were demanding but fun problems.

The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.

--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/84bb791d-e425-4b7d-93a1-dd53451095d1n%40googlegroups.com.


--
Mark Callaghan
mdca...@gmail.com
Reply all
Reply to author
Forward
0 new messages