Read performance regression when `cache_index_and_filter_blocks` enabled

778 views
Skip to first unread message

Yun Tang

unread,
Sep 2, 2018, 11:55:53 AM9/2/18
to rocksdb
Hi all

We're using RocksDB-v4.2.0 which embed in Flink (a distributed streaming processing system) through Java API for more than one year. Our rocksDB's configuration is listed below:
 

    rocksdb.block.cache-index-filter: true
    rocksdb.block.block-size: 4 kb
    rocksdb.compaction.level1-max-size: 512 mb
    rocksdb.compaction.level0-files-num-trigger: 4
    rocksdb.compaction.level1-file-target-size: 64 mb
    rocksdb.ttl: 3 d
    rocksdb.block.cache-size: 256mb
    rocksdb.writebuffer.size: 32 mb
    rocksdb.writebuffer.number: 4
    max_open_files: -1
 

 As you can see, we set cache_index_and_filter_blocks as true, and the read performance is quite good. However, if we bump the version of RocksDB to 4.11.2 or 5.7.5 with the same configuration, the read performance regressed seriously when we read heavily. I refer to the RocksDB-Tuning-Guide and Memory-usage-in-RocksDB doc, it seems that this behavior is expected since the cached index and filter occupy more space than cached data block. I have tried to called setPinL0FilterAndIndexBlocksInCache(true) and setOptimizeFiltersForHits(true) through java API according to the tuning guide, but the read performance still cannot compare to RocksDB-v4.2.0

Thus, I have two questions below:
  1. Why RocksDB-v4.2.0 could gain much better read performance even cache_index_and_filter_blocks is configured as true, is this configuration actually not taken effect?
  2. How to improve or at least equal to RocksDB-v4.2.0's read performance with cache_index_and_filter_blocks enabled, since we must control the memory's usage. I tried to use Partitioned-Index-Filters but I did not find java API to set partition_filters as true.
Hope for your answers, thanks in advance.

Best
Yun Tang

Siying Dong

unread,
Sep 4, 2018, 12:58:12 PM9/4/18
to rocksdb
What's your index and filter block hit rate? I wonder whether the regression happens because of lowering the hit rate so more index and filter blocks are read from the file system.

Yun Tang

unread,
Sep 5, 2018, 11:39:47 AM9/5/18
to rocksdb
Hi Siying

Thanks for your reply, I add metrics for a simplified perf test, you can find the source code rocksdb-perf here. Although rocksdb-v4.2.0 has some problem of showing statics, e.g. the count for num-keys-read is always zero and the count for mem-table-hit is too much larger than read keys, we can still find that rocksdb-v5.7.5 does not use memtable as efficient as rocksdb-v4.2.0. The statics below is generated of get action after checkpoint created.

 STATICS for get 15000000 key-value pairs randomly from rocksdb-v4.2.0, consumed 74.484 seconds
        blockCache Hit: 45699058
        blockCache Miss: 158385
        blockCacheIndex Hit: 14792248
        blockCacheIndex Miss: 5
        blockCacheFilter Hit: 15623094
        blockCacheFilter Miss: 5
        blockCacheData Hit: 15283716
        blockCacheData Miss: 158375
        memTable Hit: 0
        memTable Miss: 15000000
        numKeys read: 15000000
        numKeys written: 0
        numKeys updated: 0

        
  STATICS for get 15000000 key-value pairs randomly from rocksdb-5.7.5, consumed 63.964 seconds:
        blockCache Hit: 96870851
        blockCache Miss: 170525
        blockCacheIndex Hit: 15597897
        blockCacheIndex Miss: 5
        blockCacheFilter Hit: 65775939
        blockCacheFilter Miss: 5
        blockCacheData Hit: 15497015
        blockCacheData Miss: 170515
        memTable Hit: 448881433
        memTable Miss: 50178050
        numKeys read: 0
        numKeys written: 0
        numKeys updated: 0

I also tried to not checkpoint after data has been put, the memtable hit count of rocksdb-v5.7.5 still cannot compare with rocksdb-v4.2.0.

How can I fix this problem?

And I'll also try to run more complex performance tests since I met more serious performance regression in real job.

BTW, how can I use java API to set partition_filters as true?


在 2018年9月5日星期三 UTC+8上午12:58:12,Siying Dong写道:

Yun Tang

unread,
Sep 6, 2018, 2:44:15 AM9/6/18
to rocksdb
Sorry for mistakenly record the two rocksdb versions below, the better one is from rocksdb-v4.2.0, the README of my repo records the correct number.

在 2018年9月5日星期三 UTC+8下午11:39:47,Yun Tang写道:
Hi Siying

Thanks for your reply, I add metrics for a simplified perf test, you can find the source code rocksdb-perf here. Although rocksdb-v4.2.0 has some problem of showing statics, e.g. the count for num-keys-read is always zero and the count for mem-table-hit is too much larger than read keys, we can still find that rocksdb-v5.7.5 does not use memtable as efficient as rocksdb-v4.2.0. The statics below is generated of get action after checkpoint created.

 STATICS for get 15000000 key-value pairs randomly from rocksdb-v5.7.5, consumed 74.484 seconds
        blockCache Hit: 45699058
        blockCache Miss: 158385
        blockCacheIndex Hit: 14792248
        blockCacheIndex Miss: 5
        blockCacheFilter Hit: 15623094
        blockCacheFilter Miss: 5
        blockCacheData Hit: 15283716
        blockCacheData Miss: 158375
        memTable Hit: 0

        memTable Miss: 15000000
        numKeys read: 15000000
        numKeys written: 0
        numKeys updated: 0

        
  STATICS for get 15000000 key-value pairs randomly from rocksdb-4.2.0, consumed 63.964 seconds:
        blockCache Hit: 96870851
        blockCache Miss: 170525
        blockCacheIndex Hit: 15597897
        blockCacheIndex Miss: 5
        blockCacheFilter Hit: 65775939
        blockCacheFilter Miss: 5
        blockCacheData Hit: 15497015
        blockCacheData Miss: 170515
        memTable Hit: 448881433

Yun Tang

unread,
Sep 11, 2018, 1:51:49 PM9/11/18
to rocksdb
Since the JAVA API to get metrics of RocksDB-v4.2.0 seems very buggy, I read the statics from rocksDB's LOG which dumped every 60 seconds if compaction happens, and I recorded
statics below:
  RocksDB-v4.2.0 RocksDB-v5.7.5
rocksdb.block.cache.miss COUNT 423,709 372,637
rocksdb.block.cache.hit COUNT 148,501,931 57,916,857
rocksdb.block.cache.add COUNT 204892 192240
rocksdb.block.cache.index.miss COUNT 37 36
rocksdb.block.cache.index.hit COUNT 30,280,691 27,141,812
rocksdb.block.cache.filter.miss COUNT 37 36
rocksdb.block.cache.filter.hit COUNT  88,142,795 27,141,812
rocksdb.block.cache.data.miss COUNT  423635 372565
rocksdb.block.cache.data.hit COUNT 30078445 29975024
rocksdb.memtable.hit COUNT 403,758 508,608
rocksdb.memtable.miss COUNT 29,596,242 29,491,392
rocksdb.number.keys.read COUNT 30,000,000 30,000,000
rocksdb.number.keys.written COUNT   10,843,404 10,898,283

For the overall performance, rocksDB-v4.2.0 behaves 15% better than rocksDB-v5.7.5 when index&filter stored in block cache, the block cache are both set as 256MB and write buffer set 4*32MB, what's more, they both use BuiltinBloomFilter(10, false) filter. I also checked the number of L0 and L6 files, they both have 2 L0 files and 2 L6 files.
From the statics, you could see the keys read and written nearly the same as they share the same benchmark code. However, Rocks-v4.2.0 has a much larger rocksdb.block.cache.filter.hit COUNT than rocksDb-v5.7.5 which might affect the overall performance, why this number differs so much?

Best
Yun

在 2018年9月6日星期四 UTC+8下午2:44:15,Yun Tang写道:

Yun Tang

unread,
Sep 17, 2018, 5:56:50 AM9/17/18
to rocksdb
Hi all,

I use RocksDB built-in benchmark to compare performance of rocksdb-v4.2.0 and v5.14.2 as suggested on the same machine with 24*2.2Ghz CPU and total 94GB memory.
I use two steps below to compare the performance:
  1. fillseq 10 million keys
  2. readrandom the 10 million keys
RocksDB-4.2.0 fill seq command:
./db_bench -benchmarks=fillseq -use_existing_db=0 -disable_auto_compactions=0 -sync=0 -db=<rocksdb-dir-1> -wal_dir=<wal-dir-1> -disable_data_sync=0 -num=10000000 -num_levels=6 -key_size=32 -value_size=10 -block_size=4096 -cache_size=268435456 -cache_numshardbits=-1 -compression_type=snappy -min_level_to_compress=3 -compression_ratio=0.5 -level_compaction_dynamic_level_bytes=true -bytes_per_sync=0 -cache_index_and_filter_blocks=1 -write_buffer_size=33554432 -max_write_buffer_number=4 -target_file_size_base=67108864 -max_bytes_for_level_base=536870912 -statistics=0 -stats_per_interval=1 -stats_interval_seconds=60 -histogram=1 -memtablerep=skip_list -bloom_bits=10 -open_files=-1 -max_background_compactions=2 -max_background_flushes=2 -level0_file_num_compaction_trigger=4 -threads=1 -disable_wal=1 -seed=1537156504 2>&1 

RocksDB-4.2.0 read random command:
./db_bench -benchmarks=readrandom -use_existing_db=1 -db=<rocksdb-dir-1> -wal_dir=<wal-dir-1> -disable_data_sync=0 -num=10000000 -num_levels=6 -key_size=32 -value_size=15 -block_size=4096 -cache_size=268435456 -cache_numshardbits=-1 -compression_type=snappy -min_level_to_compress=3 -compression_ratio=0.5 -level_compaction_dynamic_level_bytes=true -bytes_per_sync=0 -cache_index_and_filter_blocks=1 -write_buffer_size=33554432 -max_write_buffer_number=4 -target_file_size_base=67108864 -max_bytes_for_level_base=536870912 -statistics=1 -stats_per_interval=1 -stats_interval_seconds=60 -histogram=1 -memtablerep=skip_list -bloom_bits=10 -open_files=-1 -level0_file_num_compaction_trigger=4 -max_background_compactions=2 -max_background_flushes=2 -threads=1 -disable_wal=1 -seed=1537157278 2>&1

and get the performance result:
readrandom   :       6.994 micros/op 142982 ops/sec;    5.7 MB/s (10000000 of 10000000 found)


RocksDB-5.14.2 fill seq command:
./db_bench -benchmarks=fillseq -use_existing_db=0 -disable_auto_compactions=0 -sync=0 -db=<rocksdb-dir-2> -wal_dir=<wal-dir-2> -num=10000000 -num_levels=6 -key_size=32 -value_size=10 -block_size=4096 -cache_size=268435456 -cache_numshardbits=-1 -compression_type=snappy -min_level_to_compress=3 -compression_ratio=0.5 -level_compaction_dynamic_level_bytes=true -bytes_per_sync=0 -cache_index_and_filter_blocks=1 -write_buffer_size=33554432 -max_write_buffer_number=4 -target_file_size_base=67108864 -max_bytes_for_level_base=536870912 -statistics=0 -stats_per_interval=1 -stats_interval_seconds=60 -histogram=1 -memtablerep=skip_list -bloom_bits=10 -open_files=-1 -max_background_compactions=2 -max_background_flushes=2 -level0_file_num_compaction_trigger=4 -threads=1 -disable_wal=1 -seed=1537156504 2>&
 
RocksDB-5.14.2 read random command:
./db_bench -benchmarks=readrandom -use_existing_db=1 -db=<rocksdb-dir-2> -wal_dir=<wal-dir-2> -num=10000000 -num_levels=6 -key_size=32 -value_size=15 -block_size=4096 -cache_size=268435456 -cache_numshardbits=-1 -compression_type=snappy -min_level_to_compress=3 -compression_ratio=0.5 -level_compaction_dynamic_level_bytes=true -bytes_per_sync=0 -cache_index_and_filter_blocks=1 -write_buffer_size=33554432 -max_write_buffer_number=4 -target_file_size_base=67108864 -max_bytes_for_level_base=536870912 -statistics=1 -stats_per_interval=1 -stats_interval_seconds=60 -histogram=1 -memtablerep=skip_list -bloom_bits=10 -open_files=-1 -level0_file_num_compaction_trigger=4 -max_background_compactions=2 -max_background_flushes=2 -threads=1 -disable_wal=1 -seed=1537157278 2>&1

and get the performance result:
readrandom   :       9.424 micros/op 106110 ops/sec;    4.3 MB/s (10000000 of 10000000 found)
if I read the files generated by rocksdb-4.2.0, I'll get the performance result:
readrandom   :       9.561 micros/op 104593 ops/sec;    4.2 MB/s (10000000 of 10000000 found)

You could see there exists about 33% performance regression, hope you could help to explain why.

Thanks.


在 2018年9月12日星期三 UTC+8上午1:51:49,Yun Tang写道:

Yun Tang

unread,
Sep 17, 2018, 6:16:25 AM9/17/18
to rocksdb
I forgot to turn pin_l0_filter_and_index_blocks_in_cache, optimize_filters_for_hits and verify_checksum=0 for RocksDB-5.14.2, the new command with read-random:
./db_bench -benchmarks=readrandom -use_existing_db=1 -db=<rocksdb-dir-2> -wal_dir=<wal-dir-2>-num=10000000 -num_levels=6 -key_size=32 -value_size=15 -block_size=4096 -cache_size=268435456 -cache_numshardbits=-1 -compression_type=snappy -min_level_to_compress=3 -compression_ratio=0.5 -level_compaction_dynamic_level_bytes=true -bytes_per_sync=0 -cache_index_and_filter_blocks=1 -write_buffer_size=33554432 -max_write_buffer_number=4 -target_file_size_base=67108864 -max_bytes_for_level_base=536870912 -statistics=1 -stats_per_interval=1 -stats_interval_seconds=60 -histogram=1 -memtablerep=skip_list -bloom_bits=10 -open_files=-1 -level0_file_num_compaction_trigger=4 -max_background_compactions=2 -max_background_flushes=2 -threads=1 -pin_l0_filter_and_index_blocks_in_cache=1 -optimize_filters_for_hits=1 -disable_wal=1 -verify_checksum=0 -seed=1537157278 2>&1

and the read performance:
readrandom   :       8.178 micros/op 122275 ops/sec;    4.9 MB/s (10000000 of 10000000 found)

The performane regression is only about 17% now, is that by design? And I'll continue to find worse situation.

在 2018年9月17日星期一 UTC+8下午5:56:50,Yun Tang写道:

MARK CALLAGHAN

unread,
Sep 18, 2018, 5:04:28 PM9/18/18
to tang...@gmail.com, rocksdb
I am repeating a variant of your test and will have results in a few days.

I have done cross-release testing in the past and didn't find serious problems. My scripts are at https://github.com/mdcallag/mytools/tree/master/bench/rocksdb.db_bench
It is a bit of work to keep these scripts current given the changes to options.

--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.
To post to this group, send email to roc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/d3e47880-6ce0-44a1-a8fa-0425040542b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Mark Callaghan
mdca...@gmail.com

MARK CALLAGHAN

unread,
Sep 21, 2018, 1:24:45 PM9/21/18
to Yun Tang, rocksdb
Your tests inspired mine. I see that Get is between 10% and 30% slower when going from 4.2 to 5.14.3
https://github.com/facebook/rocksdb/issues/4417
--
Mark Callaghan
mdca...@gmail.com

Siying Dong

unread,
Oct 12, 2018, 6:30:31 PM10/12/18
to rocksdb
Thank you for reporting this. I can reproduce some regression in this benchmark setting comparing the recent releases, compared to previous one. The rate regressed is similar to what you saw.
I went ahead and compared some historic releases and tried to figure out which version contributed most. It looks to me that the regression was accumulated over time. The release with relatively large regression was 4.13 => 5.0, where I saw around 10% regression in a setting similar to yours. However, I can't find an obvious change that can contribute to this.

The perf context comparison:

4.13:  user_key_comparison_count = 144025604, block_cache_hit_count = 14988500, block_read_count = 11500, block_read_byte = 48375834, block_read_time = 26,934,492, block_checksum_time = 7,301,280, block_decompress_time = 30,230,259, get_snapshot_time = 293,031,382, get_post_process_time = 278,159,389, get_from_output_files_time = 13,133,321,848, read_index_block_nanos = 520,708,663, read_filter_block_nanos = 527,402,657, new_table_block_iter_nanos = 2,093,206,796, new_table_iterator_nanos = 0, block_seek_nanos = 6,088,759,671, 
5.0:   user_key_comparison_count = 144025604, block_cache_hit_count = 14988500, block_read_count = 11500, block_read_byte = 48375834, block_read_time = 27,934,970, block_checksum_time = 7,054,535, block_decompress_time = 34,332,405, get_snapshot_time = 313,551,561, get_post_process_time = 282,546,988, get_from_output_files_time = 14,092,548,791, read_index_block_nanos = 563,527,307, read_filter_block_nanos = 564,471,797, new_table_block_iter_nanos = 2,368,650,515, new_table_iterator_nanos = 0, block_seek_nanos = 6,379,372,483, 

From the counters we can see that, the two released did exactly the same number of key comparisons, and same block reads, but it is just slower when getting blocks from the cache and binary search against it. However, there is no code change to block cache or block reading between the two releases.

Your benchmark setting is to use a very small working set (40MB, which is likely to be all cached in CPU cache), and use single thread to read from it. I'm not surprised that RocksDB does worse in this specific setting. Over time, we are adding more features, which make code more complicated, and add more counters, while we have few improvements targeting this scenario. When I run with larger working set, the regression is much lower.

We do plan to add more scenarios to the daily regression tests, so maybe a pure memory single thread tests like this can be added so that we can catch regression like this earlier. This time since we can't find anything obviously wrong yet. Maybe some other team member has the bandwidth to dig a little bit further.

By the way, "-cache_numshardbits=-1" is a feature not yet supported in 4.2, so it's a undefined behavior. You probably should consider to use "4" in both releases to be fair. It doesn't change the benchmark result much though. If the performance gap is not a blocker for you, I suggest you upgrade to newer release anyway for better support from the community.

Bests,

Siying

Siying Dong

unread,
Oct 12, 2018, 6:33:03 PM10/12/18
to rocksdb
In my runs, I'm not able to see the big regression in the large --num case. In my runs, results are within several percents comparing 4.2 and more recent releases.
I do see regression similar to what you see with --num to be 1M and 10M.
Reply all
Reply to author
Forward
0 new messages