Thank you for reporting this. I can reproduce some regression in this benchmark setting comparing the recent releases, compared to previous one. The rate regressed is similar to what you saw.
I went ahead and compared some historic releases and tried to figure out which version contributed most. It looks to me that the regression was accumulated over time. The release with relatively large regression was 4.13 => 5.0, where I saw around 10% regression in a setting similar to yours. However, I can't find an obvious change that can contribute to this.
The perf context comparison:
4.13: user_key_comparison_count = 144025604, block_cache_hit_count = 14988500, block_read_count = 11500, block_read_byte = 48375834, block_read_time = 26,934,492, block_checksum_time = 7,301,280, block_decompress_time = 30,230,259, get_snapshot_time = 293,031,382, get_post_process_time = 278,159,389, get_from_output_files_time = 13,133,321,848, read_index_block_nanos = 520,708,663, read_filter_block_nanos = 527,402,657, new_table_block_iter_nanos = 2,093,206,796, new_table_iterator_nanos = 0, block_seek_nanos = 6,088,759,671,
5.0: user_key_comparison_count = 144025604, block_cache_hit_count = 14988500, block_read_count = 11500, block_read_byte = 48375834, block_read_time = 27,934,970, block_checksum_time = 7,054,535, block_decompress_time = 34,332,405, get_snapshot_time = 313,551,561, get_post_process_time = 282,546,988, get_from_output_files_time = 14,092,548,791, read_index_block_nanos = 563,527,307, read_filter_block_nanos = 564,471,797, new_table_block_iter_nanos = 2,368,650,515, new_table_iterator_nanos = 0, block_seek_nanos = 6,379,372,483,
From the counters we can see that, the two released did exactly the same number of key comparisons, and same block reads, but it is just slower when getting blocks from the cache and binary search against it. However, there is no code change to block cache or block reading between the two releases.
Your benchmark setting is to use a very small working set (40MB, which is likely to be all cached in CPU cache), and use single thread to read from it. I'm not surprised that RocksDB does worse in this specific setting. Over time, we are adding more features, which make code more complicated, and add more counters, while we have few improvements targeting this scenario. When I run with larger working set, the regression is much lower.
We do plan to add more scenarios to the daily regression tests, so maybe a pure memory single thread tests like this can be added so that we can catch regression like this earlier. This time since we can't find anything obviously wrong yet. Maybe some other team member has the bandwidth to dig a little bit further.
By the way, "-cache_numshardbits=-1" is a feature not yet supported in 4.2, so it's a undefined behavior. You probably should consider to use "4" in both releases to be fair. It doesn't change the benchmark result much though. If the performance gap is not a blocker for you, I suggest you upgrade to newer release anyway for better support from the community.
Bests,
Siying