Caching index blocks and filter blocks in RocksDB

Dongchul Park

unread,

May 3, 2017, 8:29:57 PM5/3/17

to rocksdb

Hello,

I have a quick question about index and filter block caching in RocksDB.

As you know, RocksDB has an option to cache both index blocks and filter blocks in block cache (cache_index_and_filter_blocks option). I found its default value was set to “FALSE”. Is there any reason RocksDB does not fully utilize this feature?

Actually I also see that for RocksDB performance data in RocksDB home page, they did NOT set that value to true. This means they also used RocksDB with default value (i.e., false). Intuitively, if we set that flag to ‘true’, we could significantly reduce Read Amplification.

If we set it true, most (all) of the RocksDB block cache space will be full of index and filter blocks (instead of data block). Consequently, it will eventually hurt overall read performance because RocksDB will reduce the block cache hit chance (i.e., indexing data is stored in the block cache and there is no/less space for caching data)?

I really appreciate your answers.

MARK CALLAGHAN

unread,

May 5, 2017, 9:39:59 AM5/5/17

to Dongchul Park, rocksdb

My memory is not perfect about this but cache_index_and_filter_blocks was added long after RocksDB became popular, so the default is false to avoid a behavior change. The perf results on the RocksDB page were published before that option was added, so those results don't mention it. I think we should remove or update the perf results on the wiki because they are so old.

Given a choice I prefer to dedicate RAM to filter/index blocks versus data blocks. Data blocks in L0 and maybe L1 are frequently accessed, but data blocks near the bottom of the LSM tree are much less frequently accessed compared to filter/index blocks. We have work in progress to support partitioned indexes to improve performance for workloads where the database is much larger than RAM and we can't cache all filter/index blocks. I am not sure we have a blog post about that yet. And partitioned indexes will also improve performance when cache_index_and_filter_blocks is true.

--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+unsubscribe@googlegroups.com.
To post to this group, send email to roc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/e24f3ef5-eaf3-4913-9a78-4e43b743c062%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Mark Callaghan
mdca...@gmail.com

Siying Dong

unread,

May 5, 2017, 12:37:26 PM5/5/17

to rocksdb

Like what Mark said, cache_index_and_filter_blocks used to introduce non-trivial performance penalty. Later, the penalty has been significantly improved with the option pin_l0_filter_and_index_blocks_in_cache or clock cache. Using these options, the performance is much closer. We never updated the benchmark results.

Changing default RocksDB options is tricky as it may introduce intrusive impact to users who has tuned the options in their system carefully. This is less an issues for options that everyone tends to set, like write_buffer_size. cache_index_and_filter_blocks is not the case and we are cautious about changing the default value of it. In your use case, if you set block cache size to be far larger than total index and bloom filter block size, I encourage you to set cache_index_and_filter_blocks = true and pin_l0_filter_and_index_blocks_in_cache = true. This should get you near optimal performance, while you can have your memory usage under your control.

Lucas Lersch

unread,

Jul 3, 2017, 11:56:40 AM7/3/17

to rocksdb

I will just take the opportunity of the great discussion to make a more general comment. Using the block_cache to hold any block (data, index, filtering) seems to be a better alternative, but I understand that is the default behavior due to backward compatibility.

However, the option pin_l0_filter_and_index_blocks_in_cache sounds unnecessary. A well-tuned cache policy should identify hot data and prefer to hold it in DRAM, and this should not depend on hints from the user-side.

As an example, if a data block in L3 is being accessed frequently, the cache policy should give priority to it in detriment of a colder index block in L0.

While enabling the user to customize the behavior of the system is important, it adds complexity and enabling a more robust behavior with less options would be more user-friendly :)

Just my two cents...

MARK CALLAGHAN

unread,

Jul 3, 2017, 12:13:35 PM7/3/17

to Lucas Lersch, rocksdb

I agree with you.

--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+unsubscribe@googlegroups.com.
To post to this group, send email to roc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/92f86d00-6625-4ab6-8a1f-284d51516f95%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Mark Callaghan
mdca...@gmail.com

Igor Canadi

unread,

Jul 3, 2017, 1:59:14 PM7/3/17

to Lucas Lersch, rocksdb

Lucas, I also agree with you that cache should be trusted with figuring out cold vs hot data. However, IIRC the problem was not that the cache was evicting L0 index and filter blocks. They were always present in cache, since they had to be accessed on every single read (no way to avoid reading all L0 files, unless the key you're looking for is in L0). The problem was that on every read we had to do some cache accounting (increase/decrease refcount, hashtable lookup, etc). Since those blocks were always in cache anyway, this work seems wasted.

> As an example, if a data block in L3 is being accessed frequently, the cache policy should give priority to it in detriment of a colder index block in L0.

Because of the way RocksDB reads work, there's no way that an index block in L0 is colder than data block in L3. L0 files always need to be read, and to read them, you need to access their index and filter blocks. (well, you could avoid reading L0 index if you use full-table filter, but even then you get some false positives that need to touch index blocks)

> While enabling the user to customize the behavior of the system is important, it adds complexity and enabling a more robust behavior with less options would be more user-friendly :)

Definitely. In fact, I think pin_l0_filter_and_index_blocks_in_cache should not be an option, but just always true.

Igor

On Mon, Jul 3, 2017 at 8:56 AM, Lucas Lersch <lucas...@gmail.com> wrote:

Lucas Lersch

unread,

Jul 4, 2017, 7:16:03 AM7/4/17

to rocksdb

Thanks for the answers Mark and Igor.

Yes, I totally get the point of overhead required by the cache component. Siying mentioned before that CLOCK should help reducing this overhead (as hits are usually cheaper than LRU). Maybe overhead could be further reduced to a point where it does not impact performance so much, but it will always be there.

And the insight that an index block in L0 is never colder was very helpful. I wonder if splitting the index block into smaller blocks (same size of a data block maybe, resembling a small B+Tree) would help, and in this case the scenario of an index block in L0 being colder than a data block in L3 could happen.

Anyway, I find always helpful to have such discussions. Thanks a lot :)

To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.

To post to this group, send email to roc...@googlegroups.com.

MARK CALLAGHAN

unread,

Jul 5, 2017, 10:35:48 AM7/5/17

to Lucas Lersch, rocksdb

On Tue, Jul 4, 2017 at 4:16 AM, Lucas Lersch <lucas...@gmail.com> wrote:

Thanks for the answers Mark and Igor.

Yes, I totally get the point of overhead required by the cache component. Siying mentioned before that CLOCK should help reducing this overhead (as hits are usually cheaper than LRU). Maybe overhead could be further reduced to a point where it does not impact performance so much, but it will always be there.

And the insight that an index block in L0 is never colder was very helpful. I wonder if splitting the index block into smaller blocks (same size of a data block maybe, resembling a small B+Tree) would help, and in this case the scenario of an index block in L0 being colder than a data block in L3 could happen.

For splitting large index/filter blocks see http://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html

--

Mark Callaghan
mdca...@gmail.com

Lucas Lersch

unread,

Jul 7, 2017, 6:14:37 AM7/7/17

to rocksdb

Wel, I'm feeling a bit bad awkward for predicting the past. Nevertheless, some additional questions about partitioned indexes.

The size of the index/filter varies based on the configuration but for a SST of size 256MB the index/filter block of size 0.5/5MB is typical, which is much larger than the typical data block size of 4-32KB.

I got a bit confused about SST of size 256MB, comming from LevelDB where they are statically 2MB (default). I took a look at the code, some additional questions:

The default SST size is 64MB, any particular reason for larger than the original 2MB? Handle larger values maybe?
There is an options to increase the size of SSTs from one level to the next (target_file_size_multiplier). What are the benefits of having SSTs with variable sizes?
There is an option to increase the maximum size of levels in a non-monotonic way (max_bytes_for_level_multiplier_additional). What would be the reasoning for that?

My final question about the partitioned indexes. It seems that they are 1-layer partitioned, which alleviates the problem I mentioned in the previous post. Is there anything stopping partitioning the index block in as many blocks of fixed-size as necessary and having multiple layers?

Thanks for the clarifications and sorry for so many questions.

Igor Canadi

unread,

Jul 7, 2017, 1:51:35 PM7/7/17

to Lucas Lersch, rocksdb

I'll answer the questions I know the answers to.

> The default SST size is 64MB, any particular reason for larger than the original 2MB? Handle larger values maybe?

Handle large database. If you have 1TB database, with 2MB files you'll end up with half a million files, which is a lot of file descriptors. :) LevelDB was optimized for smaller databases, while RocksDB is trying to support big ones, too.

> There is an options to increase the size of SSTs from one level to the next (target_file_size_multiplier). What are the benefits of having SSTs with variable sizes?

This was an experimental option when we did some benchmarking early on. I don't think this option has shown any merit and I don't think anybody's using it.

> There is an option to increase the maximum size of levels in a non-monotonic way (max_bytes_for_level_multiplier_additional). What would be the reasoning for that?

See answer above :)

Igor

--

You received this message because you are subscribed to the Google Groups "rocksdb" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+unsubscribe@googlegroups.com.

To post to this group, send email to roc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/eb0a2780-dfaf-4b49-896d-01e2ea81b954%40googlegroups.com.

Reply all

Reply to author

Forward