Suggestion: An option to make bloom_filter_fp_chance be a function of sstable size (or sstable cardinality)

139 views

Skip to first unread message

Edward Rozycki

<rozycki.edward@gmail.com>

unread,

Jan 27, 2021, 4:14:30 PM1/27/21

to ScyllaDB users

Suppose you have x GB of memory available for bloom filters, and your goal is to minimize the average number of disk read IOs for each read query. Suppose also that your sstables vary greatly in size and cardinality (as is common under STCS or TWCS). Then, there is a large benefit to constructing bloom filters with very low bloom_filter_fp_chance for small sstables, and bloom filters with a higher bloom_filter_fp_chance for larger sstables. A massive sstable (as is common under TWCS or STCS) can have a massive bloom filter; some of that memory could be reallocated to reduce bloom_filter_fp_chance for dozens of small sstables nearly to 0.

There are some additional considerations in favor of dynamic bloom_filter_fp_chance:

- During periods of high sustained write load, a backlog of small- or medium-sized uncompacted sstables can accumulate. Unless bloom_filter_fp_chance is very low for these sstables, the presence of these sstables can significantly increase the volume of false positive disk read IOs. So setting bloom_filter_fp_chance for each sstable dynamically is particularly beneficial when the cluster is under the heaviest load, which (at least personally) is the worst-case scenario that I try to optimize for.

- Generally, large sstables are relatively likely to contain the key/partition that is being queried. This further reduces the value and memory-efficiency of bloom filters for large sstables.

At least for my use case (TWCS with 2 day windows, a very large number of small partitions, a high data-to-memory ratio, periods of high write volume, EBS (rather than local SSD) for disk so reads are relatively expensive), dynamic bloom_filter_fp_chance would provide a large efficiency benefit.

Nadav Har'El

<nyh@scylladb.com>

unread,

Jan 27, 2021, 7:22:57 PM1/27/21

to ScyllaDB users

Hi,

On Wed, Jan 27, 2021 at 11:14 PM Edward Rozycki <rozycki...@gmail.com> wrote:

Suppose you have x GB of memory available for bloom filters, and your goal is to minimize the average number of disk read IOs for each read query. Suppose also that your sstables vary greatly in size and cardinality (as is common under STCS or TWCS). Then, there is a large benefit to constructing bloom filters with very low bloom_filter_fp_chance for small sstables, and bloom filters with a higher bloom_filter_fp_chance for larger sstables. A massive sstable (as is common under TWCS or STCS) can have a massive bloom filter; some of that memory could be reallocated to reduce bloom_filter_fp_chance for dozens of small sstables nearly to 0.

There are some additional considerations in favor of dynamic bloom_filter_fp_chance:
- During periods of high sustained write load, a backlog of small- or medium-sized uncompacted sstables can accumulate. Unless bloom_filter_fp_chance is very low for these sstables, the presence of these sstables can significantly increase the volume of false positive disk read IOs. So setting bloom_filter_fp_chance for each sstable dynamically is particularly beneficial when the cluster is under the heaviest load, which (at least personally) is the worst-case scenario that I try to optimize for.
- Generally, large sstables are relatively likely to contain the key/partition that is being queried. This further reduces the value and memory-efficiency of bloom filters for large sstables.

We have had in the past some thoughts about how to improve bloom filter sizing, and have several open issues on the topic with various ideas, such as https://github.com/scylladb/scylla/issues/2024, https://github.com/scylladb/scylla/issues/2130, https://github.com/scylladb/scylla/issues/1946, https://github.com/scylladb/scylla/issues/2465 and https://github.com/scylladb/scylla/issues/2440.

But I don't recall this specific idea which you are raising now being raised in the past.

Let's see if I understand your idea:

You're saying that if we have one huge sstable and a thousand tiny ones, we don't need to waste a lot of memory on holding a Bloom Filter for the huge sstable with say 1% false positive chance because anyway, because most of the data which really exists exists in that huge sstable - it's even fine if we didn't have a Bloom Filter for it at all and tried to find all partitions there - they usually will be. On the other hand, for the thousand tiny tables, if we have 1% false positive it leaves every read finding a false match in 10 out of the 1000 tables. We should rather spend more memory on those thousand bloom fliters than on the one big one.

This makes a lot of sense, but I'm not convinced that the scenario it assumes is typical. Why would you have one huge sstable and 1000 tiny ones? Aren't you worried that consulting 1000 different bloom filters would, in itself, be slow, even if you don't reach reading from disk? Or maybe I'm just misrepresenting the scenario you thought about? I see you mentioned TWCS below, but I'm not sure how exactly that applies (more on this below).

At least for my use case (TWCS with 2 day windows, a very large number of small partitions,

Very large number of small partitions causes (see https://github.com/scylladb/scylla/issues/2024) the current Bloom Filter to use too much memory, right? Is this the problem that worries you?

Maybe that's the first thing we should fix for your workload?

a high data-to-memory ratio, periods of high write volume, EBS (rather than local SSD) for disk so reads are relatively expensive), dynamic bloom_filter_fp_chance would provide a large efficiency benefit.

I didn't understand why in this scenario you end up with the problem you described. With TWCS, don't you end up with sstables of roughly the same size - each one spanning about two days.

Of course you also have additional smaller sstables from the last two days, but their number is logarithmic (because STCS is used on the last time window). You probably shouldn't have anything like 100 such small sstable - so why does a 1% false positive rate cause problems? Or is it the other way around - you are trying to increase the false-positive rate on the big 2-day sstables, to save memory?

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/a032850b-1bf5-4274-8f5f-c2deaf83c1f9n%40googlegroups.com.

Edward Rozycki

<rozycki.edward@gmail.com>

unread,

Jan 27, 2021, 8:25:26 PM1/27/21

to ScyllaDB users

On Wednesday, January 27, 2021 at 4:22:57 PM UTC-8 Nadav Har'El wrote:

Hi,

On Wed, Jan 27, 2021 at 11:14 PM Edward Rozycki <rozycki...@gmail.com> wrote:
Suppose you have x GB of memory available for bloom filters, and your goal is to minimize the average number of disk read IOs for each read query. Suppose also that your sstables vary greatly in size and cardinality (as is common under STCS or TWCS). Then, there is a large benefit to constructing bloom filters with very low bloom_filter_fp_chance for small sstables, and bloom filters with a higher bloom_filter_fp_chance for larger sstables. A massive sstable (as is common under TWCS or STCS) can have a massive bloom filter; some of that memory could be reallocated to reduce bloom_filter_fp_chance for dozens of small sstables nearly to 0.

There are some additional considerations in favor of dynamic bloom_filter_fp_chance:
- During periods of high sustained write load, a backlog of small- or medium-sized uncompacted sstables can accumulate. Unless bloom_filter_fp_chance is very low for these sstables, the presence of these sstables can significantly increase the volume of false positive disk read IOs. So setting bloom_filter_fp_chance for each sstable dynamically is particularly beneficial when the cluster is under the heaviest load, which (at least personally) is the worst-case scenario that I try to optimize for.
- Generally, large sstables are relatively likely to contain the key/partition that is being queried. This further reduces the value and memory-efficiency of bloom filters for large sstables.

We have had in the past some thoughts about how to improve bloom filter sizing, and have several open issues on the topic with various ideas, such as https://github.com/scylladb/scylla/issues/2024, https://github.com/scylladb/scylla/issues/2130, https://github.com/scylladb/scylla/issues/1946, https://github.com/scylladb/scylla/issues/2465 and https://github.com/scylladb/scylla/issues/2440.
But I don't recall this specific idea which you are raising now being raised in the past.

Let's see if I understand your idea:

You're saying that if we have one huge sstable and a thousand tiny ones, we don't need to waste a lot of memory on holding a Bloom Filter for the huge sstable with say 1% false positive chance because anyway, because most of the data which really exists exists in that huge sstable - it's even fine if we didn't have a Bloom Filter for it at all and tried to find all partitions there - they usually will be.

The way I see it, this not the core idea, but nevertheless a nice "bonus" for use cases with the property that larger sstables are more likely to contain the data that is being queried. (I'd imagine that virtually all use cases exhibit this property to some extent, though it's possible to construct counterexamples if one tries to be clever.)

On the other hand, for the thousand tiny tables, if we have 1% false positive it leaves every read finding a false match in 10 out of the 1000 tables. We should rather spend more memory on those thousand bloom fliters than on the one big one.

This is the core idea. Holding memory usage constant, the average number of false positive disk read IOs is minimized by making bloom_filter_fp_chance an increasing function of sstable cardinality.

Here's a concrete example: Suppose you have two sstables, one with 1G partitions and the other with 10M partitions.

- Scenario 1: We create both sstables with a bloom_filter_fp_chance of 0.1. For a single partition read, the average number of false positive disk read IOs is 0.1 * 2 = 0.2. Optimally constructed bloom filters for these two sstables will have sizes 571MiB + 5.7MiB => ~576.7MiB total (see https://hur.st/bloomfilter/).

- Scenario 2: We create the large sstable with bloom_filter_fp_chance=0.19, and the small one with bloom_filter_fp_chance=0.01. The average number of false positive disk read IOs is 0.2, as before. Optimally constructed bloom filters for these two sstables will have sizes 412MiB + 11.4MiB => ~423.4MiB total for bloom filters. So we achieve the same false positive rate but with a smaller bloom filter footprint. The improvement is far larger if we have one large sstable and many small sstables, but to keep the example simple I only had one sstable of each size.

This makes a lot of sense, but I'm not convinced that the scenario it assumes is typical. Why would you have one huge sstable and 1000 tiny ones?

I agree that it's unlikely to have one huge sstable and and 1000 tiny ones. But we do not need anything as extreme as "one huge + 1000 tiny" for my proposed optimization to be beneficial. In fact, (if you make some reasonable assumptions) my optimization is beneficial in any scenario where sstables cardinalities are non-uniform. The optimization is beneficial if you have 1 large sstable and 4 medium ones. It is beneficial even if you have one sstable with 1M partitions and another with 1.1M partitions, though in this case the benefit is small.

Aren't you worried that consulting 1000 different bloom filters would, in itself, be slow, even if you don't reach reading from disk? Or maybe I'm just misrepresenting the scenario you thought about? I see you mentioned TWCS below, but I'm not sure how exactly that applies (more on this below).

Consulting the 1000 bloom filters would be slow. But we're going to consult 1000 bloom filters regardless of whether my optimization is used. At least with my optimization, there are fewer disk read IOs after the bloom filters have been consulted. So yes, we really want to avoid 1000 bloom filter checks, but I think that this is orthogonal to the optimization I'm suggesting. The benefit is still there even in more realistic scenarios (one large tables and tens of smaller ones).

At least for my use case (TWCS with 2 day windows, a very large number of small partitions,

Very large number of small partitions causes (see https://github.com/scylladb/scylla/issues/2024) the current Bloom Filter to use too much memory, right? Is this the problem that worries you?
Maybe that's the first thing we should fix for your workload?

Yes, for my TWCS use case Scylla uses (wastes) a very large amount of memory for bloom filters for the fully compacted sstables from previous time windows. Avi's proposal in #2024 is similar and would address my issue.

a high data-to-memory ratio, periods of high write volume, EBS (rather than local SSD) for disk so reads are relatively expensive), dynamic bloom_filter_fp_chance would provide a large efficiency benefit.

I didn't understand why in this scenario you end up with the problem you described. With TWCS, don't you end up with sstables of roughly the same size - each one spanning about two days.
Of course you also have additional smaller sstables from the last two days, but their number is logarithmic (because STCS is used on the last time window). You probably shouldn't have anything like 100 such small sstable - so why does a 1% false positive rate cause problems? Or is it the other way around - you are trying to increase the false-positive rate on the big 2-day sstables, to save memory?

Yes, it is "the other way around." In my use case, I would like to have a 1% bloom_filter_fp_chance, but I cannot because then I would have ridiculously large bloom filters for the massive sstables from completed/compacted time windows. The next best thing is to have a ~1% bloom_filter_fp_chance for all the small sstables in the most recent time window, and maybe something like ~30% for the massive sstables from completed time windows. Also, in practice the number of small sstables can be greater than logarithmic - when I hammer Scylla 4.2.1 with a very heavy write load, I can end up with 50 small/medium sstables per shard. Though perhaps the 50 small sstables per shard can be thought of as an issue with compaction falling behind. (At the same time, personally I actually don't mind if compactions fall behind since this increases write throughput. For, me it's fine as long as bloom_filter_fp_chance is low so that the large number of small uncompacted sstables doesn't translate to a large number of false positive disk reads).

My suggestion was motivated by an intense data load I did recently, where Scylla began to OOM/bad_alloc, I believe due to high bloom filter memory usage that I see reported under Non-LSA memory). However, I believe that the optimization I propose is more general and can benefit any workload with sstables of non-uniform cardinality.

Avi Kivity

<avi@scylladb.com>

unread,

Jan 28, 2021, 7:00:28 AM1/28/21

to scylladb-users@googlegroups.com, Edward Rozycki

We can start by enforcing a limit (now that we have sstable_manager we can do it easily) without singling out sstables. Later, we can choose which bloom filter to downsize by looking at a cost/benefit ratio: if a filter has few false positives, then it is a candidate for downsizing. We can calculate a metric based on the memory saved and the extra reads seen for a proposed downscaling of the filter.

Nadav Har'El

<nyh@scylladb.com>

unread,

Jan 28, 2021, 7:56:17 AM1/28/21

to ScyllaDB users

On Thu, Jan 28, 2021 at 3:25 AM Edward Rozycki <rozycki...@gmail.com> wrote:

On Wednesday, January 27, 2021 at 4:22:57 PM UTC-8 Nadav Har'El wrote:
Hi,

On Wed, Jan 27, 2021 at 11:14 PM Edward Rozycki <rozycki...@gmail.com> wrote:
Suppose you have x GB of memory available for bloom filters, and your goal is to minimize the average number of disk read IOs for each read query. Suppose also that your sstables vary greatly in size and cardinality (as is common under STCS or TWCS). Then, there is a large benefit to constructing bloom filters with very low bloom_filter_fp_chance for small sstables, and bloom filters with a higher bloom_filter_fp_chance for larger sstables. A massive sstable (as is common under TWCS or STCS) can have a massive bloom filter; some of that memory could be reallocated to reduce bloom_filter_fp_chance for dozens of small sstables nearly to 0.

There are some additional considerations in favor of dynamic bloom_filter_fp_chance:
- During periods of high sustained write load, a backlog of small- or medium-sized uncompacted sstables can accumulate. Unless bloom_filter_fp_chance is very low for these sstables, the presence of these sstables can significantly increase the volume of false positive disk read IOs. So setting bloom_filter_fp_chance for each sstable dynamically is particularly beneficial when the cluster is under the heaviest load, which (at least personally) is the worst-case scenario that I try to optimize for.
- Generally, large sstables are relatively likely to contain the key/partition that is being queried. This further reduces the value and memory-efficiency of bloom filters for large sstables.

We have had in the past some thoughts about how to improve bloom filter sizing, and have several open issues on the topic with various ideas, such as https://github.com/scylladb/scylla/issues/2024, https://github.com/scylladb/scylla/issues/2130, https://github.com/scylladb/scylla/issues/1946, https://github.com/scylladb/scylla/issues/2465 and https://github.com/scylladb/scylla/issues/2440.
But I don't recall this specific idea which you are raising now being raised in the past.

Let's see if I understand your idea:

You're saying that if we have one huge sstable and a thousand tiny ones, we don't need to waste a lot of memory on holding a Bloom Filter for the huge sstable with say 1% false positive chance because anyway, because most of the data which really exists exists in that huge sstable - it's even fine if we didn't have a Bloom Filter for it at all and tried to find all partitions there - they usually will be.

The way I see it, this not the core idea, but nevertheless a nice "bonus" for use cases with the property that larger sstables are more likely to contain the data that is being queried. (I'd imagine that virtually all use cases exhibit this property to some extent, though it's possible to construct counterexamples if one tries to be clever.)

On the other hand, for the thousand tiny tables, if we have 1% false positive it leaves every read finding a false match in 10 out of the 1000 tables. We should rather spend more memory on those thousand bloom fliters than on the one big one.

This is the core idea. Holding memory usage constant, the average number of false positive disk read IOs is minimized by making bloom_filter_fp_chance an increasing function of sstable cardinality.

Here's a concrete example: Suppose you have two sstables, one with 1G partitions and the other with 10M partitions.
- Scenario 1: We create both sstables with a bloom_filter_fp_chance of 0.1. For a single partition read, the average number of false positive disk read IOs is 0.1 * 2 = 0.2. Optimally constructed bloom filters for these two sstables will have sizes 571MiB + 5.7MiB => ~576.7MiB total (see https://hur.st/bloomfilter/).
- Scenario 2: We create the large sstable with bloom_filter_fp_chance=0.19, and the small one with bloom_filter_fp_chance=0.01. The average number of false positive disk read IOs is 0.2, as before. Optimally constructed bloom filters for these two sstables will have sizes 412MiB + 11.4MiB => ~423.4MiB total for bloom filters. So we achieve the same false positive rate but with a smaller bloom filter footprint. The improvement is far larger if we have one large sstable and many small sstables, but to keep the example simple I only had one sstable of each size.

What I still don't understand is the part of your concrete example's setup where one sstable has larger partitions and the other has smaller partitions.

This sounds like a separate issue from one sstable being larger and the other being smaller - which I thought was what your suggestion described.

I am guessing (?) that in your time-series example what happens is that the larger tables have the same number of partitions - just longer partitions

for each. Of course, this is not a universal thing.

Anyway, we're already *supposed* to make the size of the bloom filter proportional to the number of partitions, not the size in bytes of the sstable.

A huge sstable which has a relatively small number of partitions should already, in the current code, have a relatively small Bloom filter. You don't need to increase its bloom_filter_fp_chance.

If this isn't happening, perhaps we have a bug with partition count estimation? We had, and have, several issues about that.

The reason why partition count estimation is important is that we start preparing the Bloom filter during compaction, before we know how many partitions the resulting compacted table will have. So we need to estimate it - using interesting techniques which I can explain if you're interested. Perhaps we have a bug in this, or some cases where it doesn't work properly?

Note that https://github.com/scylladb/scylla/issues/2024 is apparently about the opposite case from what you care about. It's about a huge sstable with tiny partitions, which genuinely needs a huge Bloom filter because it has a huge number of partitions, but you can't afford such a huge Bloom filter. Your case is the opposite: If you have huge tables with only a (relatively) small number of partitions, you don't need a huge Bloom filter to get a low false-positive rate - a small one should have been enough.

Yes, it is "the other way around." In my use case, I would like to have a 1% bloom_filter_fp_chance, but I cannot because then I would have ridiculously large bloom filters for the massive sstables from completed/compacted time windows. The next best thing is to have a ~1% bloom_filter_fp_chance for all the small sstables in the most recent time window, and maybe something like ~30% for the massive sstables from completed time windows.

If we have 30% false positive rate, and 30 of these tables (say, spanning 60 days with 2-day windows), each read will wrongly use 3 of them. This doesn't sound like a good idea.

I think you really need a 1% false positive rate for these tables as well, no?

Also, in practice the number of small sstables can be greater than logarithmic - when I hammer Scylla 4.2.1 with a very heavy write load, I can end up with 50 small/medium sstables per shard. Though perhaps the 50 small sstables per shard can be thought of as an issue with compaction falling behind. (At the same time, personally I actually don't mind if compactions fall behind since this increases write throughput. For, me it's fine as long as bloom_filter_fp_chance is low so that the large number of small uncompacted sstables doesn't translate to a large number of false positive disk reads).

Yes, I understand why a low bloom_filter_fp (say, 1%) is good for the small tables.

I still don't understand why you need a higher one for the bigger table.

My suggestion was motivated by an intense data load I did recently, where Scylla began to OOM/bad_alloc, I believe due to high bloom filter memory usage that I see reported under Non-LSA memory). However, I believe that the optimization I propose is more general and can benefit any workload with sstables of non-uniform cardinality.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/a032850b-1bf5-4274-8f5f-c2deaf83c1f9n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/af63109d-1e5b-40ab-95ea-542332d6f090n%40googlegroups.com.

Edward Rozycki

<rozycki.edward@gmail.com>

unread,

Jan 28, 2021, 2:56:39 PM1/28/21

to ScyllaDB users

Thanks so much Nadav and Avi for the responses. Happy to see that this sort of optimization (e.g. https://github.com/scylladb/scylla/issues/2024) is in the issue tracker.

On Thursday, January 28, 2021 at 4:56:17 AM UTC-8 Nadav Har'El wrote:

What I still don't understand is the part of your concrete example's setup where one sstable has larger partitions and the other has smaller partitions.
This sounds like a separate issue from one sstable being larger and the other being smaller - which I thought was what your suggestion described.
I am guessing (?) that in your time-series example what happens is that the larger tables have the same number of partitions - just longer partitions
for each. Of course, this is not a universal thing.

Sorry, my wording was ambiguous in some places. In my posts, I am talking about (and concerned about) sstables with very high cardinality (i.e. number of partitions). I'm not concerned about sstable size in bytes per se, except to the extent that sstable size in bytes is correlated with sstable cardinality. When I say "very large sstables," I mean "sstables with a very large number of partitions" (which also have a proportionally large size in bytes, but this fact is less relevant).

Anyway, we're already *supposed* to make the size of the bloom filter proportional to the number of partitions, not the size in bytes of the sstable.
A huge sstable which has a relatively small number of partitions should already, in the current code, have a relatively small Bloom filter. You don't need to increase its bloom_filter_fp_chance.

If this isn't happening, perhaps we have a bug with partition count estimation? We had, and have, several issues about that.
The reason why partition count estimation is important is that we start preparing the Bloom filter during compaction, before we know how many partitions the resulting compacted table will have. So we need to estimate it - using interesting techniques which I can explain if you're interested. Perhaps we have a bug in this, or some cases where it doesn't work properly?

Note that https://github.com/scylladb/scylla/issues/2024 is apparently about the opposite case from what you care about. It's about a huge sstable with tiny partitions, which genuinely needs a huge Bloom filter because it has a huge number of partitions, but you can't afford such a huge Bloom filter. Your case is the opposite: If you have huge tables with only a (relatively) small number of partitions, you don't need a huge Bloom filter to get a low false-positive rate - a small one should have been enough.

In my use case, average partition size is very small and approximately independent of sstable size in bytes. I didn't mean to imply that my large-in-bytes sstables have large partitions; sorry for the confusion. I agree that large-in-bytes sstables with very large partitions wouldn't be concerning (except if there's bug in Scylla), and I have had no issues with bloom filter size for sstables that have large partitions.

If we have 30% false positive rate, and 30 of these tables (say, spanning 60 days with 2-day windows), each read will wrongly use 3 of them. This doesn't sound like a good idea.
I think you really need a 1% false positive rate for these tables as well, no?

Well, it's not ideal. But if we have a limit on total bloom filter size/memory usage and we run the risk of exceeding that limit, the most efficient use of memory is to increase the fp rate for higher-cardinality sstables. I'm largely concerned about the case where bloom filters are very large and we need to reduce their size somehow (either the user has to pre-emptively construct/ALTER the table with a higher fp chance, or Scylla has to reactively reduce the size of bloom filters when it sees that total bloom filter usage is unreasonable). However, even when Scylla is not hitting a soft/hard limit on total bloom filter size, it seems optimal (with respect to metrics like num_false_positive_disk_reads_per_read_query / total_bloom_filter_memory_usage) to have bloom_filter_fp_chance vary across sstables based on sstable cardinality or other cost/benefit metrics. Though I agree that increasing the fp rate for high-cardinality sstables to as high as 30% would often be excessive and undesirable, unless memory limitations force us to increase the fp rate by so much.

Reply all

Reply to author

Forward

0 new messages