High IO for large rows

hor...@gmail.com

<horschi@gmail.com>

unread,

Feb 3, 2021, 6:47:05 AM2/3/21

to ScyllaDB users

Hi,

does the max_cached_partition_size_in_bytes setting (found athttps://www.scylladb.com/2018/07/26/how-scylla-data-cache-works/) still exist in scylla? I cannot find it in the sourcecode and wonder if it perhaps got removed? If so, is there any alternative to it?

My issue is that I have large partitions that are not cached any more. But I have plenty of memory for them to be cached.

regards,

Christian

Avi Kivity

<avi@scylladb.com>

unread,

Feb 3, 2021, 7:07:53 AM2/3/21

to scylladb-users@googlegroups.com, hor...@gmail.com

On 2/3/21 1:47 PM, hor...@gmail.com wrote:

Hi,

does the max_cached_partition_size_in_bytes setting (found athttps://www.scylladb.com/2018/07/26/how-scylla-data-cache-works/) still exist in scylla? I cannot find it in the sourcecode and wonder if it perhaps got removed? If so, is there any alternative to it?

It was removed, and partitions of any size should be cached. Note that caching happens on row granularity, some rows in a partition can be cached while others are not.

My issue is that I have large partitions that are not cached any more. But I have plenty of memory for them to be cached.

Please provide more details. Any repeated read of a row should hit cache, unless enough time passed between repetitions for the row to be aged out.

You should be able to see whether cache is hit or not by using tracing.

regards,

Christian

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/4cbb2b70-1002-4cbd-a8a8-d206db33b7fcn%40googlegroups.com.

hor...@gmail.com

<horschi@gmail.com>

unread,

Feb 3, 2021, 7:27:54 AM2/3/21

to ScyllaDB users

Hi Avi,

perhaps my read pattern is the problem: I am reading random rows from these large partitions. But rarely the same row twice. So there are no cache-hits. With the Cassandra-style block cache, I would probably have cache hits, because of blocks containing multiple rows.

Is there any way to make the scylla-caching more aggressive? E.g. to make it cache all rows it loaded from disk and not just the one that was requested. I should have plenty of memory for all data, but it seems I have to first read everything once to have it cached. And since the application is doing single-reads, this takes time.

Another issue might be: A lot of queries are for missing rows. I don't assume missing rows are cached?

Is the row cache populated also by writes?

I don't assume there is any way to enable the Linux block-caching in scylla?

I will keep an eye on scylla_cache_row_hits/misses...

regards,

Ch

Avi Kivity

<avi@scylladb.com>

unread,

Feb 3, 2021, 7:42:58 AM2/3/21

to scylladb-users@googlegroups.com, hor...@gmail.com

On 2/3/21 2:27 PM, hor...@gmail.com wrote:

Hi Avi,

perhaps my read pattern is the problem: I am reading random rows from these large partitions. But rarely the same row twice. So there are no cache-hits. With the Cassandra-style block cache, I would probably have cache hits, because of blocks containing multiple rows.

Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).

Is there any way to make the scylla-caching more aggressive? E.g. to make it cache all rows it loaded from disk and not just the one that was requested.

Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.

I should have plenty of memory for all data, but it seems I have to first read everything once to have it cached. And since the application is doing single-reads, this takes time.

Another issue might be: A lot of queries are for missing rows. I don't assume missing rows are cached?

If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).

Is the row cache populated also by writes?

Yes, but with limitations. If a row is present in cache, it can be updated. If a row is not present in cache, but is present in sstables, it cannot be updated from a write. The reason is that we might need to merge the row with data from sstables, and we don't want to issue an sstable read just for that.

I don't assume there is any way to enable the Linux block-caching in scylla?

No.

I will keep an eye on scylla_cache_row_hits/misses...

How large is your data? total data size, average row size, average rows per partition?

Scylla was designed for workloads where the data is much larger than memory, and so page caching isn't effective.

regards,

Ch

On Wednesday, 3 February 2021 at 13:07:53 UTC+1 Avi Kivity wrote:

On 2/3/21 1:47 PM, hor...@gmail.com wrote:

Hi,

does the max_cached_partition_size_in_bytes setting (found athttps://www.scylladb.com/2018/07/26/how-scylla-data-cache-works/) still exist in scylla? I cannot find it in the sourcecode and wonder if it perhaps got removed? If so, is there any alternative to it?

It was removed, and partitions of any size should be cached. Note that caching happens on row granularity, some rows in a partition can be cached while others are not.

My issue is that I have large partitions that are not cached any more. But I have plenty of memory for them to be cached.

Please provide more details. Any repeated read of a row should hit cache, unless enough time passed between repetitions for the row to be aged out.

You should be able to see whether cache is hit or not by using tracing.

regards,

Christian

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/4cbb2b70-1002-4cbd-a8a8-d206db33b7fcn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/697b7d19-a9e3-4ad0-a2ac-5ac9dd4b80f2n%40googlegroups.com.

Tomasz Grabiec

<tgrabiec@scylladb.com>

unread,

Feb 3, 2021, 8:19:40 AM2/3/21

to ScyllaDB users, hor...@gmail.com

On Wed, Feb 3, 2021 at 1:42 PM Avi Kivity <a...@scylladb.com> wrote:

On 2/3/21 2:27 PM, hor...@gmail.com wrote:

Hi Avi,

perhaps my read pattern is the problem: I am reading random rows from these large partitions. But rarely the same row twice. So there are no cache-hits. With the Cassandra-style block cache, I would probably have cache hits, because of blocks containing multiple rows.

Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).

Is there any way to make the scylla-caching more aggressive? E.g. to make it cache all rows it loaded from disk and not just the one that was requested.

Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.

I should have plenty of memory for all data, but it seems I have to first read everything once to have it cached. And since the application is doing single-reads, this takes time.

Another issue might be: A lot of queries are for missing rows. I don't assume missing rows are cached?

If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).

Is the row cache populated also by writes?

Yes, but with limitations. If a row is present in cache, it can be updated. If a row is not present in cache, but is present in sstables, it cannot be updated from a write. The reason is that we might need to merge the row with data from sstables, and we don't want to issue an sstable read just for that.

I don't assume there is any way to enable the Linux block-caching in scylla?

No.

I will keep an eye on scylla_cache_row_hits/misses...

How large is your data? total data size, average row size, average rows per partition?

Scylla was designed for workloads where the data is much larger than memory, and so page caching isn't effective.

We could handle such workloads better by using the spare memory for sstable block caches. This way the cache would get warmer faster if the workload fits in memory.

hor...@gmail.com

<horschi@gmail.com>

unread,

Feb 3, 2021, 8:30:53 AM2/3/21

to ScyllaDB users

On Wednesday, 3 February 2021 at 13:42:58 UTC+1 Avi Kivity wrote:

Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).

In my special case it would be good if it could configured to also parse & cache those read rows (more aggressive caching). For me its reading data from disk like crazy, but throws most of it away. But it might be a very special case :-)

Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.

yes :-)

If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).

If at least the miss-information from the over-read was available, that would help a lot. Something like this:

- User requests CK=5

- A block of data is read, which contains CK=1,5,9

- Data for 5 is cached, miss for 2-3 and 6-8 is cached

If this was the case then any updates to the other keys would be cached on write.

How large is your data? total data size, average row size, average rows per partition?

Its quite small.

Table: boe3

SSTable count: 21

SSTables in each level: [21/4]

Space used (live): 3308337317

Space used (total): 3308337317

Space used by snapshots (total): 0

Off heap memory used (total): 748563584

SSTable Compression Ratio: 0.238434

Number of partitions (estimate): 1160

Memtable cell count: 113

Memtable data size: 712089697

Memtable off heap memory used: 748552192

Memtable switch count: 9

Local read count: 184100

Local read latency: 74.225 ms

Local write count: 5156189

Local write latency: 0.011 ms

Pending flushes: 0

Percent repaired: 0.0

Bloom filter false positives: 33

Bloom filter false ratio: 0.00006

Bloom filter space used: 240

Bloom filter off heap memory used: 224

Index summary off heap memory used: 11168

Compression metadata off heap memory used: 0

Compacted partition minimum bytes: 30

Compacted partition maximum bytes: 74975550

Compacted partition mean bytes: 11820385

Average live cells per slice (last five minutes): 0.0

Maximum live cells per slice (last five minutes): 0

Average tombstones per slice (last five minutes): 0.0

Maximum tombstones per slice (last five minutes): 0

Dropped Mutations: 0

Avi Kivity

<avi@scylladb.com>

unread,

Feb 3, 2021, 12:34:05 PM2/3/21

to scylladb-users@googlegroups.com, Tomasz Grabiec, hor...@gmail.com

On 2/3/21 3:19 PM, Tomasz Grabiec wrote:

How large is your data? total data size, average row size, average rows per partition?

Scylla was designed for workloads where the data is much larger than memory, and so page caching isn't effective.

We could handle such workloads better by using the spare memory for sstable block caches. This way the cache would get warmer faster if the workload fits in memory.

https://github.com/scylladb/scylla/issues/363

For such a small table, we can also scan it during startup, based on a table setting (prewarm cache).

But if it's so small, then I expect it will be brought into cache after a short while. Maybe it's the lack of negative entries in some circumstances that prevents the cache from being effective. Maybe we can convert a cache miss for a row or range into a dummy range/row.

Avi Kivity

<avi@scylladb.com>

unread,

Feb 3, 2021, 12:41:01 PM2/3/21

to scylladb-users@googlegroups.com, hor...@gmail.com

On 2/3/21 3:30 PM, hor...@gmail.com wrote:

On Wednesday, 3 February 2021 at 13:42:58 UTC+1 Avi Kivity wrote:

Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).

In my special case it would be good if it could configured to also parse & cache those read rows (more aggressive caching). For me its reading data from disk like crazy, but throws most of it away. But it might be a very special case :-)

Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.

yes :-)

Do try it, it's an interesting experiment. Also please share cache hit/miss statistics before and after.

If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).

If at least the miss-information from the over-read was available, that would help a lot. Something like this:

- User requests CK=5

- A block of data is read, which contains CK=1,5,9

- Data for 5 is cached, miss for 2-3 and 6-8 is cached

If this was the case then any updates to the other keys would be cached on write.

Unfortunately this is counterproductive for other workloads. Because of the way that data is spread across many sstables, there is a lot of work needed to gather the information.

We might opportunistically over-read and cache, but for other workloads, it would just evict stuff from cache. We'll need some way for the user to describe what kind of locality to expect. For your case, do you have anything more specific than "cache everything"?

How large is your data? total data size, average row size, average rows per partition?

Its quite small.

Table: boe3

SSTable count: 21

SSTables in each level: [21/4]

Space used (live): 3308337317

Space used (total): 3308337317

How much memory do you have? After decompression this is 14GB, and the in-memory representation has significant overhead.

btw, Scylla Enterprise has an in-memory feature which keeps sstables mirrored in memory. This isn't a cache, the sstables are permanently memory-resident (as well as stored on disk).

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/f94516b2-5d84-4d05-a5d6-ead8140c9975n%40googlegroups.com.

Reply all

Reply to author

Forward