High IO for large rows

127 views
Skip to first unread message

hor...@gmail.com

<horschi@gmail.com>
unread,
Feb 3, 2021, 6:47:05 AM2/3/21
to ScyllaDB users
Hi,


does the max_cached_partition_size_in_bytes setting (found athttps://www.scylladb.com/2018/07/26/how-scylla-data-cache-works/) still exist in scylla? I cannot find it in the sourcecode and wonder if it perhaps got removed? If so, is there any alternative to it?

My issue is that I have large partitions that are not cached any more. But I have plenty of memory for them to be cached.

regards,
Christian


Avi Kivity

<avi@scylladb.com>
unread,
Feb 3, 2021, 7:07:53 AM2/3/21
to scylladb-users@googlegroups.com, hor...@gmail.com
On 2/3/21 1:47 PM, hor...@gmail.com wrote:
Hi,


does the max_cached_partition_size_in_bytes setting (found athttps://www.scylladb.com/2018/07/26/how-scylla-data-cache-works/) still exist in scylla? I cannot find it in the sourcecode and wonder if it perhaps got removed? If so, is there any alternative to it?


It was removed, and partitions of any size should be cached. Note that caching happens on row granularity, some rows in a partition can be cached while others are not.


My issue is that I have large partitions that are not cached any more. But I have plenty of memory for them to be cached.


Please provide more details. Any repeated read of a row should hit cache, unless enough time passed between repetitions for the row to be aged out.


You should be able to see whether cache is hit or not by using tracing.



regards,
Christian


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/4cbb2b70-1002-4cbd-a8a8-d206db33b7fcn%40googlegroups.com.


hor...@gmail.com

<horschi@gmail.com>
unread,
Feb 3, 2021, 7:27:54 AM2/3/21
to ScyllaDB users
Hi Avi,

perhaps my read pattern is the problem: I am reading random rows from these large partitions. But rarely the same row twice. So there are no cache-hits. With the Cassandra-style block cache, I would probably have cache hits, because of blocks containing multiple rows.

Is there any way to make the scylla-caching more aggressive? E.g. to make it cache all rows it loaded from disk and not just the one that was requested. I should have plenty of memory for all data, but it seems I have to first read everything once to have it cached. And since the application is doing single-reads, this takes time.

Another issue might be: A lot of queries are for missing rows. I don't assume missing rows are cached?

Is the row cache populated also by writes?

I don't assume there is any way to enable the Linux block-caching in scylla?

I will keep an eye on scylla_cache_row_hits/misses...

regards,
Ch


Avi Kivity

<avi@scylladb.com>
unread,
Feb 3, 2021, 7:42:58 AM2/3/21
to scylladb-users@googlegroups.com, hor...@gmail.com
On 2/3/21 2:27 PM, hor...@gmail.com wrote:
Hi Avi,

perhaps my read pattern is the problem: I am reading random rows from these large partitions. But rarely the same row twice. So there are no cache-hits. With the Cassandra-style block cache, I would probably have cache hits, because of blocks containing multiple rows.


Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).


Is there any way to make the scylla-caching more aggressive? E.g. to make it cache all rows it loaded from disk and not just the one that was requested.


Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.



I should have plenty of memory for all data, but it seems I have to first read everything once to have it cached. And since the application is doing single-reads, this takes time.

Another issue might be: A lot of queries are for missing rows. I don't assume missing rows are cached?


If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).


Is the row cache populated also by writes?


Yes, but with limitations. If a row is present in cache, it can be updated. If a row is not present in cache, but is present in sstables, it cannot be updated from a write. The reason is that we might need to merge the row with data from sstables, and we don't want to issue an sstable read just for that.


I don't assume there is any way to enable the Linux block-caching in scylla?


No.


I will keep an eye on scylla_cache_row_hits/misses...


How large is your data? total data size, average row size, average rows per partition?


Scylla was designed for workloads where the data is much larger than memory, and so page caching isn't effective.


regards,
Ch


On Wednesday, 3 February 2021 at 13:07:53 UTC+1 Avi Kivity wrote:
On 2/3/21 1:47 PM, hor...@gmail.com wrote:
Hi,


does the max_cached_partition_size_in_bytes setting (found athttps://www.scylladb.com/2018/07/26/how-scylla-data-cache-works/) still exist in scylla? I cannot find it in the sourcecode and wonder if it perhaps got removed? If so, is there any alternative to it?


It was removed, and partitions of any size should be cached. Note that caching happens on row granularity, some rows in a partition can be cached while others are not.


My issue is that I have large partitions that are not cached any more. But I have plenty of memory for them to be cached.


Please provide more details. Any repeated read of a row should hit cache, unless enough time passed between repetitions for the row to be aged out.


You should be able to see whether cache is hit or not by using tracing.



regards,
Christian


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/4cbb2b70-1002-4cbd-a8a8-d206db33b7fcn%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Feb 3, 2021, 8:19:40 AM2/3/21
to ScyllaDB users, hor...@gmail.com
On Wed, Feb 3, 2021 at 1:42 PM Avi Kivity <a...@scylladb.com> wrote:
On 2/3/21 2:27 PM, hor...@gmail.com wrote:
Hi Avi,

perhaps my read pattern is the problem: I am reading random rows from these large partitions. But rarely the same row twice. So there are no cache-hits. With the Cassandra-style block cache, I would probably have cache hits, because of blocks containing multiple rows.


Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).


Is there any way to make the scylla-caching more aggressive? E.g. to make it cache all rows it loaded from disk and not just the one that was requested.


Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.



I should have plenty of memory for all data, but it seems I have to first read everything once to have it cached. And since the application is doing single-reads, this takes time.

Another issue might be: A lot of queries are for missing rows. I don't assume missing rows are cached?


If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).


Is the row cache populated also by writes?


Yes, but with limitations. If a row is present in cache, it can be updated. If a row is not present in cache, but is present in sstables, it cannot be updated from a write. The reason is that we might need to merge the row with data from sstables, and we don't want to issue an sstable read just for that.


I don't assume there is any way to enable the Linux block-caching in scylla?


No.


I will keep an eye on scylla_cache_row_hits/misses...


How large is your data? total data size, average row size, average rows per partition?


Scylla was designed for workloads where the data is much larger than memory, and so page caching isn't effective.


We could handle such workloads better by using the spare memory for sstable block caches. This way the cache would get warmer faster if the workload fits in memory. 

hor...@gmail.com

<horschi@gmail.com>
unread,
Feb 3, 2021, 8:30:53 AM2/3/21
to ScyllaDB users
On Wednesday, 3 February 2021 at 13:42:58 UTC+1 Avi Kivity wrote:
Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).
In my special case it would be good if it could configured to also parse & cache those read rows (more aggressive caching). For me its reading data from disk like crazy, but throws most of it away. But it might be a very special case :-) 

Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.

yes :-)
 

If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).

If at least the miss-information from the over-read was available, that would help a lot. Something like this:
- User requests CK=5
- A block of data is read, which contains CK=1,5,9
- Data for 5 is cached, miss for 2-3 and 6-8 is cached
If this was the case then any updates to the other keys would be cached on write.
 
How large is your data? total data size, average row size, average rows per partition?
Its quite small.

    Table: boe3
    SSTable count: 21
    SSTables in each level: [21/4]
    Space used (live): 3308337317
    Space used (total): 3308337317
    Space used by snapshots (total): 0
    Off heap memory used (total): 748563584
    SSTable Compression Ratio: 0.238434
    Number of partitions (estimate): 1160
    Memtable cell count: 113
    Memtable data size: 712089697
    Memtable off heap memory used: 748552192
    Memtable switch count: 9
    Local read count: 184100
    Local read latency: 74.225 ms
    Local write count: 5156189
    Local write latency: 0.011 ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bloom filter false positives: 33
    Bloom filter false ratio: 0.00006
    Bloom filter space used: 240
    Bloom filter off heap memory used: 224
    Index summary off heap memory used: 11168
    Compression metadata off heap memory used: 0
    Compacted partition minimum bytes: 30
    Compacted partition maximum bytes: 74975550
    Compacted partition mean bytes: 11820385
    Average live cells per slice (last five minutes): 0.0
    Maximum live cells per slice (last five minutes): 0
    Average tombstones per slice (last five minutes): 0.0
    Maximum tombstones per slice (last five minutes): 0
    Dropped Mutations: 0 

Avi Kivity

<avi@scylladb.com>
unread,
Feb 3, 2021, 12:34:05 PM2/3/21
to scylladb-users@googlegroups.com, Tomasz Grabiec, hor...@gmail.com
On 2/3/21 3:19 PM, Tomasz Grabiec wrote:

How large is your data? total data size, average row size, average rows per partition?


Scylla was designed for workloads where the data is much larger than memory, and so page caching isn't effective.


We could handle such workloads better by using the spare memory for sstable block caches. This way the cache would get warmer faster if the workload fits in memory. 



https://github.com/scylladb/scylla/issues/363


For such a small table, we can also scan it during startup, based on a table setting (prewarm cache).


But if it's so small, then I expect it will be brought into cache after a short while. Maybe it's the lack of negative entries in some circumstances that prevents the cache from being effective.  Maybe we can convert a cache miss for a row or range into a dummy range/row.

Avi Kivity

<avi@scylladb.com>
unread,
Feb 3, 2021, 12:41:01 PM2/3/21
to scylladb-users@googlegroups.com, hor...@gmail.com
On 2/3/21 3:30 PM, hor...@gmail.com wrote:



On Wednesday, 3 February 2021 at 13:42:58 UTC+1 Avi Kivity wrote:
Scylla only reads the rows that are asked (sometimes it has to over-read to fit sector boundary, but it doesn't over-parse from sstables, so it doesn't see these rows).
In my special case it would be good if it could configured to also parse & cache those read rows (more aggressive caching). For me its reading data from disk like crazy, but throws most of it away. But it might be a very special case :-) 

Well, you can read all the data in a partition scan or full scan with CL=ALL. Of course that's not a good solution.

yes :-)


Do try it, it's an interesting experiment. Also please share cache hit/miss statistics before and after.


 

If you read single rows (ck=5), then a miss isn't cached. If you read a range (ck>=3 AND ck<=7), and later read a single row, it will detect the missing row in cache (if it has entries for ck=4 and ck=6, it also knows that there is nothing between them).

If at least the miss-information from the over-read was available, that would help a lot. Something like this:
- User requests CK=5
- A block of data is read, which contains CK=1,5,9
- Data for 5 is cached, miss for 2-3 and 6-8 is cached
If this was the case then any updates to the other keys would be cached on write.


Unfortunately this is counterproductive for other workloads. Because of the way that data is spread across many sstables, there is a lot of work needed to gather the information.


We might opportunistically over-read and cache, but for other workloads, it would just evict stuff from cache. We'll need some way for the user to describe what kind of locality to expect. For your case, do you have anything more specific than "cache everything"?


 
How large is your data? total data size, average row size, average rows per partition?
Its quite small.

    Table: boe3
    SSTable count: 21
    SSTables in each level: [21/4]
    Space used (live): 3308337317
    Space used (total): 3308337317


How much memory do you have? After decompression this is 14GB, and the in-memory representation has significant overhead.


btw, Scylla Enterprise has an in-memory feature which keeps sstables mirrored in memory. This isn't a cache, the sstables are permanently memory-resident (as well as stored on disk).


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages