Scylla schema and query advice (200mil inserts/ttl's)

427 views
Skip to first unread message

Jethro van Ginkel

<jethrovanginkel@gmail.com>
unread,
Feb 24, 2021, 4:59:30 PM2/24/21
to scylladb-users@googlegroups.com
Hi Guys,

I started a thread on Slack but got the advice to start a conversation here. 

Currently we have the following in our application that we are re-building:
CREATE TABLE IF NOT EXISTS t1 (
i bigint,
g text,
h int,
m text,
PRIMARY KEY ((g, h), i) WITH compaction = {
'tombstone_threshold': '0.2',
'compaction_window_size': '1',
'compaction_window_unit': 'DAYS',
'class': 'TimeWindowCompactionStrategy'
}
AND read_repair_chance = 0
AND dclocal_read_repair_chance = 0

CREATE INDEX ON t1 (m)
i is in combination with g an unique number
g there are around 200,000 unique values for g and every g value is an average of 24Bytes
h we created in the past for better data distribution across nodes, if i is 0-100 than h is 0 if i is 1000-1999 than h is 1. This way we have more unique partition key.Queries:
# Insert with TTL, ttl is different based on the value of g (software gives the TTL value, TTL values are in the range of 300000-315360000)
INSERT INTO t1 (i, g, h, m) values (1, 'gvalue', 0, 'mvalue') USING TTL 300000

# In this case we know g and the i range and h we can calculate in the software
SELECT i, m FROM t1 WHERE g = ? AND h = ? AND i >= ? AND i <= ?

# same as above, as we know g and i and calculate h we can select m (should answer with one result or 0)
SELECT m FROM t1 WHERE g = ? AND h = ? AND i =?

# We are creating the index for m for this query, in this situation we only know `m` and need to know i,g
SELECT i, g FROM t1 WHERE m = ? limit 1

# With this query we want to search for the current lowest value of i (as there is an expiry TTL the lowest value will change in time)
SELECT i FROM t1 WHERE g=? AND h=? limit 1
What is very important is that we need to be so space efficient as possible. We are inserting more than 200,000,000 rows a day so that's a lot of data.My main questions are:1. Is the h solution a smart thing to do? Or do you see a better way?
2. I read that mixing different TTL values in the same table is not good practice? (TTL values are in the range from 300,000-315,360,000)
2.1 We have 200000 unique g values so creating a table for every g value isn’t going to work I think?
2.2 If we can’t mix different TTL’s in one table we might be able to create a handful of tables like lowttl, midttl, longttl or do you know a better way?
3. Is there a better / smarter way than we do right now?
4. Is the index on m the best solution (also disk space wise)? Or can we better create a separated table with m as the primary key and a `g`+`i` row so we can do lookups? (edited)

Thanks,

Jethro

Kane Wilson

<k@raft.so>
unread,
Feb 25, 2021, 12:44:28 AM2/25/21
to scylladb-users@googlegroups.com
Hi Jethro,
 
1. Is the h solution a smart thing to do? Or do you see a better way?
The bucketing of partitions via h is fine in theory, as long as you are prepared to do some work in the event the bucket size changes.
 
2. I read that mixing different TTL values in the same table is not good practice? (TTL values are in the range from 300,000-315,360,000)
TWCS with mixed TTLs will likely cause storage issues, as SSTables are aged off rather than compacted away, and an SSTable with mixed TTLs won't be aged off until every row in that SSTable has expired. This could mean that an SSTable containing a row with a TTL of 300,000 and another row with a TTL 315,360,000 will hang around until the row with the longer TTL expires. 

2.1 We have 200000 unique g values so creating a table for every g value isn’t going to work I think?
No, creating many tables is generally not a good idea, nor easy to manage and maintain. Each table has a memory overhead as well, so you'll be wasting a lot of memory unnecessarily. Plus I imagine would make development incredibly painful. Not to mention you'll still need to bucket your partitions anyway.

2.2 If we can’t mix different TTL’s in one table we might be able to create a handful of tables like lowttlmidttllongttl or do you know a better way?
If you can split your TTLs into separate tables that'll make TWCS a bit more effective, however if your data isn't in series you're still best off avoiding TWCS.
Updates to old rows and deletions will be very storage inefficient as SSTables with the same row may never be compacted together because of the nature of the windows.
If you're not going to do updates/deletions then a table per TTL and TWCS may work well for this model. Some updates are also OK, as long as you can withstand the performance
impact on the read side in situations where updates occur across many time windows. (more updates across time windows = more SSTables need to be merged to read the row)

3. Is there a better / smarter way than we do right now?
Generally it will depend on your data, but it's hard to come up with the best method without experimentation and a lot of thought. The obvious options would be either TWCS with multiple tables and a TTL per table (taking into account my caveats around updates/deletions), or a mixed TTL table with STCS. Given your write load and space constraints I'd explore the TWCS option first.

4. Is the index on m the best solution (also disk space wise)? Or can we better create a separated table with m as the primary key and a `g`+`i` row so we can do lookups? 
I'm not terribly familiar with the ins and outs of indexes in Scylla, but you definitely want to be careful with indexes. You'll want to consider how many unique values m will have, as you'll have the same partitioning issues in regards to data distribution.
Again, experimentation with the index would be necessary to see if it can work and will be performant enough.

Cheers,
Kane

raft.so - Cassandra consulting, support, and managed services


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CAPz8vUr68wyYprkk5jq4XK6BY3JV_%3DzbCk7s0v9ydb%3DF7o%2Bi2Q%40mail.gmail.com.

Avi Kivity

<avi@scylladb.com>
unread,
Feb 25, 2021, 9:23:49 AM2/25/21
to scylladb-users@googlegroups.com, Jethro van Ginkel


On 24/02/2021 23.59, Jethro van Ginkel wrote:
Hi Guys,

I started a thread on Slack but got the advice to start a conversation here. 

Currently we have the following in our application that we are re-building:
CREATE TABLE IF NOT EXISTS t1 (
	i bigint,
	g text,
	h int,
	m text,
	PRIMARY KEY ((g, h), i) WITH compaction = { 
    'tombstone_threshold': '0.2', 
    'compaction_window_size': '1', 
    'compaction_window_unit': 'DAYS', 
    'class': 'TimeWindowCompactionStrategy' 
  }
  AND read_repair_chance = 0
  AND dclocal_read_repair_chance = 0

CREATE INDEX ON t1 (m)
i is in combination with g an unique number
g there are around 200,000 unique values for g and every g value is an average of 24Bytes
h we created in the past for better data distribution across nodes, if i is 0-100 than h is 0 if i is 1000-1999 than h is 1. This way we have more unique partition key.Queries:
# Insert with TTL, ttl is different based on the value of g (software gives the TTL value, TTL values are in the range of 300000-315360000)
INSERT INTO t1 (i, g, h, m) values (1, 'gvalue', 0, 'mvalue') USING TTL 300000


Variable TTL is really a poor fit for Time Window compaction. Moreover, the upper bound of your TTL range is 10 years, that gives 3650 1-day windows, far too much.


# In this case we know g and the i range and h we can calculate in the software 
SELECT i, m FROM t1 WHERE g = ? AND h = ? AND i >= ? AND i <= ?

# same as above, as we know g and i and calculate h we can select m (should answer with one result or 0)
SELECT m FROM t1 WHERE g = ? AND h = ? AND i =?

# We are creating the index for m for this query, in this situation we only know `m` and need to know i,g 
SELECT i, g FROM t1 WHERE m = ? limit 1


Hopefully, there are many m values (few collisions), otherwise this will not distribute well.


# With this query we want to search for the current lowest value of i (as there is an expiry TTL the lowest value will change in time)
SELECT i FROM t1 WHERE g=? AND h=? limit 1
What is very important is that we need to be so space efficient as possible. We are inserting more than 200,000,000 rows a day so that's a lot of data.My main questions are:1. Is the h solution a smart thing to do? Or do you see a better way?


If the number of g values won't grow beyond 200,000, keep it. If the number of unique g value grows, I recommend to drop it for simplicity.


200M rows/day is 2000 rows/sec on average. That's not a lot, and even if the insert job takes just two hours out of the entire day, that's just 20,000 rows/sec.


2. I read that mixing different TTL values in the same table is not good practice? (TTL values are in the range from 300,000-315,360,000)


It's fine for Size Tiered or Leveled, but not TWCS.


2.1 We have 200000 unique g values so creating a table for every g value isn’t going to work I think?


No, it will blow up very quickly.


2.2 If we can’t mix different TTL’s in one table we might be able to create a handful of tables like lowttl, midttl, longttl or do you know a better way?


I recommend not splitting tables artificially (it will mean you need to query each table/index separately). Just avoid TWCS with mixed TTLs.


3. Is there a better / smarter way than we do right now?
4. Is the index on m the best solution (also disk space wise)? Or can we better create a separated table with m as the primary key and a `g`+`i` row so we can do lookups? (edited)


Under the hood Scylla creates a table with m as the primary key and (g, h, i) as the clustering key, so you won't get much out of it.


Just make sure m has high cardinality.

Jethro van Ginkel

<jethrovanginkel@gmail.com>
unread,
Mar 1, 2021, 4:31:15 AM3/1/21
to Avi Kivity, scylladb-users@googlegroups.com
First of all thank you both for your detailed answers!



On 25 Feb 2021, at 15:23, Avi Kivity <a...@scylladb.com> wrote:

Variable TTL is really a poor fit for Time Window compaction. Moreover, the upper bound of your TTL range is 10 years, that gives 3650 1-day windows, far too much.

Can this be the reason why my current ScyllaDB cluster is overloaded with compactions? As it’s running now for a few years with TWCS instead of Size Tiered or Leveled and since 6-8 months it’s really overloaded? 

And is it possible to change it to Size or Leveled tiered in an existing environment? And what will be the impact? (As the cluster is already overloaded).

Hopefully, there are many m values (few collisions), otherwise this will not distribute well.

Yes, all `m` values are unique!


# With this query we want to search for the current lowest value of i (as there is an expiry TTL the lowest value will change in time)
SELECT i FROM t1 WHERE g=? AND h=? limit 1
What is very important is that we need to be so space efficient as possible. We are inserting more than 200,000,000 rows a day so that's a lot of data.My main questions are:1. Is the h solution a smart thing to do? Or do you see a better way?


If the number of g values won't grow beyond 200,000, keep it. If the number of unique g value grows, I recommend to drop it for simplicity.


All g values are unique and not really growing so keeping `h` is a good thing. What do you think about h value range? Is it better to have larger or smaller value for h? I can set the bucket size (for calculating h) in the software. In our current cluster we have this at 1,000 this will create:

((‘Gvalue1’, 0), 1-999) (so we have the combination of ‘Gvalue1’, ‘0’ for 1000 times then it will go to
((‘Gvalue1’, 1), 1000-1999)  (1000 of these)
((‘Gvalue1’, 2), 2000-2999)  (2000 of these)
And so on.

If we use 10,000 we get:
((‘Gvalue1’, 0), 1-9999)  (so we have the combination of ‘Gvalue1’, ‘0’ for 10000 times then it will go to
((‘Gvalue1’, 1), 10000-19999)  (10000 of these)
((‘Gvalue1’, 2), 20000-29999)  (10000 of these)
And so on.


Once again thank you for the help!

Jethro

Avi Kivity

<avi@scylladb.com>
unread,
Mar 1, 2021, 5:17:08 AM3/1/21
to Jethro van Ginkel, Raphael Carvalho, scylladb-users@googlegroups.com


On 01/03/2021 11.31, Jethro van Ginkel wrote:
First of all thank you both for your detailed answers!



On 25 Feb 2021, at 15:23, Avi Kivity <a...@scylladb.com> wrote:

Variable TTL is really a poor fit for Time Window compaction. Moreover, the upper bound of your TTL range is 10 years, that gives 3650 1-day windows, far too much.

Can this be the reason why my current ScyllaDB cluster is overloaded with compactions? As it’s running now for a few years with TWCS instead of Size Tiered or Leveled and since 6-8 months it’s really overloaded? 


The 3650 windows are bad, but don't explain compaction. Maybe variable TTL does. Raphael?


And is it possible to change it to Size or Leveled tiered in an existing environment? And what will be the impact? (As the cluster is already overloaded).


It's possible (with ALTER TABLE). The cluster will start compacting to reshape the data to fit size-tiered, while this is happening reads can be slower. Scylla will also require more free space.


Hopefully, there are many m values (few collisions), otherwise this will not distribute well.

Yes, all `m` values are unique!


# With this query we want to search for the current lowest value of i (as there is an expiry TTL the lowest value will change in time)
SELECT i FROM t1 WHERE g=? AND h=? limit 1
What is very important is that we need to be so space efficient as possible. We are inserting more than 200,000,000 rows a day so that's a lot of data.My main questions are:1. Is the h solution a smart thing to do? Or do you see a better way?


If the number of g values won't grow beyond 200,000, keep it. If the number of unique g value grows, I recommend to drop it for simplicity.


All g values are unique and not really growing so keeping `h` is a good thing. What do you think about h value range? Is it better to have larger or smaller value for h?


Smaller ranges (more 'h') lead to better distribution, but if you want to read all rows for a given 'g' (do you?) then it leads to more queries. So it depends on the ratio of reads and writes. See more below.


I can set the bucket size (for calculating h) in the software. In our current cluster we have this at 1,000 this will create:

((‘Gvalue1’, 0), 1-999) (so we have the combination of ‘Gvalue1’, ‘0’ for 1000 times then it will go to
((‘Gvalue1’, 1), 1000-1999)  (1000 of these)
((‘Gvalue1’, 2), 2000-2999)  (2000 of these)
And so on.

If we use 10,000 we get:
((‘Gvalue1’, 0), 1-9999)  (so we have the combination of ‘Gvalue1’, ‘0’ for 10000 times then it will go to
((‘Gvalue1’, 1), 10000-19999)  (10000 of these)
((‘Gvalue1’, 2), 20000-29999)  (10000 of these)
And so on.


How many rows do you have overall?


Guessing from 200M rows/day and assuming an average TTL of 2000 days, that's 4G rows. That's more than enough for any sized cluster.


200,000 (no 'h') is okay for a small-medium cluster (1000 vcpus total) but will start showing bad distribution at larger cluster sizes. So if you to keep the cluster size below ~10 nodes, even removing h may work. In general I dislike introducing artificial bucketing, but in your case I think it's fine and you can keep it.

Raphael S. Carvalho

<raphaelsc@scylladb.com>
unread,
Mar 1, 2021, 12:05:50 PM3/1/21
to Avi Kivity, Jethro van Ginkel, ScyllaDB users
On Mon, Mar 1, 2021 at 7:17 AM Avi Kivity <a...@scylladb.com> wrote:
>
>
> On 01/03/2021 11.31, Jethro van Ginkel wrote:
>
> First of all thank you both for your detailed answers!
>
>
>
> On 25 Feb 2021, at 15:23, Avi Kivity <a...@scylladb.com> wrote:
>
> Variable TTL is really a poor fit for Time Window compaction. Moreover, the upper bound of your TTL range is 10 years, that gives 3650 1-day windows, far too much.
>
> Can this be the reason why my current ScyllaDB cluster is overloaded with compactions? As it’s running now for a few years with TWCS instead of Size Tiered or Leveled and since 6-8 months it’s really overloaded?
>
>
> The 3650 windows are bad, but don't explain compaction. Maybe variable TTL does. Raphael?

That's correct. The TTL range in a given window is too big and
tombstone compaction with a threshold of 20% was enabled. That means
whenever a window has at least 20% of expired data, that same window
will be recompacted, leading to terrible write amplification.

TTLs can be mixed only if the TTL range in a given window is not big,
and so you can disable tombstone compaction and can afford keeping the
data with slightly lower TTL around until all data in the window is
expired. Remember that TWCS good perf comes from being able to easily
purge a window once all of its data becomes expired. If you have to
incrementally purge a window, as with tombstone compaction, you'll end
up with bad write perf.

Perhaps it's impossible for you to create 1 table for each possible
TTL value, but perhaps you can divide your TTL range into N subranges,
where each subrange is assigned its own table.

As a temporary workaround for your write amplification issue, I'd
recommend increasing your tombstone_threshold to 0.5 at least, as 0.2
is too low and definitely will hurt your cluster.
Reply all
Reply to author
Forward
0 new messages