Compaction Strategy

samuel benhamou

unread,

Jan 18, 2017, 8:30:02 AM1/18/17

to KairosDB

Hi,

from what I got looking on the Cassandra side, the compaction strategy used is STCS. Regarding the type of data Kairos is handling, has anyone moved to TWCS maybe? Would this eventually be configurable on the Kairos side when the Sstables are created or would I have to modify it afterwards with cqlsh/cassandra-cli ?

Sam

Brian Hawkins

unread,

Jan 18, 2017, 11:09:08 AM1/18/17

to KairosDB

From the conceptual view point TWCS is ideal. I've not yet had a chance to put it into a production system. The only problem with TWCS is that it is not part of the C* distribution so you have to install manually (unless something has changed that I'm not aware of).

The compaction strategy can be set when the schema is created but currently is not by Kairos. For now you will need to set it using cqlsh.

Brian

Daniel Hopkins

unread,

Jan 18, 2017, 1:01:56 PM1/18/17

to KairosDB

Brian, this is no longer the case, TWCS is now included by default (replacing DTCS). (I do not recall at what point in time they started including it).
.

Brian Hawkins

unread,

Jan 18, 2017, 2:37:28 PM1/18/17

to KairosDB

Holy Crap! You are right, I just found it in C* 3.9 :) Oh happy day. And it is documented here: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsConfigureCompaction.html

Brian

samuel benhamou

unread,

Jan 19, 2017, 11:49:28 AM1/19/17

to KairosDB

yup,

it's been apparently introduced in 3.0.8 and 3.8 https://issues.apache.org/jira/browse/CASSANDRA-9666
would you then plan to set it at the creation time or let users handle it viq cqlsh ?

Regards

Brian Hawkins

unread,

Jan 19, 2017, 11:30:00 PM1/19/17

to KairosDB

Unless someone has an argument against it I'll add a issue in github to add TWCS as the default compaction when creating the schema.

Brian

Brian Hawkins

unread,

Jan 19, 2017, 11:31:41 PM1/19/17

to KairosDB

On second thought, some people may still be installing this on 2.2 which will cause a problem with the startup if I specify the compactions as TWCS.

Brian

Daniel Hopkins

unread,

Jan 23, 2017, 8:29:46 PM1/23/17

to KairosDB

Perhaps, adding a 'config' flag for first init, or even better maybe there is way to determine the version of cassandra when first connecting the cluster :)

Complete Side note:

Does is make sense to add (perhaps configurable), a TTL value for the Row Index/String Index column, if the user has configured ttl on the data points table?..i.e. if my data cannot exist more then 2 months..would ttl'ing out the index make sense?..would that get rid of unused tags/values after the ttl?

Brian Hawkins

unread,

Jan 24, 2017, 6:34:59 PM1/24/17

to KairosDB

I do ttl the row key index. It bumps the ttl out so that the row key doesn't disappear until after all the values do.

Brian

Riley Zimmerman

unread,

Jan 27, 2017, 10:58:15 AM1/27/17

to KairosDB

I'm starting to experiment with TWCS. Just wanted to confirm this should be all I need to do:

ALTER TABLE metricdb.data_points WITH compaction = { 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1', 'class':'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy' };

ALTER TABLE metricdb.row_key_index WITH compaction = { 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1', 'class':'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy' };

ALTER TABLE metricdb.string_index WITH compaction = { 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1', 'class':'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy' };

I've picked the default 1 day for now since I'm keeping 1 month of raw data. I'm not sure yet how summarized data will work with TWCS when I add that to our KairosDB though. Still reading up on that.

Brian Hawkins

unread,

Jan 27, 2017, 9:07:12 PM1/27/17

to KairosDB

The only one you need to change to TWCS is data_points, I'd set the others to leveled compaction.

Brian

Riley Zimmerman

unread,

Jan 30, 2017, 10:17:50 AM1/30/17

to KairosDB

Thanks Brian, I'll just do metricdb.data_points.

I've done some more investigation on KairosDB Roll-Ups and I am concerned about how they would work with TWCS.
What I don't understand is, if you have Roll-Ups with different TTLs than the raw data, will the longer live roll-ups prevent the pruning of the oldest SSTables with TWCS?

For example, I insert raw data with a specified TTL of 30 days. I have kairosdb.datastore.cassandra.datapoint_ttl set to one year since for now roll-ups cannot have their own TTL (https://github.com/kairosdb/kairosdb/issues/332). I set my TWCS to 1 day. However, won't the 1 year TTL roll-up metrics be in the same SSTables and prevent the pruning of the 30+ day data?

Could/should the roll-ups be stored in something other than metricdb.data_points which could then have a different TWCS setting?

Brian Hawkins

unread,

Jan 30, 2017, 10:58:18 AM1/30/17

to KairosDB

Good points and I've been meaning to do some investigation. I know with Leveled Compaction that sstables are added as candidates for compaction based on stats about the table (ie how much of the table will disappear because of ttl). I'm assuming that the same is true for TWCS, basically files after 30 days will be candidates for compaction because of the expired data. Like I said it needs to be tested.

Multiple needs have arisen for writing data to multiple column families. Rollups is one of them, so it is on the radar.

Brian

Riley Zimmerman

unread,

Jan 30, 2017, 11:19:41 AM1/30/17

to KairosDB

https://groups.google.com/forum/#!topic/nosql-databases/KiKVqD0Oe98
""never compacted" has an implicit asterisk referencing tombstone_compaction_interval and tombstone_threshold, sounds like. More of a "never compacted" via strategy selection, but eligible for tombstone-triggered compaction."

"With the caveat that tombstone compactions are disabled by default in TWCS (and DTCS)"

I think they could get deleted by tombstone if it was re-enabled properly for TWCS? Not positive. But definitely safer to break up the roll-up data so they can have different TWCS and TTLs.

Riley Zimmerman

unread,

Feb 1, 2017, 12:48:38 PM2/1/17

to KairosDB

Has anyone else tried to switch KairosDB to TWCS? I altered the compaction to 5 minute windows and started watching my metricdb.data_points Data.db files. It is still doing STCS with a min_threshold of 4. I checked the tables with DESCRIBE SCHEMA and I can see that it is set to TWCS, but it has the max and min threshold variables in the compaction line even though I didn't specify them in the ALTER command.

AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '5', 'compaction_window_unit': 'MINUTES', 'max_threshold': '32', 'min_threshold': '4'}

I stopped KairosDB and dropped metricdb.data_points. I then recreated the table specifying TWCS, but it still added back the min/max threshold variables and is using STCS. I am using cassandra 3.9. I feel like I must be missing something basic here.

ALTER TABLE metricdb.data_points WITH compaction = { 'compaction_window_unit': 'MINUTES', 'compaction_window_size': '5', 'class':'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy' };

Daniel Hopkins

unread,

Feb 1, 2017, 7:51:18 PM2/1/17

to KairosDB

I'm not sure if there will answer you question, but from my understanding, STCS will continue to be used for the 'current window' (I would guess this means the min of 4 minutes), but after that it should start using twcs, are you saying that after the threshold it continues to use STCS?, perhaps our understanding of this min-threshold if wrong.

Brian Hawkins

unread,

Feb 1, 2017, 10:29:54 PM2/1/17

to KairosDB

I don't have enough experience with it to answer, sorry.

Riley Zimmerman

unread,

Feb 2, 2017, 8:23:56 AM2/2/17

to KairosDB

Thanks Daniel, but unfortunately I was already expecting the STCS behavior during the current window.

I was originally doing 1 day windows, so I wasn't surprised when most of the SSTable .db files were moving around based on STCS. After 3 days I had 3 big files, although they were spaced by ~18 hours instead of 24. I figured this was just part of the varying time to do compaction and get one of the two compaction threads. But then it started to merge the "~daily" files and I knew something was wrong.

I reset my env, deleted the data_points table and recreated it with TWCS with 10 minute windows. I also set my TTL to 1 hour. I see absolutely nothing based on 10 minute windows. It's still just grouping by sets of 4. I'll look around on the TWCS forums for more info.

My disk util is 20~30%, CPU < 50%. The reads are 3~4MB/sec, which is well below the 16MB Cassandra compactionthroughput limit. I'll try more than the default 2 compaction threads, but I don't always see 2 threads running with `nodetool tpstats | grep CompactionExecutor`, so I don't think that's the issue.

Jan Mussler

unread,

Feb 6, 2017, 7:41:09 AM2/6/17

to KairosDB

As pointed out in another thread:

https://groups.google.com/forum/?hl=en-GB#!searchin/kairosdb-group/twcs|sort:relevance/kairosdb-group/w54mg_FperI/iBsiakJ3DAAJ

We run quite happily with the TWCS like this:

AND compaction = {'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '1440', 'compaction_window_unit': 'MINUTES', 'max_threshold': '32', 'min_threshold': '4'}

You might have some different needs, but right now it works quite well giving us one file per day, makes it convenient to clean up old data manually if needed too ;) This also holds true for new nodes added, after streaming you get files on a per day basis. Only issue then is that you cannot really work with the last modified / created on file level.

All other compaction strategies ended up behaving badly and do unnecessary compactions touching lots of data.

Riley Zimmerman

unread,

Feb 7, 2017, 7:58:41 AM2/7/17

to KairosDB

Thanks Jan! That's reassuring that it can work for others, so I'll push ahead with trying to get it working for me.

I'm using the TimeWindowCompactionStrategy.java that was included in the main Cassandra 3.8 and later builds (I have 3.9). I doubt something could have broken between the jeffjirsa version and the main build, but I'll look into that possibility. I've confirmed in the cassandra debug.logs that TWCS is running. I see messages in the logs like these samples I've pulled out:

DEBUG [CompactionExecutor:70] 2017-02-07 04:34:49,439 TimeWindowCompactionStrategy.java:300 - No compaction necessary for bucket size 3 , key 1483200000, now 1483200000

DEBUG [CompactionExecutor:76] 2017-02-07 07:51:06,102 TimeWindowCompactionStrategy.java:111 - TWCS skipping check for fully expired SSTables

DEBUG [CompactionExecutor:76] 2017-02-07 07:51:06,102 TimeWindowCompactionStrategy.java:286 - Using STCS compaction for first window of bucket: data files...

DEBUG [CompactionExecutor:72] 2017-02-07 07:54:31,173 TimeWindowCompactionStrategy.java:105 - TWCS expired check sufficiently far in the past, checking for fully expired SSTables

What I never see is the message on line 307 that appears to be what trims it to the threshold:
https://github.com/apache/cassandra/blob/98d74ed998706e9e047dc0f7886a1e9b18df3ce9/src/java/org/apache/cassandra/db/compaction/TimeWindowCompactionStrategy.java

else if (bucket.size() >= 2 && key < now)

{
logger.debug("bucket size {} >= 2 and not in current bucket, compacting what's here: {}", bucket.size(), bucket);
return trimToThreshold(bucket, maxThreshold);
}

Here is my DESCRIBE SCHEMA:

CREATE TABLE metricdb.data_points (

key blob,

column1 blob,

value blob,

PRIMARY KEY (key, column1)

) WITH COMPACT STORAGE

AND CLUSTERING ORDER BY (column1 ASC)

AND bloom_filter_fp_chance = 0.01

AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

AND comment = ''

AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '60', 'compaction_window_unit': 'MINUTES', 'max_threshold': '32', 'min_threshold': '4'}

AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

AND crc_check_chance = 1.0

AND dclocal_read_repair_chance = 0.1

AND default_time_to_live = 0

AND gc_grace_seconds = 864000

AND max_index_interval = 2048

AND memtable_flush_period_in_ms = 0

AND min_index_interval = 128

AND read_repair_chance = 0.1

AND speculative_retry = 'NONE';

And here is what my data files look like. Note, my compaction_window_size is 60 minutes. After running for a day, I expected one file for each hour, but instead it's still merging them into STCS groups.

[root@itm3650c data_points-53f5ef70ecae11e6a39505e3aa874c40]# ls -laht *-Data.*

-rw-r--r-- 1 root root 23M Feb 7 07:41 mc-1396-big-Data.db

-rw-r--r-- 1 root root 25M Feb 7 07:40 mc-1395-big-Data.db

-rw-r--r-- 1 root root 66M Feb 7 07:40 mc-1394-big-Data.db

-rw-r--r-- 1 root root 750M Feb 7 07:37 mc-1379-big-Data.db

-rw-r--r-- 1 root root 65M Feb 7 07:33 mc-1385-big-Data.db

-rw-r--r-- 1 root root 758M Feb 7 06:05 mc-1242-big-Data.db

-rw-r--r-- 1 root root 3.0G Feb 7 05:01 mc-1117-big-Data.db

-rw-r--r-- 1 root root 3.0G Feb 6 22:49 mc-568-big-Data.db

I think I've narrowed down my issue to this:

DEBUG [CompactionExecutor:2] 2017-02-06 15:55:28,283 TimeWindowCompactionStrategy.java:300 - No compaction necessary for bucket size 2 , key 1483200000, now 1483200000

...

DEBUG [CompactionExecutor:76] 2017-02-07 07:33:31,169 TimeWindowCompactionStrategy.java:300 - No compaction necessary for bucket size 3 , key 1483200000, now 1483200000

Whatever I'm doing, it isn't incrementing the key and now variables. And as shown in the code snippet above, key < now is needed to trigger the trimToThreshold(bucket, maxThreshold)

I'll keep looking at the TimeWindowCompactionStrategy.java and see if it gives me any clues as to what is going wrong for me.

I'm inserting 0.5 million records per minute through KairosDB. I'm watching my compaction threads with `/opt/ibm/cassandra/bin/nodetool tpstats | grep CompactionExecutor` and they are not always busy. I'm very good on resource usage.

Riley Zimmerman

unread,

Feb 7, 2017, 8:59:50 AM2/7/17

to KairosDB

TWCS uses the "Maximum timestamp" of the SSTable Data files. I took a look at mine:

for file in `ls -tr *Data.db`; do echo "`ls -ltr $file` `/opt/cassandra/tools/bin/sstablemetadata $file | grep Max`" ;done

-rw-r--r-- 1 root root 3180125875 Feb 6 22:49 mc-568-big-Data.db Maximum timestamp: 1486436960093

-rw-r--r-- 1 root root 3162960688 Feb 7 05:01 mc-1117-big-Data.db Maximum timestamp: 1486459296087

-rw-r--r-- 1 root root 794093458 Feb 7 06:05 mc-1242-big-Data.db Maximum timestamp: 1486464860517

-rw-r--r-- 1 root root 785767503 Feb 7 07:37 mc-1379-big-Data.db Maximum timestamp: 1486470415256

-rw-r--r-- 1 root root 203838215 Feb 7 07:52 mc-1409-big-Data.db Maximum timestamp: 1486471757321

-rw-r--r-- 1 root root 226126507 Feb 7 08:17 mc-1446-big-Data.db Maximum timestamp: 1486473265038

-rw-r--r-- 1 root root 205173246 Feb 7 08:36 mc-1476-big-Data.db Maximum timestamp: 1486474381351

-rw-r--r-- 1 root root 58930627 Feb 7 08:42 mc-1501-big-Data.db Maximum timestamp: 1486474870376

-rw-r--r-- 1 root root 153191477 Feb 7 08:42 mc-1497-big-Data.db Maximum timestamp: 1486474777129

-rw-r--r-- 1 root root 64450900 Feb 7 08:44 mc-1507-big-Data.db Maximum timestamp: 1486474983024

-rw-r--r-- 1 root root 65247089 Feb 7 08:46 mc-1512-big-Data.db Maximum timestamp: 1486475097291

-rw-r--r-- 1 root root 17261855 Feb 7 08:47 mc-1518-big-Data.db Maximum timestamp: 1486475223302

-rw-r--r-- 1 root root 30864815 Feb 7 08:47 mc-1519-big-Data.db Maximum timestamp: 1486475261049

-rw-r--r-- 1 root root 64050149 Feb 7 08:47 mc-1517-big-Data.db Maximum timestamp: 1486475205672

-rw-r--r-- 1 root root 22058612 Feb 7 08:48 mc-1521-big-Data.db Maximum timestamp: 1486475284387

-rw-r--r-- 1 root root 467132107 Feb 7 08:49 mc-1505-big-Data.db Maximum timestamp: 9223372036854775807

-rw-r--r-- 1 root root 67389513 Feb 7 08:49 mc-1520-big-Data.db Maximum timestamp: 9223372036854775807

It's interesting to me that here the timestamps have millisecond granularity (ignoring the last two, I think they were still being written during the query). In my Cassandra debug.log, the TWCS messages point to a key and now value of 1483200000. I wonder if it is confused by the difference in granularity and if so, why is that happening for me and not others.

If anyone else using TWCS with KairosDB would be able to check what their timestamps look like I would greatly appreciate it. I'll keep investigating!

Brian Hawkins

unread,

Feb 7, 2017, 3:10:46 PM2/7/17

to KairosDB

This may be the problem. In the develop branch, CassandraDatastore.java on line 292 there is this: long now = System.currentTimeMillis();

That is the timestamp I'm passing to C* when inserting data. As far as I knew it was only used for conflict resolution when reading data. I wonder if TWCS uses that timestamp as a means to know what time bucket it places the data in. Some of you may recognize that what C* wants is nanos not millis. So changing that line to long now = System.nanoTime(); Should resolve your problem if this is indeed the case.

If that fixes the problem then we need to look at that code again as this brings up a bigger issue. When you insert older data should it go in the current compaction window or the one appropriate for the time on the data? Rollups for example, where should they go?

Keep us posted on your progress.

Brian

Riley Zimmerman

unread,

Feb 7, 2017, 5:49:12 PM2/7/17

to KairosDB

Thanks Brian, that got me to the answer! By default, TWCS is actually looking for MICROSECONDS, not milli or nano. Thankfully there's an option to change what TWCS is looking for, so no KairosDB code change needed. My test system is compacting to 10 minute windows now (I'll experiment with bigger now that it's working).

ALTER TABLE metricdb.data_points WITH compaction = { 'class':'TimeWindowCompactionStrategy', 'timestamp_resolution': 'MILLISECONDS', 'compaction_window_unit': 'MINUTES', 'compaction_window_size': '10', 'max_threshold': '32', 'min_threshold': '4' };

For your question about old data, TWCS's docs state that it does not work well at all with "old" data. I've read that for adding old data you'd need to stop incoming real time data, manually compact, insert your old data, manually compact, then start back up the live data. That would isolate the old data to its own table(s).

https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataMaintain.html

Pros: Used for time series data, stored in tables that use the default TTL for all data. Simpler configuration than that of DTCS.

Cons: Not appropriate if out-of-sequence time data is required, since SSTables will not compact as well. Also, not appropriate for data without a TTL, as storage will grow without bound. Less fine-tuned configuration is possible than with DTCS.

Roll-ups are the bigger concern IMHO. I'll see what it does when you start mixing TTLs. I think the best option would be to move roll-ups outside of the data_points so they could have a different compaction_window_size, but I'll see if that's required or just an optimization.

Jan Mussler

unread,

Feb 8, 2017, 1:48:53 PM2/8/17

to KairosDB

TWCS is not only simpler to configure - from my experience it is the only compaction strategy where Cassandra will reliably stop touching data, that is what you finally want to achieve. With the other compaction strategies, incl. DTCS there was kind of always another merge happening, although later. (and they all share the problem where Cassandra's write timestamp is not related to data's timestamp - overall a waste of space that every data point has two timestamps ;-))

Going back to the original compaction idea, that you should get as few sstable files as possible for your reads imho e.g. setting it to one file per hour as a final bucket size sounds to optimistic. For us users tend to max queries from Grafana at the last 24h , so this means we hit yesterdays single file plus todays inter day files. But I am curious about numbers ;-)

We did not run into the described issues as we run basically our fork with CQL since quite a while and we also write an additional row key index for improved lookup speed on selected tags - sadly we have by now diverged quite a lot in the internals, not the API, and will maybe take a look once kairosdb itself gets a CQL implementation. We also have a different bucket size - three days and not three weeks to faster prune non relevant tags.

Brian Hawkins

unread,

Feb 8, 2017, 3:00:24 PM2/8/17

to KairosDB

Yes before I finish what I'm doing I need to look at your branch and compare notes. I am curious about your extra index and how you went about doing that. I've had a similar idea for indexing on particularly noisy tags.

Brian

Riley Zimmerman

unread,

Feb 17, 2017, 9:21:58 AM2/17/17

to KairosDB

Hi,

I'm not sure how best to scientifically prove TWCS is better here than STCS due to the semi-random nature of the compaction with STCS over very long runs. For example, the numbers could be very different if you end right before or right after a huge STCS compaction and vise versa. But I think logically there is no reason not to go with TWCS. Mostly, you know what your compaction load will look like compared to STCS.

I've been experimenting with the window sizes. I'm mostly concerned about the queries for all of the data for a single metric, since if that is acceptable then the recent data will be too. I had read the 30 window suggestion, along with some comments that perhaps you could have more than 30 without a major performance hit. That's true, unless your query hits the entire timeframe and therefore all the files. For fun I tried 365 windows at scale, which did NOT perform well for response time queries. Not a big surprise because queries were getting data from all 365 SSTables, but good to confirm. Dropping down to 52 windows cut the response times in half. Still measuring how even fewer windows compares to that, will let you know. The tradeoff of larger windows should simply be the extra cost to do the larger end of window compaction.

Also, I still need to confirm exactly what happens with TWCS if you have different TTLs in your data since Roll-ups would be mixed into the data.

Reply all

Reply to author

Forward