migration guide from cassandra to scylla for lambda architecture

31 views

Skip to first unread message

Denis Bolshakov

<bolshakov.denis@gmail.com>

unread,

Jun 4, 2020, 3:25:03 AM6/4/20

to ScyllaDB users

Hello,Do you have a migration guide from cassandra to scylla for big data workload?
Currently we have a system built on lamda architecture, so there are two main parts:

scheduled offline analytics and etl processes which are based on hadoop
and operation storage which keeps last snapshots of our analytics datasets

Currently our operation storage is based on cassandra and it performs well, in burst scenario it performs up to 1 000 000 read requests per second with reasonable latency, and it has another valuable feature - it provides technique to upload prepared SST tables and avoid any compactions, but we would look for another alternatives, and scylla is one of favorite ones. What we don’t like in cassandra:

cluster should contain a lot of nodes, such bunch of servers has extra operational costs
it’s quite complicated to decommission node
we should scale out our cluster from time to time, and if nodes were large we would do it less frequently

So far we’ve not checked any available recommendations for migration and tried to do that by self, but without significant results, at least positive ones.
So our current main challenge is around data uploading, we could not find any solution to perform bulk upload, so we use the most obvious decision - upload record by record. And here we have no positive results:

increased time of uploading data (compare to cassandra)
compaction takes a lot of resources, scylla cluster becomes very slow, and as result, some uploads are failed and we have to upload again (so significant time wasting), and much worse some DDL operations are failed (every dataset upload generates a new table, so if scylla is busy on another few uploads this becomes impossible). And we did not check yet read performance while uploading data.

So, may be we do something wrong and there is a migration guide which covers some details missing from us? Or scylla does not fit to our architecture and we should drop our idea about migration? Is there a way to perform bulk upload and avoid compaction?

Dan Yasny

<dyasny@scylladb.com>

unread,

Jun 4, 2020, 3:00:57 PM6/4/20

to ScyllaDB users

On Thursday, June 4, 2020 at 3:25:03 AM UTC-4, Denis Bolshakov wrote:

Hello,Do you have a migration guide from cassandra to scylla for big data workload?

This https://www.scylladb.com/2019/04/02/spark-file-transfer-and-more-strategies-for-migrating-data-to-and-from-a-cassandra-or-scylla-cluster/ covers most of the currently available migration options.

Currently we have a system built on lamda architecture, so there are two main parts:
scheduled offline analytics and etl processes which are based on hadoop
and operation storage which keeps last snapshots of our analytics dataset

Currently our operation storage is based on cassandra and it performs well, in burst scenario it performs up to 1 000 000 read requests per second with reasonable latency, and it has another valuable feature - it provides technique to upload prepared SST tables and avoid any compactions

What technique are you referring to, running sstableloader/bulk-loader?

, but we would look for another alternatives, and scylla is one of favorite ones. What we don’t like in cassandra:
cluster should contain a lot of nodes, such bunch of servers has extra operational costs
it’s quite complicated to decommission node
we should scale out our cluster from time to time, and if nodes were large we would do it less frequently
So far we’ve not checked any available recommendations for migration and tried to do that by self, but without significant results, at least positive ones.

Would you good to know what you tried and what didn't work for you

So our current main challenge is around data uploading, we could not find any solution to perform bulk upload, so we use the most obvious decision - upload record by record. And here we have no positive results:

Why not use batched uploads, optimized for parallelism? You will definitely squeeze more performance out of Scylla if you to it in batches and from multiple loaders.

increased time of uploading data (compare to cassandra)

How much data and how long does it take with cassandra today?

compaction takes a lot of resources, scylla cluster becomes very slow, and as result, some uploads are failed and we have to upload again (so significant time wasting), and much worse some DDL operations are failed (every dataset upload generates a new table, so if scylla is busy on another few uploads this becomes impossible). And we did not check yet read performance while uploading data.

The idea of running on too many tables isn't optimal either, there might be better approaches, but we'd have to take a look at your schema

So, may be we do something wrong and there is a migration guide which covers some details missing from us? Or scylla does not fit to our architecture and we should drop our idea about migration? Is there a way to perform bulk upload and avoid compaction?

Compaction can be stopped with https://docs.scylladb.com/operating-scylla/nodetool-commands/stop/ if it's getting in the way. But again, this doesn't sound like the main issue here.

I think I discussed something very similar with someone who might be a colleague of yours (the case sounds similar at least) on Telegram last week. I was also asking for details - data sizes, schema, examples, but never got them.

Denis Bolshakov

<bolshakov.denis@gmail.com>

unread,

Jun 9, 2020, 12:40:04 PM6/9/20

to scylladb-users@googlegroups.com, toroptsev@gmail.com

Hello Dan,

Thank you for your reply and sorry for the late reply from me.

I will try to cover a hole, so for сassandra uploadings, we use the self-written tool, it has a similar implementation like this

https://github.com/joswlv/Spark2CassandraBulkLoad

The main file here is SparkCassandraBulkWriter.scala, it leverages org.apache.cassandra.io.sstable.SSTableLoader

This solution work too well and too fast, so fast that SSD drives of Cassandra nodes are overloaded and there are problems with reading performance (I've to note that we don't use compaction at all while uploading data). To solve the problem we limited data transfer speed, we've achieved a stable situation with 150 Mbps, in such settings we don't see any degradation of reading performance. Of course, uploading time is significant, but it's our accepted trade-off.

Also, to be honest, we are not looking for a solution to migrate our data from Cassandra to Scylla, instead of that, we would like to upload data from HDFS directly to Scylla (so we don't have any plans to keep both Cassandra and Scylla).

Let's dive into our trial and error experience of uploading data to Scylla.

First, we tried to use sstableloader from Scylla, our main source of knowledge had been cassandra_to_scylla_migration_process.

That approach was to slow and we decided to try spark connector from Cassandra.

We use spark-connecter with default settings (so maybe you have some tips here) and this approach promises some batch uploads, but looking into Scylla metrics (see attached screenshot) looks like there are no bulk operations (or too few).

But the way it works (and it's a bit faster than our current uploads to Cassandra, and it's already a win), but we see that it impacts reading performance and we also have to limit uploading speed, so we still hope to find some better way to upload our data faster or with less impact on reading performance (or better to get both).

About your thoughts to stop compaction, we would consider this if it helped. But according to documentation, it looks like a troubleshooting solution and should not be used on a regular basis. Can you shed more light on this? Our tables are immutable and it should not affect our data, but is there a risk that there are will be a lot of small files and that leads to performance degradation? Also it's not clear, will this operation stop compaction forever or compaction will run again in the next upload? Is there a way to increase memtable size for flushing, so increasing size of files?

About our data and how we work with it.

Usually, we have around 20 "unique" tables, this unique tables have their "duplicates" which presents the result of the previous uploads (so total 40 tables), so there is a chance to roll back to old data if newly arrived data has problems. When the new upload is initiated we create a new table for each upload, so when all uploads have been finished there 60 tables, and "garbage collection" (which starts after uploading) deletes the eldest 20 tables. On average each table has 100 columns, and the most popular type - double, but few of them contain strings. As a partition key for all tables, we use a hashed string to distribute our data eventually.

I've added my college Nikolay into CC.

With kind regards,

Denis

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/471d4a95-0bce-45c8-b662-17cdbd530131o%40googlegroups.com.

//with Best Regards
--Denis Bolshakov
e-mail: bolshak...@gmail.com

image.png

Dor Laor

<dor@scylladb.com>

unread,

Jun 9, 2020, 2:50:24 PM6/9/20

to ScyllaDB users, toroptsev@gmail.com

On Tue, Jun 9, 2020 at 9:40 AM Denis Bolshakov <bolshak...@gmail.com> wrote:

Hello Dan,

Thank you for your reply and sorry for the late reply from me.

I will try to cover a hole, so for сassandra uploadings, we use the self-written tool, it has a similar implementation like this
https://github.com/joswlv/Spark2CassandraBulkLoad
The main file here is SparkCassandraBulkWriter.scala, it leverages org.apache.cassandra.io.sstable.SSTableLoader

This solution work too well and too fast, so fast that SSD drives of Cassandra nodes are overloaded and there are problems with reading performance (I've to note that we don't use compaction at all while uploading data). To solve the problem we limited data transfer speed, we've achieved a stable situation with 150 Mbps, in such settings we don't see any degradation of reading performance. Of course, uploading time is significant, but it's our accepted trade-off.

Also, to be honest, we are not looking for a solution to migrate our data from Cassandra to Scylla, instead of that, we would like to upload data from HDFS directly to Scylla (so we don't have any plans to keep both Cassandra and Scylla).

The spark migrator project now supports Parquet -> CQL, if you keep the data in this

format you can use it as well.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CAHYerSBHjqFxas1PRgq1x50ZCt2qJDDNrxLF143636D50jMK%2Bg%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages