Hello Dan,
Thank you for your reply and sorry for the late reply from me.
I will try to cover a hole, so for сassandra uploadings, we use the self-written tool, it has a similar implementation like this
This solution work too well and too fast, so fast that SSD drives of Cassandra nodes are overloaded and there are problems with reading performance (I've to note that we don't use compaction at all while uploading data). To solve the problem we limited data transfer speed, we've achieved a stable situation with 150 Mbps, in such settings we don't see any degradation of reading performance. Of course, uploading time is significant, but it's our accepted trade-off.
Also, to be honest, we are not looking for a solution to migrate our data from Cassandra to Scylla, instead of that, we would like to upload data from HDFS directly to Scylla (so we don't have any plans to keep both Cassandra and Scylla).
Let's dive into our trial and error experience of uploading data to Scylla.
We use spark-connecter with default settings (so maybe you have some tips here) and this approach promises some batch uploads, but looking into Scylla metrics (see attached screenshot) looks like there are no bulk operations (or too few).
But the way it works (and it's a bit faster than our current uploads to Cassandra, and it's already a win), but we see that it impacts reading performance and we also have to limit uploading speed, so we still hope to find some better way to upload our data faster or with less impact on reading performance (or better to get both).
About your thoughts to stop compaction, we would consider this if it helped. But according to documentation, it looks like a troubleshooting solution and should not be used on a regular basis. Can you shed more light on this? Our tables are immutable and it should not affect our data, but is there a risk that there are will be a lot of small files and that leads to performance degradation? Also it's not clear, will this operation stop compaction forever or compaction will run again in the next upload? Is there a way to increase memtable size for flushing, so increasing size of files?
About our data and how we work with it.
Usually, we have around 20 "unique" tables, this unique tables have their "duplicates" which presents the result of the previous uploads (so total 40 tables), so there is a chance to roll back to old data if newly arrived data has problems. When the new upload is initiated we create a new table for each upload, so when all uploads have been finished there 60 tables, and "garbage collection" (which starts after uploading) deletes the eldest 20 tables. On average each table has 100 columns, and the most popular type - double, but few of them contain strings. As a partition key for all tables, we use a hashed string to distribute our data eventually.
I've added my college Nikolay into CC.
With kind regards,
Denis