Large amount of data, read only, schedulelly delete old data, is Scylla for this?

260 views
Skip to first unread message

Yatong Zhang

<blueflycn@gmail.com>
unread,
Dec 29, 2016, 10:32:57 PM12/29/16
to ScyllaDB users
Hi there
1. We have about 300 million rows of data every day, about 160-200 bytes per row.
2. Once the data have been written, they won't be modified, but only be deleted.
3. Relative small load of reads, and only read by primary key and filter by time (e.g., batch read to do indexing)
4. We are about to save the data for 6 months to 1 year, and will delete old data out of the date range.

So my questions:
1. Is Scylla good for our use case?
2. if yes, what about the hardware requirements?

Thanks

Dor Laor

<dor@scylladb.com>
unread,
Dec 29, 2016, 11:21:24 PM12/29/16
to ScyllaDB users
On Thu, Dec 29, 2016 at 7:32 PM, Yatong Zhang <blue...@gmail.com> wrote:
Hi there
1. We have about 300 million rows of data every day, about 160-200 bytes per row.
2. Once the data have been written, they won't be modified, but only be deleted.
3. Relative small load of reads, and only read by primary key and filter by time (e.g., batch read to do indexing)
4. We are about to save the data for 6 months to 1 year, and will delete old data out of the date range.

So my questions:
1. Is Scylla good for our use case?

I'm biased but it's the best.
 
2. if yes, what about the hardware requirements?


You have 300M rows * 200B * 365 days = ~22TB.
Assuming you would like to use replication of 3, you'll have 66TB.
I'd use 9 good beefy nodes that can store 1/9th of the above data.
Use SSD/NVMe and 10GE. 

 
Thanks

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/caab8ff4-d50f-46c0-b7be-3a380f13e0c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yatong Zhang

<blueflycn@gmail.com>
unread,
Jan 7, 2017, 7:17:36 PM1/7/17
to ScyllaDB users
HI Dor,
Thanks for the reply. We're going to prepare 10 boxes to test Scylla. Based on our use case:
1.What are the requirements of memory and CPU cores? Are 128G ram and 8 cores (16 hyper-threading) sufficient?
2.What about the compaction strategy? date-tirered or disable compations? Since out data are 'read-only', is disabling compaction better?

Thank you very much


On Friday, December 30, 2016 at 12:21:24 PM UTC+8, Dor Laor wrote:
On Thu, Dec 29, 2016 at 7:32 PM, Yatong Zhang <blue...@gmail.com> wrote:
Hi there
1. We have about 300 million rows of data every day, about 160-200 bytes per row.
2. Once the data have been written, they won't be modified, but only be deleted.
3. Relative small load of reads, and only read by primary key and filter by time (e.g., batch read to do indexing)
4. We are about to save the data for 6 months to 1 year, and will delete old data out of the date range.

So my questions:
1. Is Scylla good for our use case?

I'm biased but it's the best.
 
2. if yes, what about the hardware requirements?


You have 300M rows * 200B * 365 days = ~22TB.
Assuming you would like to use replication of 3, you'll have 66TB.
I'd use 9 good beefy nodes that can store 1/9th of the above data.
Use SSD/NVMe and 10GE. 

 
Thanks

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.

Dor Laor

<dor@scylladb.com>
unread,
Jan 9, 2017, 1:05:51 AM1/9/17
to ScyllaDB users
On Sat, Jan 7, 2017 at 4:17 PM, Yatong Zhang <blue...@gmail.com> wrote:
HI Dor,
Thanks for the reply. We're going to prepare 10 boxes to test Scylla. Based on our use case:
1.What are the requirements of memory and CPU cores? Are 128G ram and 8 cores (16 hyper-threading) sufficient?

It depends. What will be the read access pattern? Will the active working set is small enough to fit the RAM?
Do you need it to fit in RAM? You can consider two distinct use cases - one that it's a must to fit in ram in order
to provide < 1ms latency. Another is that the working set size is that big that it won't fit ram anyway and 
you may have even smaller ram than 128G.
 
2.What about the compaction strategy? date-tirered or disable compations? Since out data are 'read-only', is disabling compaction better?


Our usual recommendation is STCS, I'd start with it. If you mostly read, no need to disable compaction (which needed for repairs too).
 
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

ddorian43

<dorian.hoxha@gmail.com>
unread,
Jan 9, 2017, 5:00:45 AM1/9/17
to ScyllaDB users
If you "filter by time" always, meaning in the primary-key or something, it would be better to put the date on the table-name. This way you have faster reads (less data because of not duplicating date on each row) and faster deletes (drop table is faster)
Reply all
Reply to author
Forward
0 new messages