Storing messages as chunked objects in Cassandra

213 views
Skip to first unread message

Michael Andrews

unread,
Aug 29, 2014, 12:27:40 PM8/29/14
to elasti...@googlegroups.com
I was wondering if there was a reason this project did not elect to store messages as chunked objects in Cassandra instead of an outside object storage provider?  We have implemented chunked object storage on top of Cassandra internally for a project, and I know the
Astyanax client from Netflix has built in chunking, so it seems that it's not that difficult to do.  Would it be possible to request that an option for blob storage be a chunked object store directly in the Cassandra cluster?  I would love to see the requirement for an external storage platform be optional and do everything from one storage mechanism.

Rustam Aliyev

unread,
Aug 29, 2014, 1:10:16 PM8/29/14
to elasti...@googlegroups.com
It does. You can specify database_blob_max_size which will set threshold for object to chunk and store in C*. Larger objects will be stored on blob store. Current limit is 128K, but it isn't hard to fix that and support larger blobs. See https://github.com/elasticinbox/elasticinbox/wiki/Blob-Storage#hybrid-storage


On 29 August 2014 17:27, Michael Andrews <win.ma...@gmail.com> wrote:
I was wondering if there was a reason this project did not elect to store messages as chunked objects in Cassandra instead of an outside object storage provider?  We have implemented chunked object storage on top of Cassandra internally for a project, and I know the
Astyanax client from Netflix has built in chunking, so it seems that it's not that difficult to do.  Would it be possible to request that an option for blob storage be a chunked object store directly in the Cassandra cluster?  I would love to see the requirement for an external storage platform be optional and do everything from one storage mechanism.

--
You received this message because you are subscribed to the Google Groups "ElasticInbox" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticinbox...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Andrews

unread,
Aug 29, 2014, 1:44:35 PM8/29/14
to elasti...@googlegroups.com
Interesting, so in theory I could set an infinitely large blob size for hybrid storage and have all my data on Cassandra?  Is there some kind of performance issue doing this?  In our testing we could stream chunks in parallel from Cassandra to compose the blob before sending it out to our client which was really fast.

Rustam Aliyev

unread,
Aug 29, 2014, 1:56:26 PM8/29/14
to elasti...@googlegroups.com
Just to reiterate, right now limit is 128K, so you can't set it above that or set it to store everything in C*. This is because chunking isn't fully implemented internally. Actual chunk (block) size is 128K so it just stores first one. We chose 128K chunk because we found that ~98-99% of emails we had were below 128K (after compression).

In theory, we could do the same as you described - reading multiple chunks from different partitions asynchronously. However, I think that will put memory pressure on EI. I think for EI use case, streaming chunks sequentially would be better option.

Michael Andrews

unread,
Aug 29, 2014, 2:23:41 PM8/29/14
to elasti...@googlegroups.com
That sounds reasonable.  We ended up seeing memory pressure when retrieving large files (we tested up to 1gb) as well.   Our chunk size was ~100kb, and we ended up only grabbing smaller subsets (~2mb) of chunks from the entire file and streaming those portions to the client while queuing up the rest.  Saved memory and performance was still pretty zippy.

Any thoughts on when you might have a C* backed blob storage for all files?
Reply all
Reply to author
Forward
0 new messages