Automatic batching

Andrew Swan

unread,

Jan 17, 2012, 11:42:08 PM1/17/12

to Scale 7 - Libraries and systems for scalable computing

I'm new to Pelops and Cassandra in general, so please forgive me if
this question is naive. I've Googled a little and nothing relevant
popped up.

I was wondering if there are any plans to implement automatic batch
flushing in Pelops. Under the current API, you:
1. obtain a Mutator
2. register one or more operations (e.g. puts and deletions) with it
3. execute it

For large numbers of operations (e.g. during bulk loads), postponing
all operations until execute is called leads to problems such as
running out of memory. This could be solved by the Mutator being able
to automatically flush any outstanding operations upon certain
criteria being reached, for example:
* a given time period has elapsed
* a given number of operations are outstanding
* a given number of bytes are waiting to be written
These criteria would be encapsulated into say a FlushCriteria class
that would be passed to the factory method that creates the Mutator.
The client's workflow would then change to:
1. obtain a Mutator, passing the desired FlushPolicy
2. register one or more operations (e.g. puts and deletions) with it
3. flush it explicitly on completion (to execute any operations that
weren't automatically flushed during step 2)

The most trivial FlushCriteria would be FlushCriteria.NEVER, which
replicates the current behaviour of not sending any operations to
Cassandra until the batch has been fully loaded into memory.

This would be a relatively simple enhancement; has anything similar
already been considered?

Dan Washusen

unread,

Jan 17, 2012, 11:57:28 PM1/17/12

to sca...@googlegroups.com

Interesting idea Andrew - I currently control this externally to Pelops by keeping track of number of rows I've added and calling execute every once in a while. It's a solution that's relatively easy to implement, so I wonder if we're really going to be adding much by including this in Pelops… In saying that I'd be more than willing to look over a thoroughly tested patch and (assuming Dominic is OK with it) including it in Pelops if you're offering…

--
Dan Washusen

Make big files fly

visit digitalpigeon.com

Alex Araujo

unread,

Jan 18, 2012, 9:12:16 AM1/18/12

to sca...@googlegroups.com

If you have a batch oriented workload you should consider using C*'s built-in bulk loading:

http://www.datastax.com/dev/blog/bulk-loading

The RPC overhead is pretty significant when dealing with this type of workload over Thrift.

Andrew Swan

unread,

Jan 18, 2012, 5:08:47 PM1/18/12

to Scale 7 - Libraries and systems for scalable computing

Thanks for the suggestion Alex, but for my use case I need an RPC-
based solution. I'll whip up something for Dan to look at.

Cheers,

Andrew

On Jan 19, 1:12 am, Alex Araujo <alexara...@gmail.com> wrote:
> If you have a batch oriented workload you should consider using C*'s
> built-in bulk loading:
>
> http://www.datastax.com/dev/blog/bulk-loading
>
> The RPC overhead is pretty significant when dealing with this type of
> workload over Thrift.
>

Reply all

Reply to author

Forward