On supporting batch

Jonathan Hseu

unread,

Aug 18, 2010, 4:23:42 AM8/18/10

to pycassa-devel

So I recently pulled David King's ( http://github.com/ketralnis )
batch_insert() changes. I'm wondering whether you guys think it's the
right approach for supporting batch_mutate.

Originally, I had planned to have something like this:
batch = cf.batch()
batch.insert(...)
batch.insert(...)
batch.remove(...)
...
batch.send()

But I think batch_insert() already handles almost all of the use cases
where someone would want to support batch operations. It's also
simpler.

What do you guys think?

Jonathan Hseu

Daniel Lundin

unread,

Aug 18, 2010, 5:16:45 AM8/18/10

to pycass...@googlegroups.com

It looks good-enough for the common case to me.

I guess, an ideal batch mutation interface would also be a bounded
queue of mutations, for safety and convenience.

Something like:

endless = ({'thecol': str(i) for i in itertools.count())
batch = cf.batch_mutate(buf=100)
for col in endless:
batch.insert(col)
# Every 100 iterations this will automatically cause a round trip
batch.do() # Explicit

As a context manager it could look like:

with cf.batch_mutate(buf=100) as batch:
for col in endless:
batch.insert(col)
# Implicit `do`

I think this pattern of "buffered/chunked" traversal and mutation
provides good and safe defaults. There's no risk a new user will ruin
their life (drama much?) by using giant amounts of memory on
either/both side of the transport, and the interface affords tunable
optimizations (adjust/disable buffering etc).

I think slicing should work the same way. Iterating over alot of - or
all - columns for a key should be as pythonic and "natural" as
possible.

Just some thoughts...

/d

Daniel Lundin

unread,

Aug 18, 2010, 6:10:55 AM8/18/10

to pycass...@googlegroups.com

I guess I have a few things to point out re: this implementation:

* No removal
* Using a dict as "rows" means all mutations for a cf key must be
constructed up front. "Streaming" mutations will be clunky.
* batch_mutate allows mutations across multiple CFs in a single
round trip. This is less important in the common case, but worth
pondering.

It'd also be more DRY to encapsulate mutations and (in batch-mode, its
state) in a composable object. Especially if adding retry-on-error
strategies (a'la Hector) and such later on.

All said though, I'm still in favor of a simple "batch_insert" method.
The innards we can change later.

/d

Daniel Lundin

unread,

Aug 18, 2010, 6:18:08 AM8/18/10

to pycass...@googlegroups.com

A quick sketch on my API ideas:

https://gist.github.com/5a69f7c5a1f1a0e25830

... Working code in a bit.

/d

Eric Evans

unread,

Aug 18, 2010, 11:00:01 AM8/18/10

to pycass...@googlegroups.com

On Wed, Aug 18, 2010 at 3:23 AM, Jonathan Hseu <vom...@vomjom.net> wrote:
> So I recently pulled David King's ( http://github.com/ketralnis )
> batch_insert() changes. I'm wondering whether you guys think it's the
> right approach for supporting batch_mutate.
>
> Originally, I had planned to have something like this:
> batch = cf.batch()
> batch.insert(...)
> batch.insert(...)
> batch.remove(...)
> ...
> batch.send()

I'm kind of leaning towards the way David implemented it. My
reasoning here is that batch_mutate() (the thrift method), is kind of
deceptive. In the case where rows are on different nodes, it's more
of a convenience than an efficiency (or at least it's not the
efficiency that people tend to expect).

Rows are really the discreet unit for writes, and I fear that what you
have (and what Daniel talks about elsewhere in the thread), would give
people a false impression of how things work, and might encourage bad
practices.

> But I think batch_insert() already handles almost all of the use cases
> where someone would want to support batch operations. It's also
> simpler.

Yeah, true, and simple is Good.

--
Eric Evans
john.er...@gmail.com

Daniel Lundin

unread,

Aug 19, 2010, 5:38:12 PM8/19/10

to pycassa-devel

I implemented a mutator-based design, and also made a method for the
`batch_insert` interface per David's design.
This will provide for both kinds of usages, and with quite alot of
flexibility.

http://github.com/dln/pycassa/tree/batch

The mutators are really useful when streaming data into a column
family.
Especially when you vary ttl or clocks for different columns (my
actual use case).

Please see the docs included in the patch for details:

http://github.com/dln/pycassa/commit/9698981c8bd7cb44af3fab4f9b4c9e9e9dc318dd

/d

Tyler Hobbs

unread,

Aug 25, 2010, 12:28:01 PM8/25/10

to pycass...@googlegroups.com

I've had a chance to look this over a bit and would like to merge it into
vomjom/pycassa. Does anybody else want to look it over before I do this?

Also, if added, I think this would be a good time for a version bump.
0.5.0 for now or should we jump to pycassa 0.7a?

- Tyler

Reply all

Reply to author

Forward

On supporting batch_mutate()

Jonathan Hseu

Daniel Lundin

Daniel Lundin

Daniel Lundin

Eric Evans

Daniel Lundin

Tyler Hobbs