I guess, an ideal batch mutation interface would also be a bounded
queue of mutations, for safety and convenience.
Something like:
endless = ({'thecol': str(i) for i in itertools.count())
batch = cf.batch_mutate(buf=100)
for col in endless:
batch.insert(col)
# Every 100 iterations this will automatically cause a round trip
batch.do() # Explicit
As a context manager it could look like:
with cf.batch_mutate(buf=100) as batch:
for col in endless:
batch.insert(col)
# Implicit `do`
I think this pattern of "buffered/chunked" traversal and mutation
provides good and safe defaults. There's no risk a new user will ruin
their life (drama much?) by using giant amounts of memory on
either/both side of the transport, and the interface affords tunable
optimizations (adjust/disable buffering etc).
I think slicing should work the same way. Iterating over alot of - or
all - columns for a key should be as pythonic and "natural" as
possible.
Just some thoughts...
/d
* No removal
* Using a dict as "rows" means all mutations for a cf key must be
constructed up front. "Streaming" mutations will be clunky.
* batch_mutate allows mutations across multiple CFs in a single
round trip. This is less important in the common case, but worth
pondering.
It'd also be more DRY to encapsulate mutations and (in batch-mode, its
state) in a composable object. Especially if adding retry-on-error
strategies (a'la Hector) and such later on.
All said though, I'm still in favor of a simple "batch_insert" method.
The innards we can change later.
/d
https://gist.github.com/5a69f7c5a1f1a0e25830
... Working code in a bit.
/d
I'm kind of leaning towards the way David implemented it. My
reasoning here is that batch_mutate() (the thrift method), is kind of
deceptive. In the case where rows are on different nodes, it's more
of a convenience than an efficiency (or at least it's not the
efficiency that people tend to expect).
Rows are really the discreet unit for writes, and I fear that what you
have (and what Daniel talks about elsewhere in the thread), would give
people a false impression of how things work, and might encourage bad
practices.
> But I think batch_insert() already handles almost all of the use cases
> where someone would want to support batch operations. It's also
> simpler.
Yeah, true, and simple is Good.
--
Eric Evans
john.er...@gmail.com