feature request to help figure out the bad row of a batch; question about batches

19 views
Skip to first unread message

Karl Lehenbauer

unread,
Apr 13, 2015, 9:48:28 AM4/13/15
to cpp-dri...@lists.datastax.com
I'm making batches containing a lot of upserts and I've generated something that's broken the batch and like anyone in this circumstance I'd like to know what statement broke it.

The error I get back is: Invalid query: Missing mandatory PRIMARY KEY part fp

OK, fine, I understand that and there's a bug in my stuff upstream because it's supposed to not allow that.

It would be really helpful, though, if there was a way to obtain the statement that caused the batch to fail.

I'm invoking the batch asynchronously and keeping the batch around until I get the callback that it has gone successfully or whatever. Presumably I could retry the batch although of course it won't work later if there's an error in the batch but something like "no hosts available", it could be successfully retried.

Anyway it's not that useful to keep the batch after an error but if there were methods provided to introspect into the batch, like to obtain the rows via a numeric index or anything like that, and some error call gave you the item number in the batch, etc, just, in general, some way or another, more introspection would be nice.

Finally, I've read in the docs that you're kind of thinking wrong if you're using batches to speed up inserts, but batching inserts a thousand or thousands of rows at a time you can get way more rows inserted over unit time from a single data source which in my case is an API presenting historical and realtime rows of data. Can someone explain to me why this approach is frowned on?

Michael Penick

unread,
Apr 14, 2015, 6:44:40 PM4/14/15
to cpp-dri...@lists.datastax.com
Agreed. That would be extremely useful. Unfortunately, that information isn't returned in the error (https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v2.spec#L915) of any version of the native protocol. I think that would be a good addition. I do see how allowing for introspection into the batch (and statements) might allow your program to identify the invalid query or at least allow for easier debugging. I've created a JIRA issue to address that: https://datastax-oss.atlassian.net/browse/CPP-255. Feedback welcome.

Regarding the performance of batches I would say, "try it". Logged batch are almost certainly going to perform worse because of the extra writes involved, but you might get some performance benefit out of unlogged batches, esp. if all the writes are to the same primary key. However, if all the writes are destined for the same host anyway you'll gain benefit from automatic write batching done in the driver by using a batch of concurrent asynchronous requests. The driver groups requests destined for the same connection and writes them in a single write() call.  Here's an example using asynchronous request batches: https://github.com/datastax/cpp-driver/blob/1.0/examples/perf/perf.c

Mike

To unsubscribe from this group and stop receiving emails from it, send an email to cpp-driver-us...@lists.datastax.com.

Karl Lehenbauer

unread,
Apr 15, 2015, 11:35:21 AM4/15/15
to cpp-dri...@lists.datastax.com
On Tuesday, April 14, 2015 at 5:44:40 PM UTC-5, Michael Penick wrote:
> Agreed. That would be extremely useful. .... I think that would be a good addition. I do see how allowing for introspection into the batch (and statements) might allow your program to identify the invalid query or at least allow for easier debugging. I've created a JIRA issue to address that: https://datastax-oss.atlassian.net/browse/CPP-255. Feedback welcome.

That's fantastic, Michael. I'm glad you agree it'll be valuable. Looks good to me.

> Regarding the performance of batches I would say, "try it". Logged batch are almost certainly going to perform worse because of the extra writes involved, but you might get some performance benefit out of unlogged batches, esp. if all the writes are to the same primary key. However, if all the writes are destined for the same host anyway you'll gain benefit from automatic write batching done in the driver by using a batch of concurrent asynchronous requests. The driver groups requests destined for the same connection and writes them in a single write() call.  Here's an example using asynchronous request batches: https://github.com/datastax/cpp-driver/blob/1.0/examples/perf/perf.c

Good info, thanks. Yeah casstcl is based a lot on studying the cpp-driver examples, perf.c in particular. In casstcl we support

$::cass async -batch $batch -callback "callback $batch"

The callback is called with an additional argument which is a casstcl future object that can be inspected with various methods for status and errors and have it iterate through results using a "foreach" method. We also rolled in the fractional returns using set_paging_size... https://github.com/flightaware/casstcl/blob/master/generic/casstcl.c
Reply all
Reply to author
Forward
0 new messages