Hello everyone, I recently discovered in my client logs, after some data went missing, that some of my queries (batches of prepared insert statements) are timing out. Looking at the server logs I noticed complaints about the batches being too big (sometimes > 1MB) and after some further reading on the internet that "I'm using batches all wrong"™.
I've now set out to fix all of the above and it looks like in order to use batches properly you essentially need to batch inserts on the same partition key together. Unfortunately my schema doesn't really let me do that as it looks like this:
CREATE TABLE dev.data (
identifier text,
time timestamp,
value text,
PRIMARY KEY (identifier, time)
) WITH CLUSTERING ORDER BY (time ASC)
In my case each insert statement in the batch is for a unique identifier, for the same time so
insert into dev.data ('id1', t1, 'foo')
insert into dev.data ('id2', t1, 'bar')
insert into dev.data ('id3', t1, 'bob')
etc...
Had I known about this earlier I could have probably split identifier in two parts where one would only vary a little between my identifiers and use that as the partition key but it's a bit too late now. My current idea is that perhaps if I hash the identifiers locally and mod the result by the number of nodes in the cluster, I can still batch the inserts in a meaningful way but that feels like a hack so wanted to get some more opinions on this - any help is appreciated. It feels like I need to do some batching to save myself the round trip times. Obviously I will also making sure batches don't go over a certain size.
Thanks!
Kostas