I'm experimenting with some of the connector's tuning parameters and I see that the default value for spark.cassandra.output.batch.grouping.key defaults to "partition".
It says in the documentation (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md) that the "partition" value means "a batch may contain only statements for rows sharing the same partition key value".
Does that mean if I batch ten thousand inserts, each with a unique partition key, each will go into its own single-row batch? Or does it mean that they'll be batched according to the Cassandra nodes their keys partition to (assuming a 50-node cluster, approximately 50 batches of 200 rows)?
thanks!
--Matt
The first behavior you described is Partition mode. The second is ReplicaSet. I would be sure to benchmark a large dataset if you aren't using the default because there may be long term stability issues using multikey batches. A powerful cluster most likely won't show any issues but a cluster which is weaker and with a high replication factor may see a high buildup of hints and long GC pauses using larger multikey batches.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.