performance of spark and Cassandra is bad

382 views
Skip to first unread message

Lu Niu

unread,
Oct 29, 2015, 3:26:45 PM10/29/15
to DataStax Spark Connector for Apache Cassandra
Hi, I build a test cluster of 5 server , each with 24 cores, 128G memory and 16T disk. spark and cassandra is deployed on each server. Then I ran a simple load test that load data from hdfs and dump data to the cassandra cluster. the schema is quite dummy: (key:Long, key:Long). It took more than 12+ hours to processing about 5TB data. the performance from cassandra side is pretty bad.

Keyspace: test
Read Count: 284531
Read Latency: 0.03688115881924993 ms.
Write Count: 1196785102
Write Latency: 0.019032578134482826 ms.
Pending Flushes: 1
Table: ids
SSTable count: 93
Space used (live): 58463321906
Space used (total): 58463321906
Space used by snapshots (total): 105961066960
Off heap memory used (total): 1923037648
SSTable Compression Ratio: 0.347303758301237
Number of keys (estimate): 1105882554
Memtable cell count: 231770
Memtable data size: 5099996
Memtable off heap memory used: 0
Memtable switch count: 745
Local read count: 0
Local read latency: NaN ms
Local write count: 1127927764
Local write latency: 0.021 ms
Pending flushes: 1
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1730434848
Bloom filter off heap memory used: 1730434104
Index summary off heap memory used: 180376240
Compression metadata off heap memory used: 12227304
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 86
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0


nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 296721 0 0
MutationStage 1 0 1252088926 0 0
CounterMutationStage 0 0 0 0 0
GossipStage 0 0 241701 0 0
RequestResponseStage 0 0 552 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
InternalResponseStage 0 0 112 0 0
ReadRepairStage 0 0 0 0 0

the row mutation stage is 1252088926! which means there are 1252088926 tasks arrive faster than it could be processed.

Why is the write latency this high? about 0.02 ms latency?! according to this post, http://stackoverflow.com/questions/8401226/low-write-performance-of-cassandra, even a windows with 4GB could achieve 0.003 ms write latency, which is 6X better than my 5 machine cluster and he is still complaining about the performance.

Does any one has any clue what's going one here or any idea to tune the performance of this cluster? Thank you very much!

Best,
Lu

Russell Spitzer

unread,
Oct 29, 2015, 3:42:47 PM10/29/15
to DataStax Spark Connector for Apache Cassandra
1) The row mutation stage show 1.2  completed operations, this has nothing to do with task arrival
2) The latency is high because the spark connector batches together rights (by primary key) while inserting this means you aren't inserting a single record at a time.

Perhaps you could give more information about your table schema and loading code. 

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Lu Niu

unread,
Oct 29, 2015, 5:11:14 PM10/29/15
to spark-conn...@lists.datastax.com, rus...@datastax.com
Thank you for replay!

Even through the writes come in batch, the latency should not be that high because it's an average one right? Suppose in the first 5s no writes come in, 1000 writes come in in the second 5s. the the average latency is 1000/10s = 100 writes per second. Right? it doesnt matter how the inserts come in.


The code is very simple:

  val loads = sc.hadoopFile(inputPath,
      classOf[AvroInputFormat[TestData]],
      classOf[AvroWrapper[TestData]],
      classOf[NullWritable])

    val testDataRDD = loads.map( t => {
      val record = t._1.datum()
        (record.getId, record.getId)
    })
    testDataRDD.saveToCassandra("test", "ids", SomeColumns("test_id", "id"))

BTW, is there a way to check how many writes in the last 1 min, or per second?

Thank you!

Best,
Lu

Russell Spitzer

unread,
Oct 29, 2015, 5:15:41 PM10/29/15
to Lu Niu, spark-conn...@lists.datastax.com
Latency measures the response time of a query, since the inserts are performed as a batches it it basically measuring the latency to insert all of the rows in the batch (An actual C* construct not a timing). It has nothing to do with the order of arrival in this case.

You should be able to see the rate of data in the spark driver ui, it should list both the amount of bytes serialized and the number of rows. 


--

Russell Spitzer

unread,
Oct 29, 2015, 5:19:14 PM10/29/15
to Lu Niu, spark-conn...@lists.datastax.com
You are also writing at approximately 115.740741 MBps to the cluster. I would agree with 5 machines this would most likely be lower than I would expect unless there is a network bottleneck. I would expect that you could do on the order of about 80K cql rows per second per machine (With RF =1 ) so maybe 400k rows per second give or take?

--

Harshit Mathur

unread,
Oct 30, 2015, 7:28:09 AM10/30/15
to spark-conn...@lists.datastax.com
What is the cassandra version you are using?

I was also having slow inserts with version 2.1.8, but when I upgraded to 2.1.9 the issue got resolved and I was able to insert more than 5M records (about 1.2KB size each) in 5min on a single 30GB ram machine and spark running on different single node 30GB ram machine.

Regards,
Harshit

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.



--
Harshit Mathur

Lu Niu

unread,
Oct 30, 2015, 12:25:03 PM10/30/15
to spark-conn...@lists.datastax.com
Hi, Harshit

is our disk SSD?

Best,
Lu
Reply all
Reply to author
Forward
0 new messages