Hi,
Following are the details of my setup and related clarifications at the bottom
Cluster and Topology :
C* Cluster Size : 4 Nodes (single keyspace, single table, RF = 3, WCL = QUORUM)
C* Node details : Each of 4 Nodes is a Ubuntu *VM* - with 10vCPU and 24GB RAM + 20GB disk
Cluster Topology : 2 C* Uuntu VMs run on 1 baremetal host server each (Config of baremetal server : 32Core + 120GB RAM + 1 TB *HDD* (yes, we are trying to get SSD but that would take some time))
Distributed 2 VMs on each of 2 baremetals mainly hoping that separate disk controllers would be better than cramming all VMs in single host
Clients are run from 3 nodes (in adition to C* nodes) in same network
Cassandra Version Details :
server version is 3.7.0
datastax driver version 3.1.0
JDK Environment on server and client:
Oracle JDK 1.8
Server side config changes :
cassandra.yaml is left with default values. Before, I could start optimizing server, I thought I will have all client aspects addressed
Datastax Driver Connection details are as follows:
datastax.contact.points = all 4 IPs of nodes have been added. I understand this is not the right thing to do as wel scale
datastax.max.reqs = 4096
datastax.min.conn = 8
datastax.max.conn = 16
read request time out is 5 mins
TokenAwarePolicy used : Yes
Retry Policy for errors : No
Client :
Using preparedstatement : Yes
async execution approach : Yes
Using batching : No. This is by choice as usecase does not permit
Inserts are executed in parallel - ie. we fire as many async executions as could be taken by the driver without any sequencing (waiting for response from previous insert)
Using only futurecallback on listenablefuture and not doing any blocking gets on the future returned by async execution
Test Scenarios :
Scenario 1 : Pumping 2M inserts from single client node (separate node connected to cluster)
Scenario 2 : Pumping 2M inserts from each of all 3 client nodes (separate client processes from separate VMs again)
How throughput is computed :
Measuring the rate from within onSuccess callback of ListenableFuture returned by async execution call
Observations :
1) For scenario 1 above, I was able to see upto 11K writes/second
2) For scenario 2 above, I am surprised to observe only upto 7 to 8K writes/second
3) In both scenarios, we noticed that the client connection count towards is not stable. when we monitor netstat of 9042 port, we see that the connection count fluctuating and after a heavy load of inserts, some connections
are lost across the nodes. Sometimes even less than minimum connections (8 in this case)
4) We also notice following exceptions on the client side for about 4-5% of inserts which indicate inability of server to get QUORUM ack for some inserts -
Failed to insert:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (2 replica were required but only 1 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100)
at com.datastax.driver.core.Responses$Error.asException(Responses.java:122)[327:com.datastax.driver.core:3.1.0]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:500)[327:com.datastax.driver.core:3.1.0]
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1012)[327:com.datastax.driver.core:3.1.0]
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:935)[327:com.datastax.driver.core:3.1.0]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)[143:io.netty.transport:4.0.37.Final]
.......
.......
Clarifications :
1) Setting aside the errors observed, what could be probable cause of throughput not scaling (in fact, degrading) when we add more clients - ie. pumping traffic from multiple nodes as in scenario 2 ?
2) Would retry policy configuration on client-side help in addressing above error or it is indicative of some serious resource issues on the server node ?
3) Is the way in which we measure throughput from client perspective is acceptable ?
4) Nodetool does not provide any specific throughput metric or we are missing something in properly reading the nodetool output - are there any other means of determining the server-side execution throughput ?
Are we missing something very basic on the throughput scaling part - because degraded writes/second is more worrisome even if other issues are addressable ?
Thanks in advance
Regards
Muthu