Single multithread Java Client with DataStax Java Driver for Apache Cassandra not utilizing system resources.

785 views
Skip to first unread message

Aidan

unread,
Apr 25, 2016, 11:38:22 AM4/25/16
to DataStax Java Driver for Apache Cassandra User Mailing List

Hi,

 

I’d appreciate any guidance on optimal setup for multi-threaded, high throughput low latency Java Client using DataStax Java Driver for Apache Cassandra. I appreciate ‘roll-your-own’ benchmarking is not recommended, but this task is also aimed at a proof-of-concept for a real-world application to achieve high TPS.

 

Setup:

-------

Client Side : Java 8 Client, configurable number of multi-threaded executor threads (facilitated by lmax disruptor), cassandra-driver-core-3.0.0.jar, running on Redhat 6.3, 24 core machine, dl360s

Server side : 3 node Cassandra Cluster (apache-cassandra-2.2.4, on Redhat 6 with Java 8) , Replication Factor = 3 , running on Redhat 6.3, 24 core machine dl360s

 

 

Testing

--------

 

With cl=LOCAL_QUORUM tests have been in the region of 3.5K INSERTS and 6.5K READS per second from a relatively simple schema, with latency circa 6 and 2 milliseconds respectively, with CPU usage circa 20% across the box.

 

However the problem I can not solve is that - when I create multiple separate instances of my load client-application I can achieve significantly higher TPS summed across instances, and greater CPU usage.  This suggests that my Java Client Application is neither IO or CPU bound, nor is the server-side Cassandra cluster the bottleneck. Likewise when I stub out the Cassandra call, I achieve much higher TPS thus giving me confidence that the application is not suffering from any contention.

 

So my question is: Is this a common problem – that one single Java Client using DataStax Java Driver for Apache Cassandra is somehow limited on it’s throughput? and assuming not can anyone point me in the right direction to investigate.

 

I have tested multiple sequences (READs and WRITEs), and also both execute and executeAsync, with variable number of concurrent threads. As you’d expect I see higher numbers with executeAsync but still the same limitation within my app. 

 

I have tested with multiple Connection Pooling settings, and have tried creating/building 1 Cluster Instance per client-application, and multiple cluster instances per application, and varying CoreConnections, maxRequestsPerConnection and newConnectionThreshold values but thus far with no success.

 

My current best results were with 50 executor threads, 5 instances ;MaxRequestsPerConnection(L) = 1024; ;NewConnectionThreshold(L) = 800; CoreConnectionsPerHost(L) = 20

 

This yielded ~4K TPS BUT only using 18% of the CPU, and when I start a separate Application Instance I achieve 7.5K TPS across both using 30% CPU, but I can not achieve this 7.5K within the save JVM

 

Thanks

Aidan

Kevin Gallardo

unread,
Apr 25, 2016, 12:47:54 PM4/25/16
to java-dri...@lists.datastax.com
Hi Aidan,

Thanks for providing that information.

Is this a common problem – that one single Java Client using DataStax Java Driver for Apache Cassandra is somehow limited on it’s throughput? and assuming not can anyone point me in the right direction to investigate.

I do not see from this description a clear anti pattern that could lead to bottlenecks when using the driver. According to our internal load tests, you should be able to get a rather higher throughput against a 3 nodes C* 2.2 cluster. To investigate a bit more, would you be able to monitor the number of inFlight queries that you observe on each node when you run your tests? That would help us determine better whether the bottleneck happens server or client side.

have tried creating/building 1 Cluster Instance per client-application, and multiple cluster instances per application, 

You should also be aware that a best practice with the Java driver is to create only 1 Cluster and 1 Session instance to share by all your application threads. The Session object should be able to distribute the queries to all the nodes in your DC efficiently, and having multiple Sessions/Cluster can actually cause more problems than it solves. Also, the default settings for the Connection pools are also supposed to give an optimal throughput against a cluster.

Would also be able to run a cassandra-stress load test on your from your application server to see if you are able to observe better throughput with this tool ? This would also help us narrow the cause of the low throughput.

CoreConnectionsPerHost(L) = 20

This seems a little high, sounds strange that you have to set this setting that high. How many connections do you see opened for each hosts during the tests ? Did you find that having 20 connections per host was improving significantly the throughput?

Cheers.


--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.



--
Kévin Gallardo.
Software Engineer in Drivers and Tools Team, at DataStax.

Aidan

unread,
Apr 26, 2016, 12:39:03 PM4/26/16
to DataStax Java Driver for Apache Cassandra User Mailing List
Thank you for your reply Kevin - your help is greatly appreciated.

Firstly I may have been under reporting my TPS as I was quoting a figure for adding an entity (let's say a test_user) which comprised of 6 CQL operations, so I can multiply my earlier figures for x 6. Even with this I still have the same issue that I am not scaling out in my application. I have reverted back to one cluster instance per app as suggested, and defaulted back to 1 connection, as 20 didn't improve things significantly.   

What I have below is a set of data for number of threads, for both execute and executeAsync.  I have simplified the test sequence such that each Txn in the TPS figures equates to exactly ONE Cassandra Insert Operation, and I have made the table trivial, where previously I had a blob field.  I've listed some data below, which I hope is useful. 

Thanks in advance for any insight you can give here.
Aidan


TPS and approx. Latency Data

 

 

1 Threads

2 Threads

5 Threads

10 Threads

20 Threads

50 Threads

100 Threads

250 Threads

execASYNC
Insert to 1 table
just 5 cols

65K TPS
13ms Latency

64K
9ms

66K
6ms

56K
11ms

62K
6ms

65k
8ms

55K
1ms

 

exec
Insert to 1 table
just 5 cols

Too few
threads

 

7K
70ms

11K
44ms

19K
22ms

35K
11ms

47K
10ms

27K
8ms


So here we see that : execASYNC - max out about 65K, and execute - max out 47K CQL Operations per second,.
If I understand the stress tool output correctly (below) these figures are short of the stress tool, so I am keen to find out why.
It is very posssible the issue is with my setup or may app, but your Cassandra expertise is very useful

Note: I ran the stress tool on one of the cluster nodes, while my Java App was on it's own machine on same subnet.

In-Flight Query Data

 

execASYNC (50 threads)

Load per node appears to jump around – which is probably as we’d expect with async requests.

1 cluster per-app: Host=/10.20.53.48:9042 connections=1, current load=0, max load=1024

1 cluster per-app: Host=/10.20.53.183:9042 connections=1, current load=1024, max load=1024

1 cluster per-app: Host=/10.20.53.184:9042 connections=1, current load=0, max load=1024

--

1 cluster per-app: Host=/10.20.53.48:9042 connections=1, current load=1023, max load=1024

1 cluster per-app: Host=/10.20.53.183:9042 connections=1, current load=40, max load=1024

1 cluster per-app: Host=/10.20.53.184:9042 connections=1, current load=41, max load=1024

--

1 cluster per-app: Host=/10.20.53.48:9042 connections=1, current load=1024, max load=1024

1 cluster per-app: Host=/10.20.53.183:9042 connections=1, current load=0, max load=1024

1 cluster per-app: Host=/10.20.53.184:9042 connections=1, current load=0, max load=1024

--

 

Exec blocking (100 threads) 

More stable and evenly distributed.


1 cluster per-app: Host=/10.20.53.48:9042 connections=1, current load=58, max load=1024

1 cluster per-app: Host=/10.20.53.183:9042 connections=1, current load=22, max load=1024

1 cluster per-app: Host=/10.20.53.184:9042 connections=1, current load=20, max load=1024

--

1 cluster per-app: Host=/10.20.53.48:9042 connections=1, current load=36, max load=1024

1 cluster per-app: Host=/10.20.53.183:9042 connections=1, current load=27, max load=1024

1 cluster per-app: Host=/10.20.53.184:9042 connections=1, current load=34, max load=1024

--

1 cluster per-app: Host=/10.20.53.48:9042 connections=1, current load=34, max load=1024

1 cluster per-app: Host=/10.20.53.183:9042 connections=1, current load=35, max load=1024

1 cluster per-app: Host=/10.20.53.184:9042 connections=1, current load=31, max load=1024

 

Output from stress tool 

 

# ./cassandra-stress write n=1000000  -node 10.20.53.48

INFO  16:18:26 Did not find Netty's native epoll transport in the classpath, defaulting to NIO.

INFO  16:18:27 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)

INFO  16:18:27 New Cassandra host /10.20.53.48:9042 added

INFO  16:18:27 New Cassandra host /10.20.53.183:9042 added

INFO  16:18:27 New Cassandra host /10.20.53.184:9042 added

Connected to cluster: TestClusteraidan224

Datatacenter: datacenter1; Host: /10.20.53.48; Rack: rack1

Datatacenter: datacenter1; Host: /10.20.53.183; Rack: rack1

Datatacenter: datacenter1; Host: /10.20.53.184; Rack: rack1

Created keyspaces. Sleeping 1s for propagation.

Sleeping 2s...

Warming up WRITE with 50000 iterations...

Failed to connect over JMX; not collecting these stats

Running WRITE with 200 threads for 1000000 iteration

Failed to connect over JMX; not collecting these stats

type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb

total,        105837,  105816,  105816,  105816,     2.0,     1.6,     3.4,     7.0,    66.0,   106.5,    1.0,  0.00000,      0,      0,       0,       0,       0,       0

total,        260081,  138394,  138394,  138394,     1.4,     1.3,     2.4,     3.8,    48.8,    58.5,    2.1,  0.09433,      0,      0,       0,       0,       0,       0

total,        404236,  119122,  119122,  119122,     1.7,     1.1,     2.0,     2.9,   161.5,   165.2,    3.3,  0.06809,      0,      0,       0,       0,       0,       0

total,        564912,  150632,  150632,  150632,     1.3,     1.1,     2.0,     3.1,   106.6,   111.7,    4.4,  0.07768,      0,      0,       0,       0,       0,       0

total,        735872,  165532,  165532,  165532,     1.1,     1.1,     2.0,     2.5,     4.9,    52.5,    5.4,  0.06977,      0,      0,       0,       0,       0,       0

total,        918927,  175529,  175529,  175529,     1.2,     1.1,     1.9,     2.6,    51.8,    56.3,    6.5,  0.06550,      0,      0,       0,       0,       0,       0

total,       1000000,  114363,  114363,  114363,     1.7,     1.1,     2.0,     2.8,   117.2,   118.4,    7.2,  0.06581,      0,      0,       0,       0,       0,       0

 

 

Results:

op rate                   : 139351 [WRITE:139351]

partition rate            : 139351 [WRITE:139351]

row rate                  : 139351 [WRITE:139351]

latency mean              : 1.4 [WRITE:1.4]

latency median            : 1.1 [WRITE:1.1]

latency 95th percentile   : 2.2 [WRITE:2.2]

latency 99th percentile   : 3.2 [WRITE:3.2]

latency 99.9th percentile : 68.3 [WRITE:68.3]

latency max               : 165.2 [WRITE:165.2]

Total partitions          : 1000000 [WRITE:1000000]

Total errors              : 0 [WRITE:0]

total gc count            : 0

total gc mb               : 0

total gc time (s)         : 0

avg gc time(ms)           : NaN

stdev gc time(ms)         : 0

Total operation time      : 00:00:07

END

 

 

Kevin Gallardo

unread,
Apr 27, 2016, 11:52:48 AM4/27/16
to java-dri...@lists.datastax.com
Good information, thank you.

My interpretation of this is, from the number of inFlight queries reported here on the executeAsync() test run, it looks like one of the Casandra node is from time to time being slower to respond. That can happen considering we're in a distributed system, there's a lot of synchronisation going on because of the Replication factor = 3, etc.. This on the driver side fills the connection on that node to the maximum number of inFlight requests ("current load=1024, max load=1024"), and in that case the driver - even if you use executeAsync() - will by default block because it can't send any request to that node. This behaviour is explained here

There are few solutions to workaround that. 

First, you could set the poolTimeout to a low value (or 0), in order to, whenever all connections are full for a host, the driver automatically sends requests to the next host. 

Second option would be to use the LatencyAware load balancing policy that is supposed to detect that a node is slower to respond at some point, and automatically gives less priority to that host regarding the next query plans. 

Also, it seems like right now you have a number of connections coreConnections=MaxConnection=1, is that correct? If that is the case, you may want to try with core=1 and max=4 or 8, this way when a host becomes slower to respond, and the driver fills the initial connection with pending requests, if the number of pending requests is higher than the newConnectionThreshold, the driver will open a new connection and can continue to "queue" the other requests to that node on the newly opened connections.

Indeed with the cassandra-stress test I wanted to make sure the problem wasn't server-side, thanks for that, we can see that cassandra-stress achieve a better throughput than your test, but let's not forget that by default the replication factor with cassandra-stress is 1, and this implies less coordination server-side. However, it might be worth for you to be informed of which kinds of queries a causing some nodes to be slower to respond. For that, on the driver side, you can use the Slow query logger. Using that, you would be able to check which query is slower than the others. 

That's all I can think of for now, please don't hesitate to share some more information/test results that you may find once you have been able to look at the suggested solutions above.

Cheers.



--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

Aidan McPeake

unread,
Apr 28, 2016, 6:48:11 PM4/28/16
to java-dri...@lists.datastax.com
Hi Kevin, Again thank you very much for this. I shall try this out as soon as I can but in the meantime thanks for your support here. It is greatly appreciated.

 Aidan
You received this message because you are subscribed to a topic in the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this topic, visit https://groups.google.com/a/lists.datastax.com/d/topic/java-driver-user/zqtlMDktuv4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to java-driver-us...@lists.datastax.com.

Aidan

unread,
Apr 29, 2016, 7:35:53 AM4/29/16
to DataStax Java Driver for Apache Cassandra User Mailing List
Hi Kevin,

Thanks again for this - focusing in on the executeAsync case I do see an  improvement with maxConnections=8, and reducing the poolTimeout value = 0 or 1

This takes me from ~65K to ~83K WRITEs with RF=3 and CL=Local Quorum : and, I see current load never hits max_load as below. 

1 cluster per-app: Host=/10.20.53.48:9042 connections=8, current load=18, max load=8192
1 cluster per-app: Host=/10.20.53.183:9042 connections=8, current load=11, max load=8192
1 cluster per-app: Host=/10.20.53.184:9042 connections=8, current load=13, max load=8192

CPU utilization (server-side is now closer to 40%, and I'd hope there's further improvements I could make).  I think this gives me a much better foundation to work on, and I'd like to sincerely thank you for your quick, clear and very helpful information. It's much appreciated.

Thanks gain,
Aidan


On Monday, 25 April 2016 16:38:22 UTC+1, Aidan wrote:
Reply all
Reply to author
Forward
0 new messages