cassandra-drive vs pycassa performance

Dave Brueck

unread,

May 14, 2014, 12:58:25 PM5/14/14

to python-dr...@lists.datastax.com

Hi, I'm starting a new project with cassandra and thought I'd use the recommended cassandra-driver, but the performance seemed a bit low. I then repeated testing with pycassa and found much higher performance. I understand that the cassandra-driver specifically mentions in the docs that some work is still underway (e.g. "C extension for encoding/decoding messages" is on the todo list), so my question is mostly just to see if the results I'm seeing are expected or if maybe I'm just doing something wrong.

For a single connection, single threaded test that just queries an object by its primary key over and over, I get:

== pycassa 1.11.0 get test (20000 gets) ==

5179 reqs per second

== cassandra-driver 1.1.2 get test (20000 gets) ==

767 reqs per second

To be clear, I had similar performance differences trying various levels of concurrency (though not as wide a gap - maybe 2x-3x in favor of pycassa). But since the performance gap also exists in the single worker model, I thought I'd post that for the sake of simplicity.

My test is as follows:

import timeit

count = 2000

key = '12341234'

import pycassa

print '== pycassa %s get test (%d gets) ==' % (pycassa.__version__, count)

from pycassa.pool import ConnectionPool

conn = ConnectionPool('t0')

cf = pycassa.ColumnFamily(conn, 'sessions1')

cf.insert(key, {'age':55})

t = timeit.timeit('cf.get(key)', 'from __main__ import cf, key', number=count)

print '%d reqs per second' % (count/t)

import cassandra

print '== cassandra-driver %s get test (%d gets) ==' % (cassandra.__version__, count)

from cassandra.cluster import Cluster

cluster = Cluster()

session = cluster.connect()

session.set_keyspace('t0')

put = session.prepare('update sessions1 set age=? where user_id=?')

session.execute(put, [55, key])

get = session.prepare('select * from sessions1 where user_id=? limit 1')

t = timeit.timeit('list(session.execute(get, [key]))[0]', 'from __main__ import session, get, key', number=count)

print '%d reqs per second' % (count/t)

I'm running it on a fairly idle Ubuntu 13.04 server, and the magnitude of the performance difference is consistent across many runs. Cassandra is running as a single node, locally, with a pretty vanilla configuration (DataStax Community version 2.0.7).

I know the test is somewhat convoluted, but here's some background: we're looking to replace our existing k/v store with Cassandra, and out of the gate we'll need to deploy a cluster with capacity for 300k rps or so. Before we get all the way to production our testing will move to something much more similar to our existing production traffic, but at the outset I'm just trying to evaluate feasibility and get a kind of best-case performance baseline.

Using pycassa and multiple clients I can easily get tens of thousands of requests per second from this local node, so even though more realistic usage will result in lower performance, it looks like we're probably in good shape and can likely reach our performance needs with a cluster of a few dozen nodes.

On the other hand, with cassandra-driver and multiple clients I can rarely get more than a few thousand requests per second. Even if we see no worse performance with real traffic patterns (which is pretty unlikely), I might need a hundred or more nodes in our cluster to achieve our performance needs.

So, what I'm looking for is some guidance, i.e. if I'm using the newer driver improperly or there's another way to go about it, pointers would be greatly appreciated. Or, if this type of performance is typically for the newer driver for now, that's totally ok - I can stick with pycassa for now and then re-evaluate the newer driver as it progresses.

FWIW, nearly all our traffic is in the form of simple read/write by ID (we have some background process that will occasionally scan the data for reporting purposes, but the live traffic is all very simple k/v access) and our real world object sizes are < 10KB, with 2KB being the norm.

Thanks for any guidance anyone can provide!

-Dave

Stan Hu

unread,

May 14, 2014, 1:52:13 PM5/14/14

to python-dr...@lists.datastax.com

Thanks for the benchmark. I ran your test code and found the same thing, with and without LZ4 compression turned on. I then ran your test using cProfile and gprof2dot. Here is the call graph. As you can see, most of the time is just waiting for data from the server.

I think the issue here has less to do with the DataStax Python driver more to do with Cassandra's native protocol vs. the Thrift protocol (https://issues.apache.org/jira/browse/CASSANDRA-6235).

Switching to async requests increased the performance about 2x, but it still was not nearly as fast as the Thrift settings.

For those playing at home: the 'sessions1' column family needs to be created with the COMPACT STORAGE option for CQL2 compatibility.

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Alex Popescu

unread,

May 14, 2014, 2:20:55 PM5/14/14

to python-dr...@lists.datastax.com

Dave,

I'm not sure you've read through this performance page that Tyler put up:

http://datastax.github.io/python-driver/performance.html

On Wed, May 14, 2014 at 9:58 AM, Dave Brueck <da...@uplynk.com> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

--

:- a)

Alex Popescu

Sen. Product Manager @ DataStax

@al3xandru

Dave Brueck

unread,

May 14, 2014, 2:24:41 PM5/14/14

to python-dr...@lists.datastax.com

Hi Stan, thanks so much for taking the time - I really appreciate it. I'll keep an eye on that issue you cited and repeat my tests if anything changes on that front. In the meantime we'll stick with pycassa (and the Thrift protocol) since it fits our simple use cases well enough. I think it won't be too painful to migrate later on anyway.

Thanks again!

-Dave

Dave Brueck

unread,

May 14, 2014, 2:41:03 PM5/14/14

to python-dr...@lists.datastax.com

Thank you, Alex. I did happen to read through that, yes. With some of the simpler examples on that page I could close the gap somewhat, but I haven't come up with a good way yet to adapt our use case to something like callback chaining. I'll fiddle around with it some more though.

-Dave

Tyler Hobbs

unread,

May 14, 2014, 4:16:33 PM5/14/14

to python-dr...@lists.datastax.com

Hi Dave,

I haven't updated the Performance page yet, but you may want to check out the relatively new cassandra.concurrent module: http://datastax.github.io/python-driver/api/cassandra/concurrent.html. That implements the callback chaining pattern for you. Of course, that pattern still may not fit your needs.

Tyler Hobbs
DataStax

Tyler Hobbs

unread,

May 14, 2014, 4:26:51 PM5/14/14

to python-dr...@lists.datastax.com

Also, I should mention that unless you're planning to deploy your application on the same machine as a Cassandra node, benchmarking a single threaded application against the local host will give you very skewed results. With a remote node, your throughput will tend to be limited by network latency. As you've noticed, the python driver (currently) has higher latency per-operation than pycassa, but that small amount of latency tends to be dominated by network latency when querying a remote host. To overcome this you need concurrency, which is where the python driver tends to beat pycassa. (Of course, this depends on the usage pattern, but I expect the gap to widen for all usages as some improvements are made to the python driver and the native protocol itself.)

--
Tyler Hobbs
DataStax

Dave Brueck

unread,

May 14, 2014, 5:06:13 PM5/14/14

to python-dr...@lists.datastax.com

Hi Tyler, thanks for reaching out. Like I mentioned in my post, I did do concurrency testing but, seeing performance problems, I whittled it down to a single connection model for the sake of posting to the newsgroup.

I've since proceeded with setting up a 3 node cluster and a non-local concurrency test that randomly reads then writes and am getting the performance I mentioned previously - the gap narrows to 2x-3x at least.