SurveyMonkey (my employer) has recently adopted cassandra for our data analytics engine and we've found that the Python serialization performance limits us from doing anything larger than a 100,000 rows with 5 or so columns. We've decided to contract Mark Florisson (core contributor to Cython and Numba) to develop a cython deserialization mechanism that should improve performance tremendously. Although we are sponsoring the work we would like to develop this publicly with the community's input so that we handle as many use cases as possible.
There are a few attempts at this currently that have already show huge improvements:
Implementation in C:
https://github.com/roncohen/python-driver/tree/rons-testIncomplete implementation in Cython:
https://github.com/ostefano/python-driver/tree/fastuuidBut each of those have small issues like forcing the use of a single deserialization mechanism and not being Python 3 compatible. So what I am proposing is development towards pluggable deserializers so that many implementations can be used based on needs and platform. The API we are thinking would look like this:
cluster = Cluster(['127.0.0.1'], parser=NumPyParser)
and then when you execute a query it would use the NumPyParser written in Cython instead of the current pure python implementation.
The core components we will have developed are:
- Cython deserializer that returns Python objects
- Cython deserializer that returns NumPy arrays
- Full Python3 support
Does anyone have suggestions for other things we should tackle?
Are there any good cassandra benchmark datasets around that we should use?
Are there any attempts at improving performance that I didn't list above that we should be aware of?
We will be using yappi to do all our profiling and will publish reports of our findings but would also love to have any profiling data others have already completed to help us in the process, so please send them!
Thanks,
John