Improving driver deserialization performance with Cython

112 views
Skip to first unread message

John Anderson

unread,
May 14, 2015, 12:51:07 PM5/14/15
to python-dr...@lists.datastax.com, Adam.H...@datastax.com, Mark Florisson
SurveyMonkey (my employer) has recently adopted cassandra for our data analytics engine and we've found that the Python serialization performance limits us from doing anything larger than a 100,000 rows with 5 or so columns. We've decided to contract Mark Florisson (core contributor to Cython and Numba) to develop a cython deserialization mechanism that should improve performance tremendously. Although we are sponsoring the work we would like to develop this publicly with the community's input so that we handle as many use cases as possible.

There are a few attempts at this currently that have already show huge improvements:

Implementation in C:

https://github.com/roncohen/python-driver/tree/rons-test

Incomplete implementation in Cython:

https://github.com/ostefano/python-driver/tree/fastuuid

But each of those have small issues like forcing the use of a single deserialization mechanism and not being Python 3 compatible.  So what I am proposing is development towards pluggable deserializers so that many implementations can be used based on needs and platform. The API we are thinking would look like this:

cluster = Cluster(['127.0.0.1'], parser=NumPyParser)

and then when you execute a query it would use the NumPyParser written in Cython instead of the current pure python implementation.

The core components we will have developed are:

- Cython deserializer that returns Python objects
- Cython deserializer that returns NumPy arrays
- Full Python3 support

Does anyone have suggestions for other things we should tackle?
Are there any good cassandra benchmark datasets around that we should use?
Are there any attempts at improving performance that I didn't list above that we should be aware of?

We will be using yappi to do all our profiling and will publish reports of our findings but would also love to have any profiling data others have already completed to help us in the process, so please send them!

Thanks,
John

Adam Holmberg

unread,
May 14, 2015, 1:09:06 PM5/14/15
to python-dr...@lists.datastax.com, Mark Florisson
John,

Thanks for bringing these resources together.

I'm presently looking at refactoring SerDes in the driver to expose flexible type deserialization. Proposal and review will be tracked on this ticket:

Input and discussion are welcome here or in JIRA.

Adam

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Mark Florisson

unread,
May 15, 2015, 11:34:04 AM5/15/15
to Adam Holmberg, python-dr...@lists.datastax.com
Hi guys,

Excited to be working on this. Adam, making type deserialization flexible would be great.

One issue I see with NumPy is that it does not support variable-length strings. However, DyND does:


How comfortable are people with dependencies? DyND itself is written in C++, though included in Anaconda by default.

Mark

John Anderson

unread,
May 15, 2015, 12:05:53 PM5/15/15
to python-dr...@lists.datastax.com, Adam Holmberg
I think we need a Cython version of the current deserializer that returns standard Python objects and then a hook/flag within that deserializer that allows us to enable the optional dependency on something like NumPy or libdynd.  There will be many people using the driver who aren't familiar with the scientific computing libraries and wouldn't be comfortable using them as the returned result. So we need to achieve a nice balance between the two types of developers.

I'm not familiar with dynd-python but giving it a quick look it does seem to lack good documentation and is still marked in alpha state on pypi.  I don't know if it would be a good choice as the primary dependency for the scientific part of the extension since it seems to lack popularity as well so I don't think many people would be expecting to get it's arrays back and if they did there wouldn't be good documentation for how to use them.

Could we load it directly into a pandas dataframe which does support these custom dtypes?

- John

Mark Florisson

unread,
May 15, 2015, 12:15:34 PM5/15/15
to python-dr...@lists.datastax.com, Adam Holmberg
On 15 May 2015 at 17:05, John Anderson <son...@gmail.com> wrote:
I think we need a Cython version of the current deserializer that returns standard Python objects and then a hook/flag within that deserializer that allows us to enable the optional dependency on something like NumPy or libdynd.  There will be many people using the driver who aren't familiar with the scientific computing libraries and wouldn't be comfortable using them as the returned result. So we need to achieve a nice balance between the two types of developers.

I'm not familiar with dynd-python but giving it a quick look it does seem to lack good documentation and is still marked in alpha state on pypi.  I don't know if it would be a good choice as the primary dependency for the scientific part of the extension since it seems to lack popularity as well so I don't think many people would be expecting to get it's arrays back and if they did there wouldn't be good documentation for how to use them.

Could we load it directly into a pandas dataframe which does support these custom dtypes?

Yes, though an issue is that Pandas stores strings as regular Python objects. Are (small) strings a common problem for performance?

Adam Holmberg

unread,
May 15, 2015, 12:27:26 PM5/15/15
to Mark Florisson, python-dr...@lists.datastax.com
We're likely not going to be able to accept parts that implement specialized types. Ideally we will develop a pattern that supports these extensions, integrate a standard Python version, and have the specialized NumPy/libdynd option as a standalone project.

Adam

Adam Holmberg

unread,
Jul 8, 2015, 12:37:26 PM7/8/15
to Mark Florisson, python-dr...@lists.datastax.com
Reviving this thread after a stint getting ready for Cassandra 2.2. It also sounds like John and Mark are ready to proceed.

I have a refactored protocol/serialization design proposed here: https://datastax-oss.atlassian.net/browse/PYTHON-313
It basically provides a way to make protocol handling more composable. I believe it would be possible to integrate both solutions shown as examples above, without changing the core driver. Hopefully this pattern can be used to provide optimized implementations that fallback gracefully when dependencies are unavailable.

I welcome any feedback on that PR, and I'm looking forward to discussing further ideas around Mark's optimization work.

Adam


John Anderson

unread,
Jul 8, 2015, 5:09:12 PM7/8/15
to python-dr...@lists.datastax.com, Mark Florisson
PYTHON-313 looks great, I'll make my small comments directly in JIRA instead of here.

Here is a simple script I created for Mark to use for basic benchmarking:

https://gist.github.com/sontek/af82d281e618074ca8a6

This focuses specifically on int serialization but I will be updating it as we go to introduce each individual type so we can test performance of each type individually and document the increase in performance for them.

With the defaults on a single node cluster (on a local box) querying 381,971 rows takes 7.3 seconds, when you compare that to ron's C implementation of the serializer it only takes 1.8 seconds.
Reply all
Reply to author
Forward
0 new messages