What is the proper and fastest way to read Cassandra data into pandas?

2,619 views
Skip to first unread message

Csaba Ragany

unread,
Dec 21, 2016, 6:01:47 AM12/21/16
to DataStax Python Driver for Apache Cassandra User Mailing List
Dear Community,

Now I use the following code but it's very slow:

import pandas as pd

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory

auth_provider
= PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS)
cluster
= Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT, auth_provider=auth_provider)

session
= cluster.connect(CASSANDRA_DB)
session
.row_factory = dict_factory

sql_query
= "SELECT * FROM {}.{};".format(CASSANDRA_DB, CASSANDRA_TABLE)

df
= pd.DataFrame()

for row in session.execute(sql_query):
    df
= df.append(pd.DataFrame(row, index=[0]))

df
= df.reset_index(drop=True).fillna(pd.np.nan)

Reading 1000 rows takes 1 minute, and I have a "bit more"... If I run the same query eg. in DBeaver, I get the whole results (~40k rows) within a minute.


If I use the official faster deserialization method:


from cassandra.protocol import NumpyProtocolHandler, LazyProtocolHandler
from cassandra.query import tuple_factory

session
.client_protocol_handler = LazyProtocolHandler
session
.row_factory = tuple_factory
session
.client_protocol_handler = NumpyProtocolHandler

then I get the following error message:


NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: my_host>: AttributeError("'NoneType' object has no attribute 'encode_message'",), <Host: my_host>: AttributeError("'NoneType' object has no attribute 'encode_message'",), <Host: my_host>: AttributeError("'NoneType' object has no attribute 'encode_message'",)})


What do I wrong?

Thank you!!!


Michael Weber

unread,
Dec 21, 2016, 6:39:36 AM12/21/16
to DataStax Python Driver for Apache Cassandra User Mailing List
Hi,

try to define your own pandas row factory:


    def pandas_factory(self, colnames, rows):
        return pd.DataFrame(rows, columns=colnames)

    session.row_factory = self.pandas_factory
    session.default_fetch_size = None

    query = "SELECT ..."
    rslt = session.execute(query)
    df = rslt._current_rows

That's the way i do it - an it should be faster...

If you find a faster method - i'm interested in :)

Michael

Csaba Ragany

unread,
Dec 21, 2016, 10:11:21 AM12/21/16
to DataStax Python Driver for Apache Cassandra User Mailing List
Thank you!

It almost works perfectly and it's very fast.
The only problem that I can `SELECT` only all my rows minus one row. I don't know why.

I run the `SELECT` queries in iterations. In the actual iteration I have exactly 18815 rows, so I want to read all of these rows (in the next iteration I can have more or less rows). If I run `SELECT` with `LIMIT 100` or `LIMIT 10000` or `LIMIT 18814` then everything works perfectly. Actually I get a warning message, but I also get the records from Cassandra. The warning message is:
WARNING:cassandra.protocol:Server warning: Read 18814 live rows and 99999 tombstone cells for query SELECT * FROM my_table WHERE datetime <= my_ts_uuid LIMIT 18814 (see tombstone_warn_threshold)

But if I have no `LIMIT` or I have `LIMIT 18815`, then I get the following warning and error message:
WARNING:cassandra.protocol:Server warning: Read 18815 live rows and 100001 tombstone cells for query SELECT * FROM my_table WHERE datetime <= my_ts_uuid LIMIT 18816 (see tombstone_warn_threshold)

ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures" info={'failures': 1, 'required_responses': 1, 'consistency': 'LOCAL_ONE', 'received_responses': 0}

It's very crazy... I don't really know what `tombstone`s stand for in Cassandra and why it reaches its limit (100000). If I use one of the default `row_factory`s, eg. `session.row_factory = dict_factory` then I get no error messages (running exactly the same query)...
Reply all
Reply to author
Forward
0 new messages