Help for data modelling to support heavy data

9 views
Skip to first unread message

Jubin Juneja

unread,
May 13, 2017, 3:02:44 AM5/13/17
to DataStax Java Driver for Apache Cassandra User Mailing List

Hi,

 

We want to use Apache Cassandra for storing big data gathered from realm time sensor data. We have developed an IOT platform capable to handle 1 million events per second. We want to persist them in Cassandra.

 

Our table looks like :

 

Sensor_data_by_date

Realm

text

K

Bucket

int

K

dateTimeReceived

timestamp

Clustering column

sensor_id

text

 

Message_id

text

 

Sensor_name

text

 

 

Query we are interested in is :

 

Give me all results for all sensor data for “realm-a” for a  dateTime range say “5th may” to “12th may” order by “dateTimeReceived”.

 

Solution :

Since our platform can handle upto 1 million events per second, when I even try to include DATE + HOUR as partition key, it will still increase the maximum recommended size by Casssandra. So we decided to keep bucket along with realm as partition key.

 

Problem :

Now say when we have a wide range of date range as mentioned (5th may to 13th may), we will have multiple buckets to lookup from. We also need to support ordering.

When we have this in place, I need to use “IN clause” for buckets say :

             ……….. where realm=realm-a and bucket in (1,2,3,4) and dateTimeReceived>… and dateTimeReceived <… order by dateTimeReceived

This would complain that IN clause and order by cant work together with pagination.

I need to have pagination as well…..

 

Can you please help me how to achieve this functionality?

 

Help will be much appreciated.

 

Regards

Kevin Gallardo

unread,
May 15, 2017, 11:25:13 AM5/15/17
to java-dri...@lists.datastax.com
Hi,

I am no data modelling expert but 1) keeping "realm" and "bucket" in the partition key sounds like a good idea, and 2) you may want to use the "WITH CLUSTERING ORDER BY" command when defining the table, which specifies to Cassandra to store the data on disk in the descending or ascending order of "dateTimeReceived", that way you do not need to use "ORDER BY" on your query anymore because the data is already ordered on disk, hence the query would automatically return the data in the order you desire. There's an example over here.

Please note, to discuss with more data modelling experts you may want to redirect your question to the Apache Cassandra mailing list, or the DataStax Academy Slack channels, as this is the mailing list for the DataStax Java driver.

Hope that helps!

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.



--
Kévin Gallardo.
Software Developer in Drivers and Tools Team,
DataStax.

Reply all
Reply to author
Forward
0 new messages