Connecting to cassandra?

957 views
Skip to first unread message

Todd Gruben

unread,
Aug 21, 2012, 10:17:08 PM8/21/12
to spark...@googlegroups.com
How would you go about connecting a spark cluster to a cassandra data store?

-Todd

Matei Zaharia

unread,
Aug 22, 2012, 3:19:28 PM8/22/12
to spark...@googlegroups.com
Hi Todd,

You should be able to read it if you create a Hadoop JobConf object for reading from Cassandra. I haven't used Cassandra yet, but you can look at the Word Count example in it to see how they read from MapReduce. Then create the same JobConf object in Scala, and use the SparkContext.hadoopRDD method, which takes the JobConf object, key class, value class, etc and reads the data.

I'll look into this when I have a moment to install Cassandra, but if you want to see this method used in another place, here's some code we had to read from Hypertable:

import org.hypertable.hadoop.mapred.TextTableInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf

val conf = new JobConf
conf.set("hypertable.mapreduce.input.table", "your_table")
val data = sc.hadoopRDD(conf, classOf[TextTableInputFormat], classOf[Text], classOf[Text])

Matei

Erich Nachbar

unread,
Aug 22, 2012, 4:32:30 PM8/22/12
to spark...@googlegroups.com
Todd,

There is a Cassandra Input Format (http://wiki.apache.org/cassandra/HadoopSupport) that should work.

I personally haven't tested it yet, because we create file dumps using parallel fetches (aka first fetch a batch of row keys only, then distribute those keys to your worker processes/threads and make them retrieve the actual data of the row).

That works reasonably well if your data size isn't too big or you do incremental dumps (like by having rows that are partitioned by time).

If you do have a lot of data, the Input Format should help by taking data locality into consideration. Of course you would need to install Spark also on the nodes running Cassandra.

We found that most tools to explore the data in Cassandra are rather awkward. So having file dumps enables people to play/look at the data more easily and provide a nice "snapshot in time" functionality for the rest of our pipeline on Spark.

-Erich
--
Erich Nachbar


sipi...@gmail.com

unread,
Mar 3, 2013, 11:40:23 AM3/3/13
to spark...@googlegroups.com
Hi Todd or others,

Does the integration work? I'm going to work on this, but I'm not sure if it's tricky. Looking forward your comments.

Regards,
Siping

Prashant Sharma

unread,
Mar 4, 2013, 12:12:29 AM3/4/13
to spark...@googlegroups.com
Who is Todd? And what integration are you talking about ?


--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
s

sipi...@gmail.com

unread,
Mar 4, 2013, 2:34:41 AM3/4/13
to spark...@googlegroups.com
Sorry. I get some message lost. I'm looking for integrating Spark&Shark with Cassandra. For example, I issue a SQL in shark like the following:

select sum(col1), count(col2), col3, col4 from myCF
group by col3, col4

Can the Shark/Spark do thing like the following:

1. read data locally on ech Cassandra node
2. do aggregation by keys locally
3. Reduce the result (meaning do aggregation by keys again)

Regards,
Siping

Reynold Xin

unread,
Mar 5, 2013, 1:21:04 AM3/5/13
to spark...@googlegroups.com
Hi Siping,

We haven't tried it with Cassandra yet. My guess is the predicate and aggregation push down part won't work right now without an extra Cassandra specific operator. 


sent from mobile device. please excuse the brevity.


--

sipi...@gmail.com

unread,
Mar 6, 2013, 12:53:12 AM3/6/13
to spark...@googlegroups.com
Thanks for your timely reply. I'm trying to create a solution for real-time monitoring/dashboard. There will be 5K/10K events (0.5K~1K per event) per seconds to monitor. Historical data has to be kept for around 1-2 month. Typical use cases will be list all events relating to me (perhaps, there will be 1K - 100K). There are dozens column for each event. Need to do order by, filter on the fly.

For your perspective, what datastore is recommended? Originally, I want to use Cassandra. BTW, I cannot use license under LGPL, GPL, AGPL and etc) in my organization. Apache, BSD, MIT, EPL is OK.

Reply all
Reply to author
Forward
0 new messages