Reading from Cassandra as data stream

457 views
Skip to first unread message

Lishu Liu

unread,
Feb 25, 2015, 5:13:06 PM2/25/15
to spark-conn...@lists.datastax.com
I'm new to spark streaming, so I might just not getting the point.

Here is my use case. I have a daily job which keeps writing result to Cassandra table. Now I want to read the data out to extract the information I want. I can do it via batch processing, but I really don't want to process the whole table everyday. It seems better to set up streams, that as new data appends to Cassandra table, it gets processed.

I tried val rdd = ssc.cassandraTable("streaming_test", "key_value").select("key", "value") but it's not reading it as DStream, but still as CassandraRDD[CassandraRow].

The other work around I can think of is, to change my daily task to also post the result via TCP socket, so Spark can pick up from there instead of Cassandra. Or is it better to integrate Kafka, Akka, etc to solve my case?

Hugo Ferreira

unread,
Feb 26, 2015, 4:15:53 AM2/26/15
to spark-conn...@lists.datastax.com
Hi,

Take a look at the thread:

https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/Q9E4TqIXGEM

Helena Edelson has posted links to Kafka examples.

HTHS.


Look at

Lishu Liu

unread,
Feb 26, 2015, 11:23:59 AM2/26/15
to spark-conn...@lists.datastax.com
Yes, I saw that thread. Thanks Hugo. Do you mean I should build a Kafka cluster to read from Cassandra as stream?
Message has been deleted

Hugo Ferreira

unread,
Feb 27, 2015, 3:43:23 AM2/27/15
to spark-conn...@lists.datastax.com
Hi,

On Thursday, 26 February 2015 16:23:59 UTC, Lishu Liu wrote:
> Yes, I saw that thread. Thanks Hugo. Do you mean I should build a Kafka cluster to read from Cassandra as stream?
>

No. I was suggesting looking at the Kafka consumer to figure out how to
deal with streams.

Not that I know anything about streaming but I figure that when your read from the Cassandra DB it will always be batch processing. I think what you want is to:
1. Read your daily data as a stream (Kafka or whatever you are using now)
2. Process the stream as you want to before placing into the Cassandra DB
3. then storing it into Cassandra

Spark supports streaming via a StreamingContext (see [1]). So use that for your
stream processing. For common use-case see [2]. You could get fancier and
think of multiple streams. Maybe one for pre-processing data before it goes to Casandra and another for real-time statistics and monitoring.

If however you want the Cassandra DB to be the source of a stream, then page 37
of the slides in [3] is what you are looking for.

HTHs
HF

[1] https://spark.apache.org/streaming/
[2] http://www.slideshare.net/helenaedelson/streaming-bigdata-helenawebinarv3
[3] http://www.slideshare.net/helenaedelson?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview
Reply all
Reply to author
Forward
0 new messages