On Wednesday, August 31, 2016 at 1:04:55 PM UTC-7, kant kodali wrote:
> Thanks!
>
>
> On Wed, Aug 31, 2016 at 1:02 PM, Russell Spitzer <
russell...@gmail.com> wrote:
>
> That's about it. Good luck!
>
>
>
>
> On Wed, Aug 31, 2016 at 12:58 PM kant kodali <
kant...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hi Russell,
>
>
> I think your responses are worth more than 100 pages in a book (not that I don't want to read but I generally like books that cover in depth rather than some high level overview in which case I am very happy to read. I already have one of the books you mentioned) that said I want to value your time and I am trying to hard to ask as few questions as possible and I think this would be last one :) you have already given me lot of your thoughts for free so I can't appreciate enough.
>
>
> Regarding implementing custom DStreams. It look like I have to implement bunch of things which is fine but I want to make sure if I am headed in the right direction. so In order to implement custom Dstream
> 1) I need to extend InputDstream and override onStart() and onStop() methods.
> 2) since InputDstream extends DStream I should also override onCompute() for creating RDD's based on offsets.
>
> 3) I should also Implement my own RDD just like kafkaRDD (from the links you pointed out) and potentially create other classes similar to references made in DirectKafkaInputDstream.scala
>
>
> Does this sound about right?
>
>
> Thanks,
> Kant
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Aug 30, 2016 4:32 PM, Russell Spitzer
russell...@gmail.com
> wrote:
>
>
>
>
>
http://spark.apache.org/docs/latest/streaming-programming-guide.html#when-to-enable-checkpointing
> Checkpointing details ^^
>
> You need some HDFS (or something compatible) for some specific streaming things. WAL (if you need it) and Checkpointing.
>
> As for your use case
>
> It's a pretty common use-case for folks to run a Stream through Spark and save aggregates to C*. Structured streaming is not really ready for
> production users as it lacks the sinks and sources for usage. If you are writing your own receiver for NSQ (I don't know what that is) you'll have
> to look into the docs for figuring out how to get spark to behave correctly in failure. Usually the issue is you don't want to update your offset in
> the queue until the data has been completely processed and of course where do you save that offset :)
>
>
> On Tue, Aug 30, 2016 at 3:36 PM kant kodali <
kant...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I heard from someone that Checkpointing is required for windowing in a spark-user thread (because I was trying to see how far can I go without a distributed storages to build a real time analytics backend). I wasn't too sure if that was the case but thank for Clarifying it.
>
>
> Some people here are bit hesitant to use HDFS but if we have to we will. Majority of them seem to prefer cassandra as a distributed storage.
>
>
> A couple of our use cases include the following.we want the results of the computation to be stored somewhere such that when we issue a SPARK SQL query we should get the results back instead of recomputing again.Checkpointing whatever spark streaming wants to checkpoint. If we loose messages it should be ok because we use a queueing system called NSQ which will resend it if we loose.our Architecture is like this we have a NSQ consumer(which will have the spark context as well) will push data to the spark cluster to perform necessary computation and send it to dashboard server as well as may be store it Cassandra so we could query at a later time.
> Any ideas or thoughts will be greatly appreciated.
>
>
> Thanks Again!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Aug 30, 2016 3:06 PM, Russell Spitzer
russell...@gmail.com
> wrote:
>
>
>
>
> Checkpointing is not required for windowing. Perhaps you mean stateful transformations? Or Fault Tolerance? For Kafka fault tolerance all that's really required is an offset storage location which can be Cassandra.
>
> Cassandra out of the box does not provide a hdfs compatible system for generic checkpointing. Due to the size and nature of checkpoints files I would recommend against checkpointing anything to large to Cassandra anyway.
>
> That said
>
> Datastax Enterprise does include a HDFS replacement DSEFS which runs using Cassandra as a NameNode service (avoiding the large checkpoints in C* problem) as well as an older system CFS which stored data directly in C*.
>
>
>
> On Mon, Aug 29, 2016 at 1:51 PM kant kodali <
kant...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I understand that I cannot use spark streaming window operation without checkpointing to HDFS but Without window operation I don't think we can do much with spark streaming. so since it is very essential can I use Cassandra as a distributed storage? If so, can I see an example on how I can tell spark cluster to use Cassandra for checkpointing?
>
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to
spark-connector-...@lists.datastax.com.
Are you able to get this working. There is another feature DSEFS - distributed file system in Datastax 5.0, we are trying to use this for the same use case and seeing some issues. Would like to know if anyone got success using DSEFS for checkpoint