Helping out with Cascading Spark

Ron's Gmail

unread,

Jun 21, 2016, 12:21:21 PM6/21/16

to cascadi...@googlegroups.com

Hi,
Can someone share how one could help with the Cascading Spark integration? Is there a github or something that I can take a look at?

Thanks,
Ron

Ken Krugler

unread,

Jun 21, 2016, 1:27:17 PM6/21/16

to cascadi...@googlegroups.com

Hi Ron,

I don’t know of anything specific, but I would take a look at the Cascading-Flink planner, as that should be somewhat similar to what you’d need to do.

See https://github.com/dataArtisans/cascading-flink

— Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/868F64C1-C982-484E-95BD-16B0626EB0E7%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Ron Gonzalez

unread,

Jul 29, 2016, 9:00:30 PM7/29/16

to cascading-user

Oh I thought that the cascading spark effort was underway and I could just contribute. Is that not the case?

If so, then I guess I can take a look at Cascading-Flink as a starting point then...

Thanks,
Ron

Chris K Wensel

unread,

Jul 29, 2016, 10:41:32 PM7/29/16

to cascadi...@googlegroups.com

Feel free to use both the Flink and Tez work as a starting point.

I’m happy to offer suggestions on the project as well.

Just as a couple things to consider…

Spark is imperative, in the sense that jobs fire as side effects of building the work load. for example, calling something like saveAsHadoopFile (i doubt thats the actual signature) will immediately fire off the job. great for a repl, challenging when building a declarative layer over the imperative one.

subsequently, the order of calling the imperative calls can effect performance. the Cascading planner will allow for rules that can signify job/step schedule ordering as meta-data to the Cascading scheduler/orchestrator (we might need a patch for this in Cascading, no other platform needs it).

also, because of the sensitivity of ordering, there will need to be rules to mark intermediate RDDs or Stages as cached. note much of Spark is really just overcoming the limitations of both being imperative and only supporting directed in-tree graphs on jobs — that is, not supporting forks within jobs (I haven’t looked at Spark 2.0, hopefully they spent their time lifting this limitation instead of providing yet another api).

good news is you probably only need to create two custom RDDs that would parallelize on splits or on hash partitions to run the associated pipeline (once you dig into the planner, this will make a bit more sense). and then probably RDDs to wrap Tap instance (this is where i’m still a bit fuzzy) or call out to equivalent RDDs.

this might be of help

https://github.com/cwensel/notebook/blob/public/cluster-computing.adoc
https://github.com/cwensel/notebook/blob/public/cluster-technologies.adoc

I still need to update the table for Spark 1.x and 2.x — unless I get a PR for it.

ckw

> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
> To post to this group, send email to cascadi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/cascading-user.

> To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/5331a39a-2a9c-4627-8367-eaaf6c8c0ca3%40googlegroups.com.

Ron Gonzalez

unread,

Aug 3, 2016, 11:43:06 PM8/3/16

to cascading-user

Thanks Chris.
I've started looking at Flink. Would it be ok to start with those packages and start making those work? What's not clear to me is where the interfaces end and the implementations begin, but I think if I get into it, things will become clearer.
Would it be ok to require JDK8 as a requirement to use the cascading-spark package?

Thanks,
Ron

Chris K Wensel

unread,

Aug 4, 2016, 12:48:29 PM8/4/16

to cascadi...@googlegroups.com

Java 8 should be fine since this will be a stand alone build, not part of core.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/b741e2d4-0489-47c8-b409-0d8c65635745%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

—

Chris K Wensel

ch...@wensel.net

Reply all

Reply to author

Forward