update on Cascading over Spark

duob...@homeaway.com

unread,

May 17, 2017, 11:45:09 AM5/17/17

to cascading-user

Hi folks, congrats to the core team on the December acquisition. Wondering if Cascading for Spark is still on any roadmap?

Thanks,

Dusty

Chris K Wensel

unread,

May 17, 2017, 12:14:50 PM5/17/17

to cascadi...@googlegroups.com

I can’t speak for Xplenty, but i’ve no plans to port to Spark.

given I don’t have the time, I don’t believe the underlying model to be suitable for the types of workloads Cascading was designed to support.

https://github.com/cwensel/notebook/blob/public/cluster-computing.adoc

I strongly recommend looking at Tez, Flink, or Hazelcast Jet

https://hazelcast.com/products/jet/

(though it looks like its going through a re-org as I don’t see the Cascading sub-project from google)

sounds like twitter is having luck with Tez at scale.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/5b5807de-fbf4-45c2-ad35-7c3b6b1b5390%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dusty O'Brien

unread,

May 17, 2017, 12:41:33 PM5/17/17

to cascadi...@googlegroups.com

Thanks Chris – we’ve been thinking about using Cascading over Tez on our local cluster. Are you suggesting running Cascading over Flink or Jet also? Spark is being encouraged by our data teams (and their pre-processing pipeline build-outs) as we move more deeply into AWS.

Thanks,

Dusty

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/JSnuYqLxv7g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/4E255A0E-15F4-4546-A99A-0011DA7DAD51%40wensel.net.

Chris K Wensel

unread,

May 17, 2017, 1:24:11 PM5/17/17

to cascadi...@googlegroups.com

I can only speak to experience w/ Tez, I do strongly recommend it as complimentary to MR. I wouldn’t switch over entirely, but port one app at at time as you gain experience with it. There are many differences.

Spark just won’t be efficient for complex workloads (pipelines with branches). See the aforementioned doc.

The irony Tez is primarily used underneath Hive, but it is a DAG model (not a directed in-tree like Spark, suitable for SQL workloads), so branches can run in parallel on the cluster (vs sequentially)

Also, i’ve had no issues running Tez in AWS/EMR.

ckw

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/F1E90E24-798B-4197-A35F-B7BC2B4C9CD5%40homeaway.com.

Cyrille Chépélov

unread,

May 17, 2017, 1:29:34 PM5/17/17

to cascadi...@googlegroups.com

still happily doing scalding on TEZ ;)

-- Cyrille

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2DD74C6A-CCA4-40DB-AF77-80CB5CFF5E7F%40wensel.net.

Dusty O'Brien

unread,

May 17, 2017, 1:43:42 PM5/17/17

to cascadi...@googlegroups.com

Thanks Chris, do you have a summary of which platforms map to which feature sets in your link? E.g. that Spark does directed in-tree.

We do have pipelines that take N inputs and create M outputs. So I guess these are the areas where you say Spark would be inefficient and might require a redesign.

It remains to be seen how much external force will be applied to encourage us to migrate our solution into Spark. We’re happing with our current processes using Cascading over MapReduce solution right now, and are about to give Cascading over Tez a try.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2DD74C6A-CCA4-40DB-AF77-80CB5CFF5E7F%40wensel.net.

Chris K Wensel

unread,

May 17, 2017, 3:34:28 PM5/17/17

to cascadi...@googlegroups.com

Thanks Chris, do you have a summary of which platforms map to which feature sets in your link? E.g. that Spark does directed in-tree.

https://github.com/cwensel/notebook/blob/public/cluster-technologies.adoc

I haven’t filled out other platforms, would prefer someone else to do so actually (PR please!). I know Tez and MR very well, Spark fairly well (yet didn’t add it), and Flink not so much (the Flink guys declined to contribute). jet is a new beast but they based it off of my cluster notes i sent earlier but I don’t know what the end result looks like.

We do have pipelines that take N inputs and create M outputs. So I guess these are the areas where you say Spark would be inefficient and might require a redesign.

In spark, every time you call ‘write to hdfs’ for a pipeline, the job executes immediately. if you have three branches that each write to hdfs, you get three jobs, with the risk of the subsequent jobs back-tracking the work of the previous (yes caching, but this is the root of many problems I hear about).

Cascading (and any DAG model) will run the writes simultaneously if they are independent — thanks to topological traversal of the step workloads in Tez/MR, vs depth-first in Spark.

for a small cluster it’s probably irrelevant (all work is serialized anyway), but a large Spark cluster will tend towards being underutilized as workload complexity goes up (not data size).

It remains to be seen how much external force will be applied to encourage us to migrate our solution into Spark. We’re happing with our current processes using Cascading over MapReduce solution right now, and are about to give Cascading over Tez a try.

I heard a rumor of a Cascading company throwing out 50k lines of Spark code in the attempt to port. they are trying again. I hear they are unable to use Tez for non-technical reasons.

i’m not trying to throw FUD at Spark, every system has its strengths and weaknesses.

eg.. I believe Flink to only support ‘simultaneous staging’ that is (because it is really a streaming platform), you have to instantiate enough workers to satisfy your workload end to end. this can be problematic for complex and large workloads (that is, you have to increase your cluster size proportionally to the workload/data increase — but i could have been misinformed on this point).

Tez is not a multi-dag, so there are some inefficiencies with self-joines when compared to MR (it’s not a multi-dag, but a Mapper can have multiple/branched data pipelines into the Reducer).

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/98F2F8E0-62DC-4FB9-BF31-0CF3B3C7117D%40homeaway.com.

Chris K Wensel

unread,

May 17, 2017, 3:35:52 PM5/17/17

to cascadi...@googlegroups.com

Maybe you can push Twitter to release their new (Scalding specific) planner rule optimizations!

ckw

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/38b80f1a-8726-d422-d937-8c31f831b7cb%40transparencyrights.com.

Reply all

Reply to author

Forward