Cascading to spark

Velkumar Neel

unread,

Jul 17, 2022, 3:48:56 PM7/17/22

to cascadi...@googlegroups.com

Hi Chris/Ken
Greetings ! Hope you are doing well.
My company has developed a product 7 years before on cascading framework which is on Map reduce . Now they wanted to convert to spark . I wanted to get an opinion if that conversion would be effective .

Brief about the product
Four workflows are created which will do the aggregation and create records in cascading framework and through kafka it will write on Hbase .

Those run perfectly fine in production but they wanted to get into the edap data lake model down the line.

1. First they wanted to convert to spark - First phase
2. Then they moved to the edap datalake -Seco d phase

Iam suggesting why can’t we do the second phase only instead of two separate phases. What’s your suggestion on this ? And what framework you are suggesting better for the above conversion?

Your inputs would be greatly appreciated .

Thanks
Vel

Ken Krugler

unread,

Jul 17, 2022, 8:44:20 PM7/17/22

to cascadi...@googlegroups.com

Hi Vel,

I can’t add much value to the general discussion of data lakes and EDAP (which has the smell of a marketing acronym).

I have seen numerous data lake projects fail, for many reasons - you can find lots of blog posts about lessons learned.

We have converted several of our client's Cascading workflows to Flink, with good success.

And we’ve been very happy with using Apache Pinot as the analytics end-point for workflows.

HTH,

— Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/6123A1CF-AC7D-4ADE-9991-B1C24C333BA2%40gmail.com.

--------------------------

Ken Krugler

http://www.scaleunlimited.com

Custom big data solutions

Flink, Pinot, Solr, Elasticsearch

--------------------------

Ken Krugler

http://www.scaleunlimited.com

Custom big data solutions

Flink, Pinot, Solr, Elasticsearch

Chris K Wensel

unread,

Jul 18, 2022, 7:44:47 PM7/18/22

to cascadi...@googlegroups.com

There are a lot of ‘depends’ here to work through..

First, I’d see what the gap is to have the existing Cascading write to the data lake and rule out/in the value there.

If the lake is really just a function of using Apache Iceberg or a vendor specific format, we can discuss how to get that support. I plan to add Iceberg this year, but I need to add native Parquet support beforehand (which i’ve been sitting on for months).

As for Spark vs Cascading, that depends on the actual workloads..

If you have complex flows with a lot of forks in the Flow pipeline, you will likely see a slowdown if you ‘lift and shift’ the app over to spark. if i remember correctly, spark cannot parallelize a fork because it hasn’t happened yet in your code. Cascading is declarative, Spark is imperative. you will write a lot of code to manage the difference.

if the Flow is trivial, no harm in porting i guess, but I’d look at Flink or other tools because they have stronger messaging and time semantics, but then you might lose your ‘data lake’ format.

ckw

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/6071A491-66C5-4FA3-9386-92AAD2FD22F3%40krugler.org.

—
Chris K Wensel
ch...@wensel.net

Reply all

Reply to author

Forward