Cascading 3.1.0 on Flink

38 views
Skip to first unread message

Ron Gonzalez

unread,
May 21, 2016, 2:55:36 AM5/21/16
to cascading-user
Hi,
I wanted to find out if Cascading on Flink automatically lets users create a data flow that becomes streaming enabled. Any experience using Cascading with Flink for both streaming and batch? What kind of performance difference has one seen for the same flow but different fabric (tez vs flink)?

Thanks,
Ron

Ken Krugler

unread,
May 21, 2016, 9:44:07 AM5/21/16
to cascadi...@googlegroups.com
Hi Ron,

On May 20, 2016, at 11:55pm, Ron Gonzalez <zlgon...@gmail.com> wrote:

Hi,
 I wanted to find out if Cascading on Flink automatically lets users create a data flow that becomes streaming enabled. Any experience using Cascading with Flink for both streaming and batch?

No - the streaming model is very different from the batch model, once you start talking about groups/joins.

E.g. in streaming, when do you decide to trigger the grouping?

It’s likely possible to figure out a way to re-use Cascading map/filter operations in a Flink streaming environment, but I haven’t dug into that.

What kind of performance difference has one seen for the same flow but different fabric (tez vs flink)?

I haven’t gotten my test workflow to run on Tez in Elastic Mapreduce.

But I can tell you that performance differences between “classic” Map-Reduce and Flink will be hard to reduce down to a typical factor, as this depends on so many factors - the nature of the workflow, the nature of the data being pumped through the workflow, and amount of data relative to the amount of memory, how Flink & MR have been configured, the type of hardware, etc., etc. I imagine the same will be true of Flink vs. Tez.

I got a 50% speedup going from MR to Flink for one specific workflow, with moderate tuning. See the slides from my ApacheCon talk

Regards,

— Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



zlgonzalez

unread,
May 22, 2016, 12:35:58 PM5/22/16
to cascadi...@googlegroups.com
Thanks Ken. Agreed that particular operators need some extra configuration for defining the window, but like you said, the map/filter operations could be interesting.

I'm surprised to hear that you only a 50% speed up from MR to Flink. We saw a huge improvement when moving to Tez just from the reduction of the submitted jobs though since we were only 0.6.2, the reducers are alow slower in Tez.

Thanks,
Ron



Sent via the Samsung Galaxy S7 edge, an AT&T 4G LTE smartphone

Ken Krugler

unread,
May 22, 2016, 12:57:26 PM5/22/16
to cascadi...@googlegroups.com
On May 22, 2016, at 9:35am, zlgonzalez <zlgon...@gmail.com> wrote:

Thanks Ken. Agreed that particular operators need some extra configuration for defining the window, but like you said, the map/filter operations could be interesting.

I'm surprised to hear that you only a 50% speed up from MR to Flink. We saw a huge improvement when moving to Tez just from the reduction of the submitted jobs though since we were only 0.6.2, the reducers are alow slower in Tez.

If you take a look at my slides, you’ll see that the baseline (MR) runtime was 148 minutes, for 12 jobs. So the overhead of firing up a job as a percentage of total run time wasn’t significant.

And a key factor for speedup is the net reduction in spilling to disk/reading from disk.

For the workflow I was running, almost all phases had significantly more memory than could fit in memory, so there wasn’t as much of a win for Flink.

Estimating the performance win from switching to Flink is hard to do, but I agree that 50% was on the lower end of what I was expecting.

— Ken

Chris K Wensel

unread,
May 22, 2016, 6:31:56 PM5/22/16
to cascadi...@googlegroups.com
If you drop the Driven plugin into your apps and run them on different platforms, all other things being equal, you an compare performance from a number of angles via the rollup of counter values etc.

CPU_MILLIS is a very handy metric available on many platforms (unsure about Flink)

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/5B58E715-258F-46D6-A5C7-0A2E920486A3%40transpac.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Reply all
Reply to author
Forward
0 new messages