Cascading 3.0.1 with checkpoints

47 views
Skip to first unread message

Ron Gonzalez

unread,
Aug 20, 2015, 8:42:17 PM8/20/15
to cascading-user
Hi,
  I had a cascading job with checkpoints all working before migrating to Tez. After getting the job to finally work with Tez, I am now seeing that the checkpoints aren't working. What I mean here is that the job will complete and create the final resulting table, but none of the checkpoints get created.
  Any suggestions?

Thanks,
Ron

Chris K Wensel

unread,
Aug 21, 2015, 1:33:28 AM8/21/15
to cascadi...@googlegroups.com

There was some mention of not supporting restartable Checkpoints here

but we should update the userguide to be more clear they aren’t supported because Tez doesn’t allow for writing to hdfs between nodes — last i checked.

which by itself doesn’t provide a proper solution. of which would entail us cutting a DAG into to two or more smaller DAGs to provide that restart-ability while preserving as much efficiency as possible. 

this in itself is complex, but happy to entertain a patch.

ckw 

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/94d91708-2ba0-4ed0-858b-28c735d5ae09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Ron Gonzalez

unread,
Aug 21, 2015, 1:52:15 AM8/21/15
to cascadi...@googlegroups.com
Thanks Chris. This is extremely unfortunate though, but I guess we'll have to figure something out. Checkpointing is one of those features of Cascading that really make it quite compelling.

On that note, given that Tez is 0.7.0 at best right now, would you recommend moving to Tez? I do see huge performance gains, but was wondering your thoughts on how safe it is to migrate...

Thanks,
Ron
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/cGDUuco1xSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

Ron Gonzalez

unread,
Aug 21, 2015, 1:56:53 AM8/21/15
to cascadi...@googlegroups.com
Thanks Chris. This is extremely unfortunate though, but I guess we'll have to figure something out. Checkpointing is one of those features of Cascading that really make it quite compelling.

On that note, given that Tez is 0.7.0 at best right now, would you recommend moving to Tez? I do see huge performance gains, but was wondering your thoughts on how safe it is to migrate...

Thanks,
Ron

On 08/20/2015 10:33 PM, Chris K Wensel wrote:
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/cGDUuco1xSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

Chris K Wensel

unread,
Aug 21, 2015, 12:23:46 PM8/21/15
to cascadi...@googlegroups.com
Thanks Chris. This is extremely unfortunate though, but I guess we'll have to figure something out. Checkpointing is one of those features of Cascading that really make it quite compelling.

in theory, all we need is a planner rule to identify and cut the assembly into steps (tez dags, like we do to get MR jobs). will just take time to work through the edge cases. 


On that note, given that Tez is 0.7.0 at best right now, would you recommend moving to Tez? I do see huge performance gains, but was wondering your thoughts on how safe it is to migrate…

as of 3.0.2 (out today or so) is pinned to 0.6.2. we don’t _recommend_ any other release at this time but do try and report issues if you can. its how things improve.

that said, many Apache projects have parallel branching semantics. that is, its not always clear if, say, 0.7.0 is the logical successor of 0.6.2. it might be in theory, but it will likely have bits missing. 

due to the multi-commit patching policy, it is near impossible to determine lineage. that is, the same patch will be committed uniquely to one or more branches. no merges to watch the flow.

so, give 0.7.0 a shot, and let us know how goes. we currently aren’t running regressions on it yet.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Ron Gonzalez

unread,
Aug 21, 2015, 1:20:26 PM8/21/15
to cascadi...@googlegroups.com
Sounds good. Will the Spark fabric also not support checkpointing? We've migrated to 3.0 in preparation for the Spark fabric...:)

I'll give 0.7.0 a try and ping the list for any discovered issues.

Thanks,
Ron

Chris K Wensel

unread,
Aug 21, 2015, 2:20:55 PM8/21/15
to cascadi...@googlegroups.com
Unlike Tez — which has full DAG per ‘job’ support — Spark only supports a reversed Tree — multiple sources, a single sink per job.

In MapReduce (which is a job), we drop in a temp file to chain two MR jobs together. this is in large part why MR is slow. 

To map Cascading to Spark, we will have multiple jobs (steps in Cascading terms) per Flow if there are multiple sinks. the question I haven’t answered is what will link them together most reliably — hopefully not a commit to disk intermediate file.

In Tez, we _can’t_ write to an intermediate file within a job, and since Tez is a true DAG (multiple sources and sinks) per job, we don’t need steps and need to chain them together with intermediate files.

A further slow down of MR is that you can’t do MRR, you can only do MR->MR — forcing data to disk, then using a split parallelization to forward the data to the next Reducer — Identity Mapper.

Spark has MRR, so does Tez. 

But Tez can fan — split — out arbitrarily after any M or R. Spark cannot in a job (from what I can tell). it will require a new job(s) to manage the split. I don’t know yet, but on the new job, it may require a split based parallelization — vs a partitioned parallelization — most likely not if memory serves, but may have caveats.

Thus ultimately, for complex loads, Tez is likely better positioned than Spark.

So yes, we can more easily do Checkpoints in Spark I believe — by virtue of already having to partition Flows into multiple Steps (jobs) in order to execute the load.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Andre Kelpe

unread,
Aug 21, 2015, 2:40:24 PM8/21/15
to cascading-user
On Fri, Aug 21, 2015 at 7:20 PM, Ron Gonzalez <zlgon...@gmail.com> wrote:
Sounds good. Will the Spark fabric also not support checkpointing? We've migrated to 3.0 in preparation for the Spark fabric...:)

I'll give 0.7.0 a try and ping the list for any discovered issues.

Tez 0.7 is mostly UI work, which I guess, you don't really care about. Since 0.7.0 is a lot older than 0.6.2, it still has some nasty bugs (hanging threads in test suites) and others, that 0.6.2 does not suffer from. I'd say using 0.6.2 is your best bet right now from a stability point of view.

- André
 

For more options, visit https://groups.google.com/d/optout.

Ron Gonzalez

unread,
Aug 21, 2015, 5:42:15 PM8/21/15
to cascadi...@googlegroups.com
Hmm ok. I guess I clearly don't understand the versioning strategy that Tez is doing since it seems none of the stability fixes available in 0.6.2 are in 0.7.0. Good to know.

Thanks for the heads up. Will give 0.6.2 a try. From your tests, is it safe to say that Cascading on Tez is functionally equivalent in terms of created rows right? Aside from checkpoints, are there any operators that would not create the same rows for its HadoopFlowConnector equivalent?

Thanks,
Ron

Sent from my iPhone

Chris K Wensel

unread,
Aug 22, 2015, 12:07:44 AM8/22/15
to cascadi...@googlegroups.com
After all the directly supported platform tests are run, we compare the results across all of them (that don’t generate random non-deterministic output).

Order is not part of the criteria, this should be obvious, but i’m stating it anyway.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Ron Gonzalez

unread,
Feb 6, 2016, 4:43:22 PM2/6/16
to cascading-user
Hi,
  To summarize this old thread, I was told that Cascading 3.0.2 isn't guaranteed to work well with Tez 0.7.0. HDP 2.3 ships with 0.7.0, and since this was August of last year, just checking if your regressions have started testing against this Tez 0.7.0 release.
  Would you say it would be ok to use Tez 0.7.0 with 3.0.2? Are there newer releases that would allow us to move to 0.7.0 with confidence?

Thanks,
Ron

Chris K Wensel

unread,
Feb 6, 2016, 5:22:33 PM2/6/16
to cascadi...@googlegroups.com
in regards to 0.7.0, the notes hold

that said, 0.8.2 seems to work great on regressions. my plan is to make it the default for 3.1, but I haven’t had time to setup long running tests and build public Tez artifacts so others can share in the success. 

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Ron Gonzalez

unread,
Feb 7, 2016, 1:07:45 PM2/7/16
to cascadi...@googlegroups.com
Cool thanks Chris.

Sent from my iPhone
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/cGDUuco1xSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

Cyrille Chépélov

unread,
Feb 8, 2016, 4:11:09 AM2/8/16
to cascadi...@googlegroups.com
For what it's worth, things look fine here so far using Cascading 3.1.0-wip-52 and evicting tez-0.6.2 with 0.8.2 in the build.sbt file.

    -- Cyrille
Reply all
Reply to author
Forward
0 new messages