Is retry policy available for individual hadoop jobs

26 views
Skip to first unread message

ranjan banerjee

unread,
Jun 5, 2018, 3:05:05 PM6/5/18
to cascading-user
Hello Cascading Users,
   Is there any retry policy available for individual hadoop jobs when they fail instead of failing the whole topology? Is there a configuration which can support this setting today? I am using scalding and the behaviour I notice is that if an individual hadoop job fails, the entire topology fails. 

Thanks for any insights!
Ranjan

Chris K Wensel

unread,
Jun 5, 2018, 11:05:29 PM6/5/18
to cascadi...@googlegroups.com
No, there is no retry policy per job.

The reasoning is that transient failures should be handled by the task retry settings, in hadoop the default is 3 retries. after a task fails its final time, the job is failed.

If failures happen at the job level, they are not transient (since the failure happened at least 3 times) and would repeat on the next attempt, so retrying is pointless.

that said, if you want the Flow to survive failures thrown by operations (because of bad data), register a trap for the branch in the pipe assembly. the Flow won’t fail, and the bad data will be captured.. all without triggering a task retry.

The user guide covers this in detail.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/036d3cfd-4abd-4643-98b6-a6aa538970c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


ranjan banerjee

unread,
Jun 6, 2018, 12:12:37 AM6/6/18
to cascadi...@googlegroups.com
Hi Chris,
    Thanks for your response. The situation I found was that we have a scalding topology that consists of around 20 hadoop jobs and takes around 3.5 hrs to run. The jobs at the very end of the topology are essentially sink jobs(writing to hdfs) which I believe makes more metadata call to the namenodes. Sometimes I find that due to increased pressure of namenode these calls fail and the final hadoop job fails and the entire topology is restarted. Historically these job failures are transient(as the pressure on namenode in this case varies depending on cluster usage)

Do you think this kind of errors warrant a job level retry? 

Thanks
Ranjan

On Tue, Jun 5, 2018 at 8:05 PM, Chris K Wensel <ch...@wensel.net> wrote:
No, there is no retry policy per job.

The reasoning is that transient failures should be handled by the task retry settings, in hadoop the default is 3 retries. after a task fails its final time, the job is failed.

If failures happen at the job level, they are not transient (since the failure happened at least 3 times) and would repeat on the next attempt, so retrying is pointless.

that said, if you want the Flow to survive failures thrown by operations (because of bad data), register a trap for the branch in the pipe assembly. the Flow won’t fail, and the bad data will be captured.. all without triggering a task retry.

The user guide covers this in detail.

ckw


On Jun 5, 2018, at 12:05 PM, ranjan banerjee <ranjanb...@gmail.com> wrote:

Hello Cascading Users,
   Is there any retry policy available for individual hadoop jobs when they fail instead of failing the whole topology? Is there a configuration which can support this setting today? I am using scalding and the behaviour I notice is that if an individual hadoop job fails, the entire topology fails. 

Thanks for any insights!
Ranjan

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.

Chris K Wensel

unread,
Jun 6, 2018, 12:20:37 AM6/6/18
to cascadi...@googlegroups.com
I’m not opposed to it, feel free to provide a pull request with a test and a draft implementation branched off wip-3.3.

but you may find shoring up the namenode a better long term option, or simply sorting out ways to reduce namenode pressure, say fewer reducers on the final job.

ckw

To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ranjan banerjee

unread,
Jun 6, 2018, 9:39:52 AM6/6/18
to cascadi...@googlegroups.com
Thanks Chris. 
I would very much like to contribute to this with a PR. As I am very new to the project, could you kindly provide me with some starting point as to which classe(s) are responsible for stitching multiple hadoop jobs together. Trying to search, I find HadoopFlow (https://github.com/Cascading/cascading/blob/3.3/cascading-hadoop/src/main/shared-mr1/cascading/flow/hadoop/HadoopFlow.java#L122) which seems to set individual hadoop params for a particular job.

Ranjan

ckw

ckw


To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

Ken Krugler

unread,
Jun 6, 2018, 10:52:57 AM6/6/18
to cascadi...@googlegroups.com
Hi Ranjan,

One thing we’ve done here is to break the one Flow up into multiple pieces.

Then you can use a Cascade to re-run only the portions that didn’t complete before the failure.

This does introduce some latency and extra storage in HDFS that you’ll need to clean up.

— Ken

To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--------------------------------------------

ranjan banerjee

unread,
Jun 6, 2018, 3:49:16 PM6/6/18
to cascadi...@googlegroups.com
Hi Ken,
   That makes sense. I will dig around how scalding stitches multiple flows in cascading to determine the rerun policy. Thanks for the pointers.

Ranjan

ckw


To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--------------------------------------------

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

Chris K Wensel

unread,
Jun 6, 2018, 6:58:08 PM6/6/18
to cascadi...@googlegroups.com
see FlowStepJob#blockOnJob()

that should get you going in the right direction.

but keep in mind this logic ranges from 3 to 9 years old. 

and is deeply integrated with the internal monitoring framework allowing folk to properly quantify resource consumption from flows.

subsequently there are some complex issues, like, do counters rollback and increment anew on the retry? and apps that are monitored, do we add a new construct called Step_Attempt to track the retries (where downstream systems alert on too many retries vs a single failure). 

so I would exhaust the possibilities of your application becoming a good citizen and reduce pressure on the namenode instead of continuously crushing the namenode via retries until it works. assuming that’s even possible.

ckw

To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages