I had a quick quesiton, we are using lemur to launch our cascalog jobs on EMR, one of the workflows uses checkpointed code that can actually re-run from the point of failure, the only condition being it should be re-run on the same cluster.Is there a provision to relaunch/retry a job flow step on failure, via a hook or any other means on the same cluster?
--
You received this message because you are subscribed to the Google Groups "Lemur User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lemur-user+...@googlegroups.com.
To post to this group, send email to lemur...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/lemur-user/-/MtQeaqnnSVMJ.
For more options, visit https://groups.google.com/groups/opt_out.
(defn run-as-workflow [options]
(workflow [(:staging options)]
step-1 ([:tmp-dirs stage-step-1]
(stage-listings-with-content options stage-step-1))
step-2 ([:deps step-1 :tmp-dirs stage-step-2]
(stage-listing-content options stage-step-1 stage-step-2))
step-3 ([:deps step-1 :tmp-dirs stage-step-3]
(stage-public-id-content options stage-step-3))
step-4 ([:deps step-1 :tmp-dirs stage-step-4]
(stage-normalized-content options stage-step-1 stage-step-4))
step-5 ([:deps :all]
(aggregate-and-sort options stage-step-2 stage-step-3 stage-step-4))))
This is launched as an EMR job via lemur, and sometimes there is an error in submitting some of the steps. In that case i want to refire the same workflow on the same EMR cluster, as the checkpoint information resides on the HDFS of the cluster.
Like you said, i will look into possibly just catching the exception in the application code and then re-run it.
Last night i was trying to work on a retry code by adding it into lemur job conf as below.
(defn fire-n-wait-status [i]
(prn "Retrying the job")
(fire! cluster job-flow)
(let [retry-result (wait-on-step job-flow 43200)] (:success retry-result)))
(defn retry [n] (some #(fire-n-wait-status %) (range n)))
(let [result (wait-on-step job-flow 43200)
success (:success result)]
(if (= true success)
(prn (format "job status: %s" result))
(let [retry-result (retry 1)] (prn "The resut of retrying " 1 "times:" retry-result))))
To unsubscribe from this group, send email to lemur-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/lemur-user/-/hixfAR5HGCAJ.