Retry

21 views
Skip to first unread message

Vikram Kadi

unread,
Jan 28, 2013, 4:17:41 AM1/28/13
to lemur...@googlegroups.com
I had a quick quesiton, we are using lemur to launch our cascalog jobs on EMR, one of the workflows uses checkpointed code that can actually re-run from the point of failure, the only condition being it should be re-run on the same cluster. 

Is there a provision to relaunch/retry a job flow step on failure, via a hook or any other means on the same cluster?

Marc Limotte

unread,
Jan 28, 2013, 10:48:30 AM1/28/13
to lemur...@googlegroups.com
Hi Vikram,

I haven't tried this out.  So, I'm not sure exactly what you would need.  Maybe with a little more information?  Are you using https://github.com/nathanmarz/cascalog-contrib/tree/master/cascalog.checkpoint  for the check pointing, or some other solution?

You would need some code to recognize failure on your side.  What would that look like?

Marc


On Mon, Jan 28, 2013 at 1:17 AM, Vikram Kadi <vikra...@gmail.com> wrote:
I had a quick quesiton, we are using lemur to launch our cascalog jobs on EMR, one of the workflows uses checkpointed code that can actually re-run from the point of failure, the only condition being it should be re-run on the same cluster. 

Is there a provision to relaunch/retry a job flow step on failure, via a hook or any other means on the same cluster?

--
You received this message because you are subscribed to the Google Groups "Lemur User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lemur-user+...@googlegroups.com.
To post to this group, send email to lemur...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/lemur-user/-/MtQeaqnnSVMJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Vikram Kadi

unread,
Jan 28, 2013, 12:48:00 PM1/28/13
to lemur...@googlegroups.com
Yes Marc, We are using the cascalog.checkpoint. 

So our code looks like 

(defn run-as-workflow [options]

  (workflow [(:staging options)

            step-1 ([:tmp-dirs stage-step-1]

                      (stage-listings-with-content options stage-step-1))            

            step-2 ([:deps step-1 :tmp-dirs stage-step-2] 

                      (stage-listing-content options stage-step-1 stage-step-2))            

            step-3 ([:deps step-1 :tmp-dirs stage-step-3]

                     (stage-public-id-content options stage-step-3))

            step-4 ([:deps step-1 :tmp-dirs stage-step-4]

                     (stage-normalized-content options stage-step-1 stage-step-4))

            step-5 ([:deps :all

                     (aggregate-and-sort options stage-step-2 stage-step-3 stage-step-4))))

This is launched as an EMR job via lemur, and sometimes there is an error in submitting some of the steps. In that case i want to refire the same workflow on the same EMR cluster, as the checkpoint information resides on the HDFS of the cluster.

Like you said, i will look into possibly just catching the exception in the application code and then re-run it.

Last night i was trying to work on a retry code by adding it into lemur job conf as below.

(defn fire-n-wait-status [i] 

  (prn "Retrying the job")

  (fire! cluster job-flow)

  (let [retry-result (wait-on-step job-flow 43200)] (:success retry-result)))


(defn retry [n] (some #(fire-n-wait-status %) (range n)))


(let [result (wait-on-step job-flow 43200)

      success (:success result)]

    (if (= true success) 

        (prn (format "job status: %s" result))

        (let [retry-result (retry 1)] (prn "The resut of retrying " 1 "times:" retry-result))))

Marc Limotte

unread,
Jan 28, 2013, 1:16:00 PM1/28/13
to lemur...@googlegroups.com
Ok.  I see better what you're trying to do now.  

I'd first look to put the retry logic in your code, adjacent to the (workflow ...) form, but if you want to do this externally (i.e. in the jobdef) instead, you will need the 'submit' functionality.  This feature is still in testing, you can find it at https://github.com/mlimotte/lemur/pull/1, Federico Brubacher is helping test and fix some issues with it.

Marc



To unsubscribe from this group, send email to lemur-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/lemur-user/-/hixfAR5HGCAJ.
Reply all
Reply to author
Forward
0 new messages