Azkaban Errors

176 views
Skip to first unread message

Michael Chang

unread,
Mar 6, 2012, 2:56:17 PM3/6/12
to azkab...@googlegroups.com
Hi guys,

We've just implemented Azkaban here at we've gotten it working pretty well with our needs, but we've seen a couple strange errors that we were hoping you guys could  help with.

  1. Job workflows sometimes just stall with an error in the main log such as: 2012/03/01 00:10:13.719 +0000 ERROR [ExecutionModelImpl] [Job-Control] No workflow model exists with id: 38.  This doesn't seem to happen consistently.  The job id 38 is the currently executing workflow.
  2. Similarly, we'll see messages such as: 2012/02/29 11:52:54.619 +0000 ERROR [IndexServlet] [Job-Control] Error cancelling job java.lang.IllegalArgumentException: '38' is not currently running. when trying to cancel jobs.
  3. We're having a little trouble getting job permits to work.  I have the following job folder hierarchy under my Azkaban jobs folder.  Our use case is simply limiting the number of simultaneous jobs being launched (so each job file has job.permits=1).

jobs/nightly/base.properties
                              nightly/base/job1.job
                                                job2.job

Where base.properties contains the line "total.job.permits=6" and job1.job and job2.job contain the line "job.permits=1". 

Thanks!

Richard Park

unread,
Mar 6, 2012, 3:04:02 PM3/6/12
to azkab...@googlegroups.com
We've hit this problem before, and the common issue seemed to be that Azkaban 'ran out of memory'. This ended up being a red herring because the real cause was that our machines had proc ulimit set really low. On our linux machine, with this low limit, we'd constantly use up all the threads and the default error is an OO error.

There's another issue which I'm looking into that has to do with the scheduler, though I haven't looked into it too deeply yet.

Michael Chang

unread,
Mar 6, 2012, 3:57:16 PM3/6/12
to azkab...@googlegroups.com
Sorry, so both #1 and #2 have been known to be caused by the OOM error?

Richard Park

unread,
Mar 6, 2012, 4:38:01 PM3/6/12
to azkab...@googlegroups.com
Sorry, didn't read your whole email.
We had issues with stuck jobs that were giving us OOM errors. However, this was due to running out of threads, not due to running out of memory. This freezing also caused #2 which is weird.
This whole piece of code needs to be re-examined and probably refactored.

Michael Chang

unread,
Mar 7, 2012, 1:10:00 PM3/7/12
to azkab...@googlegroups.com
Cool, thanks.  Do you recommend any temporary measures in the meantime?

Also, do you think our configuration for the permits seems reasonable in terms of trying to prevent jobs from running simultaneously?

Thanks,
Michael

Michael Chang

unread,
May 17, 2012, 1:47:04 PM5/17/12
to azkab...@googlegroups.com
Revisiting this...Are there any known bugs with using permits?  Still can't get them to work correctly.

Michael

Richard Park

unread,
May 23, 2012, 9:54:56 AM5/23/12
to azkab...@googlegroups.com
There are no known bugs, but I'm not sure permits are a feature that is overly used, and so it may be broken.

Are you using it to throttle your jobs?

Michael Chang

unread,
May 23, 2012, 1:21:39 PM5/23/12
to azkab...@googlegroups.com
Yeah, I'm trying to use it to throttle jobs.  Is the example I used above (in the first message of this thread) supposed to work?

Thanks,
Michael

xav...@squareup.com

unread,
Jul 18, 2012, 6:17:54 PM7/18/12
to azkab...@googlegroups.com
I'm trying to get permits working right now and am also failing. It's not clear from the documentation what is supposed to go where.

Can anyone post a working example?

Xav

xav...@squareup.com

unread,
Jul 18, 2012, 6:32:50 PM7/18/12
to azkab...@googlegroups.com


On Wednesday, July 18, 2012 3:22:25 PM UTC-7, Michael Chang wrote:
I actually had to put my permits one level up.  In my example, the base.properties file with the total.job.permits property needs to actually be in the top level jobs directory (e.g., in /jobs, not in /jobs/nightly/)
Got it. The other thing that was confusing me is that the jobs show up in the "currently running" tab, even when they are blocked waiting for a permit. I guess that makes sense now that I understand it. I can see it is working from the log though.

Xav
Reply all
Reply to author
Forward
0 new messages