Is Azkaban good for non-hadoop jobs (bash/ruby)?

1,176 views
Skip to first unread message

Huy Nguyen

unread,
May 21, 2013, 11:23:55 PM5/21/13
to azkab...@googlegroups.com
We have an in-house analytics pipeline we're building just with PostgreSQL and Ruby (our transforms happen at the postgres level). We don't need to farm hadoop jobs but just need job dependency capability for basic bash script, like:

05 8 * * * thor db:dump --from live --to data && thor db:transform users && thor db:aggregate user_metrics

Would Azkaban be a suitable solution for this?

Thank you,
Huy

Richard Park

unread,
May 22, 2013, 12:28:19 AM5/22/13
to azkab...@googlegroups.com
Indeed, we don't need to run Hadoop jobs. Azkaban in its base form just runs command line jobs.
In fact, we run non-hadoop jobs often. You can do this through the command job and there is built in ruby/python job types. Just realized we didn't document it. We'll get on that.

However, the jobs will run as the user that azkaban running as. We are contemplating switching to different users before running jobs, but we can always run an ssh to switch user.



--
You received this message because you are subscribed to the Google Groups "azkaban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to azkaban-dev...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Huy Nguyen

unread,
May 22, 2013, 12:50:20 AM5/22/13
to azkab...@googlegroups.com
Thanks for your answer Richard. We successfully deployed a test environment and ran a few test commands. Questions:

- We find zipping the jobs and upload the zip file troublesome (not as straightforward as using crontab), is there plan to support web-based add/edit of jobs?

- If a job is running and we push a zip file that overwrite that job, does it break the current job? And does it cause any side-effect?




--
You received this message because you are subscribed to a topic in the Google Groups "azkaban" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/azkaban-dev/3jHqLAOUAsI/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to azkaban-dev...@googlegroups.com.

Richard Park

unread,
May 22, 2013, 2:15:28 AM5/22/13
to azkab...@googlegroups.com
There is job edits in the UI, at least in 2.1, but not the ability to add new jobs. However, it does seem like a good idea.
Also, there is a way to upload without using the UI by curl or http post http://azkaban.github.io/azkaban2/documents/2.1/ajaxapi.html. It does require an authentication step. I plan on providing a shell script in the future, although if one is useful now, we can provide a preliminary limited tested version.

For the next version of Azkaban, we're also going to have the ability to pull job archives. We're looking at implementing a http get, and perhaps a svn/git solution tied to a schedule or http trigger.

Uploading of new archives is safe. Azkaban tracks versions of the uploaded project files, so no file is overwritten while an upload is occurring. Only on the next execution of the job will the new uploaded files be taken into account. However, job edits through the ui will be reflected in a currently executing flow if the job hasn't started to run yet.

We are considering a manual hotswap feature that would allow users, under a disclaimer, to hotswap the execution files while it's still running.  The disclaimer is that any executing jobs would properly handle this inconsistent case since it's inherently dangerous.

-Richard

Huy Nguyen

unread,
May 22, 2013, 4:11:47 AM5/22/13
to azkab...@googlegroups.com
Thanks again for your response Richard, we really like this and will give it a trial shot. A few confusions/suggestions that we have:

We have created a trial flow for our pipeline: http://i.imgur.com/nSKddUw.png

1/ We have to add a empty 'end' job to merge the flows together, otherwise it'll show up as different flows. Is there a way to add a start node instead which kind of starts the whole flow.

2/ I have this in my system.properties but it doesn't seem to schedule a daily run. What am I missing?

azkaban.flow.start.hour=8
azkaban.flow.start.minute=5

3/ Can we start jobs at the top level at different timing? Starting all 4 copy jobs (see image) will throttle our server.

4/ The manager view (manager?project=x&flow=y) and the executor view (executor?...) look the same but are different. And it's hard to remember how to navigate to executor view (which we prefer to view since it has more real-time information).

Assuming that it's unlikely to have 2 executions of the same flow happening at the same time, is it better to merge them together?

Richard Park

unread,
May 22, 2013, 1:19:10 PM5/22/13
to azkab...@googlegroups.com
1. With dependency chains, there could be multiple start nodes. The idea is that dependencies are resolved before a job runs. We'll be adding a more robust feature that will allow you to define a flow in one file which may be a solution for you.

2. Schedules are only set manually. We found it to be problematic to have flows automatically add itself due to the shear number of flows and the common practice of copy and paste.  If this feature is desirable, it is fairly trivial although we'd have a switch to wire it off.

Those azkaban.flow.* values are not properties that you pass to azkaban. They are properties that Azkaban will pass to your job when they run.

3. There are a few ways to throttle. One is to chain your jobs. The second is to bake into the command a simple sleep/wait. If you have your own plugin execution type, the throttle can be baked into that. It is also fairly trivial to add native feature for delaying the execution of a job.

4. Navigation issues have come up before.  We'll most likely get a ued expert to look at it. We do need a page to display the flow, and we need a page to display the execution. Unfortunately they look similar and a lot of clicking around is necessary to get from one place to the other.



From: azkab...@googlegroups.com [azkab...@googlegroups.com] on behalf of Huy Nguyen [h...@viki.com]
Sent: Wednesday, May 22, 2013 1:11 AM
To: azkab...@googlegroups.com
Subject: Re: [azkaban] Is Azkaban good for non-hadoop jobs (bash/ruby)?

Huy Nguyen

unread,
May 29, 2013, 1:53:42 AM5/29/13
to azkab...@googlegroups.com
Thanks. This is cool. We've been using for a week and pretty happy with it. Serve our purpose well.

The executor node seems to fork a lot of sub-processes. Any thoughts?


Richard Park

unread,
May 29, 2013, 3:24:09 AM5/29/13
to azkab...@googlegroups.com
That's the nature of java calling a system.exec.  It's inherently safer for cleaning up leaky tasks to have it in a different process rather than a thread. You can cap the # of executing jobs by changing the number of jobs a flow can execute, and you can cap the # of flows you can execute as well. I believe the config setting is documented.

If you think you could be thread safe, you can create an executor job type that just spawns a thread instead of System.exec.  We found this to be unsafe and not much a benefit in resources.

There is a plan in works to have multiple executor servers running simultaneously on different machines. Actually, the work to get this done should be fairly trivial but has been a lower priority for us since hadoop jobs tend to use less resources on the client.

-Richard

Huy Nguyen

unread,
May 29, 2013, 3:28:48 AM5/29/13
to azkab...@googlegroups.com
I see I see. That's not much of a concern to us either, cos it doesn't seem to consume that much resource.

Also, is there way to allow freely arrange (drag and drop) nodes around in the visual chart, our flow graph just expand horizontally and taking much screen space.

Richard Park

unread,
May 29, 2013, 7:12:02 AM5/29/13
to azkab...@googlegroups.com
The layout is all in javascript, so that's easily changeable. There is currently no persistence with the layout, so saving of custom positions would have to be added. 
We'd probably have to change the layout code a bit due to the way the svg path is rendered.

If that's desired, please add an issue in Azkaban's Github so we don't forget about it.

Thanks,
-Richard
Reply all
Reply to author
Forward
0 new messages