Re: Can Azkaban fit with this workflow? ec2 server to emr?

Richard

unread,

May 21, 2013, 2:48:20 PM5/21/13

to azkab...@googlegroups.com

At Linkedin, we don't use EC2 for our hadoop clusters, so I haven't yet tried to set up Azkaban on an EC2.

However, the use case you gave is very similar to what we do at LinkedIn. We have a dedicated server running Azkaban, and users submit their project files, including pig scripts, through various means (the web ui, curl, http post). They then schedule those scripts to be run at a regular interval.
We have various plugins available to accommodate different types of jobs that need to be run. Pig, Python, Ruby, Java Hadoop, Hive... if the specific execution type doesn't exist for the type of job you want, then it should be easy to build one and drop it into Azkaban.

On Tuesday, May 21, 2013 11:35:13 AM UTC-7, Jim wrote:

I've been reading the azkaban docs and this part isn't entirely clear to me...we're in the AWS environment and we want to run pig jobs on scheduled intervals. Once those pig jobs run, they put the results into S3 via the store command. We then have a python script that runs that picks up the s3 results and dumps them into a database and/or emails out the results.

We have 1 EMR cluster up in interactive mode, so it's always running. We have 1 EC2 server up that we'll run azkaban from.

What I'd be looking to do is:

Upload job to the EC2 Azkaban server, that job contains the pig job and the python script. I want Azkaban to submit the pig job to the EMR cluster, get notified when the job is complete and then execute the python script on the EC2 Azkaban server to process the results of the pig job.

Is that the typical Azkaban workflow and if so are there more docs/articles or discussions that might be useful in that space?

thanks!
Jim

Richard

unread,

May 21, 2013, 7:03:39 PM5/21/13

to azkab...@googlegroups.com

Yes.
The way we set it up is that we use the pig execution plugin. This is available in the azkaban-plugins package. We create a job file with type=pig, and specify the pig script to run. No other code is necessary.

We then upload the archive of these files to Azkaban from our desktop/laptop (or from other sources) and can get Azkaban to execute the job, either by manually triggering it, or on a given schedule. Azkaban's pig plugin just needs to know the location of the Hadoop cluster.

-Richard

On Tuesday, May 21, 2013 3:44:14 PM UTC-7, Jim wrote:

thanks Richard,
so assuming your running a PIG job, you submit your job file to the Azkaban Server and that then sees you're running a PIG job as one of the steps and then uses the Hadoop Job API to submit that job and JAR file to the cluster which is on a different set of nodes?

so
local laptop -> azkaban server -> hadoop namenode?

ald...@leal.eng.br

unread,

May 30, 2013, 4:01:57 AM5/30/13

to azkab...@googlegroups.com

Its working just fine on EMR. In fact, I've just started an azkaban-maven-plugin to upload my projects directly into it from my CI / Build Environment