At Linkedin, we don't use EC2 for our hadoop clusters, so I haven't yet tried to set up Azkaban on an EC2.
However, the use case you gave is very similar to what we do at LinkedIn. We have a dedicated server running Azkaban, and users submit their project files, including pig scripts, through various means (the web ui, curl, http post). They then schedule those scripts to be run at a regular interval.
We have various plugins available to accommodate different types of jobs that need to be run. Pig, Python, Ruby, Java Hadoop, Hive... if the specific execution type doesn't exist for the type of job you want, then it should be easy to build one and drop it into Azkaban.
On Tuesday, May 21, 2013 11:35:13 AM UTC-7, Jim wrote:
I've been reading the azkaban docs and this part isn't entirely clear to me...we're in the AWS environment and we want to run pig jobs on scheduled intervals. Once those pig jobs run, they put the results into S3 via the store command. We then have a python script that runs that picks up the s3 results and dumps them into a database and/or emails out the results.
We have 1 EMR cluster up in interactive mode, so it's always running. We have 1 EC2 server up that we'll run azkaban from.
What I'd be looking to do is:
Upload job to the EC2 Azkaban server, that job contains the pig job and the python script. I want Azkaban to submit the pig job to the EMR cluster, get notified when the job is complete and then execute the python script on the EC2 Azkaban server to process the results of the pig job.
Is that the typical Azkaban workflow and if so are there more docs/articles or discussions that might be useful in that space?
thanks!
Jim