run a sequence of MR jobs iteratively until convergence, on EMR

654 views
Skip to first unread message

Jasper Chen

unread,
Nov 11, 2012, 1:46:05 PM11/11/12
to mr...@googlegroups.com
Hi all,

I've been stuck here about one day where I cannot find an easy way to run iterative MR jobs on EMR..
Before that, I've defined all map-reduce jobs and wrote a python script to control my program logic, including file ops and determining whether it reaches convergence.  But on EMR, it seems like we have to squeeze everything into one job flow to get rid of the time-consuming initialization phase of a job flow, which can take several minutes usually.  As far as I know, I can run "python myscript.py -r emr input output" to run it on EMR provided access setup. It did run but with a new job flow every time I call this command on shell.  Do you guys know how to pack everything into one job flow?  Thanks!

Sudarshan Gaikaiwari

unread,
Nov 11, 2012, 2:23:50 PM11/11/12
to mr...@googlegroups.com
Hi Jasper

Please see the information on reusing job flows here


--
Sudarshan Gaikaiwari
www.sudarshan.org
suda...@acm.org


Jimmy Retzlaff

unread,
Nov 11, 2012, 7:50:29 PM11/11/12
to mr...@googlegroups.com
In addition to the link Sudarshan suggested, also see this:

This shows how to write a second Python script that runs locally and controls the running of each individual job on EMR. It can run the job repeatedly on the same job flow until you've reached convergence. It's not quite as nice as being able to do it completely on the EMR cluster, but it's probably the best we have for now.

Jimmy

Jasper Chen

unread,
Nov 12, 2012, 7:18:06 PM11/12/12
to mr...@googlegroups.com
Hi Jimmy, thanks for you reply.  I sort of find the way to do it, but got two questions:

1) At first I create a jobflow with "--alive" tag using CLI, and then in python script I run a MR job with "-r emr --emr-job-flow-id XXXXX" args, it will raise the exception saying "operation expect job flow to terminate, but it never do so...".  Was I doing right to add one step?
2) For each job flow, there is a hard limitation of 256 steps at most.  Can we get rid of that in mrjob, maybe with specific args?  It's just I cannot find some useful working examples online..

Do you have any idea about them?

Thanks,
Jasper

Jimmy Retzlaff

unread,
Nov 12, 2012, 7:38:19 PM11/12/12
to mr...@googlegroups.com
On Mon, Nov 12, 2012 at 4:18 PM, Jasper Chen <xiub...@gmail.com> wrote:
1) At first I create a jobflow with "--alive" tag using CLI, and then in python script I run a MR job with "-r emr --emr-job-flow-id XXXXX" args, it will raise the exception saying "operation expect job flow to terminate, but it never do so...".  Was I doing right to add one step?
 
I don't know what's happening here and unfortunately don't have the time to look into it right now - hopefully someone else can jump in.

2) For each job flow, there is a hard limitation of 256 steps at most.  Can we get rid of that in mrjob, maybe with specific args?  It's just I cannot find some useful working examples online.. 

This is an AWS EMR limit. There are workarounds, but you can't use their API (which mrjob does) to do it:


Jimmy

Jasper Chen

unread,
Nov 12, 2012, 8:31:57 PM11/12/12
to mr...@googlegroups.com
Thanks, that's really helpful!!

Jasper

mike bowles

unread,
Nov 30, 2012, 4:24:49 AM11/30/12
to mr...@googlegroups.com
sorry for the late response.  i've been on the road.  The iteration i've done is along the lines that jimmy retzlaff describes.

run python program locally that runs MR job in a loop.  The local program uses s3cmd to pull down a relatively small data set from s3 in order to check convergence and then either stops or repeats.  the algorithms i'm running (k-means clustering, for example), have some relatively simple data structure that characterizes the ultimate answer and the progress in the iteration.  that's stored on s3.  the MrJob mappers and reducers also use s3cmd to initialize at the beginning of each pass through the iteration. 

This approach forces me to have two different versions of the MrJob mappers and reducers for a local, development version versus a version that runs on aws.  The local one initializes from local files while the other initializes from s3.  All-in-all, it's not too bad.  There's a long long list of algorithms that will fit into this framework. 

Mike
Reply all
Reply to author
Forward
0 new messages