run a sequence of MR jobs iteratively until convergence, on EMR

Jasper Chen

unread,

Nov 11, 2012, 1:46:05 PM11/11/12

to mr...@googlegroups.com

Hi all,

I've been stuck here about one day where I cannot find an easy way to run iterative MR jobs on EMR..

Before that, I've defined all map-reduce jobs and wrote a python script to control my program logic, including file ops and determining whether it reaches convergence. But on EMR, it seems like we have to squeeze everything into one job flow to get rid of the time-consuming initialization phase of a job flow, which can take several minutes usually. As far as I know, I can run "python myscript.py -r emr input output" to run it on EMR provided access setup. It did run but with a new job flow every time I call this command on shell. Do you guys know how to pack everything into one job flow? Thanks!

Sudarshan Gaikaiwari

unread,

Nov 11, 2012, 2:23:50 PM11/11/12

to mr...@googlegroups.com

Hi Jasper

Please see the information on reusing job flows here

http://mrjob.readthedocs.org/en/latest/guides/emr-advanced.html

--
Sudarshan Gaikaiwari
www.sudarshan.org
suda...@acm.org

Jimmy Retzlaff

unread,

Nov 11, 2012, 7:50:29 PM11/11/12

to mr...@googlegroups.com

In addition to the link Sudarshan suggested, also see this:

http://mrjob.readthedocs.org/en/latest/guides/runners.html#running-your-job-programmatically

This shows how to write a second Python script that runs locally and controls the running of each individual job on EMR. It can run the job repeatedly on the same job flow until you've reached convergence. It's not quite as nice as being able to do it completely on the EMR cluster, but it's probably the best we have for now.

Jimmy

Jasper Chen

unread,

Nov 12, 2012, 7:18:06 PM11/12/12

to mr...@googlegroups.com

Hi Jimmy, thanks for you reply. I sort of find the way to do it, but got two questions:

1) At first I create a jobflow with "--alive" tag using CLI, and then in python script I run a MR job with "-r emr --emr-job-flow-id XXXXX" args, it will raise the exception saying "operation expect job flow to terminate, but it never do so...". Was I doing right to add one step?

2) For each job flow, there is a hard limitation of 256 steps at most. Can we get rid of that in mrjob, maybe with specific args? It's just I cannot find some useful working examples online..

Do you have any idea about them?

Thanks,

Jasper

Jimmy Retzlaff

unread,

Nov 12, 2012, 7:38:19 PM11/12/12

to mr...@googlegroups.com

On Mon, Nov 12, 2012 at 4:18 PM, Jasper Chen <xiub...@gmail.com> wrote:

1) At first I create a jobflow with "--alive" tag using CLI, and then in python script I run a MR job with "-r emr --emr-job-flow-id XXXXX" args, it will raise the exception saying "operation expect job flow to terminate, but it never do so...". Was I doing right to add one step?

I don't know what's happening here and unfortunately don't have the time to look into it right now - hopefully someone else can jump in.

2) For each job flow, there is a hard limitation of 256 steps at most. Can we get rid of that in mrjob, maybe with specific args? It's just I cannot find some useful working examples online..

This is an AWS EMR limit. There are workarounds, but you can't use their API (which mrjob does) to do it:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_AddJobFlowSteps.html

Jimmy

Jasper Chen

unread,

Nov 12, 2012, 8:31:57 PM11/12/12

to mr...@googlegroups.com

Thanks, that's really helpful!!

Jasper

mike bowles

unread,

Nov 30, 2012, 4:24:49 AM11/30/12

to mr...@googlegroups.com

sorry for the late response. i've been on the road. The iteration i've done is along the lines that jimmy retzlaff describes.

run python program locally that runs MR job in a loop. The local program uses s3cmd to pull down a relatively small data set from s3 in order to check convergence and then either stops or repeats. the algorithms i'm running (k-means clustering, for example), have some relatively simple data structure that characterizes the ultimate answer and the progress in the iteration. that's stored on s3. the MrJob mappers and reducers also use s3cmd to initialize at the beginning of each pass through the iteration.

This approach forces me to have two different versions of the MrJob mappers and reducers for a local, development version versus a version that runs on aws. The local one initializes from local files while the other initializes from s3. All-in-all, it's not too bad. There's a long long list of algorithms that will fit into this framework.

Mike

Reply all

Reply to author

Forward