running mrjob from a script

Cpt Caveman

unread,

Apr 22, 2011, 10:00:35 PM4/22/11

to mrjob

I was wondering how to run an mrjob from a python script. If I were
to do something like the example in the mrjob documentation (below),
how would I specify that standard input should be directed to this
runner?

if __name__ == '__main__':
mr_job = MRWordCounter()
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)

Thanks for the help.

Dave Marin

unread,

Apr 25, 2011, 1:49:17 PM4/25/11

to mr...@googlegroups.com

Oh, the example below will automatically read from stdin;
make_runner() automatically passes stdin to the runner.

By the way, normally you'd pass some command-line arguments to your
job's constructor to feed in input files and configuration. In this
particular example, the job will ALWAYS read from stdin, and will
always run in local mode.

You can also initialize your job with args=sys.argv[1:] to just read
the standard options from the command line.

I apologize for not having better examples of running a MRJob from a
separate script. At Yelp we have a (crufty, old) framework for running
batch jobs that has its own command-line option parsing. A typical
batch job flow is something like:

- pick which log files to read from based on a date range specified on
the command-line
- run a MRJob on these log files, passing through relevant options
from the command-line
- write the output of the MRJob to a database

Hope this helps!

-Dave

--

Yelp is looking to hire great engineers! See http://www.yelp.com/careers.

Cpt Caveman

unread,

Apr 25, 2011, 2:18:31 PM4/25/11

to mrjob

Yes that helps,
I will use Python and the subprocess module to orchestrate multiple
jobs and their dependences.

On Apr 25, 10:49 am, Dave Marin <d...@yelp.com> wrote:
> Oh, the example below will automatically read from stdin;
> make_runner() automatically passes stdin to the runner.
>
> By the way, normally you'd pass some command-line arguments to your
> job's constructor to feed in input files and configuration. In this
> particular example, the job will ALWAYS read from stdin, and will
> always run in local mode.
>
> You can also initialize your job with args=sys.argv[1:] to just read
> the standard options from the command line.
>
> I apologize for not having better examples of running a MRJob from a
> separate script. At Yelp we have a (crufty, old) framework for running
> batch jobs that has its own command-line option parsing. A typical
> batch job flow is something like:
>
> - pick which log files to read from based on a date range specified on
> the command-line
> - run a MRJob on these log files, passing through relevant options
> from the command-line
> - write the output of the MRJob to a database
>
> Hope this helps!
>
> -Dave
>
> On Fri, Apr 22, 2011 at 7:00 PM, Cpt Caveman
>

Dave Marin

unread,

Apr 25, 2011, 2:24:30 PM4/25/11

to mr...@googlegroups.com

Cool, good luck!

-Dave

vs

unread,

Jun 5, 2011, 10:47:18 AM6/5/11

to mrjob

Hi,

I am stuck with a similar problem here. I am trying to port an
already existing application to use Map-Reduce. I thought of using
MRJob as its simpler than creating my own process to define Map and
Reduce jobs.

The first script does some initial processing on the stdin file and I
want this output as a feed to my mapper function. Can someone here plz
tell me how I could alter the runner to read this output and not the
original stdin file.

Thanks
VS

Cpt Caveman

unread,

Jun 5, 2011, 1:25:02 PM6/5/11

to mr...@googlegroups.com

VS,

You can run mrjob jobs from within Python. There are a few lines of code to do this, I have made them into a function called runJob(). The code is on github here: https://github.com/pbharrin/maureen/blob/master/maureen/runJob.py.

You can use this to chain together multiple mapreduce jobs, and mix and match inputs here is an example:

from runJob import runJob

from InitProcessing import InitProcessing

from OtherJob import OtherJob

runJob(InitProcessing, ['/temp/inputData.txt', '--output-dir=/temp/InitProcessingResults'], 'emr')

runJob(OtherJob, ['/temp/InitProcessingResults/part*', '/temp/2ndInputDir/stuff.txt' '--output-dir=/temp/Final'], 'emr')

In this example I have two mapreduce jobs called: InitProcessing, and OtherJob. InitProcessing is run with the data: inputData.txt. In the second step OtherJob is run with the ouput from InitProcessing and new inputs from 2ndInputDir/stuff.txt.

Just understand the big picture of what this is doing when you run this on EMR. It is uploading the files, then launching the EC2 instances, bootstrapping them, and configuring the cluster for each job. You can reduce some upload time using s3 for that temp directory.

An alternative would be just to write everything in one big mrjob class and pass around unneeded data until it is needed. I found this to be worse than the overhead of launching multiple jobs. Using the method I have shown above you can also choose to launch jobs on EMR, locally or on your own Hadoop cluster. This was helpful for me because my initial jobs in the chain would need multiple machines but later the data would get reduced to a small amount that could be handled on one machine. If a job can be handled on one machine it's more efficient to NOT use mapreduce.

vs

unread,

Jun 5, 2011, 3:29:47 PM6/5/11

to mrjob

Hi,

Thanks a ton for your reply. This would definitely help solve one
of my problems. I will use this code and run some experiments now.

However, I was wondering what if the input to the MR job is not
actually a flat file. If the first python file is actually yielding a
record (say a list or a dictionary), how do I pass it as an argument
to the second Mapper.?

Are the use of flat files absolutely necessary for using MRJob?

Thanks
VS

Cpt Caveman

unread,

Jun 5, 2011, 5:19:55 PM6/5/11

to mr...@googlegroups.com

Hey,

Everything going in and out of mrjob is serialized. It's serialized between map reduce steps as well. This serialization depends on the protocol you set. See the documentation here:

http://packages.python.org/mrjob/protocols.html

So if your first Python file is not a mrjob class then you need to make sure the records are serialized so that the mrjob class can properly understand the inputs.

The two formats I'm familiar with (JSON and pickle) both have the dumps() method to serialize things.

Reply all

Reply to author

Forward