IPython notebook and mrjob

ms180276

unread,

Oct 24, 2013, 2:33:48 PM10/24/13

to mr...@googlegroups.com

Is it possible to execute mrjob mapper/reducer code inside IPython notebook ?

Any examples to support.

Steve Johnson

unread,

Oct 24, 2013, 2:37:45 PM10/24/13

to mr...@googlegroups.com

This should work. You probably want to use the inline runner.

http://mrjob.readthedocs.org/en/latest/guides/runners.html#running-your-job-programmatically

On Thu, Oct 24, 2013, at 11:33 AM, ms180276 wrote:

Is it possible to execute mrjob mapper/reducer code inside IPython notebook ?
Any examples to support.

--

You received this message because you are subscribed to the Google Groups "mrjob" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

ms180276

unread,

Oct 24, 2013, 3:59:20 PM10/24/13

to mr...@googlegroups.com

Thanks for your response.

I have installed mrjob on IPython notebook.

While trying to execute below, I am getting error as -

Code :

import mrjob

with mrjob.make_runner() as runner:

runner.run()

Error :

AttributeError                            Traceback (most recent call last)
<ipython-input-32-b930bfc1ea9b> in <module>()
      1 import mrjob
----> 2 with mrjob.make_runner() as runner:
      3     runner.run()

AttributeError: 'module' object has no attribute 'make_runner'

Steve Johnson

unread,

Oct 24, 2013, 4:02:03 PM10/24/13

to mr...@googlegroups.com

Re-read the documentation. make_runner() is a MRJob class method, not a module function.

ms180276

unread,

Oct 24, 2013, 4:20:47 PM10/24/13

to mr...@googlegroups.com

Is there any way to see any sample code, which can be executed ? I am new to this.

On Thursday, October 24, 2013 2:37:45 PM UTC-4, steve wrote:

ms180276

unread,

Oct 24, 2013, 4:21:52 PM10/24/13

to mr...@googlegroups.com

Is there any way to see any sample code, which can be executed ? I am new to this.

Taro Sato

unread,

Oct 25, 2013, 1:35:01 PM10/25/13

to mr...@googlegroups.com

There are some example code:

https://github.com/Yelp/mrjob/tree/master/mrjob/examples

but none of which actually provides the prototype you are looking for. But basically you can launch an mrjob programatically by doing something like this (using mr_wc.py as a template):

#### CODE BEGINS ####

from mrjob.job import MRJob



class MRWordCountUtility(MRJob):

    def __init__(self, *args, **kwargs):
        super(MRWordCountUtility, self).__init__(*args, **kwargs)
        self.chars = 0
        self.words = 0
        self.lines = 0

    def mapper(self, _, line):
        # Don't actually yield anything for each line. Instead, collect them
        # and yield the sums when all lines have been processed. The results
        # will be collected by the reducer.
        self.chars += len(line) + 1  # +1 for newline
        self.words += sum(1 for word in line.split() if word.strip())
        self.lines += 1

    def mapper_final(self):
        yield('chars', self.chars)
        yield('words', self.words)
        yield('lines', self.lines)

    def reducer(self, key, values):
        yield(key, sum(values))


if __name__ == '__main__':
    job = MRWordCountUtility(args=['-r', 'emr'])
    with job.make_runner() as runner:
        runner.run()

#### CODE ENDS ####

Note the lines after "if __name__ == '__main__':". If this doesn't make sense to you, you probably should read up on the basics of object-oriented programming in Python before you simply copy & paste stuff just to find things that don't magically work...

Steve Johnson

unread,

Oct 25, 2013, 1:48:41 PM10/25/13

to mr...@googlegroups.com

Taro, you cannot invoke a mrjob that way. I'll try to explain later.

> >> AttributeError Traceback (most recent call last)<ipython-input-32-b930bfc1ea9b> in <module>() 1 import mrjob----> 2 with mrjob.make_runner() as runner: 3 runner.run()

Ewen Cheslack-Postava

unread,

Oct 25, 2013, 1:54:22 PM10/25/13

to mr...@googlegroups.com

The snippet of code for running the job is fine, it just can't be in
main because mrjob needs to be able to ship the py file containing the
job to EMR and run it. You must have

if __name__ == '__main__':
MRWordCountUtility.run()

for it to run on other servers properly (or via the local runner since
that spawns a new process).

However, if you use the word count py file as is but spawn it using
that snippet of code, e.g. paste that into your ipython session, that
should work. That's exactly how I programmatically launch a bunch of
mrjobs from the same python driver script.

-Ewen

-----
Ewen Cheslack-Postava
StraightUp | http://readstraightup.com
ewe...@readstraightup.com
(201) 286-7785

Taro Sato

unread,

Oct 25, 2013, 10:33:53 PM10/25/13

to mr...@googlegroups.com, m...@ewencp.org

Yeah, sorry perhaps I may have only contributed to confuse the issue, when the original question was specifically running a job on ipython. I was only trying to provide an example of make_runner method and haven't thought deeply about the implication of using "-r emr" switch.

Cheers,
Taro

pl h

unread,

Apr 1, 2014, 1:05:07 AM4/1/14

to mr...@googlegroups.com, m...@ewencp.org

Any news on this problem so far?

It looks like mrjob relies on existing script file. So we need to create the file in IPython Notebook and then programmably launch it?

It would be cool if mrjob can directly take the job class, serialize it and send to Hadoop workers. Then the integration in IPython Notebook will be seamless.

pl h

unread,

Apr 1, 2014, 9:08:33 AM4/1/14

to mr...@googlegroups.com, m...@ewencp.org

I was giving the tutorial tonight for IPython Notebook and Hadoop bindings. Here is how I run mrjob in ipynb. Basically, I use "%%file" magic to write the script file and use the programmable execution documented in mrjob.

http://nbviewer.ipython.org/urls/course.ie.cuhk.edu.hk/~engg4030/tutorial/tutorial11/Python-Hadoop.ipynb#mrjob

This does not look ideal, but it at least retain the rationale of using ipynb: write/run everything in just one notebook.

Still looking forward for object serialization so that the integration could be more beautiful, i.e. you don't need to generate intermediate script files.

Steve Johnson

unread,

Apr 1, 2014, 1:43:15 PM4/1/14

to mr...@googlegroups.com, m...@ewencp.org

I'm pretty sure that's the only approach that will work. There needs to be an executable script for Hadoop to run, and you can't just generate that from the class (afaik).

I suppose it's possible to turn the AST back into code (which you can do, I think?), but then you'd need to prepend the imports (which you would get...how?) and append the run() block. And even if you got it to work, it would be, like, the biggest, ugliest hack ever.

I agree that it would be pretty cool to run this in IPython, but inline mode is probably the best you're going to get on that front.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward