IPython notebook and mrjob

瀏覽次數:830 次
跳到第一則未讀訊息

ms180276

未讀,
2013年10月24日 下午2:33:482013/10/24
收件者:mr...@googlegroups.com
Is it possible to execute mrjob mapper/reducer code inside IPython notebook ?
Any examples to support.

Steve Johnson

未讀,
2013年10月24日 下午2:37:452013/10/24
收件者:mr...@googlegroups.com
This should work. You probably want to use the inline runner.
 
 
 
On Thu, Oct 24, 2013, at 11:33 AM, ms180276 wrote:
Is it possible to execute mrjob mapper/reducer code inside IPython notebook ?
Any examples to support.


--
You received this message because you are subscribed to the Google Groups "mrjob" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.
 

ms180276

未讀,
2013年10月24日 下午3:59:202013/10/24
收件者:mr...@googlegroups.com
Thanks for your response.
I have installed mrjob on IPython notebook.
While trying to execute below, I am getting error as -

Code :
import mrjob
with mrjob.make_runner() as runner:
    runner.run()

Error :
AttributeError                            Traceback (most recent call last)
<ipython-input-32-b930bfc1ea9b> in <module>()
      1 import mrjob
----> 2 with mrjob.make_runner() as runner:
      3     runner.run()

AttributeError: 'module' object has no attribute 'make_runner'

Steve Johnson

未讀,
2013年10月24日 下午4:02:032013/10/24
收件者:mr...@googlegroups.com
Re-read the documentation. make_runner() is a MRJob class method, not a module function.

ms180276

未讀,
2013年10月24日 下午4:20:472013/10/24
收件者:mr...@googlegroups.com
Is there any way to see any sample code, which can be executed ? I am new to this.


On Thursday, October 24, 2013 2:37:45 PM UTC-4, steve wrote:

ms180276

未讀,
2013年10月24日 下午4:21:522013/10/24
收件者:mr...@googlegroups.com
Is there any way to see any sample code, which can be executed ? I am new to this.

Taro Sato

未讀,
2013年10月25日 下午1:35:012013/10/25
收件者:mr...@googlegroups.com
There are some example code:


but none of which actually provides the prototype you are looking for.  But basically you can launch an mrjob programatically by doing something like this (using mr_wc.py as a template):

#### CODE BEGINS ####

from mrjob.job import MRJob


class MRWordCountUtility(MRJob):

    def __init__(self, *args, **kwargs):
        super(MRWordCountUtility, self).__init__(*args, **kwargs)
        self.chars = 0
        self.words = 0
        self.lines = 0

    def mapper(self, _, line):
        # Don't actually yield anything for each line. Instead, collect them
        # and yield the sums when all lines have been processed. The results
        # will be collected by the reducer.
        self.chars += len(line) + 1 # +1 for newline
        self.words += sum(1 for word in line.split() if word.strip())
        self.lines += 1

    def mapper_final(self):
        yield('chars', self.chars)
        yield('words', self.words)
        yield('lines', self.lines)

    def reducer(self, key, values):
        yield(key, sum(values))


if __name__ == '__main__':
    job = MRWordCountUtility(args=['-r', 'emr'])
with job.make_runner() as runner:
runner.run()


#### CODE ENDS ####

Note the lines after "if __name__ == '__main__':".  If this doesn't make sense to you, you probably should read up on the basics of object-oriented programming in Python before you simply copy & paste stuff just to find things that don't magically work...

Steve Johnson

未讀,
2013年10月25日 下午1:48:412013/10/25
收件者:mr...@googlegroups.com
Taro, you cannot invoke a mrjob that way. I'll try to explain later.
> >> AttributeError Traceback (most recent call last)<ipython-input-32-b930bfc1ea9b> in <module>() 1 import mrjob----> 2 with mrjob.make_runner() as runner: 3 runner.run()

Ewen Cheslack-Postava

未讀,
2013年10月25日 下午1:54:222013/10/25
收件者:mr...@googlegroups.com
The snippet of code for running the job is fine, it just can't be in
main because mrjob needs to be able to ship the py file containing the
job to EMR and run it. You must have

if __name__ == '__main__':
MRWordCountUtility.run()

for it to run on other servers properly (or via the local runner since
that spawns a new process).

However, if you use the word count py file as is but spawn it using
that snippet of code, e.g. paste that into your ipython session, that
should work. That's exactly how I programmatically launch a bunch of
mrjobs from the same python driver script.

-Ewen

-----
Ewen Cheslack-Postava
StraightUp | http://readstraightup.com
ewe...@readstraightup.com
(201) 286-7785

Taro Sato

未讀,
2013年10月25日 晚上10:33:532013/10/25
收件者:mr...@googlegroups.com、m...@ewencp.org
Yeah, sorry perhaps I may have only contributed to confuse the issue, when the original question was specifically running a job on ipython.  I was only trying to provide an example of make_runner method and haven't thought deeply about the implication of using "-r emr" switch.

Cheers,
Taro

pl h

未讀,
2014年4月1日 凌晨1:05:072014/4/1
收件者:mr...@googlegroups.com、m...@ewencp.org
Any news on this problem so far?

It looks like mrjob relies on existing script file. So we need to create the file in IPython Notebook and then programmably launch it?

It would be cool if mrjob can directly take the job class, serialize it and send to Hadoop workers. Then the integration in IPython Notebook will be seamless.

pl h

未讀,
2014年4月1日 上午9:08:332014/4/1
收件者:mr...@googlegroups.com、m...@ewencp.org
I was giving the tutorial tonight for IPython Notebook and Hadoop bindings. Here is how I run mrjob in ipynb. Basically, I use "%%file" magic to write the script file and use the programmable execution documented in mrjob.


This does not look ideal, but it at least retain the rationale of using ipynb: write/run everything in just one notebook.



Still looking forward for object serialization so that the integration could be more beautiful, i.e. you don't need to generate intermediate script files. 

Steve Johnson

未讀,
2014年4月1日 下午1:43:152014/4/1
收件者:mr...@googlegroups.com、m...@ewencp.org
I'm pretty sure that's the only approach that will work. There needs to be an executable script for Hadoop to run, and you can't just generate that from the class (afaik).
 
I suppose it's possible to turn the AST back into code (which you can do, I think?), but then you'd need to prepend the imports (which you would get...how?) and append the run() block. And even if you got it to work, it would be, like, the biggest, ugliest hack ever.
 
I agree that it would be pretty cool to run this in IPython, but inline mode is probably the best you're going to get on that front.
For more options, visit https://groups.google.com/d/optout.
 
回覆所有人
回覆作者
轉寄
0 則新訊息