Running MrJob on hadoop cluster - load data from hdfs

501 views
Skip to first unread message

Vlad Monteal

unread,
Aug 23, 2012, 1:42:38 PM8/23/12
to mr...@googlegroups.com
I have looked for almost one day already for an answer on how to write a script which loads data from hdfs using mrjob without the need to use the command line. I first used the command line, it works, but I intend to implement scripts which can take can take hadoop cluster as an option and more importantly, that allow me to explicitly set the input data in a location in hdfs.
After looking at several questions and answers in the group, it seems that loading data from hdfs is not very clear. For example, I tried "Cpt Caveman"'s reply to:
And yet I cant' get things to work, this is what I have:

#----------------------------------------------
# simple_mrjob.py
import json

from mrjob.job import MRJob

class PlatformCounter(MRJob):
    def mapper(self, key, line):
        line = json.loads(line.strip())
        yield line['platform'], 1

    def reducer(self, word, occurrences):
        print 'in reducer {0}  occurrences'.format(occurrences)
        yield word, sum(occurrences)
#----------------------------------------------
# simple_runner.py
from simple_mrjob import PlatformCounter
from runJob import runJob

runJob(PlatformCounter, ['hdfs:///user/hadoop/logs', '--output-dir=result-mrjob-args'], 'hadoop')
#----------------------------------------------

And next is the output of executing simple_runner.py:

hadoop@ip-10-34-139-215:~/vlad/pythonmr/test_mrjob$ python simple_runner.py
starting PlatformCounter job on hadoop
Traceback (most recent call last):
  File "simple_runner.py", line 10, in <module>
    runJob(PlatformCounter, ['hdfs:///user/hadoop/logs', '--output-dir=result-mrjob-args'], 'hadoop')
  File "/home/hadoop/vlad/pythonmr/test_mrjob/runJob.py", line 29, in runJob
    runner.run()
  File "/usr/local/lib/python2.6/dist-packages/mrjob/runner.py", line 487, in run
    self._run()
  File "/usr/local/lib/python2.6/dist-packages/mrjob/hadoop.py", line 239, in _run
    self._run_job_in_hadoop()
  File "/usr/local/lib/python2.6/dist-packages/mrjob/hadoop.py", line 336, in _run_job_in_hadoop
    steps = self._get_steps()
  File "/usr/local/lib/python2.6/dist-packages/mrjob/runner.py", line 985, in _get_steps
    'error getting step information: %s', stderr)
Exception: ('error getting step information: %s', 'Traceback (most recent call last):\n  File "/home/hadoop/vlad/pythonmr/test_mrjob/simple_mrjob.py", line 22, in <module>\n    with mr_pc.make_runner() as runner:\n  File "/usr/local/lib/python2.6/dist-packages/mrjob/job.py", line 575, in make_runner\n    " __main__\
, which doesn\'t work." % w)\nmrjob.job.UsageError: make_runner() was called with --steps. This probably means you tried to use it from __main__, which doesn\'t work.\n')


Can anyone give me a clue of what is going on?

Thanks!

Steve Johnson

unread,
Aug 23, 2012, 2:07:27 PM8/23/12
to mr...@googlegroups.com
You didn't put the 'if __name__ == "__main__": PlatformCounter().run()' block at the bottom of your script, so the job won't work at all, regardless of where the input files are.

Vlad Monteal

unread,
Aug 23, 2012, 5:45:39 PM8/23/12
to mr...@googlegroups.com
Thanks a lot Steve,
This works now.
I should have taken the first lines in the documentation more seriously, but forgive me, I only started using MRJob yesterday.
Other questions rised after running this, I placed them in a new topic:

It seems that you have done work in metabolic networks, I did work in computational biology myself, and my PhD research was in complex networks :)

Thanks again!
Reply all
Reply to author
Forward
0 new messages