Running MrJob on hadoop cluster - load data from hdfs

Vlad Monteal

unread,

Aug 23, 2012, 1:42:38 PM8/23/12

to mr...@googlegroups.com

I have looked for almost one day already for an answer on how to write a script which loads data from hdfs using mrjob without the need to use the command line. I first used the command line, it works, but I intend to implement scripts which can take can take hadoop cluster as an option and more importantly, that allow me to explicitly set the input data in a location in hdfs.

After looking at several questions and answers in the group, it seems that loading data from hdfs is not very clear. For example, I tried "Cpt Caveman"'s reply to:

https://groups.google.com/forum/?fromgroups=#!topic/mrjob/5e2fjaYiUfQ

And yet I cant' get things to work, this is what I have:

#----------------------------------------------

# simple_mrjob.py

import json

from mrjob.job import MRJob

class PlatformCounter(MRJob):

def mapper(self, key, line):

line = json.loads(line.strip())

yield line['platform'], 1

def reducer(self, word, occurrences):

print 'in reducer {0} occurrences'.format(occurrences)

yield word, sum(occurrences)

#----------------------------------------------

# simple_runner.py

from simple_mrjob import PlatformCounter

from runJob import runJob

runJob(PlatformCounter, ['hdfs:///user/hadoop/logs', '--output-dir=result-mrjob-args'], 'hadoop')

#----------------------------------------------

And next is the output of executing simple_runner.py:

hadoop@ip-10-34-139-215:~/vlad/pythonmr/test_mrjob$ python simple_runner.py

starting PlatformCounter job on hadoop

Traceback (most recent call last):

File "simple_runner.py", line 10, in <module>

runJob(PlatformCounter, ['hdfs:///user/hadoop/logs', '--output-dir=result-mrjob-args'], 'hadoop')

File "/home/hadoop/vlad/pythonmr/test_mrjob/runJob.py", line 29, in runJob

runner.run()

File "/usr/local/lib/python2.6/dist-packages/mrjob/runner.py", line 487, in run

self._run()

File "/usr/local/lib/python2.6/dist-packages/mrjob/hadoop.py", line 239, in _run

self._run_job_in_hadoop()

File "/usr/local/lib/python2.6/dist-packages/mrjob/hadoop.py", line 336, in _run_job_in_hadoop

steps = self._get_steps()

File "/usr/local/lib/python2.6/dist-packages/mrjob/runner.py", line 985, in _get_steps

'error getting step information: %s', stderr)

Exception: ('error getting step information: %s', 'Traceback (most recent call last):\n File "/home/hadoop/vlad/pythonmr/test_mrjob/simple_mrjob.py", line 22, in <module>\n with mr_pc.make_runner() as runner:\n File "/usr/local/lib/python2.6/dist-packages/mrjob/job.py", line 575, in make_runner\n " __main__\

, which doesn\'t work." % w)\nmrjob.job.UsageError: make_runner() was called with --steps. This probably means you tried to use it from __main__, which doesn\'t work.\n')

Can anyone give me a clue of what is going on?

Thanks!

Steve Johnson

unread,

Aug 23, 2012, 2:07:27 PM8/23/12

to mr...@googlegroups.com

You didn't put the 'if __name__ == "__main__": PlatformCounter().run()' block at the bottom of your script, so the job won't work at all, regardless of where the input files are.

Vlad Monteal

unread,

Aug 23, 2012, 5:45:39 PM8/23/12

to mr...@googlegroups.com

Thanks a lot Steve,

This works now.

I should have taken the first lines in the documentation more seriously, but forgive me, I only started using MRJob yesterday.

Other questions rised after running this, I placed them in a new topic:

https://groups.google.com/forum/?fromgroups=#!topic/mrjob/ZNfSff_DKY8

It seems that you have done work in metabolic networks, I did work in computational biology myself, and my PhD research was in complex networks :)

Thanks again!

Reply all

Reply to author

Forward