I have looked for almost one day already for an answer on how to write a script which loads data from hdfs using mrjob without the need to use the command line. I first used the command line, it works, but I intend to implement scripts which can take can take hadoop cluster as an option and more importantly, that allow me to explicitly set the input data in a location in hdfs.
After looking at several questions and answers in the group, it seems that loading data from hdfs is not very clear. For example, I tried "Cpt Caveman"'s reply to:
And yet I cant' get things to work, this is what I have:
#----------------------------------------------
# simple_mrjob.py
import json
from mrjob.job import MRJob
class PlatformCounter(MRJob):
def mapper(self, key, line):
line = json.loads(line.strip())
yield line['platform'], 1
def reducer(self, word, occurrences):
print 'in reducer {0} occurrences'.format(occurrences)
yield word, sum(occurrences)
#----------------------------------------------
# simple_runner.py
from simple_mrjob import PlatformCounter
from runJob import runJob
runJob(PlatformCounter, ['hdfs:///user/hadoop/logs', '--output-dir=result-mrjob-args'], 'hadoop')
#----------------------------------------------
And next is the output of executing simple_runner.py:
hadoop@ip-10-34-139-215:~/vlad/pythonmr/test_mrjob$ python simple_runner.py
starting PlatformCounter job on hadoop
Traceback (most recent call last):
File "simple_runner.py", line 10, in <module>
runJob(PlatformCounter, ['hdfs:///user/hadoop/logs', '--output-dir=result-mrjob-args'], 'hadoop')
File "/home/hadoop/vlad/pythonmr/test_mrjob/runJob.py", line 29, in runJob
runner.run()
File "/usr/local/lib/python2.6/dist-packages/mrjob/runner.py", line 487, in run
self._run()
File "/usr/local/lib/python2.6/dist-packages/mrjob/hadoop.py", line 239, in _run
self._run_job_in_hadoop()
File "/usr/local/lib/python2.6/dist-packages/mrjob/hadoop.py", line 336, in _run_job_in_hadoop
steps = self._get_steps()
File "/usr/local/lib/python2.6/dist-packages/mrjob/runner.py", line 985, in _get_steps
'error getting step information: %s', stderr)
Exception: ('error getting step information: %s', 'Traceback (most recent call last):\n File "/home/hadoop/vlad/pythonmr/test_mrjob/simple_mrjob.py", line 22, in <module>\n with mr_pc.make_runner() as runner:\n File "/usr/local/lib/python2.6/dist-packages/mrjob/job.py", line 575, in make_runner\n " __main__\
, which doesn\'t work." % w)\nmrjob.job.UsageError: make_runner() was called with --steps. This probably means you tried to use it from __main__, which doesn\'t work.\n')
Can anyone give me a clue of what is going on?
Thanks!