Hi,
Following the mrjob guide, I have created a sample job as -
[cloudera@quickstart mrjob]$ cat mr_first_job.py
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
Next I have uploaded a file in hdfs as below -
[cloudera@quickstart mrjob]$ hadoop fs -ls /user/cloudera/ngrams
Found 1 items
-rw-r--r-- 1 cloudera cloudera 9175040 2014-08-21 10:23 /user/cloudera/ngrams/googlebooks-eng-all-5gram-20090715-199.csv
But when i run the mrjob i get the below error -
[cloudera@quickstart mrjob]$ python mr_first_job.py -r hadoop --hadoop-bin /usr/bin/hadoop --jobconf mapred.reduce.tasks=1 -o hdfs:///user/cloudera/output-mrjob hdfs:///user/cloudera/ngrams
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/mr_first_job.cloudera.20140821.180047.687023
writing wrapper script to /tmp/mr_first_job.cloudera.20140821.180047.687023/setup-wrapper.sh
STDERR: mkdir: `hdfs:///user/cloudera/tmp/mrjob/mr_first_job.cloudera.20140821.180047.687023/files/': No such file or directory
Traceback (most recent call last):
File "mr_first_job.py", line 10, in <module>
MRWordFrequencyCount.run()
File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/usr/lib/python2.6/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/usr/lib/python2.6/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/usr/lib/python2.6/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 238, in _run
self._upload_local_files_to_hdfs()
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 265, in _upload_local_files_to_hdfs
self._mkdir_on_hdfs(self._upload_mgr.prefix)
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 273, in _mkdir_on_hdfs
self.invoke_hadoop(['fs', '-mkdir', path])
File "/usr/lib/python2.6/site-packages/mrjob/fs/hadoop.py", line 109, in invoke_hadoop
raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'fs', '-mkdir', 'hdfs:///user/cloudera/tmp/mrjob/mr_first_job.cloudera.20140821.180047.687023/files/']' returned non-zero exit status 1
[cloudera@quickstart mrjob]$
The python code is correct as it runs in local mode -
[cloudera@quickstart mrjob]$ python mr_first_job.py mr_first_job.py
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/mr_first_job.cloudera.20140821.180158.291763
writing to /tmp/mr_first_job.cloudera.20140821.180158.291763/step-0-mapper_part-00000
Counters from step 1:
(no counters found)
writing to /tmp/mr_first_job.cloudera.20140821.180158.291763/step-0-mapper-sorted
> sort /tmp/mr_first_job.cloudera.20140821.180158.291763/step-0-mapper_part-00000
writing to /tmp/mr_first_job.cloudera.20140821.180158.291763/step-0-reducer_part-00000
Counters from step 1:
(no counters found)
Moving /tmp/mr_first_job.cloudera.20140821.180158.291763/step-0-reducer_part-00000 -> /tmp/mr_first_job.cloudera.20140821.180158.291763/output/part-00000
Streaming final output from /tmp/mr_first_job.cloudera.20140821.180158.291763/output
"chars" 308
"lines" 12
"words" 31
removing tmp directory /tmp/mr_first_job.cloudera.20140821.180158.291763
[cloudera@quickstart mrjob]$
I tried with
[cloudera@quickstart mrjob]$ python mr_first_job.py -r hadoop hdfs:///user/cloudera/ngrams/googlebooks-eng-all-5gram-20090715-199.csv --check-input-paths=false
Usage: mr_first_job.py [options] [input files]
mr_first_job.py: error: --check-input-paths option does not take a value
[cloudera@quickstart mrjob]$
Thanks for your help.
Manish