I'm trying to run the [Mrjob example](
https://github.com/MinerKasch/HadoopWithPython/blob/master/python/MapReduce/mrjob/top_salary.py)
from the book Hadoop with Python. I'm using MrJob0.56.
(the file salaries.csv can be found [here](
https://github.com/MinerKasch/HadoopWithPython/blob/master/resources/salaries.csv))
So I can start the namenode and the datanode. Doing:
start-dfs.sh
returns:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-namenode-me-Notebook-PC.out
localhost: starting datanode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-datanode-me-Notebook-PC.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-secondarynamenode-me-Notebook-PC.out
I also have no problem creating the input file structure and copying
`salaries.csv` unto the hdfs:
hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/me/
hdfs dfs -mkdir /user/me/input/
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/
hdfs dfs -ls /user/me/input/
returns:
Found 1 items
-rw-r--r-- 3 me supergroup 1771685 2016-12-24 15:57 /user/me/input/salaries.csv
I also make `top_salaries.py` executable:
sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py
lauching `top_salaries.py` in local mode also works:
python2 top_salaries.py -r local salaries.csv > answer.csv
returns:
No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_salaries.me.20161224.195052.762894
Running step 1 of 1...
Counters: 1
warn
missing gross=3223
Counters: 1
warn
missing gross=3223
Streaming final output from /tmp/top_salaries.me.20161224.195052.762894/output...
Removing temp directory /tmp/top_salaries.me.20161224.195052.762894...
however, running this job on the hadoop (putting things together) `python2 top_salaries.py -r hadoop hdfs:///user/me/input/salaries.csv` returns:
No configs found; falling back on auto-configuration
Looking for hadoop binary in $PATH...
Found hadoop binary: /home/me/hadoop-2.7.3/bin/hadoop
Using Hadoop version 2.7.3
Looking for Hadoop streaming jar in /home/me/hadoop-2.7.3...
Found Hadoop streaming jar: /home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
Creating temp directory /tmp/top_salaries.me.20161224.195201.967990
Copying local files to hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/...
Running step 1 of 1...
session.id is deprecated. Instead, use dfs.metrics.session-id
Initializing JVM Metrics with processName=JobTracker, sessionId=
Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
Cleaning up the staging area file:/tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001
Error launching job , bad input path : File does not exist: /tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001/files/mrjob.zip#mrjob.zip
Streaming Command Failed!
Attempting to fetch counters from logs...
Can't fetch history log; missing job ID
No counters found
Scanning logs for probable cause of failure...
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Step 1 of 1 failed: Command '['/home/me/hadoop-2.7.3/bin/hadoop', 'jar', '/home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar', '-files', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/mrjob.zip#mrjob.zip,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/top_salaries.py#top_salaries.py', '-input', 'hdfs:///user/me/input/salaries.csv', '-output', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/output', '-mapper', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --mapper', '-combiner', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --combiner', '-reducer', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --reducer']' returned non-zero exit status 512
I'm a complete beginner in mrjob and don't understand what could be causing this problem. I'm sure I'm setting something uncorrectly somewhere, but I don't know what or where.
For the record this is my core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
and this is my hdfs-site.xml:
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/me/Desktop/work/cv/hadoop/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/me/Desktop/work/cv/hadoop/datanode</value>
</property>
</configuration>
(I have not changed any of the other files from their default values).
For the record, here is the python script (same as on the github link above)
from mrjob.job import MRJob
from mrjob.step import MRStep
import csv
cols = 'Name,JobTitle,AgencyID,Agency,HireDate,AnnualSalary,GrossPay'.split(',')
class salarymax(MRJob):
def mapper(self, _, line):
# Convert each line into a dictionary
row = dict(zip(cols, [ a.strip() for a in csv.reader([line]).next()]))
# Yield the salary
yield 'salary', (float(row['AnnualSalary'][1:]), line)
# Yield the gross pay
try:
yield 'gross', (float(row['GrossPay'][1:]), line)
except ValueError:
self.increment_counter('warn', 'missing gross', 1)
def reducer(self, key, values):
topten = []
# For 'salary' and 'gross' compute the top 10
for p in values:
topten.append(p)
topten.sort()
topten = topten[-10:]
for p in topten:
yield key, p
combiner = reducer
if __name__ == '__main__':
salarymax.run()