Reproducing mrjob example from github's 'Hadoop with Python' in hadoop mode: Error launching job , bad input path : File does not exist

407 views
Skip to first unread message

Kaveh Vakili

unread,
Dec 25, 2016, 12:13:03 PM12/25/16
to mrjob
I'm trying to run the [Mrjob example](https://github.com/MinerKasch/HadoopWithPython/blob/master/python/MapReduce/mrjob/top_salary.py)
 from the book Hadoop with Python. I'm using MrJob0.56.

(the file salaries.csv can be found [here](https://github.com/MinerKasch/HadoopWithPython/blob/master/resources/salaries.csv))

So I can start the namenode and the datanode. Doing:

    start-dfs.sh

returns:

    Starting namenodes on [localhost]
    localhost: starting namenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-namenode-me-Notebook-PC.out
    localhost: starting datanode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-datanode-me-Notebook-PC.out
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: starting secondarynamenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-secondarynamenode-me-Notebook-PC.out

I also have no problem creating the input file structure and copying
`salaries.csv` unto the hdfs:

    hdfs dfs -mkdir /user/
    hdfs dfs -mkdir /user/me/
    hdfs dfs -mkdir /user/me/input/
    hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/
    hdfs dfs -ls /user/me/input/

returns:

    Found 1 items
    -rw-r--r--   3 me supergroup    1771685 2016-12-24 15:57 /user/me/input/salaries.csv


I also make `top_salaries.py` executable:

    sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py

lauching `top_salaries.py` in local mode also works:

    python2 top_salaries.py -r local salaries.csv > answer.csv

returns:

    No configs found; falling back on auto-configuration
    Creating temp directory /tmp/top_salaries.me.20161224.195052.762894
    Running step 1 of 1...
    Counters: 1
        warn
            missing gross=3223
    Counters: 1
        warn
            missing gross=3223
    Streaming final output from /tmp/top_salaries.me.20161224.195052.762894/output...
    Removing temp directory /tmp/top_salaries.me.20161224.195052.762894...

however, running this job on the hadoop (putting things together)  `python2 top_salaries.py -r hadoop  hdfs:///user/me/input/salaries.csv` returns:

    No configs found; falling back on auto-configuration
    Looking for hadoop binary in $PATH...
    Found hadoop binary: /home/me/hadoop-2.7.3/bin/hadoop
    Using Hadoop version 2.7.3
    Looking for Hadoop streaming jar in /home/me/hadoop-2.7.3...
    Found Hadoop streaming jar: /home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
    Creating temp directory /tmp/top_salaries.me.20161224.195201.967990
    Copying local files to hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/...
    Running step 1 of 1...
      session.id is deprecated. Instead, use dfs.metrics.session-id
      Initializing JVM Metrics with processName=JobTracker, sessionId=
      Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
      Cleaning up the staging area file:/tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001
      Error launching job , bad input path : File does not exist: /tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001/files/mrjob.zip#mrjob.zip
      Streaming Command Failed!
    Attempting to fetch counters from logs...
    Can't fetch history log; missing job ID
    No counters found
    Scanning logs for probable cause of failure...
    Can't fetch history log; missing job ID
    Can't fetch task logs; missing application ID
    Step 1 of 1 failed: Command '['/home/me/hadoop-2.7.3/bin/hadoop', 'jar', '/home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar', '-files', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/mrjob.zip#mrjob.zip,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/top_salaries.py#top_salaries.py', '-input', 'hdfs:///user/me/input/salaries.csv', '-output', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/output', '-mapper', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --mapper', '-combiner', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --combiner', '-reducer', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --reducer']' returned non-zero exit status 512

I'm a complete beginner in mrjob and don't understand what could be causing this problem. I'm sure I'm setting something uncorrectly somewhere, but I don't know what or where.

For the record this is my core-site.xml:

    <configuration>
     <property>        
        <name>fs.defaultFS</name>        
        <value>hdfs://localhost:9000</value>   
     </property>
    </configuration>

and this is my hdfs-site.xml:

    <configuration>
        <property>
           <name>dfs.namenode.name.dir</name>
           <value>/home/me/Desktop/work/cv/hadoop/namenode</value>
        </property>
        <property>
           <name>dfs.datanode.data.dir</name>
           <value>/home/me/Desktop/work/cv/hadoop/datanode</value>
        </property>
    </configuration>

(I have not changed any of the other files from their default values).

For the record, here is the python script (same as on the github link above)

    from mrjob.job import MRJob
    from mrjob.step import MRStep
    import csv
   
    cols = 'Name,JobTitle,AgencyID,Agency,HireDate,AnnualSalary,GrossPay'.split(',')
   
    class salarymax(MRJob):
   
        def mapper(self, _, line):
            # Convert each line into a dictionary
            row = dict(zip(cols, [ a.strip() for a in csv.reader([line]).next()]))
   
            # Yield the salary
            yield 'salary', (float(row['AnnualSalary'][1:]), line)
           
            # Yield the gross pay
            try:
                yield 'gross', (float(row['GrossPay'][1:]), line)
            except ValueError:
                self.increment_counter('warn', 'missing gross', 1)
   
        def reducer(self, key, values):
            topten = []
   
            # For 'salary' and 'gross' compute the top 10
            for p in values:
                topten.append(p)
                topten.sort()
                topten = topten[-10:]
   
            for p in topten:
                yield key, p
   
        combiner = reducer
   
    if __name__ == '__main__':
    salarymax.run()

Traiano Welcome

unread,
Mar 11, 2018, 9:01:22 PM3/11/18
to mrjob
Hi Kaveh

Did you ever find an  answer to this question? I've started on the exampled in the same book, but find the book of very little value given that the most basic examples don't work (and am bound to assume mrjob is of little utility at this point):

I've posted almost the same question to StackOverflow: https://stackoverflow.com/questions/49102796/using-mrjob-for-hadoop-streaming-error-launching-job-bad-input-path-file-d

I'd be very thankful if you could share any suggestions you have on debugging this!

Traiano
Reply all
Reply to author
Forward
0 new messages