Hi,
I am new to mrjob, and am taking this class: "Machine Learning on BigData w. Map Reduce" (
http://www.meetup.com/HandsOnProgrammingEvents/events/96046502/ ) being taught by
Mike Bowles.
I am using my own 3-node hadoop cluster, running Ubuntu 12.04.1 and Cloudera 4.1.2.
I installed mrjob, and have gotten some basics working, but have some questions as to how to debug possible problems with MapReduce jobs written in mrjob.
From these slides,
http://machinelearningbigdata.pbworks.com/w/file/50030744/Machine%20Learning%20on%20Big%20Data%20-%20ClassIntro.pdf ,
I took the example, which I have attached as mrjob_test1.py.
In my Hadoop cluster, I ran:
python mrjob_test1.py -r hadoop < good_data.txt
and it worked fine.
When I ran this:
python mrjob_test1.py -r hadoop < bad_data.txt
I got a Python exception:
STDOUT: packageJobJar: [/tmp/hadoop-hadoop1/hadoop-unjar5518420872309545823/] [] /tmp/streamjob4286063445422626248.jar tmpDir=null
Job failed with return code 1: ['/usr/lib/hadoop-0.20-mapreduce/bin/hadoop', 'jar', '/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar', '-files', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/files/mrjob_test1.py#mrjob_test1.py', '-archives', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/files/STDIN', '-output', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/output', '-mapper', 'python mrjob_test1.py --step-num=0 --mapper', '-reducer', 'python mrjob_test1.py --step-num=0 --reducer']
Scanning logs for probable cause of failure
Traceback (most recent call last):
File "mrjob_test1.py", line 28, in <module>
mrMeanVar.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 483, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 501, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 207, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 448, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 232, in _run
self._run_job_in_hadoop()
File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 334, in _run_job_in_hadoop
raise Exception(msg)
Exception: Job failed with return code 1: ['/usr/lib/hadoop-0.20-mapreduce/bin/hadoop', 'jar', '/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar', '-files', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/files/mrjob_test1.py#mrjob_test1.py', '-archives', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/files/STDIN', '-output', 'hdfs:///user/hadoop1/tmp/mrjob/mrjob_test1.hadoop1.20130120.235407.486390/output', '-mapper', 'python mrjob_test1.py --step-num=0 --mapper', '-reducer', 'python mrjob_test1.py --step-num=0 --reducer']
I couldn't figure out the source of the problem based on
this exception. By looking in the Mapreduce logs of my Hadoop server,
I found this:
2013-01-20 23:54:58,853 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201301191948_0008_m_000000_3: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
I still couldn't figure out the source of the problem.
However, if I ran the mrjob without Hadoop, I got this error:
Traceback (most recent call last):
File "mrjob_test1.py", line 28, in <module>
mrMeanVar.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 483, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 501, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 207, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 448, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/inline.py", line 161, in _run
'mapper')
File "/usr/local/lib/python2.7/dist-packages/mrjob/inline.py", line 216, in _invoke_inline_mrjob
child_instance.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 492, in execute
self.run_mapper(self.options.step_num)
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 557, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "mrjob_test1.py", line 9, in mapper
num = json.loads(line)
File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
At this point, I could figure out that the input data was bad.
Is there a way that I can get mrjob to display this
exception information when running in "Hadoop mode"?
Without this stacktrace, things are very difficult to figure out.
Thanks.
--
Craig Rodrigues
rod...@crodrigues.org