I'm playing around with dumbo. I tried running the wordcount example
with input being a directory containing four bzip2 files, each of them
being around 110MiB in size.
I'm running Cloudera hadoop 0.18.3 which has the backported bzip2
support, but without splitting support. So, when I run normal streaming
jobs against this directory, the job is split into 4 maps.
With dumbo and the wordcount example, the job was split into 8 maps.
Very confusing. The output also was doubled, so each file was analysed
in its full length twice.
Adding -jobconf 'mapred.min.split.size=922337203685775807' to the dumbo
cmdline fixed the problem.
Any ideas on why this problem occurs? I can't reproduce it with small
files, it seems the input files have to be large, probably larger than
the dfs.block.size (set to 64MiB in my case).
Thanks,
\EF
--
Erik Forsberg <fors...@opera.com>
Developer, Opera Software - http://www.opera.com/
>
> Hey Erik,
>
> Never tried the bzip2 stuff (since we use LZO), but I can't think of
> any reason why "normal" Streaming would behave differently from Dumbo.
Me neither. I was actually fully prepared to see the problem gone when
trying again after a good nights sleep, but unfortunately, it's still
there.
> I could try to look for something suspicious if you can show me the
> first few lines Dumbo outputs (in particular the line that begins with
> "EXEC:") though.
bin/dumbo start experiments/dumbo/wordcount.py -input dumbo/input
-output dumbo/output -hadoop /usr/lib/hadoop EXEC:
HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" /usr/lib/hadoop/bin/hadoop
jar /usr/lib/hadoop/contrib/streaming/hadoop-0.18.3-6cloudera0.3.0-streaming.jar
-input 'dumbo/input' -output 'dumbo/output' -mapper 'python -m
wordcount map 0 262144000' -reducer 'python -m wordcount red 0
262144000' -jobconf 'stream.map.input=typedbytes' -jobconf
'stream.reduce.input=typedbytes' -jobconf
'stream.map.output=typedbytes' -jobconf
'stream.reduce.output=typedbytes' -jobconf
'mapred.job.name=wordcount.py (1/1)' -inputformat
'org.apache.hadoop.streaming.AutoInputFormat' -outputformat
'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv
'PYTHONPATH=dumbo-0.21.21-py2.5.egg:typedbytes-0.3.6-py2.5.egg' -file
'/home/forsberg/dev/dumbotest/experiments/dumbo/wordcount.py' -file
'/home/forsberg/dev/dumbotest/lib/dumbo-0.21.21-py2.5.egg' -file
'/home/forsberg/dev/dumbotest/lib/typedbytes-0.3.6-py2.5.egg'
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar:
[/home/forsberg/dev/dumbotest/experiments/dumbo/wordcount.py, /home/forsberg/dev/dumbotest/lib/dumbo-0.21.21-py2.5.egg, /home/forsberg/dev/dumbotest/lib/typedbytes-0.3.6-py2.5.egg, /var/lib/hadoop/cache/forsberg/hadoop-unjar14949/]
[] /tmp/streamjob14950.jar tmpDir=null 09/09/29 08:21:46 WARN
mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
Applications should implement Tool for the same. 09/09/29 08:21:46 INFO
mapred.FileInputFormat: Total input paths to process : 2 09/09/29
08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 2
09/09/29 08:21:46 INFO streaming.StreamJob: getLocalDirs():
[/var/lib/hadoop/cache/forsberg/mapred/local] 09/09/29 08:21:46 INFO
streaming.StreamJob: Running job: job_200909281311_0025 09/09/29
08:21:46 INFO streaming.StreamJob: To kill this job, run: 09/09/29
08:21:46 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=localhost:8021 -kill job_200909281311_0025
09/09/29 08:21:46 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030/jobdetails.jsp?jobid=job_200909281311_0025
09/09/29 08:21:47 INFO streaming.StreamJob: map 0% reduce 0%
...
09/09/29 08:46:55 INFO streaming.StreamJob: map 100% reduce 100%
09/09/29 08:46:55 INFO streaming.StreamJob: Job complete:
job_200909281311_0025 09/09/29 08:46:55 INFO streaming.StreamJob:
Output: dumbo/output
dumbo/input in my HDFS has the following contents:
$ hadoop dfs -ls dumbo/input
Found 2 items
-rw-r--r-- 1 forsberg supergroup 118392538 2009-09-29
08:16 /user/forsberg/dumbo/input/file1.log.bz2 -rw-r--r-- 1 forsberg
supergroup 119518613 2009-09-29
08:17 /user/forsberg/dumbo/input/file2.log.bz2
Yet, there are 4 mappers started. Weird.
My mapper.py:
import os
import sys
libpath = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0]),
"../../lib")) sys.path = [libpath] + sys.path
from pkg_resources import require
require("dumbo")
import dumbo
def mapper(key,value):
for word in value.split(): yield word,1
def reducer(key,values):
yield key,sum(values)
if __name__ == "__main__":
import dumbo
dumbo.run(mapper,reducer)
I also ran the same wordcount program against an input directory where
the same files were uncompressed, and the result is as yesterday - the
word that was counted 84 times when running against the compressed
files, is counted 42 times against the uncompressed.
Weird. Could be some hadoop streaming bug, I guess.
> Weird. Could be some hadoop streaming bug, I guess.
An additional note: I have not applied MAPREDUCE-764 on this cluster.
Don't know if it's related, just thought I should mention it.
>
> Not having applied MAPREDUCE-764 shouldn't be related I think.
>
> I wonder if it has to do with AutoInputFormat. Could you try to run
> the Dumbo program with the option "-inputformat text" and see if that
> helps?
That helps.
> Another way of checking if it's related to AutoInputFormat would be to
> run the "normal" Streaming program with the option "-inputformat
> org.apache.hadoop.streaming.AutoInputFormat" and see if you also get
> the weirdness then.
That gives me the weirdness.
So, some kind of problem with AutoInputFormat and bzip2-compressed
files, then. I'll create a bug report in the Hadoop Jira.
Thanks for your help!
For future reference: https://issues.apache.org/jira/browse/HADOOP-6290