Running a dumbo job on large bzip2-compressed files causes splits

Erik Forsberg

unread,

Sep 28, 2009, 9:56:47 AM9/28/09

to dumbo...@googlegroups.com

Hi!

I'm playing around with dumbo. I tried running the wordcount example
with input being a directory containing four bzip2 files, each of them
being around 110MiB in size.

I'm running Cloudera hadoop 0.18.3 which has the backported bzip2
support, but without splitting support. So, when I run normal streaming
jobs against this directory, the job is split into 4 maps.

With dumbo and the wordcount example, the job was split into 8 maps.
Very confusing. The output also was doubled, so each file was analysed
in its full length twice.

Adding -jobconf 'mapred.min.split.size=922337203685775807' to the dumbo
cmdline fixed the problem.

Any ideas on why this problem occurs? I can't reproduce it with small
files, it seems the input files have to be large, probably larger than
the dfs.block.size (set to 64MiB in my case).

Thanks,
\EF
--
Erik Forsberg <fors...@opera.com>
Developer, Opera Software - http://www.opera.com/

Klaas Bosteels

unread,

Sep 28, 2009, 10:31:21 AM9/28/09

to dumbo...@googlegroups.com

Hey Erik,

Never tried the bzip2 stuff (since we use LZO), but I can't think of
any reason why "normal" Streaming would behave differently from Dumbo.
I could try to look for something suspicious if you can show me the
first few lines Dumbo outputs (in particular the line that begins with
"EXEC:") though.

Regards,
-Klaas

Erik Forsberg

unread,

Sep 29, 2009, 3:09:33 AM9/29/09

to dumbo...@googlegroups.com

On Mon, 28 Sep 2009 16:31:21 +0200
Klaas Bosteels <klaas.b...@gmail.com> wrote:

>
> Hey Erik,
>
> Never tried the bzip2 stuff (since we use LZO), but I can't think of
> any reason why "normal" Streaming would behave differently from Dumbo.

Me neither. I was actually fully prepared to see the problem gone when
trying again after a good nights sleep, but unfortunately, it's still
there.

> I could try to look for something suspicious if you can show me the
> first few lines Dumbo outputs (in particular the line that begins with
> "EXEC:") though.

bin/dumbo start experiments/dumbo/wordcount.py -input dumbo/input
-output dumbo/output -hadoop /usr/lib/hadoop EXEC:
HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" /usr/lib/hadoop/bin/hadoop
jar /usr/lib/hadoop/contrib/streaming/hadoop-0.18.3-6cloudera0.3.0-streaming.jar
-input 'dumbo/input' -output 'dumbo/output' -mapper 'python -m
wordcount map 0 262144000' -reducer 'python -m wordcount red 0
262144000' -jobconf 'stream.map.input=typedbytes' -jobconf
'stream.reduce.input=typedbytes' -jobconf
'stream.map.output=typedbytes' -jobconf
'stream.reduce.output=typedbytes' -jobconf
'mapred.job.name=wordcount.py (1/1)' -inputformat
'org.apache.hadoop.streaming.AutoInputFormat' -outputformat
'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv
'PYTHONPATH=dumbo-0.21.21-py2.5.egg:typedbytes-0.3.6-py2.5.egg' -file
'/home/forsberg/dev/dumbotest/experiments/dumbo/wordcount.py' -file
'/home/forsberg/dev/dumbotest/lib/dumbo-0.21.21-py2.5.egg' -file
'/home/forsberg/dev/dumbotest/lib/typedbytes-0.3.6-py2.5.egg'
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar:
[/home/forsberg/dev/dumbotest/experiments/dumbo/wordcount.py, /home/forsberg/dev/dumbotest/lib/dumbo-0.21.21-py2.5.egg, /home/forsberg/dev/dumbotest/lib/typedbytes-0.3.6-py2.5.egg, /var/lib/hadoop/cache/forsberg/hadoop-unjar14949/]
[] /tmp/streamjob14950.jar tmpDir=null 09/09/29 08:21:46 WARN
mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
Applications should implement Tool for the same. 09/09/29 08:21:46 INFO
mapred.FileInputFormat: Total input paths to process : 2 09/09/29
08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 2
09/09/29 08:21:46 INFO streaming.StreamJob: getLocalDirs():
[/var/lib/hadoop/cache/forsberg/mapred/local] 09/09/29 08:21:46 INFO
streaming.StreamJob: Running job: job_200909281311_0025 09/09/29
08:21:46 INFO streaming.StreamJob: To kill this job, run: 09/09/29
08:21:46 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=localhost:8021 -kill job_200909281311_0025
09/09/29 08:21:46 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030/jobdetails.jsp?jobid=job_200909281311_0025
09/09/29 08:21:47 INFO streaming.StreamJob: map 0% reduce 0%
...
09/09/29 08:46:55 INFO streaming.StreamJob: map 100% reduce 100%
09/09/29 08:46:55 INFO streaming.StreamJob: Job complete:
job_200909281311_0025 09/09/29 08:46:55 INFO streaming.StreamJob:
Output: dumbo/output

dumbo/input in my HDFS has the following contents:

$ hadoop dfs -ls dumbo/input
Found 2 items
-rw-r--r-- 1 forsberg supergroup 118392538 2009-09-29
08:16 /user/forsberg/dumbo/input/file1.log.bz2 -rw-r--r-- 1 forsberg
supergroup 119518613 2009-09-29
08:17 /user/forsberg/dumbo/input/file2.log.bz2

Yet, there are 4 mappers started. Weird.

My mapper.py:

import os
import sys
libpath = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0]),
"../../lib")) sys.path = [libpath] + sys.path

from pkg_resources import require
require("dumbo")

import dumbo

def mapper(key,value):
for word in value.split(): yield word,1

def reducer(key,values):
yield key,sum(values)

if __name__ == "__main__":
import dumbo
dumbo.run(mapper,reducer)

I also ran the same wordcount program against an input directory where
the same files were uncompressed, and the result is as yesterday - the
word that was counted 84 times when running against the compressed
files, is counted 42 times against the uncompressed.

Weird. Could be some hadoop streaming bug, I guess.

Erik Forsberg

unread,

Sep 29, 2009, 3:35:25 AM9/29/09

to dumbo...@googlegroups.com

On Tue, 29 Sep 2009 09:09:33 +0200
Erik Forsberg <fors...@opera.com> wrote:

> Weird. Could be some hadoop streaming bug, I guess.

An additional note: I have not applied MAPREDUCE-764 on this cluster.
Don't know if it's related, just thought I should mention it.

Klaas Bosteels

unread,

Sep 29, 2009, 4:27:25 AM9/29/09

to dumbo...@googlegroups.com

Not having applied MAPREDUCE-764 shouldn't be related I think.

I wonder if it has to do with AutoInputFormat. Could you try to run
the Dumbo program with the option "-inputformat text" and see if that
helps?

Another way of checking if it's related to AutoInputFormat would be to
run the "normal" Streaming program with the option "-inputformat
org.apache.hadoop.streaming.AutoInputFormat" and see if you also get
the weirdness then.

-Klaas

Erik Forsberg

unread,

Sep 29, 2009, 4:54:43 AM9/29/09

to dumbo...@googlegroups.com

On Tue, 29 Sep 2009 10:27:25 +0200
Klaas Bosteels <klaas.b...@gmail.com> wrote:

>
> Not having applied MAPREDUCE-764 shouldn't be related I think.
>
> I wonder if it has to do with AutoInputFormat. Could you try to run
> the Dumbo program with the option "-inputformat text" and see if that
> helps?

That helps.

> Another way of checking if it's related to AutoInputFormat would be to
> run the "normal" Streaming program with the option "-inputformat
> org.apache.hadoop.streaming.AutoInputFormat" and see if you also get
> the weirdness then.

That gives me the weirdness.

So, some kind of problem with AutoInputFormat and bzip2-compressed
files, then. I'll create a bug report in the Hadoop Jira.

Thanks for your help!

Erik Forsberg

unread,

Sep 29, 2009, 5:28:57 AM9/29/09

to dumbo...@googlegroups.com

> So, some kind of problem with AutoInputFormat and bzip2-compressed
> files, then. I'll create a bug report in the Hadoop Jira.

For future reference: https://issues.apache.org/jira/browse/HADOOP-6290

Reply all

Reply to author

Forward