Dumbo on EMR / S3

356 views
Skip to first unread message

Tims

unread,
Feb 17, 2012, 11:48:07 AM2/17/12
to dumbo-user
I know people here are using dumbo with elastic mapreduce, but I can't
get it to work.

I start my instances like so:
./elastic-mapreduce --create --num-instances 3 --alive --ami-version
2.0.4

Then I ssh to the master and install dumbo like so:
wget -O ez_setup.py http://bit.ly/ezsetup
sudo python ez_setup.py dumbo

I copy over my script and run it with this:
dumbo start count.py -hadoop /home/hadoop \
-input s3://<bucket>/logs/weblogs/2012/02/17/ \
-output s3://<bucket>/output/count

but my tasks fail with this:
FAILED
java.lang.IllegalArgumentException: This file system object (hdfs://
10.55.79.18:9000) does not support access to the request path `s3://
songkick/logs/weblogs/2012/02/17/
file-20120217-001349578+0000.708358639498509.0000
0043` You possibly called FileSystem.get(conf) when you should have
called FileSystem.get(uri, conf) to obtain a file system supporting
your path.
etc..

Any ideas?

Gilles

unread,
Feb 19, 2012, 11:50:39 AM2/19/12
to dumbo-user
Hello,

This is an identified problem look to:
http://nubetech.co/amazon-elastic-map-reduce-lessons-learnt

I don't think this is a dumbo issue but more related to the way hdfs
works on emr/s3.

I hope this helps,

-Gilles

On 17 fév, 17:48, Tims <trs...@gmail.com> wrote:
> I know people here are using dumbo with elastic mapreduce, but I can't
> get it to work.
>
> I start my instances like so:
> ./elastic-mapreduce --create --num-instances 3 --alive --ami-version
> 2.0.4
>
> Then I ssh to the master and install dumbo like so:
> wget -O ez_setup.pyhttp://bit.ly/ezsetup

Gilles

unread,
Feb 21, 2012, 12:39:52 PM2/21/12
to dumbo-user
Sorry for the previous post it was a "too fast" answer...
I fact I have the same problem when running a multiple item job. The
problem is in org.apache.hadoop.streaming.AutoInputFormat which seems
to not support S3 correctly.
If you forced it to be
org.apache.hadoop.mapred.SequenceFileAsTextInputFormat its works fine
(at least for me).

You need to overload the 'code' inputformats in /etc/dumbo.conf

[inputformats]
code: org.apache.hadoop.mapred.SequenceFileAsTextInputFormat

-Gilles

Klaas Bosteels

unread,
Feb 21, 2012, 4:22:31 PM2/21/12
to dumbo...@googlegroups.com
Interesting. Maybe we should file a bug about this for Hadoop MapReduce?

-K

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.


Gilles

unread,
Mar 2, 2012, 8:21:10 AM3/2/12
to dumbo-user
This is a basic case just using hadoop streaming on EMR:

This call crashes:
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -input
s3://bom-test/test/querylist -output test -inputformat
org.apache.hadoop.streaming.AutoInputFormat -mapper '/bin/cat' -
reducer '/bin/cat'


2012-02-23 14:22:38,351 INFO org.apache.hadoop.mapred.TaskInProgress
(IPC Server handler 36 on 9001): Error from
attempt_201202231342_0001_m_000000_0:
java.lang.IllegalArgumentException: This file system object (hdfs://
10.228.77.162:9000) does not support access to the request path 's3://
bom-test/test/querylist' You possibly called FileSystem.get(conf) when
you should have called FileSystem.get(uri, conf) to obtain a file
system supporting your path.
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:
372)
at
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:
106)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:
162)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:
187)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:417)
at
org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:
56)
at org.apache.hadoop.mapred.MapTask
$TrackedRecordReader.<init>(MapTask.java:199)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:
423)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

This call works:
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -input /
test/querylist -output test -inputformat
org.apache.hadoop.streaming.AutoInputFormat -mapper '/bin/cat' -
reducer '/bin/cat'

I can file a bug on this. Which list do you recommand?

-Gilles

Klaas Bosteels

unread,
Mar 20, 2012, 8:19:40 AM3/20/12
to dumbo...@googlegroups.com
If the problem really is that AutoInputFormat doesn't support S3 then you could simply file a bug for this in the Hadoop MapReduce JIRA I guess...

-K

David Gleich

unread,
May 10, 2012, 4:20:08 PM5/10/12
to dumbo...@googlegroups.com
Was there ever a bug posted about this?  I'm running into the same issue...

The error that comes up reminds me of this https://issues.apache.org/jira/browse/MAPREDUCE-1293 but I suppose that's long since fixed.

David
-K


> > For more options, visit this group at
> >http://groups.google.com/group/dumbo-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+unsubscribe@googlegroups.com.

Klaas Bosteels

unread,
May 11, 2012, 2:56:47 AM5/11/12
to dumbo...@googlegroups.com
Hey David,

That seems to be the bug that people in this thread were hitting yeah. So apparently it's fixed in 0.21, but presumably you're still on 0.20 (which has been called 1.0 now) like most of us? CDH3 is based on 0.20.2 for instance.

-K  

To view this discussion on the web visit https://groups.google.com/d/msg/dumbo-user/-/h-lJNCRDLGYJ.

To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.

Igor Gatis

unread,
Nov 10, 2013, 11:10:59 PM11/10/13
to dumbo...@googlegroups.com
Is there any work around? I tried setting inputformat to SequenceFileInput but haddop seems to stream file content as if it was a text file (e.g. SEQ....\n).
-K  

David
-K

> > dumbo-user+...@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/dumbo-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.

Klaas Bosteels

unread,
Nov 13, 2013, 2:15:06 AM11/13/13
to dumbo...@googlegroups.com
How did you set it? Did you provide the full classname, i.e.

-inputformat org.apache.hadoop.mapred.SequenceFileInputFormat

?

-K


To unsubscribe from this group and stop receiving emails from it, send an email to dumbo-user+...@googlegroups.com.

To post to this group, send email to dumbo...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages