From: Klaas Bosteels <klaas.boste...@gmail.com>
Date: Tue, 24 Nov 2009 19:24:49 +0100
Local: Tues, Nov 24 2009 1:24 pm
Subject: Fwd: Getting a nullPointerException when using dumbo with hadoop
FYI
> Hi Klaas.
On Nov 24, 2009, at 7:28, "Klaas Bosteels" <klaas.boste...@gmail.com> wrote:
> Just a quick note, we had to roll back the AMI that had dumbo support. > It should be back on there before Monday, though. > Sorry for the inconvenience, I should have emailed you sooner. > Regards, > Andrew Nitin,
Now you should be able to run Dumbo jobs on Elastic MapReduce. To
elastic-mapreduce --create --alive
SSH into the cluster using your EC2 keypair as user hadoop and install
wget http://peak.telecommunity.com/dist/ez_setup.py
Then you can run your Dumbo scripts. I was able to run the ipcount.py
dumbo start ipcount.py -hadoop /home/hadoop -input
The -hadoop option is important. At this point I haven't created an
I'll try to blog about this on http://dumbotics.com once the automatic
On 24 Nov 2009, at 15:34, Nitin Madnani wrote:
Klaas,
Thanks for getting back to me! Yeah, I think the Python version may be
BTW, I am trying to do all this for my PyCon 2010 talk which is on
BTW, just to clarify, the regular streaming command (non-module
If nothing works on this cluster, I will have to use EC2, I guess.
Nitin
On Tue, Nov 24, 2009 at 5:04 AM, Klaas Bosteels
Here are the answers to your questions:
(a) Why is dumbo doing the python -m thing instead of just specifying
the filename like I did in the streaming command above (and as the
streaming page on apache's site does)?
Because dumbo also allows you to run python modules instead of .py
files:
http://dumbo.assembla.com/spaces/dumbo/tickets/50
(b) What does the word 'map' refer to after 'wordcount'?
It specifies that the mapper (and not the reducer) has to be executed.
(c) What are the two numbers 0 and 262144000?
The first number is the iteration number and the second one is the
memory limit.
(d) Where is a file copied when used with the -file option? May be
python cannot find the module because it's copied somewhere weird?
It should be put in the current working directory.
-Klaas
On 23 Nov 2009, at 21:40, Nitin Madnani wrote:
So, I downloaded and used the 0.20.1+152 Cloudera hadoop distribution
and the error still persists.
I tried a toy python program directly using the streaming interface as
follows and it works fine:
bin/hadoop jar /Users/nmadnani/hadoop-0.20.1+152/contrib/streaming/
hadoop-0.20.1+152-streaming.jar -input /tmp/nmadnani/test.txt -
output /
tmp/nmadnani/demo2 -mapper ~/dumbo-workspace/test.py -file ~/dumbo-
workspace/test.py -jobconf mapred.reduce.tasks=0
where test.py is just:
#!/usr/bin/python
import sys
def main(argv):
for line in sys.stdin:
print len(line)
if __name__ == "__main__":
main(sys.argv)
So, I looked at how dumbo is invoking the mapper etc. and it uses the
following: -mapper 'python -m wordcount map 0 262144000'.
So, I changed my simple streaming command above to use "-mapper
'python -m test'" instead of "-mapper ~/dumbo-workspace/test.py" and
my "module not found" error reappeared. So, the problem lies with the
way that the mapper is invoked (using the 'python -m' invocation)
rather than just specifying the file.
This leads me to the following questions:
(a) Why is dumbo doing the python -m thing instead of just specifying
the filename like I did in the streaming command above (and as the
streaming page on apache's site does)?
(b) What does the word 'map' refer to after 'wordcount'?
(c) What are the two numbers 0 and 262144000?
(d) Where is a file copied when used with the -file option? May be
python cannot find the module because it's copied somewhere weird?
Thanks!
Nitin
On Nov 23, 9:25 am, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
Upstream 0.20.1 should work fine as far as I know. I guess it must be
some kind of configuration issue that keeps Streaming from properly
sending along files with jobs then, but I'm afraid I can't
immediately
think of a specific cause anymore if it's not related to
MAPREDUCE-967.. :/
-Klaas
On Mon, Nov 23, 2009 at 2:53 PM, Nitin Madnani <nmadn...@gmail.com>
wrote:
Klaas,
I was using that 0.20.1+152 from Cloudera but then something wasn't
working right. So, I downloaded 0.20.1 from Apache and applied the
four patches you mention on the wiki. Is there a reason why that
won't
work?
Nitin
On Mon, Nov 23, 2009 at 3:08 AM, Klaas Bosteels
then
you should upgrade to +152 in which they reverted the patch for
MAPREDUCE-967 that breaks some of the Hadoop Streaming
functionality
on which Dumbo relies.
-Klaas
On 23 Nov 2009, at 03:42, Nitin Madnani wrote:
Here's what stderr says:
/usr/bin/python: module wordcount not found
I am not sure what's going on. Here's the actual java streaming
command line that's generated when I run the python command:
/Users/nmadnani/hadoop-0.20.1/bin/hadoop jar /Users/nmadnani/
hadoop-0.20.1/build/contrib/streaming/hadoop-0.20.1-
streaming.jar -
input '/tmp/nmadnani/bible+shakes.nopunc' -output '/tmp/nmadnani/
demo2' -mapper 'python -m wordcount map 0 262144000' -reducer
'python -
m wordcount red 0 262144000' -jobconf
'stream.map.input=typedbytes' -
jobconf 'stream.reduce.input=typedbytes' -jobconf
'stream.map.output=typedbytes' -jobconf
'stream.reduce.output=typedbytes' -jobconf
'mapred.job.name=wordcount.py (1/1)' -inputformat
'org.apache.hadoop.streaming.AutoInputFormat' -outputformat
'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv
'PYTHONPATH=dumbo-0.21.21-py2.5.egg:typedbytes-0.3.6-py2.5.egg' -
file
'/Users/nmadnani/dumbo-workspace/wordcount.py' -file '/Library/
Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-
packages/
dumbo-0.21.21-py2.5.egg' -file '/Users/nmadnani/typedbytes-0.3.6-
py2.5.egg'
I see that '-file /Users/nmadnani/dumbo-workspace/wordcount.py' is
there so why can't it find it on the server?
Nitin
On Nov 22, 5:38 am, Tim Sell <trs...@gmail.com> wrote:
From the web interface you can click through to get the stdout/
err
for
each failed map. What does that look like? Errors in the python
code
often show as NPE in java.
2009/11/22 Nitin Madnani <nmadn...@gmail.com>:
Hi,
I am trying to use dumbo with an academic cluster that our
university
has access to. I have downloaded hadoop 0.20.1, patched it as
explained in the dumbo installation instructions and it all
works
fine. I also ran the streaming unit tests and they all pass.
However, when I use dumbo to run the equivalent job (with the
same
input and output), it doesn't work. My logs show the following:
---
java.lang.NullPointerException
at org.apache.hadoop.io.BytesWritable.
(BytesWritable.java:54)
at org.apache.hadoop.typedbytes.TypedBytesWritable.
(TypedBytesWritable.java:41)
at
org.apache.hadoop.streaming.io.TypedBytesOutputReader.getLastOutput
(TypedBytesOutputReader.java:73)
at org.apache.hadoop.streaming.PipeMapRed.getContext
(PipeMapRed.java:
607)
at org.apache.hadoop.streaming.PipeMapRed.logFailure
(PipeMapRed.java:
634)
at org.apache.hadoop.streaming.PipeMapper.map
(PipeMapper.java:122)
at org.apache.hadoop.mapred.MapRunner.run
(MapRunner.java:50)
at org.apache.hadoop.streaming.PipeMapRunner.run
(PipeMapRunner.java:
36)
at org.apache.hadoop.mapred.MapTask.runOldMapper
(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
---
Of course, the exact same python program runs just fine in
standalone
mode on the same input. Any help would be greatly appreciated!
Thanks!
Nitin
--
You received this message because you are subscribed to the
Google
Groups "dumbo-user" group.
To post to this group, send email to dumbo-
user@googlegroups.com.
To unsubscribe from this group, send email to
.
For more options, visit this group
.
--
You received this message because you are subscribed to the Google
Groups "dumbo-user" group.
To post to this group, send email to dumbo-user@googlegroups.com.
To unsubscribe from this group, send email to
.
For more options, visit this group
.
--
You received this message because you are subscribed to the
Google Groups "dumbo-user" group.
To post to this group, send email to dumbo-user@googlegroups.com.
To unsubscribe from this group, send email to
.
For more options, visit this group
.
--
Got Blog?
http://greenideas.blogspot.com
--
You received this message because you are subscribed to the Google
Groups "dumbo-user" group.
To post to this group, send email to dumbo-user@googlegroups.com.
To unsubscribe from this group, send email to
.
For more options, visit this group
.
--
You received this message because you are subscribed to the Google
Groups "dumbo-user" group.
To post to this group, send email to dumbo-user@googlegroups.com.
To unsubscribe from this group, send email to
.
For more options, visit this group at
.
--
You received this message because you are subscribed to the Google
To post to this group, send email to dumbo-user@googlegroups.com.
To unsubscribe from this group, send email to
For more options, visit this group at
--
--
You received this message because you are subscribed to the Google
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||