Getting a nullPointerException when using dumbo with hadoop

Nitin Madnani

unread,

Nov 21, 2009, 10:29:46 PM11/21/09

to dumbo-user

Hi,

I am trying to use dumbo with an academic cluster that our university
has access to. I have downloaded hadoop 0.20.1, patched it as
explained in the dumbo installation instructions and it all works
fine. I also ran the streaming unit tests and they all pass.

However, when I use dumbo to run the equivalent job (with the same
input and output), it doesn't work. My logs show the following:

---
java.lang.NullPointerException
at org.apache.hadoop.io.BytesWritable.(BytesWritable.java:54)
at org.apache.hadoop.typedbytes.TypedBytesWritable.
(TypedBytesWritable.java:41)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.getLastOutput
(TypedBytesOutputReader.java:73)
at org.apache.hadoop.streaming.PipeMapRed.getContext(PipeMapRed.java:
607)
at org.apache.hadoop.streaming.PipeMapRed.logFailure(PipeMapRed.java:
634)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:122)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:
36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
---

Of course, the exact same python program runs just fine in standalone
mode on the same input. Any help would be greatly appreciated!

Thanks!
Nitin

Tim Sell

unread,

Nov 22, 2009, 5:38:53 AM11/22/09

to dumbo...@googlegroups.com

From the web interface you can click through to get the stdout/err for
each failed map. What does that look like? Errors in the python code
often show as NPE in java.

2009/11/22 Nitin Madnani <nmad...@gmail.com>:

> --
>
> You received this message because you are subscribed to the Google Groups "dumbo-user" group.
> To post to this group, send email to dumbo...@googlegroups.com.
> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=.
>
>
>

Nitin Madnani

unread,

Nov 22, 2009, 9:42:37 PM11/22/09

to dumbo-user

Here's what stderr says:

/usr/bin/python: module wordcount not found

I am not sure what's going on. Here's the actual java streaming
command line that's generated when I run the python command:

/Users/nmadnani/hadoop-0.20.1/bin/hadoop jar /Users/nmadnani/
hadoop-0.20.1/build/contrib/streaming/hadoop-0.20.1-streaming.jar -
input '/tmp/nmadnani/bible+shakes.nopunc' -output '/tmp/nmadnani/
demo2' -mapper 'python -m wordcount map 0 262144000' -reducer 'python -
m wordcount red 0 262144000' -jobconf 'stream.map.input=typedbytes' -
jobconf 'stream.reduce.input=typedbytes' -jobconf
'stream.map.output=typedbytes' -jobconf
'stream.reduce.output=typedbytes' -jobconf
'mapred.job.name=wordcount.py (1/1)' -inputformat
'org.apache.hadoop.streaming.AutoInputFormat' -outputformat
'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv
'PYTHONPATH=dumbo-0.21.21-py2.5.egg:typedbytes-0.3.6-py2.5.egg' -file
'/Users/nmadnani/dumbo-workspace/wordcount.py' -file '/Library/
Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/
dumbo-0.21.21-py2.5.egg' -file '/Users/nmadnani/typedbytes-0.3.6-
py2.5.egg'

I see that '-file /Users/nmadnani/dumbo-workspace/wordcount.py' is
there so why can't it find it on the server?

Nitin

On Nov 22, 5:38 am, Tim Sell <trs...@gmail.com> wrote:
> From the web interface you can click through to get the stdout/err for
> each failed map. What does that look like? Errors in the python code
> often show as NPE in java.
>

> 2009/11/22 Nitin Madnani <nmadn...@gmail.com>:

Klaas Bosteels

unread,

Nov 23, 2009, 3:08:28 AM11/23/09

to dumbo...@googlegroups.com

Are you using Cloudera's hadoop-0.20.1+133 by any chance? If so, then
you should upgrade to +152 in which they reverted the patch for
MAPREDUCE-967 that breaks some of the Hadoop Streaming functionality
on which Dumbo relies.

-Klaas

Nitin Madnani

unread,

Nov 23, 2009, 8:53:11 AM11/23/09

to dumbo...@googlegroups.com

Klaas,

I was using that 0.20.1+152 from Cloudera but then something wasn't
working right. So, I downloaded 0.20.1 from Apache and applied the
four patches you mention on the wiki. Is there a reason why that won't
work?

Nitin

--
Got Blog?
http://greenideas.blogspot.com

Tim Sell

unread,

Nov 23, 2009, 9:01:59 AM11/23/09

to dumbo...@googlegroups.com

That is weird :/
if it couldn't find the file it wouldn't even start, surely.

2009/11/23 Nitin Madnani <nmad...@gmail.com>:

Klaas Bosteels

unread,

Nov 23, 2009, 9:25:20 AM11/23/09

to dumbo...@googlegroups.com

Upstream 0.20.1 should work fine as far as I know. I guess it must be
some kind of configuration issue that keeps Streaming from properly
sending along files with jobs then, but I'm afraid I can't immediately
think of a specific cause anymore if it's not related to
MAPREDUCE-967.. :/

-Klaas

Nitin Madnani

unread,

Nov 23, 2009, 3:40:49 PM11/23/09

to dumbo-user

So, I downloaded and used the 0.20.1+152 Cloudera hadoop distribution
and the error still persists.

I tried a toy python program directly using the streaming interface as
follows and it works fine:

bin/hadoop jar /Users/nmadnani/hadoop-0.20.1+152/contrib/streaming/
hadoop-0.20.1+152-streaming.jar -input /tmp/nmadnani/test.txt -output /
tmp/nmadnani/demo2 -mapper ~/dumbo-workspace/test.py -file ~/dumbo-
workspace/test.py -jobconf mapred.reduce.tasks=0

where test.py is just:

#!/usr/bin/python

import sys

def main(argv):
for line in sys.stdin:
print len(line)

if __name__ == "__main__":
main(sys.argv)

So, I looked at how dumbo is invoking the mapper etc. and it uses the
following: -mapper 'python -m wordcount map 0 262144000'.

So, I changed my simple streaming command above to use "-mapper
'python -m test'" instead of "-mapper ~/dumbo-workspace/test.py" and
my "module not found" error reappeared. So, the problem lies with the
way that the mapper is invoked (using the 'python -m' invocation)
rather than just specifying the file.

This leads me to the following questions:

(a) Why is dumbo doing the python -m thing instead of just specifying
the filename like I did in the streaming command above (and as the
streaming page on apache's site does)?
(b) What does the word 'map' refer to after 'wordcount'?
(c) What are the two numbers 0 and 262144000?
(d) Where is a file copied when used with the -file option? May be
python cannot find the module because it's copied somewhere weird?

Thanks!
Nitin

> >> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.

> >> For more options, visit this group athttp://groups.google.com/group/dumbo-user?hl=.

Klaas Bosteels

unread,

Nov 24, 2009, 5:04:40 AM11/24/09

to dumbo...@googlegroups.com

What version of python are you using, Nitin?

Here are the answers to your questions:

> (a) Why is dumbo doing the python -m thing instead of just specifying
> the filename like I did in the streaming command above (and as the
> streaming page on apache's site does)?

Because dumbo also allows you to run python modules instead of .py
files:

http://dumbo.assembla.com/spaces/dumbo/tickets/50

> (b) What does the word 'map' refer to after 'wordcount'?

It specifies that the mapper (and not the reducer) has to be executed.

> (c) What are the two numbers 0 and 262144000?

The first number is the iteration number and the second one is the
memory limit.

> (d) Where is a file copied when used with the -file option? May be
> python cannot find the module because it's copied somewhere weird?

It should be put in the current working directory.

-Klaas

>>>>>>> To post to this group, send email to dumbo-
>>>>>>> us...@googlegroups.com.

>>>>>>> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com
>>>>>>> .
>>>>>>> For more options, visit this group athttp://groups.google.com/group/dumbo-user?hl=
>>>>>>> .
>>
>>>>> --
>>
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "dumbo-user" group.
>>>>> To post to this group, send email to dumbo...@googlegroups.com.
>>>>> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com
>>>>> .
>>>>> For more options, visit this group athttp://groups.google.com/group/dumbo-user?hl=
>>>>> .
>>
>>>> --
>>
>>>> You received this message because you are subscribed to the
>>>> Google Groups "dumbo-user" group.
>>>> To post to this group, send email to dumbo...@googlegroups.com.
>>>> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com
>>>> .
>>>> For more options, visit this group athttp://groups.google.com/group/dumbo-user?hl=
>>>> .
>>
>>> --

>>> Got Blog?
>>> http://greenideas.blogspot.com
>>
>>> --
>>
>>> You received this message because you are subscribed to the Google
>>> Groups "dumbo-user" group.
>>> To post to this group, send email to dumbo...@googlegroups.com.

Nitin Madnani

unread,

Nov 24, 2009, 9:34:58 AM11/24/09

to dumbo...@googlegroups.com

Klaas,

Thanks for getting back to me! Yeah, I think the Python version may be
the kicker here. It's Python 2.3! Do you think that's the problem?

BTW, I am trying to do all this for my PyCon 2010 talk which is on
doing large scale natural language processing using NLTK and Dumbo.

BTW, just to clarify, the regular streaming command (non-module
specification) without using Dumbo seems to have worked just fine. I
guess I will try taking the command line generated by Dumbo and
modifying it to use the other semantics and see what happens.

If nothing works on this cluster, I will have to use EC2, I guess.

Nitin

> For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.

Klaas Bosteels

unread,

Nov 24, 2009, 10:28:01 AM11/24/09

to dumbo...@googlegroups.com, an...@amazon.com

Nitin,

It's great to hear that you'll be using dumbo for your pycon talk!

The python version is probably the problem yeah. You could try changing the generated commands manually, but I think you might also run into some other issues if you don't use python 2.5 or newer.

Btw, you could also use amazon EMR instead of raw EC2 instances. Here's part of an email I got from one of the amazon guys recently:

Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so:

elastic-mapreduce --create --alive

SSH into the cluster using your EC2 keypair as user hadoop and install Dumbo with the following two commands:

wget http://peak.telecommunity.com/dist/ez_setup.py
sudo python ez_setup.py dumbo

Then you can run your Dumbo scripts. I was able to run the ipcount.py demo with the following command.

dumbo start ipcount.py -hadoop /home/hadoop -input s3://anhi-test-data/wordcount/input/ -output s3://anhi-test-data/output/dumbo/wc/

The -hadoop option is important. At this point I haven't created an automatic Dumbo install script, so you'll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.

I'll try to blog about this on http://dumbotics.com once the automatic install script is ready.

-Klaas

Nitin Madnani

unread,

Nov 24, 2009, 11:02:36 AM11/24/09

to dumbo...@googlegroups.com

Klaas

Thanks so much for that info! I will try and give that a whirl!

- Nitin

Klaas Bosteels

unread,

Nov 24, 2009, 1:24:49 PM11/24/09

to dumbo...@googlegroups.com

FYI

> Hi Klaas.
> Just a quick note, we had to roll back the AMI that had dumbo support.
> It should be back on there before Monday, though.
> Sorry for the inconvenience, I should have emailed you sooner.
> Regards,
> Andrew

Klaas

unread,

Dec 23, 2009, 4:29:26 AM12/23/09

to dumbo-user

http://dumbotics.com/2009/12/23/dumbo-on-amazon-emr/

On Nov 24, 7:24 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> FYI
>
> > Hi Klaas.
> > Just a quick note, we had to roll back the AMI that had dumbo support.
> > It should be back on there before Monday, though.
> > Sorry for the inconvenience, I should have emailed you sooner.
> > Regards,
> > Andrew
>

> On Nov 24, 2009, at 7:28, "Klaas Bosteels" <klaas.boste...@gmail.com> wrote:
>
> Nitin,
> It's great to hear that you'll be using dumbo for your pycon talk!
> The python version is probably the problem yeah. You could try
> changing the generated commands manually, but I think you might also
> run into some other issues if you don't use python 2.5 or newer.
> Btw, you could also use amazon EMR instead of raw EC2 instances.
> Here's part of an email I got from one of the amazon guys recently:
>
> Now you should be able to run Dumbo jobs on Elastic MapReduce. To
> start a cluster, you can use the Ruby client as so:
>
> elastic-mapreduce --create --alive
>
> SSH into the cluster using your EC2 keypair as user hadoop and install
> Dumbo with the following two commands:
>

> wgethttp://peak.telecommunity.com/dist/ez_setup.py

> sudo python ez_setup.py dumbo
>
> Then you can run your Dumbo scripts. I was able to run the ipcount.py
> demo with the following command.
>
> dumbo start ipcount.py -hadoop /home/hadoop -input
> s3://anhi-test-data/wordcount/input/ -output
> s3://anhi-test-data/output/dumbo/wc/
>
> The -hadoop option is important. At this point I haven't created an
> automatic Dumbo install script, so you'll have to install Dumbo by
> hand each time you launch the cluster. Fortunately installation is
> easy.
>

> I'll try to blog about this onhttp://dumbotics.comonce the automatic

> install script is ready.
> -Klaas
>
> On 24 Nov 2009, at 15:34, Nitin Madnani wrote:
>
> Klaas,
>
> Thanks for getting back to me! Yeah, I think the Python version may be
> the kicker here. It's Python 2.3! Do you think that's the problem?
>
> BTW, I am trying to do all this for my PyCon 2010 talk which is on
> doing large scale natural language processing using NLTK and Dumbo.
>
> BTW, just to clarify, the regular streaming command (non-module
> specification) without using Dumbo seems to have worked just fine. I
> guess I will try taking the command line generated by Dumbo and
> modifying it to use the other semantics and see what happens.
>
> If nothing works on this cluster, I will have to use EC2, I guess.
>
> Nitin
>
> On Tue, Nov 24, 2009 at 5:04 AM, Klaas Bosteels
>

> For more options, visit this ...
>
> read more »

Reply all

Reply to author

Forward