Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Fwd: Getting a nullPointerException when using dumbo with hadoop
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Klaas Bosteels  
View profile  
 More options Nov 24 2009, 1:24 pm
From: Klaas Bosteels <klaas.boste...@gmail.com>
Date: Tue, 24 Nov 2009 19:24:49 +0100
Local: Tues, Nov 24 2009 1:24 pm
Subject: Fwd: Getting a nullPointerException when using dumbo with hadoop
FYI

> Hi Klaas.
> Just a quick note, we had to roll back the AMI that had dumbo support.
> It should be back on there before Monday, though.
> Sorry for the inconvenience, I should have emailed you sooner.
> Regards,
> Andrew

On Nov 24, 2009, at 7:28, "Klaas Bosteels" <klaas.boste...@gmail.com> wrote:

Nitin,
It's great to hear that you'll be using dumbo for your pycon talk!
The python version is probably the problem yeah. You could try
changing the generated commands manually, but I think you might also
run into some other issues if you don't use python 2.5 or newer.
Btw, you could also use amazon EMR instead of raw EC2 instances.
Here's part of an email I got from one of the amazon guys recently:

Now you should be able to run Dumbo jobs on Elastic MapReduce. To
start a cluster, you can use the Ruby client as so:

elastic-mapreduce --create --alive

SSH into the cluster using your EC2 keypair as user hadoop and install
Dumbo with the following two commands:

wget http://peak.telecommunity.com/dist/ez_setup.py
sudo python ez_setup.py dumbo

Then you can run your Dumbo scripts. I was able to run the ipcount.py
demo with the following command.

dumbo start ipcount.py -hadoop /home/hadoop -input
s3://anhi-test-data/wordcount/input/ -output
s3://anhi-test-data/output/dumbo/wc/

The -hadoop option is important. At this point I haven't created an
automatic Dumbo install script, so you'll have to install Dumbo by
hand each time you launch the cluster. Fortunately installation is
easy.

I'll try to blog about this on http://dumbotics.com once the automatic
install script is ready.
-Klaas

On 24 Nov 2009, at 15:34, Nitin Madnani wrote:

Klaas,

Thanks for getting back to me! Yeah, I think the Python version may be
the kicker here. It's Python 2.3! Do you think that's the problem?

BTW, I am trying to do all this for my PyCon 2010 talk which is on
doing large scale natural language processing using NLTK and Dumbo.

BTW, just to clarify, the regular streaming command (non-module
specification) without using Dumbo seems to have worked just fine. I
guess I will try taking the command line generated by Dumbo and
modifying it to use the other semantics and see what happens.

If nothing works on this cluster, I will have to use EC2, I guess.

Nitin

On Tue, Nov 24, 2009 at 5:04 AM, Klaas Bosteels

<klaas.boste...@gmail.com> wrote:

What version of python are you using, Nitin?

Here are the answers to your questions:

(a) Why is dumbo doing the python -m thing instead of just specifying

the filename like I did in the streaming command above (and as the

streaming page on apache's site does)?

Because dumbo also allows you to run python modules instead of .py

files:

http://dumbo.assembla.com/spaces/dumbo/tickets/50

(b) What does the word 'map' refer to after 'wordcount'?

It specifies that the mapper (and not the reducer) has to be executed.

(c) What are the two numbers 0 and 262144000?

The first number is the iteration number and the second one is the

memory limit.

(d) Where is a file copied when used with the -file option? May be

python cannot find the module because it's copied somewhere weird?

It should be put in the current working directory.

-Klaas

On 23 Nov 2009, at 21:40, Nitin Madnani wrote:

So, I downloaded and used the 0.20.1+152 Cloudera hadoop distribution

and the error still persists.

I tried a toy python program directly using the streaming interface as

follows and it works fine:

bin/hadoop jar /Users/nmadnani/hadoop-0.20.1+152/contrib/streaming/

hadoop-0.20.1+152-streaming.jar -input /tmp/nmadnani/test.txt -

output /

tmp/nmadnani/demo2 -mapper ~/dumbo-workspace/test.py -file ~/dumbo-

workspace/test.py -jobconf mapred.reduce.tasks=0

where test.py is just:

#!/usr/bin/python

import sys

def main(argv):

   for line in sys.stdin:

       print len(line)

if __name__ == "__main__":

    main(sys.argv)

So, I looked at how dumbo is invoking the mapper etc. and it uses the

following: -mapper 'python -m wordcount map 0 262144000'.

So, I changed my simple streaming command above to use "-mapper

'python -m test'" instead of "-mapper ~/dumbo-workspace/test.py" and

my "module not found" error reappeared. So, the problem lies with the

way that the mapper is invoked (using the 'python -m' invocation)

rather than just specifying the file.

This leads me to the following questions:

(a) Why is dumbo doing the python -m thing instead of just specifying

the filename like I did in the streaming command above (and as the

streaming page on apache's site does)?

(b) What does the word 'map' refer to after 'wordcount'?

(c) What are the two numbers 0 and 262144000?

(d) Where is a file copied when used with the -file option? May be

python cannot find the module because it's copied somewhere weird?

Thanks!

Nitin

On Nov 23, 9:25 am, Klaas Bosteels <klaas.boste...@gmail.com> wrote:

Upstream 0.20.1 should work fine as far as I know. I guess it must be

some kind of configuration issue that keeps Streaming from properly

sending along files with jobs then, but I'm afraid I can't

immediately

think of a specific cause anymore if it's not related to

MAPREDUCE-967.. :/

-Klaas

On Mon, Nov 23, 2009 at 2:53 PM, Nitin Madnani <nmadn...@gmail.com>

wrote:

Klaas,

I was using that 0.20.1+152 from Cloudera but then something wasn't

working right. So, I downloaded 0.20.1 from Apache and applied the

four patches you mention on the wiki. Is there a reason why that

won't

work?

Nitin

On Mon, Nov 23, 2009 at 3:08 AM, Klaas Bosteels

<klaas.boste...@gmail.com> wrote:

Are you using Cloudera's hadoop-0.20.1+133 by any chance? If so,

then

you should upgrade to +152 in which they reverted the patch for

MAPREDUCE-967 that breaks some of the Hadoop Streaming

functionality

on which Dumbo relies.

-Klaas

On 23 Nov 2009, at 03:42, Nitin Madnani wrote:

Here's what stderr says:

/usr/bin/python: module wordcount not found

I am not sure what's going on. Here's the actual java streaming

command line that's generated when I run the python command:

/Users/nmadnani/hadoop-0.20.1/bin/hadoop jar /Users/nmadnani/

hadoop-0.20.1/build/contrib/streaming/hadoop-0.20.1-

streaming.jar -

input '/tmp/nmadnani/bible+shakes.nopunc' -output '/tmp/nmadnani/

demo2' -mapper 'python -m wordcount map 0 262144000' -reducer

'python -

m wordcount red 0 262144000' -jobconf

'stream.map.input=typedbytes' -

jobconf 'stream.reduce.input=typedbytes' -jobconf

'stream.map.output=typedbytes' -jobconf

'stream.reduce.output=typedbytes' -jobconf

'mapred.job.name=wordcount.py (1/1)' -inputformat

'org.apache.hadoop.streaming.AutoInputFormat' -outputformat

'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv

'PYTHONPATH=dumbo-0.21.21-py2.5.egg:typedbytes-0.3.6-py2.5.egg' -

file

'/Users/nmadnani/dumbo-workspace/wordcount.py' -file '/Library/

Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-

packages/

dumbo-0.21.21-py2.5.egg' -file '/Users/nmadnani/typedbytes-0.3.6-

py2.5.egg'

I see that '-file /Users/nmadnani/dumbo-workspace/wordcount.py' is

there so why can't it find it on the server?

Nitin

On Nov 22, 5:38 am, Tim Sell <trs...@gmail.com> wrote:

From the web interface you can click through to get the stdout/

err

for

each failed map. What does that look like? Errors in the python

code

often show as NPE in java.

2009/11/22 Nitin Madnani <nmadn...@gmail.com>:

Hi,

I am trying to use dumbo with an academic cluster that our

university

has access to. I have downloaded hadoop 0.20.1, patched it as

explained in the dumbo installation instructions and it all

works

fine. I also ran the streaming unit tests and they all pass.

However, when I use dumbo to run the equivalent job (with the

same

input and output), it doesn't work. My logs show the following:

---

java.lang.NullPointerException

       at org.apache.hadoop.io.BytesWritable.

(BytesWritable.java:54)

       at org.apache.hadoop.typedbytes.TypedBytesWritable.

(TypedBytesWritable.java:41)

       at

org.apache.hadoop.streaming.io.TypedBytesOutputReader.getLastOutput

(TypedBytesOutputReader.java:73)

       at org.apache.hadoop.streaming.PipeMapRed.getContext

(PipeMapRed.java:

607)

       at org.apache.hadoop.streaming.PipeMapRed.logFailure

(PipeMapRed.java:

634)

       at org.apache.hadoop.streaming.PipeMapper.map

(PipeMapper.java:122)

       at org.apache.hadoop.mapred.MapRunner.run

(MapRunner.java:50)

       at org.apache.hadoop.streaming.PipeMapRunner.run

(PipeMapRunner.java:

36)

       at org.apache.hadoop.mapred.MapTask.runOldMapper

(MapTask.java:358)

       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

       at org.apache.hadoop.mapred.Child.main(Child.java:170)

---

Of course, the exact same python program runs just fine in

standalone

mode on the same input. Any help would be greatly appreciated!

Thanks!

Nitin

--

You received this message because you are subscribed to the

Google

Groups "dumbo-user" group.

To post to this group, send email to dumbo-

user@googlegroups.com.

To unsubscribe from this group, send email to
dumbo-user+unsubscribe@googlegroups.com

.

For more options, visit this group
athttp://groups.google.com/group/dumbo-user?hl=

.

--

You received this message because you are subscribed to the Google

Groups "dumbo-user" group.

To post to this group, send email to dumbo-user@googlegroups.com.

To unsubscribe from this group, send email to
dumbo-user+unsubscribe@googlegroups.com

.

For more options, visit this group
athttp://groups.google.com/group/dumbo-user?hl=

.

--

You received this message because you are subscribed to the

Google Groups "dumbo-user" group.

To post to this group, send email to dumbo-user@googlegroups.com.

To unsubscribe from this group, send email to
dumbo-user+unsubscribe@googlegroups.com

.

For more options, visit this group
athttp://groups.google.com/group/dumbo-user?hl=

.

--

Got Blog?

http://greenideas.blogspot.com

--

You received this message because you are subscribed to the Google

Groups "dumbo-user" group.

To post to this group, send email to dumbo-user@googlegroups.com.

To unsubscribe from this group, send email to
dumbo-user+unsubscribe@googlegroups.com

.

For more options, visit this group
athttp://groups.google.com/group/dumbo-user?hl=

.

--

You received this message because you are subscribed to the Google

Groups "dumbo-user" group.

To post to this group, send email to dumbo-user@googlegroups.com.

To unsubscribe from this group, send email to
dumbo-user+unsubscribe@googlegroups.com

.

For more options, visit this group at
http://groups.google.com/group/dumbo-user?hl=

.

--

You received this message because you are subscribed to the Google
Groups "dumbo-user" group.

To post to this group, send email to dumbo-user@googlegroups.com.

To unsubscribe from this group, send email to
dumbo-user+unsubscribe@googlegroups.com.

For more options, visit this group at
http://groups.google.com/group/dumbo-user?hl=en.

--
Got Blog?
http://greenideas.blogspot.com

--

You received this message because you are subscribed to the Google
Groups "dumbo-user" group.
To post to this group, send email to dumbo-user@googlegroups.com.
To unsubscribe from this group, send email to
dumbo-user+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/dumbo-user?hl=en.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.