debugging code in dumbo

20 views

Skip to first unread message

Mohit Singh

unread,

Jul 23, 2013, 2:01:32 PM7/23/13

to dumbo...@googlegroups.com

Hi,

I am trying to use dumbo library and am just a beginner.

One of the things i like about hadoop streaming is the ease of debugging. So for example, if I do:

cat input python mapper.py | sort -k,1 | python reducer.py

And here if something is wrong in my code, like some syntax error I know that beforehand and at the same time I can print the heck out of the code to see if everything is working or not.

And then deploy it on cluster.

I am trying to do the same thing but not sure how to do this using dumbo module.

Also, can you help me how does input is received using dumbo:

For example:

def mapper(data):
    for key, value in data:
        for word in value.split(): yield word,1

How do you get key,value pair here?
I mean, so lets say data is input file
Shouldnt it be like 
for line in data:
    # do something?

I mean I would love to just print out what is in each data structure.. So if you can help me with my first question, then I can take it from there :)
Thanks

Klaas Bosteels

unread,

Aug 7, 2013, 4:06:05 PM8/7/13

to dumbo...@googlegroups.com

Sorry for the late reply, you got buried onder a pile of emails I'm afraid. Answers are inline.

On Tue, Jul 23, 2013 at 8:01 PM, Mohit Singh <mohi...@gmail.com> wrote:

Hi,
I am trying to use dumbo library and am just a beginner.

One of the things i like about hadoop streaming is the ease of debugging. So for example, if I do:
cat input python mapper.py | sort -k,1 | python reducer.py
And here if something is wrong in my code, like some syntax error I know that beforehand and at the same time I can print the heck out of the code to see if everything is working or not.

And then deploy it on cluster.
I am trying to do the same thing but not sure how to do this using dumbo module.

https://github.com/klbostee/dumbo/wiki/Running-programs#locally-on-unix

Also, can you help me how does input is received using dumbo:

For example:

def mapper(data):
    for key, value in data:
        for word in value.split(): yield word,1

How do you get key,value pair here?

I mean, so lets say data is input file
Shouldnt it be like 
for line in data:
    # do something?

The key is the offset position of the line in the file and the value is the line itself.


I mean I would love to just print out what is in each data structure.. So if you can help me with my first question, then I can take it from there :)
Thanks