debugging code in dumbo

20 views
Skip to first unread message

Mohit Singh

unread,
Jul 23, 2013, 2:01:32 PM7/23/13
to dumbo...@googlegroups.com
Hi,
   I am trying to use dumbo library and am just a beginner.
One of the things i like about hadoop streaming is the ease of debugging. So for example, if I do:
cat input python mapper.py | sort -k,1 | python reducer.py
And here if something is wrong in my code, like some syntax error I know that beforehand and at the same time I can print the heck out of the code to see if everything is working or not.
And then deploy it on cluster.
I am trying to do the same thing but not sure how to do this using dumbo module.

Also, can you help me how does input is received using dumbo:

For example:
def mapper(data):
    for key, value in data:
        for word in value.split(): yield word,1

How do you get key,value pair here?
I mean, so lets say data is input file
Shouldnt it be like
for line in data:
# do something?

I mean I would love to just print out what is in each data structure.. So if you can help me with my first question, then I can take it from there :)
Thanks

Klaas Bosteels

unread,
Aug 7, 2013, 4:06:05 PM8/7/13
to dumbo...@googlegroups.com
Sorry for the late reply, you got buried onder a pile of emails I'm afraid. Answers are inline.

On Tue, Jul 23, 2013 at 8:01 PM, Mohit Singh <mohi...@gmail.com> wrote:
Hi,
   I am trying to use dumbo library and am just a beginner.
One of the things i like about hadoop streaming is the ease of debugging. So for example, if I do:
cat input python mapper.py | sort -k,1 | python reducer.py
And here if something is wrong in my code, like some syntax error I know that beforehand and at the same time I can print the heck out of the code to see if everything is working or not.
And then deploy it on cluster.
I am trying to do the same thing but not sure how to do this using dumbo module.

 

Also, can you help me how does input is received using dumbo:

For example:
def mapper(data):
    for key, value in data:
        for word in value.split(): yield word,1

How do you get key,value pair here?
I mean, so lets say data is input file
Shouldnt it be like
for line in data:
# do something?

The key is the offset position of the line in the file and the value is the line itself.

I mean I would love to just print out what is in each data structure.. So if you can help me with my first question, then I can take it from there :)
Thanks


-K 

Reply all
Reply to author
Forward
0 new messages