Hi,
I am trying to use dumbo library and am just a beginner.
One of the things i like about hadoop streaming is the ease of debugging. So for example, if I do:
cat input python mapper.py | sort -k,1 | python reducer.py
And here if something is wrong in my code, like some syntax error I know that beforehand and at the same time I can print the heck out of the code to see if everything is working or not.
And then deploy it on cluster.
I am trying to do the same thing but not sure how to do this using dumbo module.
Also, can you help me how does input is received using dumbo:
For example:
def mapper(data):
for key, value in data:
for word in value.split(): yield word,1
How do you get key,value pair here?
I mean, so lets say data is input file
Shouldnt it be like
for line in data:
# do something?
I mean I would love to just print out what is in each data structure.. So if you can help me with my first question, then I can take it from there :)
Thanks