Pierre-Francois Laquerre
unread,Mar 20, 2013, 6:00:49 PM3/20/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to dumbo...@googlegroups.com
Is there a way to run a dumbo job on a hadoop cluster, but save its output locally as a pickled dict? I have a few jobs whose output I only ever use outside of the hadoop ecosystem. For example, one of these jobs processes a large text corpus down to a small word frequency dict, which I then use locally for spell checking. My current workflow looks like this:
1) run the dumbo job: dumbo start word_freq.py -hadoop $HADOOP_PREFIX -input my_corpus -output word_freq
2) get the data off hdfs and convert it to a pickled dict: dumbo cat word_freq | convert_to_pickled_dict.py my_dict.bin (this script just splits on \t, builds a dict and pickles it)
3) use my_dict.bin in local scripts
I would like to merge steps 1 and 2 to get something along the lines of:
#!/usr/bin/env python
# encoding: utf-8
def mapper(key, line):
for word in line.split():
yield word, 1
def reducer(word, freqs):
yield word, sum(freqs)
if __name__ == "__main__":
import dumbo
dumbo.run(mapper, reducer, combiner=reducer)
# my usual script would end here, but I would like to add this:
word_freqs = dumbo.reducer_output() # where reducer_output() returns a dict that contains the reducers' outputs
import pickle
with open('my_dict.bin', 'wb') as f:
pickle.dump(word_freqs, f)
What should magic() be here? A quick look at the source tells me that I could probably hack something up based on StreamingFileSystem's cat method, but I wanted to ask in case something already exists for this.
Thanks,
Pierre