Saving a hadoop job's output locally

24 views
Skip to first unread message

Pierre-Francois Laquerre

unread,
Mar 20, 2013, 6:00:49 PM3/20/13
to dumbo...@googlegroups.com
Is there a way to run a dumbo job on a hadoop cluster, but save its output locally as a pickled dict? I have a few jobs whose output I only ever use outside of the hadoop ecosystem. For example, one of these jobs processes a large text corpus down to a small word frequency dict, which I then use locally for spell checking. My current workflow looks like this:

1) run the dumbo job: dumbo start word_freq.py -hadoop $HADOOP_PREFIX -input my_corpus -output word_freq
2) get the data off hdfs and convert it to a pickled dict: dumbo cat word_freq | convert_to_pickled_dict.py my_dict.bin   (this script just splits on \t, builds a dict and pickles it)
3) use my_dict.bin in local scripts

I would like to merge steps 1 and 2 to get something along the lines of:

#!/usr/bin/env python
# encoding: utf-8

def mapper(key, line):
    for word in line.split():
        yield word, 1

def reducer(word, freqs):
    yield word, sum(freqs)

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper, reducer, combiner=reducer)

    # my usual script would end here, but I would like to add this:
    word_freqs = dumbo.reducer_output()  # where reducer_output() returns a dict that contains the reducers' outputs
    import pickle
    with open('my_dict.bin', 'wb') as f:
        pickle.dump(word_freqs, f)

What should magic() be here? A quick look at the source tells me that I could probably hack something up based on StreamingFileSystem's cat method, but I wanted to ask in case something already exists for this.

Thanks,

Pierre

Klaas Bosteels

unread,
Mar 28, 2013, 6:50:13 PM3/28/13
to dumbo...@googlegroups.com
You'd want to do this in a starter, not simply after dumbo.run() (since that will mess things up when running mapper/reducers on the cluster). You can do something like:

def starter(prog):
    prog.start()
    # get output out of HDFS here

For some inspiration on how to conveniently/efficiently get the output from HDFS and into a local file, you might want to have a look at the code for "dumbo cat" in the dumbo codebase...

-K




--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dumbo-user+...@googlegroups.com.
To post to this group, send email to dumbo...@googlegroups.com.
Visit this group at http://groups.google.com/group/dumbo-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages