Hi Andreas,
On Monday, July 23, 2012 09:57:38 AM Andreas Weller wrote:
> I plan to use Jug to parallelize data-mining tasks on a SGE cluster.
Ok, great!
> Ideally, I would write a script in a map/reduce fashion where part of the
> script is to iterate over several massive files in parallel and collect
> relevant information into one dictionary per file (the map step), then
> combine those dictionaries (the reduce) and do other work on the result
> dictionary, all within one script.
Ok.
> I am confused now how to accomplish the collection and merging of results.
> Is the jug.mapreduce module the right one for me?
It is one possibility, yes.
> If yes, how do I implement the merging of dicts in the reduce step
> Or do I need a reduce function decorated with @TaskGenerator?
No.
> global_results = {}
> targets = [target_file1, target_file2, target_file3]
>
> def map(filename):
> local_results = {}
> with open(filename):
> # do iteration, collect relevant info
> return local_results
>
> def reduce(local_results, global_results):
> for key in local_results:
> global_results[key] = local_results[key]
> run = jug.mapreduce(map, reduce, targets)
This would not actually work.
Imagine if you were not using jug, but just did
run = builtins.reduce(reduce, builtins.map(map, targets))
This would not work!
With jug, you should not use any global variables; your functions should be
pure (i.e., depend on their inputs only). You are looking for something like:
def reduce(local0, local1):
combined = local0.copy()
combined.update(local1)
return combined
HTH
Luis
--
Luis Pedro Coelho | Institute for Molecular Medicine |
http://luispedro.org
LxMLS 2012: Lisbon Machine Learning School
http://lxmls.it.pt