Hello everyone,
I am trying to modify TASSEL to a distributed architecture - so that it can process data and run plugins on multiple machines simultaneously. I encountered issues during my endeavours, and I am sharing them to get feedback and suggestions from the maintainers of the package as well as enthusiasts.
As far I’ve seen, tassel uses Java parallel streams and Java threads to achieve single machine parallelism. That might be enough for some with a very powerful machine (say, 100s of cores and gigabytes of RAM), but the application is not scalable - that is, it’s not built to work in cluster of machines. I’m trying to modify TASSEL and make it work with existing Big Data applications (such as Spark and Hadoop). But I’ve encountered some problems.
I should mention that before posting here, I have looked through various researches to find attempts of implementing distributed solution of tassel, without much success. So I have posted here, to get feedback from the maintainers of this package about my proposed solution and the problems I am encountering. I would much appreciate it if anyone can share ideas and suggestions about how I can proceed to parallelize tassel plugins to work in multiple machines without the problems I an encountering. I would also welcome any mention of shortcomings in my approach.
./run_pipeline.pl -fork1 -h /tassel-5-source/data/mdp_genotype.hmp.txt -filterAlign -filterAlignMi nFreq 0.05 -fork2 -r /tassel-5-source/data/mdp_traits.txt -fork3 -q /tassel-5-source/d ata/mdp_population_structure.txt -excludeLastTrait -fork4 -k /tassel-5-source/data/mdp_kinship.txt -combine5 -input1 -input2 -input3 -intersect -combine6 -input5 -input4 -mlm -export /tassel-5-sou rce/mlm_output_tutorial -runfork1 -runfork2 -runfork3 -runfork4
[pool-17-thread-3] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.analysis.data.FileLoadPlugin: time: Oct 26, 2017 14:00:58 [pool-17-thread-3] INFO net.maizegenetics.pipeline.TasselPipeline - net.maizegenetics.analysis.data.FileLoadPlugin: time: Oct 26, 2017 14:00:58: progress: 100%
[Thread-29] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.analysis.association.WeightedMLMPlugin: time: Oct 26, 2017 14:00:58
List<Future<?>> futures = new ArrayList<>();
myThreads.stream().forEach((current) -> {
futures.add(pool.submit(current));
});
for (Future<?> future : futures) {
future.get();
}