I have data from a two Illumina miseq runs on 72 soil samples. One run was 16S data and one was bacterial data. I soon found out, with the help of some people on this forum, that I could not do some of the heavy lifting steps in the pipeline such as OTU picking. Thus, I am working with my University’s high performance computing services to do the heavy lifting. This requires an estimate of the amount of “service hours” any jobs I plan to run will take. These service hours essentially amount to the amount of time x the number of processors used to run any jobs. Here are the specs of the nodes of super computer I will be using (or refer to
http://www.hpc.lsu.edu/resources/hpc/system.php?system=Philip for more detail):
3 Compute Nodes each with Two 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors / 96GB 1066MHz Ram
32 Compute Nodes with Two 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors/24GB 1333MHz Ram
I’m hoping someone with experience can help me make accurate time estimations. Here are my questions:
How long should I it take to run a pick_open_reference_OTUs.py script when using any given number of nodes?
Does OTU picking take roughly the same amount of time for 16S using green genes and Unite for ITS?
Will passing enable “pick_otus:enable_rev_strand_match True” in the parameter file literally double the amount of memory used in this step?
Are there significant differences in run time between different OTU picking methods (ie uclust vs usearch?)