Estimating pick_open_reference

Adam Bigott

unread,

Mar 30, 2016, 11:03:30 AM3/30/16

to Qiime 1 Forum

I have data from a two Illumina miseq runs on 72 soil samples. One run was 16S data and one was bacterial data. I soon found out, with the help of some people on this forum, that I could not do some of the heavy lifting steps in the pipeline such as OTU picking. Thus, I am working with my University’s high performance computing services to do the heavy lifting. This requires an estimate of the amount of “service hours” any jobs I plan to run will take. These service hours essentially amount to the amount of time x the number of processors used to run any jobs. Here are the specs of the nodes of super computer I will be using (or refer to http://www.hpc.lsu.edu/resources/hpc/system.php?system=Philip for more detail):

3 Compute Nodes each with Two 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors / 96GB 1066MHz Ram

32 Compute Nodes with Two 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors/24GB 1333MHz Ram

I’m hoping someone with experience can help me make accurate time estimations. Here are my questions:

How long should I it take to run a pick_open_reference_OTUs.py script when using any given number of nodes?

Does OTU picking take roughly the same amount of time for 16S using green genes and Unite for ITS?

Will passing enable “pick_otus:enable_rev_strand_match True” in the parameter file literally double the amount of memory used in this step?

Are there significant differences in run time between different OTU picking methods (ie uclust vs usearch?)

Kate Blackwell

unread,

Mar 30, 2016, 1:13:06 PM3/30/16

to Qiime 1 Forum

I was struggling with a similar issue in trying to estimate run times for my sample set.

I found the best way to answer this question for my own samples was to select a single sample and then a set of three samples and run each grouping under each of the different parameters I was considering using with my pipeline. This will give you baseline time estimates and comparing between 1 and 3 samples will give you an idea if adding samples simply doubles the time or progressively adds to it.

Just quickly from my own observations, enabling reverse strands as true will double the amount of run time. Make sure you actually need to do this by checking to see if you have reverse strands. If you don't, then there is no point in adding this parameter. There are different run times between the OTU picking methods. I would suggest picking the one you feel is best suited for the question you are trying to answer with your research, or select 2-3 as different permutations to try with the 1 and 3 sample sets to get an idea of the time difference.

Jenya Kopylov

unread,

Mar 30, 2016, 1:37:26 PM3/30/16

to Qiime 1 Forum

Hi Adam,

The runtime of pick_open_reference_otus.py depends on many parameters such as size of input file, length of reads, size of reference database, number of final OTUs, etc.) so it is difficult to suggest any estimate without looking at the data and interpolate based on different software time complexities (usearch or uclust, pynast, FastTree, .. all tools used in pick_open_reference_otus.py), or going the route Kate suggested.

Passing “pick_otus:enable_rev_strand_match True” in the parameter file will double the memory used for loading the reference database since it will be reverse complemented.

Our recent paper on comparison of OTU clustering algorithms can give you insight on runtime performance for various tools (see Figure 5) and perhaps help you choose the right tool for your analysis.

Jenya

Reply all

Reply to author

Forward

Estimating pick_open_reference_OTUs run time

Adam Bigott

Kate Blackwell

Jenya Kopylov