chao1 and estimate_observation_richness.py

502 views
Skip to first unread message

Devin Leopold

unread,
Jan 4, 2014, 3:07:06 PM1/4/14
to qiime...@googlegroups.com
I was wondering what the QIIME community thinks about the chao1 diversity estimators for next-gen microbial data. I know this is an option for many QIIME analyses, and it seems to be the foundation behind the new estimate_observation_richness.py script. I am wondering what the justification is for using a metric that relies on single and double observations in samples to estimate diversity, when most OTU pipelines suggest removal of all global singletons to avoid spurious OTUs? I realize that removing global singletons from a multiplexed sequencing run does not mean that all singletons within a sample will be removed (especially if the data is rarefied before analysis), but it seem like this would introduce significant bias in diversity estimation with chao1 based metrics. Particularly if there is significant variation in diversity between samples (ie: samples with more singletons will likely be impacted to a greater extent by global singleton removal).

-Devin

P.S. I have been testing out the estimate_observation_richness.py script and have found it to require much more time and memory that I expected. I ran the script with -n 4 on an OTU table with 32 samples and ~400 OTUs and found that it maxed out the memory on my laptop (8gb). So I switched to a c3.2xlarge EC2 instance (15gb ram), which also ran out of memory. It did work on a m2.4xlarge EC2 instance (68gb) but took ~24 hours. I am wondering if this is the expected behavior for this script?

Jai Ram Rideout

unread,
Jan 6, 2014, 5:41:02 PM1/6/14
to qiime...@googlegroups.com
Hi Devin,

Good questions! You're correct that we typically recommend removal of global singletons (i.e., observations that only occur once across all samples in the study) as these could be attributed to noise in the data. estimate_observation_richness.py will report N/A's in the output file if a sample doesn't have any (sample-specific) singletons/doubletons.

Your reasoning about bias makes sense, and this type of bias (along with a whole slew of others) plagues microbial community studies in general. However, we can often still draw conclusions by comparing results across samples (e.g., "is sample A more diverse than sample B?"). If you're concerned about global singleton removal greatly impacting some of your samples' diversity estimates, you might look into exactly what samples are losing global singletons (and how many), and use that information when drawing conclusions from the Chao1 estimates.

I'm not sure what the general consensus is regarding Chao1 within the microbial ecology community, but you might find these papers interesting:



Both seem to indicate that Chao1 "seems promising", especially with higher sampling depths, but that more investigation is necessary. An alternative is ACE, but this is not available in estimate_observation_richness.py. You might take a look at Robert Colwell's EstimateS program; I think it has Chao1 and ACE implemented, and this was what estimate_observation_richness.py was based off of:


I'm really interested in hearing what others think about this- I'm definitely not the expert here! :)

For the estimate_observation_richness.py performance issues: that's very interesting, and I agree that the time and memory requirements aren't even close to ideal. When I wrote this script, I remember testing it out with tables that were much larger than the one that you're describing, and the performance was okay (i.e. ran in a few seconds on my laptop, didn't eat up all my RAM). If possible, can you please send me (jai.r...@gmail.com) your OTU table so that I can take a look?

Thanks,
Jai

--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jai Ram Rideout

unread,
Jan 7, 2014, 12:02:52 PM1/7/14
to qiime...@googlegroups.com
Hi again,

I was able to reproduce the performance issues on my laptop- definitely not ideal! Unfortunately, there isn't an easy fix/workaround to speed things up or cut down the memory requirements (performance is affected by the number of sequences in your samples). I created an issue on our tracker with a detailed description of the problem:


Until this gets fixed, you might try converting your OTU table to tab-delimited format using the "biom convert" command (http://biom-format.org/documentation/biom_conversion.html) and try it out with Robert Colwell's EstimateS program (http://viceroy.eeb.uconn.edu/estimates/) or with iNEXT (http://glimmer.rstudio.com/tchsieh/inext/). EstimateS will likely have similar performance issues as estimate_observation_richness.py, and I'm not sure about iNEXT.

If you end up trying out these alternatives, it'd be great if you posted back here with what you find. Sorry for the inconvenience, and thanks for catching this!

-Jai

Jenna Morgan Lang

unread,
Feb 26, 2014, 12:57:00 PM2/26/14
to qiime...@googlegroups.com
Hi Jai,

Just curious about this performance issue. I am having the same problem with this script running slowly and eating up all my 16G of RAM (using qiime 1.8.0). I'll look into the workaround you've recommended, but wanted to check first to see if a fix is coming soon.

Thank you!
Jenna

Devin Leopold

unread,
Feb 26, 2014, 2:13:59 PM2/26/14
to qiime...@googlegroups.com
Jenna - You might want to check out the R package iNext. It implements similar methods for estimating richness and is more flexible than the Qiime script. Here is a link. -Devin

Jai Ram Rideout

unread,
Feb 26, 2014, 4:30:49 PM2/26/14
to qiime...@googlegroups.com
Thanks Devin!

Jenna, I likely won't have time to fix this in the near future, though hopefully for the next release of QIIME (1.9.0). To see when this is fixed, I recommend checking the following issue (it'll be closed when fixed):


-Jai
Reply all
Reply to author
Forward
0 new messages