Hi Devin,
Good questions! You're correct that we typically recommend removal of global singletons (i.e., observations that only occur once across all samples in the study) as these could be attributed to noise in the data. estimate_observation_richness.py will report N/A's in the output file if a sample doesn't have any (sample-specific) singletons/doubletons.
Your reasoning about bias makes sense, and this type of bias (along with a whole slew of others) plagues microbial community studies in general. However, we can often still draw conclusions by comparing results across samples (e.g., "is sample A more diverse than sample B?"). If you're concerned about global singleton removal greatly impacting some of your samples' diversity estimates, you might look into exactly what samples are losing global singletons (and how many), and use that information when drawing conclusions from the Chao1 estimates.
I'm not sure what the general consensus is regarding Chao1 within the microbial ecology community, but you might find these papers interesting:
Both seem to indicate that Chao1 "seems promising", especially with higher sampling depths, but that more investigation is necessary. An alternative is ACE, but this is not available in estimate_observation_richness.py. You might take a look at Robert Colwell's EstimateS program; I think it has Chao1 and ACE implemented, and this was what estimate_observation_richness.py was based off of:
I'm really interested in hearing what others think about this- I'm definitely not the expert here! :)
For the estimate_observation_richness.py performance issues: that's very interesting, and I agree that the time and memory requirements aren't even close to ideal. When I wrote this script, I remember testing it out with tables that were much larger than the one that you're describing, and the performance was okay (i.e. ran in a few seconds on my laptop, didn't eat up all my RAM). If possible, can you please send me (
jai.r...@gmail.com) your OTU table so that I can take a look?
Thanks,
Jai