Remove OTUs present at <1% relative abundance in a given sample

Alexandra Willcox

unread,

Dec 8, 2016, 4:07:55 PM12/8/16

to Qiime 1 Forum

Hi all,

I am attempting to remove OTUs that are present at less than 1% of the total read counts on a per-sample basis, not across the whole dataset. (In other words, going through each sample one at a time and replacing any OTUs that account for <1% of the reads in that sample with zeros.) I'm trying to find a way to get summarize_taxa.py to calculate relative abundances of each OTU instead of each taxa, so that I can then open the table in Excel and use a function to replace all numbers less than 0.01 with 0. I think I should be able to get this to work using the --md_identifier option but I am running into an error.

[awillcox@kure-login1 december_8]$ bsub summarize_taxa.py -i ../otu_table_no_singletons.biom --md_identifier 'OTU ID' -o relative_abundance_otu_table

The output (if any) follows:

Traceback (most recent call last):
File "/nas02/apps/qiime-1.9.0/python/bin/summarize_taxa.py", line 259, in <module>
    main()
File "/nas02/apps/qiime-1.9.0/python/bin/summarize_taxa.py", line 235, in main
    md_identifier)
File "/nas02/apps/qiime-1.9.0/python/lib/python2.7/site-packages/qiime/summarize_taxa.py", line 43, in make_summary
    md_identifier)
File "/nas02/apps/qiime-1.9.0/python/lib/python2.7/site-packages/qiime/summarize_taxa.py", line 101, in sum_counts_by_consensus
    "identifier?" % (md_identifier, otu_id))
KeyError: u"Metadata category 'OTU ID' not in OTU denovo84068. Can't continue. Did you pass the correct metadata identifier?"

I've also tried calling '#OTU ID' and get the same error. Any suggestions for how to fix this or a better way to go about removing these OTUs?

Thanks,

Alexandra

Alexandra Willcox

unread,

Dec 11, 2016, 2:55:53 PM12/11/16

to Qiime 1 Forum

Any ideas?

Antonio González Peña

unread,

Dec 12, 2016, 8:23:53 AM12/12/16

to Qiime 1 Forum

This will be a really strange way to handle data, basically you will be biasing your results and handling each sample as a different dataset. Imagine that you are "treating" each sample different in the wet lab and then trying to compare them.

Anyway, if you really want to do this, I think the easiest will be to:

1. split the biom table per sample, perhaps split_otu_table.py

2. then summarize each one to see which OTUs are below that level, summarize_taxa.py?

3. filter_otus_from_otu_table.py

4. merge_otu_tables.py

Alexandra Willcox

unread,

Dec 12, 2016, 9:12:56 AM12/12/16

to Qiime 1 Forum

Thanks for the response. Basically I want to do this as a quality control step. I have some samples that are infected and some uninfected, but all of the "uninfected" samples are showing up with very low levels of the infecting bacteria (<100 reads in each uninfected sample compared to tens or hundreds of thousands in the infected). We believe this is sequencing error, some kind of contamination or chimera formation, and want to ignore these very low-abundance OTUs. Can you suggest a better way to do this?

Antonio González Peña

unread,

Dec 12, 2016, 9:16:04 AM12/12/16

to Qiime 1 Forum

Not sure what will be the "best" computation method to deal with this will be; not even sure if one exists. My suggestion will be to re-sequence and try to pool less samples in the same run so you can get a higher number of sequences per sample, and hopefully this will solve your issue.

Alexandra Willcox

unread,

Dec 12, 2016, 10:04:20 AM12/12/16

to Qiime 1 Forum

We have relatively high numbers of sequences per sample, the smallest is 16,000 but most samples have over 100,000. I do see how reducing the number of samples in each pool will reduce the chances for contamination between samples.

I will see about the possibility of re-sequencing, but short of this, perhaps a less biased approach to clean up the data would be to remove reads of an OTU that total less than 50 in each sample? (so not based on percentage but number of reads)

Thanks again!

Antonio González Peña

unread,

Dec 12, 2016, 10:49:22 AM12/12/16

to Qiime 1 Forum

Wow, that's cool! (number of sequences per sample).

Contamination is something really hard to control, cause contamination can be from different sources, from the person taking the samples to splash from well to well. What I have seen in the past is removing sequences/OTUs from the blanks in all the samples or doing a filtering of all OTUs that are below a given threshold, like you said.

Now, if you are concerned of read errors, etc. I would suggest taking a look at DADA2/QIIME2 and deblur, which are newer tools that deal with this.

Reply all

Reply to author

Forward