How to eliminate short sequences?

28 views
Skip to first unread message

Laura

unread,
Nov 30, 2017, 8:04:46 AM11/30/17
to Qiime 1 Forum
Hi everyone! 

I'm working with Qiime 1.8.0.

I need to eliminate short sequences (<100 bp or <200 bp) in a set of Illumina sequencing data, since I think they are given me unassigned results because they are not enough long to be assigned a taxonomy or at least an accurate taxonomy. 

Could you help me with this?

Thank you very much in advance, 
Laura

Greg Caporaso

unread,
Dec 1, 2017, 9:26:45 AM12/1/17
to Qiime 1 Forum
Hi Laura, QIIME 1.8.0 isn't supported anymore - you should really be using QIIME 1.9.1. My suggestion here should work for both versions, but if not you should first upgrade your version (or switch to QIIME 2, as QIIME 1.9.1 won't be support anymore as of January 1, 2018. 

We don't have a direct way to filter sequences based on their length for Illumina data, but during quality filtering with split_libraries_fastq.py this can be achieved by setting the --min_per_read_length_fraction parameter. This sets how long a read must be, as a fraction of it's starting length, to be retained after quality trimming. 

A more direct way to achieve your goal though is probably just to filter OTUs that are unassigned using filter_taxa_from_otu_table.py. This would let you still retain OTUs with shorter representative sequences that achieved a good taxonomic assignment, and also remove OTUs with longer representative sequences that don't achieve a good taxonomy assignment. You'll need to define what it means for a result to be assigned - for example passing -p k__Bacteria,k__Archaea would allow you to retain only OTUs that were at least assigned to the kingdom level (if you've assigned against the Greengenes taxonomy).

Hope this helps! 
Greg


TonyWalters

unread,
Dec 1, 2017, 9:39:32 AM12/1/17
to Qiime 1 Forum
Hello Laura,

One other suggestion on top of what Greg is pointing out there-you may want to find some of the reads (in the taxonomic assignment output folder, there is a .txt file with the assignments, and you can find the OTU IDs associated with taxonomy strings that are not defined well, and then use that OTU ID to find the sequence associated with them in the rep_set.fna file) that are giving you poor assignments and examine them for any sequencing artifacts (e.g. barcodes), as these could interfere with taxonomic assignments. You may blast a few of them on the NCBI site to see what they hit, and if they have overhang(s) that don't match at the ends, as this can also indicate non-16S data in your reads.

-Tony
Reply all
Reply to author
Forward
0 new messages