ITS pick_reference_otus_through_otu_table.py and updating unite database

204 views
Skip to first unread message

sp20

unread,
Apr 7, 2014, 3:13:09 AM4/7/14
to qiime...@googlegroups.com
Hi Qiime users,

we sequenced two samples of ITS 1-2 target on 454 sequencer. I'm trying to analysis this data through qiime pipeline. After processing data using qiime pipeline, I've noticed about 80% of the reads were unidentified at phylum level for the two samples as shown below.
Taxon                                     s1            s2
k__Fungi;p__unidentified
85.31% 80.47%


I used the following commands to process the data

pick_reference_otus_through_otu_table.py -p $PWD/ITS_params.txt -i $PWD/split_libarary/seqs.fna -r /home/qiime/qiime_software/its_12_11_otus/rep_set/99_otus.fasta -o $PWD/ref_otus
make_otu_table.py -i $PWD/ref_outs/uclust_ref_picked_otus/seqs_otus.txt -o out_new.biom -t ~/qiime_software/its_12_11_otus/taxonomy/otu_taxonomy.txt
summarize_taxa_through_plots.py -i out_new.biom -o taxa_plot -p ~/ITS_params.txt -m $PWD/ITS_mapping.txt -s

Not sure why a large percentage of data is discarded. Is there a way to identify this data.


Moreover, I'm trying to update qiime compatible ITS unite database. I downloaded the latest release from "http://unite.ut.ee/sh_files/sh_qiime_release_09.02.2014.zip"

upon observation the largest file is 13M "sh_refs_qiime_ver6_99_09.02.2014.fasta" but the previous version its_11_12 release the file size is as follows.

36M 97_otus.fasta  59M 99_otus.fasta

I assume upon updating a database the file size has to be increased. Do you think I just have to merge the two fasta files (old and the new one) ?


Many thanks for your time.

Sp

Kyle Bittinger

unread,
Apr 7, 2014, 9:20:59 AM4/7/14
to qiime...@googlegroups.com
Just talking with Tony Walters about these issues.  He says:

"Just a heads up on issues with using the new UNITE ITS database releases: the files have been trimmed to just the ITS region using the ITSx software, so if you have overhands in the SSU or LSU, which you will for most ITS targeting primers, then you'll have issues with clustering/taxonomic assignment."

In this thread, Tony makes the suggestion of using the UNITE reference files from the "developer" folder:

In a different thread, the ITS OTUs are improved by trimming the input sequences to just the ITS region:

Best,
Kyle


--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sp20

unread,
Apr 8, 2014, 12:09:19 AM4/8/14
to qiime...@googlegroups.com
Hi Best,

Thanks for your kind reply. 

For time being, I'm sticking to the older version "its_12_11_otus" database for data analysis.

I used  ITSx_1.0.6 software for trimming the data. After processing it outputted ITS1.fasta and an empty ITS2.fasta. Here is an position log file 


sample.A_1       216 bp. SSU: 1-45       ITS1: 46-186    5.8S: No end    ITS2: Not found LSU: Not found  Broken or partial sequence, only partial 5.8S!
sample.A_2       252 bp. SSU: 1-46       ITS1: 47-222    5.8S: No end    ITS2: Not found LSU: Not found  Broken or partial sequence, only partial 5.8S!



I picked ITS1 fasta file for downstream processing, still unidentified percentage is very high as shown below.

k__Fungi;p__unidentified 74.10% 82.40%


I would appreciate if you could suggest some possible method to analysis these samples.


Many Thanks
Sp

Tony Walters

unread,
Apr 8, 2014, 12:17:05 AM4/8/14
to qiime...@googlegroups.com
Hello,

There are two approaches I can recommend to try and improve the taxonomic assignments. You first want to make sure there aren't reverse primers (and more importantly, the subsequent non-ITS data) in the reads. You can truncate these using the truncate_reverse_primer.py script (http://qiime.org/scripts/truncate_reverse_primer.html). Note that the ReversePrimer values are in 5'-3' direction, so if you were inspecting your sequences manually, they would be in reverse complement to what you're seeing.

In addition to testing for reverse primers, you could try a newer version of UNITE (http://unite.ut.ee/repository.php), and I'd recommend trying the /developer/ version of the reference sequences.

sp20

unread,
Apr 8, 2014, 5:40:09 AM4/8/14
to qiime...@googlegroups.com
Hi Tony,
I appreciate for your quick and valuable response. I always trim primers both forward and reverse prior processing data. Even then I checked once again for primers and the data looks good.

This time as per your recommendation I used /development/ version of the reference sequence and to it worked very well as shown below. 
k__Fungi;p__unidentified 4.70% 6.10%

I'm curious why its_12_11 dataset has large number of unidentified when compared to the newer developmental version. 

The data size in development version is just  15MB (sh_refs_qiime_ver6_dynamic_09.02.2014.fasta) wheareas, the data set  in its_12_11 is 59M (99_otus.fasta).

1. Do you think as it is a reference alignment, most of the reads are discarded there by unidentified is very less ?
2. Is there any commands such that I can retrive all the discarded reads that are unaligned ?

Many Thanks
Sp

Tony Walters

unread,
Apr 8, 2014, 11:18:07 AM4/8/14
to qiime...@googlegroups.com
Hello,

1. I think there is more annotated reads in the newer databases that are leading to better assignments for you.
2. I don't think the reads are aligned (because of the non-conservation of the ITS region, you can generally only align closely related taxa), but I take it you mean unassigned? You could build the OTU table, use filter_taxa_from_otu_table.py (http://qiime.org/scripts/filter_taxa_from_otu_table.html) to create an OTU table with just k__Fungi;p__unidentified taxa in it, and then use filter_fasta.py with the representative sequence fasta as the input as the -f parameter and the filtered OTU table as the -b parameter.
Reply all
Reply to author
Forward
0 new messages