QIIME pick_closed_reference_otus output: why not mapped reads?

Skip to first unread message

Dani Haze

Jul 4, 2016, 10:31:07 PM7/4/16
to qiime...@googlegroups.com
Hello everyone, I am fairly new to QIIME and this is the first time I post a question here.

I have mouse 16S data obtained with Illumina MiSeq and I want to detect the different OTUs in each sample. My problem here is that my final OTU table for all my samples is basically all 0, with some random 1 or a 2, when I was expecting relatively big numbers of mapped reads.

I have done the following. I start with 45 samples, each with 2 FASTQ files of paired-end reads. I used cutadapt to trim the adapters, flash to merge the paired-end FASTQ files into 1, and trimmomatic to filter by quality.

Starting from my merged and trimmed FASTQ file per sample (I am attaching one of them), even though my data is already demultiplexed, I use split_libraries_fastq.py to transform the FASTQ file into FASTA format with the sample ID in each FASTA file entry.

     split_libraries_fastq.py -i $FASTQ -o $outDIR1 -m $mapfile --barcode_type 'not-barcoded' --sample_ids $sample -q 0

     where $FASTQ is the FASTQ file input for each sample, $outDIR1 is the output directory for each sample, $sample is each sample ID (AHM041trim in this case), and $mapfile is my mapping file for each sample, which in this case looks like:
          #SampleID    BarcodeSequence    LinkerPrimerSequence    Description
AHM041trim    NA    NA    AHM041trim

Once I obtain my properly formatted seqs.fna file for each sample (I am attaching the one for AHM041trim), I use pick_closed_reference_otus.py to detect the OTUs, using the Silva database as reference (https://www.arb-silva.de/)

     pick_closed_reference_otus.py -i $outDIR1/seqs.fna -o $outDIR2 -r $silvafasta -t $silvatax -p $paramfile --parallel --jobs_to_start=2

     where $outDIR2 is the output directory for OTU picking results for each sample, $silvafasta is the absolute path to the file SILVA_123.1_SSURef_Nr99_tax_silva.fasta and $silvatax is the absolute path to the taxonomy file taxonomy_all_levels.txt (SILVA123_QIIME_release/taxonomy/16S_only/99), both downloaded from https://www.arb-silva.de/download/archive/ My parameters file $paramfile looks like the following and is the same for all samples:
          pick_otus:enable_rev_strand_match True
          pick_otus:similarity 0.97

After this, I successfully obtain a otu_table.biom per sample (I am attaching the one for AHM041trim). I merge all of them into one using merge_otu_tables.py:

     merge_otu_tables.py -i $OTUtables -o $ALLbiom

     where $OTUtables is a string of the absolute paths for each sample otu_table.biom, separated by "," and $
ALLbiom is the final biom file merging all samples.

Once I have my final biom file with all samples (attached), I convert it into txt for better readability:

     biom convert -i $ALLbiom -o $ALLtxt -b --header-key taxonomy

     where $ALLtxt is the final table with all samples in txt format

My final table (attached) has the expected format, but I am puzzled to find all 0s everywhere with random 1s and 2s here and there... Is this how it is supposed to look like? I was expecting relatively big mapped read counts, so my guess is I am doing something wrong somewhere, but I cannot figure out what.

I would really appreciate if I could get some help on this, many thanks!!


Please note the FASTQ and FASTA files where too big to upload, so I am just uploading the first portion of them. Thanks!

EDIT: There seems to be a problem with the Silva files I use... can anybody provide some guidance on which Silva files to use (https://www.arb-silva.de/no_cache/download/archive/)? Many thanks!

zhiying Guo

Jul 5, 2016, 11:21:11 PM7/5/16
to Qiime 1 Forum

Sorry for delayed response.
When I summarize the biom file with command biom summarize-table -i MUX3696_all.biom -o MUX3696_all.summary, I found that the reads number in each sample varied from 2 to 3564. For sample AHM060trim, there are 3564 reads in it after clustering OTU. So I think the process of OTU clustering is not the problem. 

Whether the input sequence file of each sample has few reads? Can you check the reads number of each sample by count_seqs.py? I usually check it after each step of quality 
control like cutadapt, trim and so on. 
BTW, why not merge all reads together to cluster OTU? This will save input times as it needs pick OTU once.


Dani Haze

Jul 6, 2016, 9:47:07 PM7/6/16
to Qiime 1 Forum
Thanks for your response Zhiying,
yes I use fastqc to see the number of reads and quality after the trimming process.

I think I spotted the problem, which is I am not using the appropriate taxonomy mapping file for the reference database. Have you ever used Silva as reference?

I use the reference sequences in the SILVA_123_SSURef_Nr99_tax_silva.fasta file I downloaded from https://www.arb-silva.de/no_cache/download/archive/release_123/Exports/

As a mapping file, I have tried taxonomy/taxonomy_all/99/taxonomy_all_levels.txt and taxonomy/16S_only/99/taxonomy_all_levels.txt in Silva_123_provisional_release.zip, downloaded from https://www.arb-silva.de/no_cache/download/archive/qiime/

I have had no success using those files
, but I was able to recover a custom mapping file a colleague used for a previous version of Silva, and I get much better results. However, I do not know where he got this file, she left and I haven't been able to locate her yet.

Do you know what Silva file to use as a taxonomy mapping file? I also contacted Silva, but they haven't replied...

I am also interested in your suggestion of merging and picking the OTUs just once to save time. How should I do it? Appending the FASTA files and running pick_closed_reference_otus as usual? Will I get the same results?

Thank you so much!

zhiying Guo

Jul 7, 2016, 11:16:12 AM7/7/16
to Qiime 1 Forum
Hello Daniel,

Probably it's really a matter of reference file. The reference sequence file should correspond to taxonomy file (mapping file).
However, according to the different address in your last post, the reference and mapping file are not consistent. Because The mapping file is QIIME release but reference file not.
You have downloaded 'Silva_123_provisional_release.zip', then just unzip it, use the sequence file and mapping file in this folder.
For example,  
if using 'SILVA123_QIIME_release/rep_set/rep_set_16S_only/97/97_otus_16S.fasta' as a sequence reference,
taxonomy should be used 'SILVA123_QIIME_release/taxonomy/16S_only/97/taxonomy_7_levels.txt' 
So that, the sequence and taxonomy are consistent. They are both for 16S and non-redundancy level of 97%.

Besides, you can directly use Greegene 13_8 (QIIME default database) as reference. 

Please let me know if more I can do
Reply all
Reply to author
0 new messages