>50% Reads Unassigned 16S Illumina PE Data

Cindy

unread,

Aug 7, 2016, 8:50:08 PM8/7/16

to Qiime 1 Forum

Hi Qiime Forum!

I am using Illumina MiSeq 600 cycle (2x300) paired end data for the 16S gene, regions V3 and V4. These are environmental samples, so I do not expect that all of my reads will be assigned to the genus or species level but I am concerned because when I use all of my reads including reads that are not joined, about 50% of my OTUs are unassigned at any level. When I removed all of the un-joined reads, my assigned taxonomy plots had much higher assignments (only 5-10% unassigned), but my counts per sample dropped significantly. In my test data for example, it dropped from A: 61,424 to 1825 and B: 65243 to 1886--see below. When I removed chimeric sequences from unjoined reads (using usearch61), there was not a huge change in reads assigned to the genus level (A: 1825 to 1559 and B: 1886 to 1650--see below). On my real data, in some cases only 250 reads were used per sample, whereas for the same sample including unassigned reads had over 100,000 reads. I'm wondering if there is something that I can change in my work-flow to better accommodate my data, or if perhaps there is something I am currently using incorrectly. I had originally processed the same data using the Illumina 16S basespace app, and for the same samples, significantly more reads were assigned to the genus level (A: 120,000; B: 160,431 -which could be unpaired data, I'm not sure).
I tried using SILVA but haven't gotten my parameter file to work properly, the script is killed every time (but I think this is a separate issue- I'm not trying to address that here, unless you think that would give me significantly better results). This is my first analysis I'm trying to do using Qiime so any input would be greatly appreciated.

Here is what I have been doing:

Including unjoined reads:

multiple_join_pair_ends.py -i $PWD/IlluminaOutput/ -o $PWD/PairedEndData
multiple_split_libraries_

fastq.py -i $PWD/PairedEndData/ -o $PWD/SplitLib/ --remove_filepath_in_name --include_input_dir_path
pick_de_novo_otus.py -a -O 7 -i $PWD/SplitLib/non_chimeric_seqs.fasta -o $PWD/OTUS/
summarize_taxa_through_plots.py -i $PWD/OTUS/otu_table.biom -m map.txt -o $PWD/SumTaxa
biom convert -i $PWD/OTUS/otu_table.biom -o $PWD/OTUS/otu_table_tabseparated.txt --to-tsv --header-key taxonomy --output-metadata-id "ConsensusLineage"
biom summarize-table -i $PWD/OTUS/otu_table_tabseparated.txt -o $PWD/OTUS/summarized_OTU_table.txt

Getting rid of unjoined reads (plus steps ** for removing chimeric reads):
multiple_join_pair_ends.py -i $PWD/IlluminaOutput/ -o $PWD/PairedEndData
find $PWD/PairedEndData/ -name "fastqjoin.un*" -print -exec mv {} Remove_Unjoined/ \;
multiple_split_libraries_fastq.py -i $PWD/PairedEndData/ -o $PWD/SplitLib_RMUnjoin/ --remove_filepath_in_name --include_input_dir_path
**identify_chimeric_seqs.py -m usearch61 -i $PWD/SplitLib_RMUnjoin/seqs.fna --suppress_usearch61_ref -o $PWD/ Chimeras_forRMUnJoinSplitLib/
**filter_fasta.py -f $PWD/SplitLib_RMUnjoin/seqs.fna -s $PWD/Chimeras_forRMUnJoinSplitLib/chimeras.txt -n -o
pick_de_novo_otus.py -a -O 7 -i $PWD/SplitLib_RMUnjoin/non_chimeric_seqs.fasta -o $PWD/OTUS_nonchimRMUnjoin/
summarize_taxa_through_plots.py -i $PWD/OTUS_nonchimRMUnjoin/otu_table.biom -m map.txt -o $PWD/SumTaxa_nonchimRMUnjoin
biom convert -i OTUS_nonchimRMUnjoin/otu_table.biom -o OTUS_nonchimRMUnjoin/otu_table_tabseparated.txt --to-tsv --header-key taxonomy --output-metadata-id "ConsensusLineage"
biom summarize-table -i OTUS_nonchimRMUnjoin/otu_table_tabseparated.txt -o OTUS_nonchimRMUnjoin/summarized_OTU_table.txt

Here is my config info:
System information
==================
         Platform:      linux2
   Python version:      2.7.12 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Python executable:      /home/envs/qiime1/bin/python

QIIME default reference information
===================================
For details on what files are used as QIIME's default references, see here:
https://github.com/biocore/qiime-default-reference/releases/tag/0.1.3

Dependency versions
===================
          QIIME library version:        1.9.1
           QIIME script version:        1.9.1
qiime-default-reference version:        0.1.3
                  NumPy version:        1.10.4
                  SciPy version:        0.17.1
                 pandas version:        0.18.1
             matplotlib version:        1.4.3
            biom-format version:        2.1.5
                   h5py version:        2.6.0 (HDF5 version: 1.8.16)
                   qcli version:        0.1.1
                   pyqi version:        0.3.2
             scikit-bio version:        0.2.3
                 PyNAST version:        1.2.2
                Emperor version:        0.9.51
                burrito version:        0.9.1
       burrito-fillings version:        0.1.1
              sortmerna version:        SortMeRNA version 2.0, 29/11/2014
              sumaclust version:        SUMACLUST Version 1.0.00
                  swarm version:        Swarm 1.2.19 [Mar 1 2016 23:41:10]
                          gdata:        Installed.

QIIME config values
===================
For definitions of these settings and to learn how to configure QIIME, see here:
http://qiime.org/install/qiime_config.html
http://qiime.org/tutorials/parallel_qiime.html

                     blastmat_dir:      None
      pick_otus_reference_seqs_fp:      /home/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                         sc_queue:      all.q
      topiaryexplorer_project_dir:      None
     pynast_template_alignment_fp:      /home/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta
                  cluster_jobs_fp:      start_parallel_jobs.py
pynast_template_alignment_blastdb:      None
assign_taxonomy_reference_seqs_fp:      /home/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                     torque_queue:      friendlyq
                    jobs_to_start:      7
                       slurm_time:      None
            denoiser_min_per_core:      50
assign_taxonomy_id_to_taxonomy_fp:      /home/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
                         temp_dir:      /tmp/
                     slurm_memory:      None
                      slurm_queue:      None
                      blastall_fp:      blastall
                 seconds_to_sleep:      1

-------------------------------------------------------------------------------------------
Here is my count data from each analysis
OTUS_with_joinPE_and_unjoined/summarized_otu_table.txt
Num samples: 2
Num observations: 106416
Total count: 126667
Table density (fraction of non-zero values): 0.506

Counts/sample summary:
Min: 61424.0
Max: 65243.0
Median: 63333.500
Mean: 63333.500
Std. dev.: 1909.500
Sample Metadata Categories: None provided
Observation Metadata Categories: ConsensusLineage

Counts/sample detail:
A: 61424.0
B: 65243.0
---------------------------------------------------------
Removed_Unjoined_denovOTU/summarized_OTU_table.txt
Num samples: 3
Num observations: 56469
Total count: 67068
Table density (fraction of non-zero values): 0.337

Counts/sample summary:
Min: 1825.0
Max: 63357.0
Median: 1886.000
Mean: 22356.000
Std. dev.: 28992.096
Sample Metadata Categories: None provided
Observation Metadata Categories: ConsensusLineage

Counts/sample detail:
A: 1825.0
B: 1886.0
UnJoin: 63357.0
---------------------------------------------------------
Removed_Unjoined_denovOTU_nonchimericseqs/summarized_OTU_table.txt
Num samples: 3
Num observations: 52084
Total count: 62026
Table density (fraction of non-zero values): 0.337

Counts/sample summary:
Min: 1559.0
Max: 58817.0
Median: 1650.000
Mean: 20675.333
Std. dev.: 26970.257
Sample Metadata Categories: None provided
Observation Metadata Categories: ConsensusLineage

Counts/sample detail:
A: 1559.0
B: 1650.0
UnJoin: 58817.0

Thank you for any input!!

Jai Ram Rideout

unread,

Aug 8, 2016, 3:36:11 PM8/8/16

to Qiime 1 Forum

Hi Cindy,

I followed up with a developer for advice, we'll keep you updated!

Best,

Jai

TonyWalters

unread,

Aug 8, 2016, 3:57:43 PM8/8/16

to Qiime 1 Forum

Using "merged" unstitched and stitched data could be questionable, and I would go the route of either 1. altering the parameters for the stitching step to improve the number of stitched reads, and just use those for downstream step or 2. Just use the R1 reads rather than stitched + unjoined data.

As far as the taxonomic assignments, you may need to determine if reads have non-16S data in them, as this could interfere with taxonomic assignments (with perhaps long enough reads to overcome the issue with the stitched reads). E.g., you could find OTUs that are unclassified in your tab delimited OTU table, find the associated read in the rep_set.fna file using a grep -A 1 X rep_set.fna (where X is the OTU ID), and blast on NCBI to see if there are overhangs at the end that aren't hitting reference 16S reads (or if non-16S data are being hit which could imply non-target amplification or bad reads).

Reply all

Reply to author

Forward