Should I be concerned about "Unassigned;Other"/unmapped(i.e. o_____

matt4...@gmail.com

unread,

Jul 3, 2017, 9:23:54 PM7/3/17

to Qiime 1 Forum

Hi all, newbie qiime user here.

I am looking at cecal contents of mice fed different diets (control vs high fibre). I'm wondering whether I have done everything correctly because:

1) At the phylum level (L2), 14% (range 3-29%) of the reads map to "Unassigned;Other" (as opposed to "k__Bacteria;p__Bacteroidetes")...is this % unreasonably high? Does this indicate I should have filtered the results to exclude whatever the unassigned reads are?

2) When I look at the species (L6) level it looks like I have lots of otus (44 out of 154 total otus) that don't appear to be "fully mapped" (apologies for poor terminology)

e.g. k__Bacteria;Other;Other;Other;Other;Other

k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Bifidobacteriales;f__Bifidobacteriaceae;Other

k__Bacteria;p__Armatimonadetes;c__[Fimbriimonadia];o__[Fimbriimonadales];f__;g__

k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;Other;Other

k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__;g__

k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__S24-7;g__ (mean 31% of reads map to this OTU)

k__Bacteria;p__Planctomycetes;c__C6;o__d113;f__;g__

A lot of these are quite low in terms of relative abundance %'s - e.g. less than 0.01% of reads map to these otus. Is there a level of cutoff that these low unknown otus should be excluded? I am unsure of how to interpret situations where the family and genus levels have been left blank...does this mean I cannot look down to the genus level for these otus?

Heres' the Workflow I've been using:

We used paired end Illumina reads of the V3-V5 region. Trimmed using Skewer (skewer -Q 10 -t `nproc`).

$ multiple_join_paired_ends.py -i dbdb_fastq/trimmed -o dbdb_paired -p dbdb_paired/pairpara.txt

pairpara.txt:

join_paired_ends:perc_max_diff 80

join_paired_ends:min_overlap 25

$ multiple_split_libraries_fastq.py -i dbdb_paired -o dbdb_SplitLib --include_input_dir_path --remove_filepath_in_name

$ pick_open_reference_otus.py -i SplitLib/seqs.fna -o otus

# default database would be greengenes...should I try against another database?

$ cp otus/otu_table_mc2_w_tax_no_pynast_failures.biom output.biom

$ core_diversity_analyses.py -o core -i output.biom -m Map_File.txt -t otus/rep_set.tre -e 10569

Thanks in advance for your help,
Matt

Colin Brislawn

unread,

Jul 4, 2017, 4:29:02 PM7/4/17

to Qiime 1 Forum

Hello Matt,

Thanks for posting your workflow. Taxonomy results could change based on processing steps, so workflow is key.

1) At the phylum level (L2), 14% (range 3-29%) of the reads map to "Unassigned;Other"

Because this mouse microbiome is relatively well characterized, this number sounds a little high. Because you are using 'no_pynast_failures' and default settings, we know that all remaining OTUs are > 70% similar to something in greengenes. But only getting assignments to the kingdom level is a little strange.

Also, 14% of all your reads, or 14% of your OTUs. Having lots of low quality OTUs without good taxonomy is common. Having lots of unknown reads is more unusual.

2) When I look at the species (L6) level it looks like I have lots of otus (44 out of 154 total otus) that don't appear to be "fully mapped" (apologies for poor terminology)

I know what you mean! I would describe these as 'OTUs that are only assigned down to the class/order/family level'. This is normal. Keep in mind that your amplicon region is a few hundred base pairs, so any microbes which are similar in this V3-V5 region will not have a definite taxonomy.

Is there a level of cutoff that these low unknown otus should be excluded?

There is now single rule about this. Some people take out things that appear less than 0.01% (or 0.0001). I keep all my OTUs in the table.

does this mean I cannot look down to the genus level for these otus?

Unfortunately, that's correct...

k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__S24-7;g__ (mean 31% of reads map to this OTU)

Ah yes, S24-7. The mysterious mouse microbe!

I'm not sure if these answers have been very helpful, but I hope this was able to address your questions.

Colin

matt4...@gmail.com

unread,

Jul 5, 2017, 9:33:36 PM7/5/17

to Qiime 1 Forum

Thanks for your reply Colin, your guidance has been very helpful.

1) At the phylum level (L2), 14% (range 3-29%) of the reads map to "Unassigned;Other"
Because this mouse microbiome is relatively well characterized, this number sounds a little high. Because you are using 'no_pynast_failures' and default settings, we know that all remaining OTUs are > 70% similar to something in greengenes. But only getting assignments to the kingdom level is a little strange.
Also, 14% of all your reads, or 14% of your OTUs. Having lots of low quality OTUs without good taxonomy is common. Having lots of unknown reads is more unusual.

I believe it's 14% of reads. I've attached a screenshot of the taxa output at the phylum level (where you can see the "Unassigned;Other" in Red). As I proceed further down the taxonomic rank, the relative percentage values stay the same and map to "Unassigned;Other;Other", etc.

My concern is that having a high % of unassigned reads in some samples will 'push down' the relative percentages of other bacteria (the flip side of this is that if I filter out the unassigned reads, then those that had high % unassigned reads may have the other bacteria relative abundance inflated). Worryingly the % of unassigned reads is higher in some groups than others, meaning which option I choose could have a significant impact on results (see bar graph).

I'm still uncertain which approach would be best, and any advice would be greatly appreciated.

Many thanks,

Matt

Capture.PNG

Capture2.PNG

Colin Brislawn

unread,

Jul 5, 2017, 10:25:34 PM7/5/17

to Qiime 1 Forum

Hello Matt,

Thanks for attaching those bar graphs. Yep, that looks like 14% to me.

You said

My concern is that having a high % of unassigned reads in some samples will 'push down' the relative percentages of other bacteria (the flip side of this is that if I filter out the unassigned reads, then those that had high % unassigned reads may have the other bacteria relative abundance inflated). Worryingly the % of unassigned reads is higher in some groups than others, meaning which option I choose could have a significant impact on results (see bar graph)

I think you are right on all counts. Large numbers of unassigned reads will suppress counts of other bacteria, and normalizing for this between different samples is very hard.

So, what should we do?

My first thought is to do quality control before OTU picking (instead of trying to explain extra reads after taxonomy assignment). Basically, I think all these reads coming from unassigned OTUs are very low quality. During your split_libraries step, you can pass a more stringent quality control threshold to remove low quality reads. The default threshold in qiime is very permissive, and passing a more stringent threshold, should pass fewer, better reads, which may cluster into OTUs with known taxonomy. You can pass a paramater file with a higher q score cutoff, using a paramater file:

http://qiime.org/scripts/multiple_split_libraries_fastq.html

http://qiime.org/scripts/split_libraries_fastq.html

So your paramater file could look something like this:

split_libraries_fastq:phred_quality_threshold 19

So basically, reads with Q scores under 20 would be removed.

Let me know if this helps. There are other ways to remove low quality reads and also to improve taxonomy assignment, if this method does not work well.

Colin

matt4...@gmail.com

unread,

Jul 6, 2017, 8:11:48 PM7/6/17

to Qiime 1 Forum

Hi Colin,

Thanks again for your help. I've re-run the analysis and it looks very similar to what i was getting before (see screenshot of taxa summary).

My initial thought is that I haven't done the quality score cutoff parameter file correctly (which I've named splitlibparam.txt - see screenshot of that too). I'm running multiple_split_libraries_fastq.py so the text that I placed in the param file is "multiple_split_libraries_fastq:phred_quality_threshold 19 " - have I done something wrong at this point?

If workflow wise everything seems okay, would you recommend redoing and increasing the Quality threshold?

Thanks in advance,

Matt

WORKFLOW: Redoing analysis with quality threshold.

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq$ cat > splitlibparam.txt

multiple_split_libraries_fastq:phred_quality_threshold 19

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq$ multiple_split_libraries_fastq.py -i dbdb_paired2 -o REDONE/SplitLib --include_input_dir_path --remove_filepath_in_name -p splitlibparam.txt

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE$ pick_open_reference_otus.py -i SplitLib/seqs.fna -o otus

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE$ cp otus/otu_table_mc2_w_tax_no_pynast_failures.biom output.biom

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE$ biom summarize_table –i output.biom –o tablesummary.txt

# Note: lowest sample count is 10566

# doing core_diversity analysis:

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE$ core_diversity_analyses.py -o core -i output.biom -m Map.txt -t otus/rep_set.tre -e 10566

TaxaREDONE.PNG

paramfile.PNG

Colin Brislawn

unread,

Jul 6, 2017, 8:33:35 PM7/6/17

to Qiime 1 Forum

Hello Matt,

When passing a parameter file to a workflow script, you should list the name of the internal script called by the workflow script.

In this case, the workflow script is multiple_split_libraries_fastq.py, and the internal script is split_libraries_fastq.py.

So your workflow script may have a line like,

split_libraries_fastq:phred_quality_threshold 19

Try that. It should pass the setting of 19 (up from the default of 3) and that should change the results.

Let me know what you find,

Colin

matt4...@gmail.com

unread,

Jul 9, 2017, 8:39:38 PM7/9/17

to Qiime 1 Forum

Hi Colin,

Thankyou very much for your help. I've redone the analysis using the parameter file exactly as you've suggested (attached, in case I've messed something up), however I'm still receiving a high proportion of Unassigned;Other OTUs at the phylum level (screenshot attached)

I'd greatly appreciate any suggestions re: what to try next (i.e. should I increase the quality threshold?)

Many thanks,

Matt

Workflow:

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq$ multiple_split_libraries_fastq.py -i dbdb_paired2 -o REDONE3/SplitLib --include_input_dir_path --remove_filepath_in_name -p param.txt

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE3$ pick_open_reference_otus.py -i SplitLib/seqs.fna -o otus

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE3$ cp otus/otu_table_mc2_w_tax_no_pynast_failures.biom output.biom

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE3$ biom summarize_table –i output.biom –o tablesummary.txt

# Note: lowest sample count is 10760

# doing core_diversity analysis:

qiime@qiime-190-virtual-box:~/Desktop/dbdb_fastq/REDONE$ core_diversity_analyses.py -o core -i output.biom -m Map.txt -t otus/rep_set.tre -e 10760

Capture - redoing param file.PNG

param.txt

Colin Brislawn

unread,

Jul 9, 2017, 9:21:13 PM7/9/17

to Qiime 1 Forum, Se Jin Song

Thanks for all this detail Matt,

I think you are doing everything right, but those counts are still high. I'm not sure what's going on. Maybe something with the Illumina machine, or like these are Phix reads or Illumina adaptors sneaking into your data.

Or maybe it's still not working for some reason. I noticed this from your most recent script:

Note: lowest sample count is 10760 (with q == 19)

But your original scripts returned

Note: lowest sample count is 10566 (with default of q == 3)

With more strict quality filters, I would expect this number to go down, not up. So that's strange.

Se Jin has been monitoring the forums this week. She's a member of the Knight Lab, and maybe have other suggestions that I'm missing.

Should I be concerned about "Unassigned;Other"/unmapped(i.e. o______) otus?

matt4...@gmail.com

Colin Brislawn

matt4...@gmail.com

Colin Brislawn

matt4...@gmail.com

Colin Brislawn

matt4...@gmail.com

Colin Brislawn