"unclassified" species

897 views
Skip to first unread message

Sergey

unread,
Dec 11, 2012, 8:37:33 AM12/11/12
to metaphl...@googlegroups.com
What does it mean if MetaPhlAn unclassifies some taxonomic units. For examople:
s__Blattabacterium_unclassified 0.02393. It happens sometimes with genus and species level for example in samples of ocean microbial society.

Sergey

Nicola Segata

unread,
Dec 11, 2012, 9:24:00 AM12/11/12
to metaphl...@googlegroups.com
Hi Sergey,
  s__Blattabacterium_unclassified means that an organism in the genus g__Blattabacterium is found, but it is likely not belonging to any known and named species of Blattabacterium. In the specific case of Blattabacterium it seems no species names are available at all! Similarly for the other species and taxonomic levels. Does it make sense?

thanks
Nicola

J Sung

unread,
Feb 27, 2013, 1:37:09 AM2/27/13
to metaphl...@googlegroups.com
Hi Nicola,

I ran MetaPhlAn (using bowtie2db) on a few fastq human gut microbiome samples from a study by Qin et al. (Nature 2012, A metagenome-wide association study of gut microbiota in type 2 diabetes). I have two questions in continuum of this thread:

1) When I used your recommended "--bt2_ps very-sensitive" or "--bt2_ps sensitive" options, the output file containing the relative abundances shows me a single line of "unclassified 100.0". I was very surprised to see this. Can you please provide any comments on how this may have happened, and possibly any recommendations?

(I'm pretty sure all my files are installed correctly, and I have made sure of this while I was getting to know the program with your example files posted online, e.g. LC1.fna)

2) I next used the default "--bt2_ps" option, i.e. "very-sensitive-local", and I got relative abundances covering a wide range of different taxonomies (something you'd more likely expect). However, when I look at the relative abundances at the species level, the most abundant species came out to be "s__bacteroides_unclassified" @ 30-40% for my samples. Is there a way in MetaPhlAn to get around this, such as maybe forcing the program to find the known species most closely related to this "unclassified" one? Or am I stuck with this "unclassified" species as the most dominant at the species level? I would be most grateful for any comments and recommendations.

Are you surprised by these results, or can these things likely happen? Also, please let me know if you suspect any kind of technical error on my end, and I'll check, just in case.

Thank you.

Jaeyun

Nicola Segata

unread,
Feb 27, 2013, 4:28:17 AM2/27/13
to metaphl...@googlegroups.com
Hi Jaeyun,
  the problem is that the gut microbiome samples from Qin et at are stored with the two paired ends merged in a single read. You thus need to split the ends before applying MetaPhlAn otherwise only a tiny fraction of the reads are actually mapped.

Using the SRA toolking you need to specify the "--split-spot" option. I also recommend to use the 'sensitive' mapping policy, and if you want to avoid storing on disk the uncompressed file you can pipe the uncompressed read to MetaPhlAn directly. Pratically speaking, here is the command line I would use:
$ fastq-dump -Z sample_name.sra --split-spot | metaphlan.py --input_type multifastq --bowtie2out sample_name.bt2out --bt2_ps sensitive --bowtie2db bowtie2db/mpa  > sample_name.out.txt

Using the "pipe" strategy above BowTie2 cannot use multithreading. But if you need to profile several samples you can launch multiple samples in parallel.

many thanks
Nicola

Flo

unread,
Oct 30, 2016, 10:26:09 PM10/30/16
to MetaPhlAn-users
Hello,

I have a similar question.
I am working on different microbiomes : saliva, ocular and skin and I have an issue with the taxonomic assignation of the metaphlan2 pipeline at the species level.
The output contain a lot of species : genusname_unclassified.

I know when metaphlan2 can classify some reads at the species level but not all of them the other are pooled as genus_unclassified but in my case I only have genus and genus_unclassified with the same proportion. This mean that metaphlan could not classify the reads at the species level, so I am quite confused why the output mention some species classification.

I am using cutadapt to trim the sequences (-q20, -m 50), bowtie2 to remove the reads that match the human genome. I do not stitch the reads because only 60% of the reads are overlapping, so I just concatenate them before using metaphlan2.

Do you have any idea what happened to get a lot of unassigned reads at the species level and why metaphlan specify this? Because it should only report genus in these cases.
How could I improve the quality of my analysis?

ex. attached : Within the acidobacteria family there is two genus identified : Granulicella and Terriglobus. So I understand that there are some Acidobacteria_unclassified genuses. But within genus Granulicella and Terriglobus, there are no identified species so I do no understand why the report contains S--Geanulicella_unclassified and g__ Terriglobus_unclassified.


Thanks a lot!


Florentin
data.txt

Nicola Segata

unread,
Oct 31, 2016, 5:18:01 AM10/31/16
to Flo, MetaPhlAn-users
Hi Florentin,
  if you have "s__genusname_unclassified" this refers to an unknwon" species in the genus. If this abundance is the same of "g__genusname" it means that no other known species in the genus is detected. Note howerver, that "s__genusname_unclassified" and "g__genusname"  are two different taxonomic levels. It is true that when no known species are identified in a genus we could avoid completely reporting "s__genusname_unclassified", but we prefer reporting this for consistency when post-processing the results. Does it make sense?

One detail: when you mention that you concatenate the reads, you mean that you concatenate the files, right? Not that the two ends of the reads are joined, otherwise this would be a problem.

I hope this helps
cheers
Nicola


--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Flo

unread,
Oct 31, 2016, 11:30:59 PM10/31/16
to MetaPhlAn-users, florentinc...@gmail.com, nicola...@unitn.it
Hi Nicolas,

Thanks for your answer.
Yes, I meant I concatenate the files of R1 and R2 in one single file.

I personally find it confusing to report those unassigned species because when I import the data in R and I extract the species __s, the really unassigned species (those for which g__genusname and s__genusname_unclassified have the same relative abundnace) are extracted and taken into account for further diversity (Shannon, richness) or bray-curtis communities dissimilarities.
It might be more informative to remove those (g__genusname and s__genusname_unclassified have the same relative abundnace) as they are not defined at this taxonomical level.

Does it makes sense?

Nicola Segata

unread,
Nov 2, 2016, 3:38:01 AM11/2/16
to Flo, MetaPhlAn-users
Hi Florentin,
 yes, it makes sense. For other analyses, though, we are looking at a given taxonomic level only (e.g. species level) and if we don't add the "unclassified" in some cases the sum of the abundances is very far from 100% which causes several downstream problems. Maybe you can have a simple script for removing the "unclassified" clades before proceeding with your analysis?

Nicola

Flo

unread,
Nov 22, 2016, 12:49:14 AM11/22/16
to MetaPhlAn-users, florentinc...@gmail.com, nicola...@unitn.it
Dear Nicola,

Yes that makes sense. Thank you!

Flo
Reply all
Reply to author
Forward
0 new messages