strainPhlAn2 vs. strainPhlAn and the large number of unclassified strains

XG Yang

unread,

Dec 9, 2019, 10:55:13 AM12/9/19

to MetaPhlAn-users

Hello,

ALL the identified strains by metaphlan/strainphlan (accessed through biobakery workflows) for the workflows tutorial data are labeled as unclassified even without using marker_in_clade 0.01. I understand that this could be, in part, due to sub-sampling of the tutorial data for demonstration purposes but when I use metaphlan/strainphlan on over 300 full human gut microbiome samples that I have, ~70% of identified strains, on average, are labeled as "unclassified".

- I wonder if this is a typical behavior.

- Is there any way of improving strain identification e.g., by modifying the default parameters of strainPhlAn?

- How is strainPhlAn2 different from strainPhlAn? Does strainPhlAn2 have improved strain profiling capabilities that can possibly help with this issue (workflows currently supports strainPhlAn only)?

- Is there any way of having metaphlan run both strainPhlan and panPhlAn and combine the results at the end to improve strain profiling?

Another strainPhlAn question: Is there a convenient way of converting RefSeq assembly accession ids of the identified strains (GCF_###) by strainPhlAn to the actual strain names (e.g., e.g., instead of GCF_000173975, report Anaerobutyricum hallii DSM 3353)? Accession ids (GCF_###)) are not very useful when presenting the data to biologists.

Aitor Blanco-Miguez

unread,

Dec 9, 2019, 12:36:07 PM12/9/19

to MetaPhlAn-users

Hi Yang,
Thanks for getting in touch.

- Currently, MetaPhlAn2 strain level profiling is only available in some particular cases, this is the reason why most of the strains identified by the tool will be labelled as unclassified.
In order to identify known strains in your analysis (not profile), it is necessary to add them as reference genomes together with your samples when StrainPhlAn is executed.

- StrainPhlAn version 2 is a non stable version that is still under developing, therefore its use is not recommended. However, the behavior of the tool will be the same for strain labelling.

Answering your last question, changing the file name of your reference genomes before process them by StrainPhlAn will solve your problem, e.g. from GCF_00017397.fna to Anaerobutyricum_hallii_DSM_3353.fna. However, if you mean the strains labels reported by the MetaPhlAn profiling, it is not a quick method to do it, but if you are really interested, it is possible to modify the database following this tutorial: https://bitbucket.org/biobakery/metaphlan2/src/2.9/#markdown-header-customizing-the-database

Best,

Aitor

XG Yang

unread,

Dec 9, 2019, 1:16:26 PM12/9/19

to MetaPhlAn-users

Thanks so much, Aitor for the feedback! Any ideas on my question regarding integrating strainphlan and panphlan to improve strain profiling? Additionally, are there any other good alternative methods for strain profiling?

Aitor Blanco-Miguez

unread,

Dec 10, 2019, 9:24:34 AM12/10/19

to MetaPhlAn-users

Hi again,

There is a possibility to integrate both results from StrainPhlAn and PanPhlAn in order to get more robust results. It is a bit tricky, but you can do that comparing the normalised pairwise distances of the phylogenetic tree generated by StrainPhAn against a pairwise matrix generated based on the presence-absence gene matrix of PanPhlAn. However, I don't think this will be the solution you are looking for.

However, you can try to use metaMLST (http://segatalab.cibio.unitn.it/tools/metamlst/) if the species you are interesting on are available on the tool. Maybe this could fit better your case.

I hope this helps.

Best.

Aitor

Reply all

Reply to author

Forward