Re: ChocoPhlAn how-to?

334 views
Skip to first unread message

Curtis Huttenhower

unread,
Jun 22, 2015, 11:12:38 AM6/22/15
to Sheri Simmons, metaphl...@googlegroups.com
Hi all - I was just chatting with Sheri and wanted to check to see whether someone could help her out with this?  My understanding is that it's possible with MetaPhlAn2 but not necessarily straightforward (and I don't think we've documented the process yet?)

Thanks a bunch -
Curtis

On Wed, May 20, 2015 at 11:18 AM, Sheri Simmons <sher...@gmail.com> wrote:
Hi all
I've read through the group archives on the issue of creating a custom clade specific marker database using ChocoPhlAn. So far I am still unclear on the path for executing this. I've read through the instructions for Metaphlan2 on bitbucket but can't find specific reference to ChocoPhlAn. I think likely I am missing something obvious, and wonder if someone can point me to the correct place?

Sheri

--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicola Segata

unread,
Jun 22, 2015, 12:28:44 PM6/22/15
to Curtis Huttenhower, Sheri Simmons, metaphl...@googlegroups.com
Hi all,
 sorry for the delay in replying.

With ChocoPhlAn the issue is that we have to continuously update the genome retrieval procedure (RepoPhlAn) to reflect the changes in the NCBI ftp data structure and that the marker identification algorithm is set up for high-performance computing facilities. As a consequence, the code is not yet general enough to be easily usable by MetaPhlAn2 users. We are working on this (and I'm aware I said that several months ago as well:).

In the meantime, we do have a way for easily including in the MetaPhlAn2 database other markers that users want to add. This has been implemented and described by Tin here:
With the above procedure you can add markers to the DB, but markers should be identified by the users.

For finding markers, the basic strategy is to cluster the genes of the genomes of interest with usearch at an identity threshold of 80%. Then only genes specific for the species of interest should be retained and further screened by removing those genes with hits outside the species of interest (one can do this even by blasting against NCBI NT database). Single representative for the retained clusters are what we call markers. In addition to the sequence, you need to some additional information like the taxonomy of the corresponding clade and the length of the markers. With this information you have everything you need for adding your markers to the system as described at the link above.

I hope this helps, please let me know if you have other questions or comments.
many thanks
Nicola






Sheri Simmons

unread,
Jun 22, 2015, 9:13:22 PM6/22/15
to Nicola Segata, Curtis Huttenhower, metaphl...@googlegroups.com
Thanks very much for these instructions. I assume that I can extract the correct format for taxonomy by looking at the existing DB but are there any specific watchouts on taxonomy formatting, or on marker gene header/name formatting? And could you point me to a definition of the marker score?

Sheri

Nicola Segata

unread,
Jun 23, 2015, 2:49:37 PM6/23/15
to Sheri Simmons, Curtis Huttenhower, metaphl...@googlegroups.com
Hi Sheri,
 correct, you can look at the other taxonomic label to have an idea about how to specify the labels for new clades. Formally, you have to make every level starting with "k__", "p__", and so on, and you cannot introduce any intermediate levels. 

The core score is defined as follows:

import scipy.stats as st
 1.0-st.binom.sf(ok,tot,pr)

where ok is the number of taxa in the clade with the marker, tot is the total number of taxa in the clade, and pr is the success_rate (i.e. 1 - the estimation of the probability of not seeing a marker in a taxa due to problems with the assembly, gene calling, etc). We usually set pr to 0.95.  The uniqueness score is defined as the inverse of the number of external hits nomalized in the [0-1] interval. The marker score is a basically the combination (by multiplication) of the two with some additional special cases for clades with unclear taxonomy or a high probability of genome mislabelling.

I hope it helps!
thanks
Nicola

Liyang Diao

unread,
Sep 2, 2016, 2:00:00 PM9/2/16
to MetaPhlAn-users, sher...@gmail.com, chut...@hsph.harvard.edu, nicola...@unitn.it
Hi Nicola,

I realize that this response was posted a while ago, but I was looking at the metaphlan2 database (db_v20) and found that the marker scores are highly correlated (though often not equal to) the length of the external hits. If I correctly understand your response above, it seems to me that the marker scores should be between 0 and 1. Could you please let me know if I'm missing something?

Thank you so much,
Liyang
Reply all
Reply to author
Forward
0 new messages