Hi all,
sorry for the delay in replying.
With ChocoPhlAn the issue is that we have to continuously update the genome retrieval procedure (RepoPhlAn) to reflect the changes in the NCBI ftp data structure and that the marker identification algorithm is set up for high-performance computing facilities. As a consequence, the code is not yet general enough to be easily usable by MetaPhlAn2 users. We are working on this (and I'm aware I said that several months ago as well:).
In the meantime, we do have a way for easily including in the MetaPhlAn2 database other markers that users want to add. This has been implemented and described by Tin here:
With the above procedure you can add markers to the DB, but markers should be identified by the users.
For finding markers, the basic strategy is to cluster the genes of the genomes of interest with usearch at an identity threshold of 80%. Then only genes specific for the species of interest should be retained and further screened by removing those genes with hits outside the species of interest (one can do this even by blasting against NCBI NT database). Single representative for the retained clusters are what we call markers. In addition to the sequence, you need to some additional information like the taxonomy of the corresponding clade and the length of the markers. With this information you have everything you need for adding your markers to the system as described at the link above.
I hope this helps, please let me know if you have other questions or comments.
many thanks
Nicola