about the chocoPhlan pangenome database

3,159 views
Skip to first unread message

Fix Ace

unread,
May 24, 2016, 9:54:48 AM5/24/16
to humann...@googlegroups.com
Hello, there,

I am a new user of the humann. Just wondering where I would be able to find the references for chochoPhlan pangenome database?

Thanks.

A.

Eric Franzosa

unread,
May 27, 2016, 11:18:49 AM5/27/16
to humann...@googlegroups.com
Greetings!

The ChocoPhlAn pangenomes were built by clustering coding sequencing from NCBI genomes. If you look at the headers for an individual ChocoPhlAn file, you'll find that the original NCBI headers / accession numbers are still included. Perhaps these would be useful to you for getting back to the original genomes?

Example:
>gi|554770365|gb|ACIN03000016.1|:34743-35330|additional|fields|added|for|humann2

Thanks,
Eric


capricy gao

unread,
Jun 2, 2016, 10:04:44 AM6/2/16
to HUMAnN Users
Nice to have your post. The additional field has uniref90 and uniref50. Do they mean that they were derived by clustering with uniref90 and uniref50?

My understanding is that uniref90 and uniref50 are both peptide database; however the ChocoPhlAn has nucleotide sequences. How did the clustering work this way?

Thanks a lot!

C.

Eric Franzosa

unread,
Jun 2, 2016, 10:16:48 AM6/2/16
to humann...@googlegroups.com
We annotated ChocoPhlAn against UniRef50/90 by translating the coding sequences and performing a protein-level search. If a protein's best hit to a UniRef90 centroid had >90% identity, then we annotated it to that centroid (likewise for a >50% identity best hit and UniRef50).

Thanks,
Eric


capricy gao

unread,
Jun 3, 2016, 10:36:56 AM6/3/16
to HUMAnN Users
Does it mean that the chocophlan database contains all the bacterial CDSs based on uniref annotation?

Thanks.

C.

Eric Franzosa

unread,
Jun 3, 2016, 10:42:21 AM6/3/16
to humann...@googlegroups.com
The CDSs are based on the original (depositor-supplied) genome annotations from NCBI. We then mapped these sequences to UniRef ourselves using the same thresholds UniProt uses for building UniRef. For a future release, we are considering redoing all of the gene calling ourselves so that it's more consistent across genomes.

Thanks,
Eric


hxz...@gmail.com

unread,
Jan 18, 2017, 4:55:33 PM1/18/17
to HUMAnN Users
Sorry to bump an old thread.

there are 4187 pan genomes in chocophlan database. Some of my colleagues say that 4187 is too small for them to find precious hits for their novel metagenomes. What should I say to convince them otherwise?

Husen

Eric Franzosa

unread,
Jan 18, 2017, 5:09:50 PM1/18/17
to humann...@googlegroups.com
Hmm... Not sure I completely understand the question? Your colleagues are expecting more microbial species to be represented? 4K remains right around the number in the NCBI "representative genomes" collection:


And of course, if a community member is _not_ represented in ChocoPhlAn, its reads can still be mapped by translated search to UniRef50/90, after which approximate taxonomy can be inferred with humann2_infer_taxonomy.

Thanks,
Eric


hxz116

unread,
Jan 18, 2017, 6:01:48 PM1/18/17
to humann...@googlegroups.com
Thanks Eric .  This is very helpful.

Husen Zhang

unread,
Jan 19, 2017, 10:20:20 AM1/19/17
to humann...@googlegroups.com
Hi Eric,
I guess I am confused about the number of bacterial genomes in
chocophlan , which is around 4k as you mentioned, and the 13k bacterial
genomes mentioned in the metaphlann2 bitbucket page:

https://bitbucket.org/biobakery/metaphlan2#markdown-header-description

I admit that I haven't read the manual .. where can I download the 13k
bacterial genomes?

Thanks,

Husen

Eric Franzosa

unread,
Jan 19, 2017, 11:11:30 AM1/19/17
to humann...@googlegroups.com
Hi Husen,

To clarify, those same 13K genomes are all represented in ChocoPhlAn. The 13K genomes vs 4K pangenomes difference is due to species that have multiple sequenced isolate genomes. In those cases, the coding sequences from a given species across isolate genomes are clustered into a _single_ pangenome.

Unfortunately we don't currently have the raw genomes hosted for download since they aren't directly used by any of the bioBakery tools. I could provide a list of NCBI accession numbers if it would be useful?

Thanks,
Eric


heathe...@gmail.com

unread,
Oct 8, 2018, 1:42:54 AM10/8/18
to HUMAnN Users
Hi Eric,

I was using one of the pangenome and some of the UniRef ID are obsolete/redundant. Is there a way to map those to their current UniRef IDs?

Thanks,
Heather

Eric Franzosa

unread,
Oct 10, 2018, 1:12:50 PM10/10/18
to humann...@googlegroups.com
Not directly (at least not that I know of). You can always strip off the "UniRef90_" prefix and look up the remaining accession code directly in UniProt. Sometimes these are also retired, but it should point you to a more recent entry for the same sequence.

Thanks,
Eric


Reply all
Reply to author
Forward
0 new messages