about the chocoPhlan pangenome database

Fix Ace

unread,

May 24, 2016, 9:54:48 AM5/24/16

to humann...@googlegroups.com

Hello, there,

I am a new user of the humann. Just wondering where I would be able to find the references for chochoPhlan pangenome database?

Thanks.

A.

Eric Franzosa

unread,

May 27, 2016, 11:18:49 AM5/27/16

to humann...@googlegroups.com

Greetings!

The ChocoPhlAn pangenomes were built by clustering coding sequencing from NCBI genomes. If you look at the headers for an individual ChocoPhlAn file, you'll find that the original NCBI headers / accession numbers are still included. Perhaps these would be useful to you for getting back to the original genomes?

Example:

>gi|554770365|gb|ACIN03000016.1|:34743-35330|additional|fields|added|for|humann2

Thanks,

Eric

capricy gao

unread,

Jun 2, 2016, 10:04:44 AM6/2/16

to HUMAnN Users

Nice to have your post. The additional field has uniref90 and uniref50. Do they mean that they were derived by clustering with uniref90 and uniref50?

My understanding is that uniref90 and uniref50 are both peptide database; however the ChocoPhlAn has nucleotide sequences. How did the clustering work this way?

Thanks a lot!

C.

Eric Franzosa

unread,

Jun 2, 2016, 10:16:48 AM6/2/16

to humann...@googlegroups.com

We annotated ChocoPhlAn against UniRef50/90 by translating the coding sequences and performing a protein-level search. If a protein's best hit to a UniRef90 centroid had >90% identity, then we annotated it to that centroid (likewise for a >50% identity best hit and UniRef50).

Thanks,

Eric

capricy gao

unread,

Jun 3, 2016, 10:36:56 AM6/3/16

to HUMAnN Users

Does it mean that the chocophlan database contains all the bacterial CDSs based on uniref annotation?

Thanks.

C.

Eric Franzosa

unread,

Jun 3, 2016, 10:42:21 AM6/3/16

to humann...@googlegroups.com

The CDSs are based on the original (depositor-supplied) genome annotations from NCBI. We then mapped these sequences to UniRef ourselves using the same thresholds UniProt uses for building UniRef. For a future release, we are considering redoing all of the gene calling ourselves so that it's more consistent across genomes.

Thanks,

Eric

hxz...@gmail.com

unread,

Jan 18, 2017, 4:55:33 PM1/18/17

to HUMAnN Users

Sorry to bump an old thread.

there are 4187 pan genomes in chocophlan database. Some of my colleagues say that 4187 is too small for them to find precious hits for their novel metagenomes. What should I say to convince them otherwise?

Husen

Eric Franzosa

unread,

Jan 18, 2017, 5:09:50 PM1/18/17

to humann...@googlegroups.com

Hmm... Not sure I completely understand the question? Your colleagues are expecting more microbial species to be represented? 4K remains right around the number in the NCBI "representative genomes" collection:

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

And of course, if a community member is _not_ represented in ChocoPhlAn, its reads can still be mapped by translated search to UniRef50/90, after which approximate taxonomy can be inferred with humann2_infer_taxonomy.

Thanks,

Eric

hxz116

unread,

Jan 18, 2017, 6:01:48 PM1/18/17

to humann...@googlegroups.com

Thanks Eric . This is very helpful.

Husen Zhang

unread,

Jan 19, 2017, 10:20:20 AM1/19/17

to humann...@googlegroups.com

Hi Eric,
I guess I am confused about the number of bacterial genomes in
chocophlan , which is around 4k as you mentioned, and the 13k bacterial
genomes mentioned in the metaphlann2 bitbucket page:

https://bitbucket.org/biobakery/metaphlan2#markdown-header-description

I admit that I haven't read the manual .. where can I download the 13k
bacterial genomes?

Thanks,

Husen

Eric Franzosa

unread,

Jan 19, 2017, 11:11:30 AM1/19/17

to humann...@googlegroups.com

Hi Husen,

To clarify, those same 13K genomes are all represented in ChocoPhlAn. The 13K genomes vs 4K pangenomes difference is due to species that have multiple sequenced isolate genomes. In those cases, the coding sequences from a given species across isolate genomes are clustered into a _single_ pangenome.

Unfortunately we don't currently have the raw genomes hosted for download since they aren't directly used by any of the bioBakery tools. I could provide a list of NCBI accession numbers if it would be useful?

Thanks,

Eric

heathe...@gmail.com

unread,

Oct 8, 2018, 1:42:54 AM10/8/18

to HUMAnN Users

Hi Eric,

I was using one of the pangenome and some of the UniRef ID are obsolete/redundant. Is there a way to map those to their current UniRef IDs?

Thanks,
Heather

Eric Franzosa

unread,

Oct 10, 2018, 1:12:50 PM10/10/18

to humann...@googlegroups.com

Not directly (at least not that I know of). You can always strip off the "UniRef90_" prefix and look up the remaining accession code directly in UniProt. Sometimes these are also retired, but it should point you to a more recent entry for the same sequence.