Constructing custom nucleotide database from Prodigal output.

74 views
Skip to first unread message

Armin Anocic

unread,
Jul 24, 2019, 7:21:58 AM7/24/19
to HUMAnN Users
Hi,

I'm analyzing metagenomic samples from a watertreatment pipeline for functional annotation. The initial HUMAnN2 run maps 30%-50% of the reads. I am going to do a second run with the unmapped reads from the first run, by mapping these against my own assemblies of the samples. I will use the predicted proteins (Prodigal) from my assemblies as a custom nucleotide reference database in HUMAnN2 and for this I'm annotating the proteins with DIAMOND with the UniRef databases from HUMAnN2. As you did with the ChocoPhlan pangenomes, the predicted proteins will be annotated with UniRef after meeting the criteria (>80% coverage (both hit and query), 50%/90% sequence ID).

I have several questions regarding this annotation. 
1. The UniRef databases (and all others) within HUMAnN2 are outdated (2016). Is there an update coming up?
2. Did you allow gaps in the DIAMOND blast alignment of ChocoPhlan? As UniRef clustering algorithm doesn't (to my knowledge) and some hits per query do not even differ 0.1% in bitscore.
3. When multiple significant UniRef hits come up, most of the time they share the same UniRef function. But do they contribute to the same pathways? 

My dilemma: What if I choose one UniRef that has the highest bitscore and the other only has a slightly lower bitscore. They both share the same function but cover different species. The best hit is not included in a pathway and the second hit is. This would eventually mean that I will miss valuable information in the calculation of pathway abundance if I blindly choose the best hit. Is this scenario possible?

My apologies for writing a thesis, instead of asking a question. I would gladly appreciate the help or input that can be given.
Thanks in advance! 

Kind regards,
Armin

Eric Franzosa

unread,
Jul 29, 2019, 4:40:11 PM7/29/19
to humann...@googlegroups.com
Hi Armin,

To answer your questions:

1) We are currently testing updated HUMAnN2 databases (all of them!) that match the newly released MetaPhlAn 2.9. I will make an announcement in this group when they are ready for public consumption.

2) There are definitely approximations in our annotations vs. how UniProt defines the UniRef clusters. For example, UniProt seeds UniRef clusters with their longest members, but the sequences they include as representatives are instead the best-annotated members (and these are what we end up using for pangenome annotation since they can be downloaded en masse). The upcoming databases bypass these steps by pulling pangenome annotations directly from UniProt. That said, I've compared the two approaches and they tend to agree very well.

In case it's useful as inspiration, we use this script for (pan)genome annotation in our group, but it's not an officially supported part of HUMAnN2:


3) The scenario you propose is definitely possible, though in most cases if a protein could map very well to two different UniRef90s I suspect it would inherit similar, less-specific annotations (e.g. EC memberships) from either. While this is not something we did when annotating the ChocoPhlAn pangenomes, you could in principle use the HUMAnN2 mapping from ECs to UniRef90s to favor a UniRef90 assignment with an EC annotation over one without an EC annotation if both were highly homologous to your query. I'd be curious to know how often this extra logic assigns an otherwise-missed EC to a protein.

Thanks,
Eric



--
You received this message because you are subscribed to the Google Groups "HUMAnN Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to humann-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/humann-users/9411aff3-5336-47f0-8b74-b956a851f99c%40googlegroups.com.

Keith

unread,
Oct 2, 2019, 4:16:16 PM10/2/19
to HUMAnN Users
Great to hear there is an updated database coming! Do you have any approximate times to expect that?

I also would like to hear more details on constructing custom databases. I know there is a script humann2/tools/build_custom_database.py but that doesn't seem to be the whole story. There's lots of files to update from the contents of the database directory (genomes in the chocophlan directory, UniRef90 IDs in the uniref directory) to the tables in the data/pathways/ directory included with python module for mapping UniRef90 IDs to MetaCyc reactions, and MetaCyc reactions to MetaCyc pathways.

Thanks,
Keith

Eric Franzosa

unread,
Oct 8, 2019, 3:07:25 PM10/8/19
to humann...@googlegroups.com
I recommend checking out the sections starting with "Custom..." in the HUMAnN2 manual:


There are indeed a lot of files to update for a complete HUMAnN2 ecosystem. Part of what we're doing right now is developing a system to export as much as possible from a clone of the UniProt database without as much post-hoc ID mapping and annotation needed. Once we are confident in that approach (i.e. we know it's as accurate as the original, less automated approach) we will 1) make the new databases available and 2) be able to update them more regularly going forward.

Thanks,
Eric



--
You received this message because you are subscribed to the Google Groups "HUMAnN Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to humann-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages