I'm analyzing metagenomic samples from a watertreatment pipeline for functional annotation. The initial HUMAnN2 run maps 30%-50% of the reads. I am going to do a second run with the unmapped reads from the first run, by mapping these against my own assemblies of the samples. I will use the predicted proteins (Prodigal) from my assemblies as a custom nucleotide reference database in HUMAnN2 and for this I'm annotating the proteins with DIAMOND with the UniRef databases from HUMAnN2. As you did with the ChocoPhlan pangenomes, the predicted proteins will be annotated with UniRef after meeting the criteria (>80% coverage (both hit and query), 50%/90% sequence ID).
I have several questions regarding this annotation.
1. The UniRef databases (and all others) within HUMAnN2 are outdated (2016). Is there an update coming up?
2. Did you allow gaps in the DIAMOND blast alignment of ChocoPhlan? As UniRef clustering algorithm doesn't (to my knowledge) and some hits per query do not even differ 0.1% in bitscore.
3. When multiple significant UniRef hits come up, most of the time they share the same UniRef function. But do they contribute to the same pathways?
My dilemma: What if I choose one UniRef that has the highest bitscore and the other only has a slightly lower bitscore. They both share the same function but cover different species. The best hit is not included in a pathway and the second hit is. This would eventually mean that I will miss valuable information in the calculation of pathway abundance if I blindly choose the best hit. Is this scenario possible?
My apologies for writing a thesis, instead of asking a question. I would gladly appreciate the help or input that can be given.
Thanks in advance!