Translated Query Coverage Threshold

373 views
Skip to first unread message

Aditya Bandla

unread,
Feb 13, 2017, 11:51:14 PM2/13/17
to HUMAnN Users
Hi,

I am currently processing a soil metagenome (n = 18) using HUMAnN2. I was wondering how the translated search coverage is calculated?

Previously, I had done a standalone comprehensive search against NCBI-nr using DIAMOND. Using HUMAnN2, I end up with a similar number of alignments. However, the proteins that pass the default coverage threshold (50.0) is only 49507 (Total proteins 393295) i.e. only 12.58% pass the filter

I am concerned by the drastic drop in the number of proteins post this filter. I could lower the threshold value. However, I am not sure about the cons of doing that.

In our comprehensive search pipeline using DIAMOND, we usually limit our filters to the e-value (<0.001) and bit score (>50) and use the parse the resulting tables directly for downstream stats

Advise on modifying this filter (translated coverage threshold) would be much appreciated

Best,
Aditya

Aditya Bandla

unread,
Feb 14, 2017, 7:31:18 AM2/14/17
to HUMAnN Users
Hi,

Update: I just consulted the developer of DIAMOND. The default behaviour is to filter alignments based only on the e-value (e<0.001) and to avoid filtering based on identity threshold, query coverage and subject coverage (all of which are set in the HUMAnN2 workflow)

a) Based on the DIAMOND workflow, I want to first filter my alignments using the e-value (which can be modified in HUMAnN2), followed by the bit-score (>50). In DIAMOND, its not possible to simultaneously supply both these thresholds. Thus, the bit score filtering is usually done post the alignment step. Can this be done in HUMAnN2?

b) Are the uniref90 annotated fasta files available for download instead of the .dmnd files? The .dmnd files available for download aren't optimal given the hardware available on my side (based on benchmarking done)

c) I have access to the KEGG repository, thus are there any scripts available to make it compatible to use with the HUMAnN2 pipeline?

Best,
Aditya

Eric Franzosa

unread,
Feb 14, 2017, 1:26:03 PM2/14/17
to humann...@googlegroups.com
Hi Aditya,

In reply to your first email, the coverage threshold is the fraction of sites in a protein that must be hit by a read to consider that protein "detected." Without such a filter, it is very easy for proteins with local homology to a protein from your sample to recruit reads spuriously, resulting in false positives. The coverage filter sacrifices a bit of sensitivity for very low-abundance proteins (present but not passing the coverage threshold) in exchange for a dramatic increase in specificity.

Note that E-value and bit score are tightly related, so I don't think it will be critical to threshold on both of them (see: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html). In addition, we found in testing that the DIAMOND default E-value of 0.001 was too conservative when searching short reads (~100 nts) for remote homology (50% amino acid identity), hence the HUMAnN2 default being different.

We don't currently have the UniRef fasta files hosted, but I can look into doing that. Can you clarify what aspect of the currently provided databases is not optimal for your system?

Unfortunately we don't have a system for quickly plugging HUMAnN2 into modern KEGG (our KEGG support is currently limited to legacy support for HUMAnN1 and UniRef-KO associations, which are derived en masse from UniProt). I can refer you to the HUMAnN2 manual sections for creating custom databases, which begin here:


Hope this helps!

Thanks,
Eric


Message has been deleted
Message has been deleted

Eric Franzosa

unread,
Feb 15, 2017, 10:59:07 AM2/15/17
to humann...@googlegroups.com
Hi Aditya,

My best guess is that this has less to do with the translated search coverage threshold and more to do with the % identity threshold: 90%+ for valid alignments to UniRef90. That number may be too stringent for a soil community. You can manually tune the threshold with "--identity-threshold <#>". It will be automatically set to 50% if you use UniRef50 as a protein family system rather than UniRef90. Note that, if you saved your temp outputs, you can 1) delete the 3 main HUMAnN2 outputs and 2) run with --resume (changing settings like "--identity-threshold") and HUMAnN2 will recompute gene and pathway abundance without having to remap your reads.

To help calibrate this, I'd recommend examining the typical % identity numbers you got from your earlier NR DIAMOND runs (since UniRef90 is basically a clustered version of NR). If the best hits are typically below 90% identity, this would support the hypothesis that the HUMAnN2 default is too stringent for your runs.

Thanks,
Eric



On Tue, Feb 14, 2017 at 3:47 PM, Aditya Bandla <aditya...@gmail.com> wrote:
Hi Eric

Thank you for the detailed reply.

Based on my first run, using uniref90, 99% of my reads remain unaligned after running it through the full HUMAnN2 pipeline. Upon inspecting the log file (attached for your reference), I could see that a fairly large number of alignments hadn't passed the default thresholds set for the translate alignment step. Thus, I am a bit confused as to which parameters need to be toned down to increase the number of reads that align

For the DIAMOND step, using 22 cores, I was able to optimize the block-size and index-chunks to maxout the memory available (while aligning against NR). With the currently available .dmnd files, I guess the block-size is too small and hence its taking more time than expected. Hence, if the fasta files are available, I can reformat the .dmnd files using a new block size

Thanks again

Best,
Aditya
Message has been deleted

Eric Franzosa

unread,
Feb 16, 2017, 3:17:05 PM2/16/17
to humann...@googlegroups.com
Hi Aditya,

Can you forward the HUMAnN2 command you're using to run with relaxed alignment criteria? I'm curious to see what's being turned off in this process.

I too am surprised that you're only seeing ~1/3 alignment against UniRef vs ~100% to NR. Aligning less stringently to a clustered database ought to map as many (and likely more) reads as compared with stringent alignment to a clustered database. For that reason I would not expect mapping to UniRef100 to improve your mapping relative to UniRef90/50.

Also, were you using DIAMOND's sensitive mode in your NR search? You can invoke this in HUMAnN2 with "--search-mode​ uniref50" (this also sets the % identity threshold to 50%, as both are recommended for search vs. UniRef50).

Thanks,
Eric



On Thu, Feb 16, 2017 at 10:44 AM, Aditya Bandla <aditya...@gmail.com> wrote:
Hi Eric

Thanks for the suggestions! I have been playing around a bit with the parameters and the different uniref databases. Even with all alignment parameters set to zero, except the e-value threshold, at most only 34% of my reads seem to align. This is quite low in comparison to almost 95% reads aligning against  NR, using DIAMOND previously, with the same parameters.

Do you think aligning against uniref100 might help in my case instead? If so, are there any scripts to format the uniref100 database with the gene length annotations? 

At the moment, I could proceed with my results based on alignments against NR, but this would not be quantitative due to the absence of gene length normalisation and hence I am looking to make my data more quantitative

Best,
Aditya

Aditya Bandla

unread,
Feb 16, 2017, 4:53:42 PM2/16/17
to HUMAnN Users
Hi Eric

Following is the command

humann2 --input C1F1A1_CGGCTATG-TATAGCCT_R1-R2.fastq --output $HOME/C1F1A1/ --evalue 0.001 --translated-query-coverage-threshold 0 --translated-subject-coverage-threshold 0 --identity-threshold 0 --memory-use maximum --threads 22


I found that the identity threshold and the subject-coverage-threshold were getting rid of most alignments. Since my read lengths span from 100-250 bp, I am not sure if its prudent to still set the coverage thresholds?


I was using the default 'fast' mode for DIAMOND. Currently I have downloaded the latest release of uniref90 and giving it a try using humann2, will update here how that goes


Best,

Aditya

Aditya Bandla

unread,
Feb 17, 2017, 3:01:31 PM2/17/17
to HUMAnN Users
Hi Eric

Update: After running the file against the latest uniref90 database, close to 99% of reads aligned. However, close to 67% were thrown away due to "small query coverage" inspite of setting the  translated query coverage threshold = 0.0

I find this to be quite strange. I was expecting no filtering to be done given this value. Am I missing something?

Best,
Aditya

Eric Franzosa

unread,
Feb 17, 2017, 3:28:01 PM2/17/17
to humann...@googlegroups.com, Lauren McIver
Hi Aditya,

Indeed, that does sound strange! Can you send me and Lauren (cc'ed) the log files for 1) this run and 2) your previous run with the relaxed alignment criteria? Feel free to reply to us individually if there's anything in the log files you don't want shared with the public group.

Thanks,
Eric


Reply all
Reply to author
Forward
0 new messages