I am currently processing a soil metagenome (n = 18) using HUMAnN2. I was wondering how the translated search coverage is calculated?
Previously, I had done a standalone comprehensive search against NCBI-nr using DIAMOND. Using HUMAnN2, I end up with a similar number of alignments. However, the proteins that pass the default coverage threshold (50.0) is only 49507 (Total proteins 393295) i.e. only 12.58% pass the filter
I am concerned by the drastic drop in the number of proteins post this filter. I could lower the threshold value. However, I am not sure about the cons of doing that.
In our comprehensive search pipeline using DIAMOND, we usually limit our filters to the e-value (<0.001) and bit score (>50) and use the parse the resulting tables directly for downstream stats
Advise on modifying this filter (translated coverage threshold) would be much appreciated
Best,
Aditya
Hi EricThank you for the detailed reply.Based on my first run, using uniref90, 99% of my reads remain unaligned after running it through the full HUMAnN2 pipeline. Upon inspecting the log file (attached for your reference), I could see that a fairly large number of alignments hadn't passed the default thresholds set for the translate alignment step. Thus, I am a bit confused as to which parameters need to be toned down to increase the number of reads that alignFor the DIAMOND step, using 22 cores, I was able to optimize the block-size and index-chunks to maxout the memory available (while aligning against NR). With the currently available .dmnd files, I guess the block-size is too small and hence its taking more time than expected. Hence, if the fasta files are available, I can reformat the .dmnd files using a new block sizeThanks againBest,Aditya
Hi EricThanks for the suggestions! I have been playing around a bit with the parameters and the different uniref databases. Even with all alignment parameters set to zero, except the e-value threshold, at most only 34% of my reads seem to align. This is quite low in comparison to almost 95% reads aligning against NR, using DIAMOND previously, with the same parameters.Do you think aligning against uniref100 might help in my case instead? If so, are there any scripts to format the uniref100 database with the gene length annotations?At the moment, I could proceed with my results based on alignments against NR, but this would not be quantitative due to the absence of gene length normalisation and hence I am looking to make my data more quantitativeBest,Aditya
humann2 --input C1F1A1_CGGCTATG-TATAGCCT_R1-R2.fastq --output $HOME/C1F1A1/ --evalue 0.001 --translated-query-coverage-threshold 0 --translated-subject-coverage-threshold 0 --identity-threshold 0 --memory-use maximum --threads 22
I found that the identity threshold and the subject-coverage-threshold were getting rid of most alignments. Since my read lengths span from 100-250 bp, I am not sure if its prudent to still set the coverage thresholds?
I was using the default 'fast' mode for DIAMOND. Currently I have downloaded the latest release of uniref90 and giving it a try using humann2, will update here how that goes
Best,
Aditya