Unaligned reads rate for nucleotide vs translated search?

35 views
Skip to first unread message

Nick

unread,
Oct 3, 2019, 10:11:24 PM10/3/19
to HUMAnN Users
For my analysis, I ran both nucleotide and translated protein search. 


Unaligned reads after nucleotide alignment: 35.8506 %

...

Unaligned reads after translated alignment: 31.1693 %


Is the 31.1693 % of unaligned reads (ie, 31.1693 % out of 35.8506 %) or 31.1693 % of total reads? 

If it's the latter, it means that I only gained ~4% new alignments for a ton of compute time. Is this to be expected? 

Eric Franzosa

unread,
Oct 4, 2019, 11:11:38 AM10/4/19
to humann...@googlegroups.com
Those %s are both relative to the initial total # of reads. The default translated search against UniRef90 is still pretty stringent to avoid false positives. You can relax the translated search parameters yourself or have HUMAnN2 do that automatically by using UniRef50 instead of UniRef90 (to try to map more reads).

Thanks,
Eric



--
You received this message because you are subscribed to the Google Groups "HUMAnN Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to humann-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/humann-users/ab09535e-169d-42b8-a88e-2c3dc81bdb86%40googlegroups.com.

Nick

unread,
Oct 5, 2019, 4:55:46 AM10/5/19
to HUMAnN Users
Hi Eric,

   Thanks, that makes sense. I ran a nucleotide-only search as well to compare to the nucleotide+translated search and found the major gene families and pathways are pretty much the same. This is to be expected since, as per the wiki, most of the gene families in my gut sample will have been found in the nucleotide step. So I wonder if it's even worth doing a translated search, since all I'm gaining is ~4% more gene families, which (probably) are not that important anyways!


On Friday, 4 October 2019 23:11:38 UTC+8, Eric Franzosa wrote:
Those %s are both relative to the initial total # of reads. The default translated search against UniRef90 is still pretty stringent to avoid false positives. You can relax the translated search parameters yourself or have HUMAnN2 do that automatically by using UniRef50 instead of UniRef90 (to try to map more reads).

Thanks,
Eric



On Thu, Oct 3, 2019 at 10:11 PM Nick <nicochun...@gmail.com> wrote:
For my analysis, I ran both nucleotide and translated protein search. 


Unaligned reads after nucleotide alignment: 35.8506 %

...

Unaligned reads after translated alignment: 31.1693 %


Is the 31.1693 % of unaligned reads (ie, 31.1693 % out of 35.8506 %) or 31.1693 % of total reads? 

If it's the latter, it means that I only gained ~4% new alignments for a ton of compute time. Is this to be expected? 

--
You received this message because you are subscribed to the Google Groups "HUMAnN Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to humann...@googlegroups.com.

Nick

unread,
Oct 5, 2019, 5:28:44 AM10/5/19
to HUMAnN Users
And another question, in your paper you mentioned you tested the accuracy of the humann2 profiling of gene families and pathways. The positive control you used was a synthetic gut community. I'd like to know how you calculated the ground truth for the gene family and pathway abundances? How would I produce a similar truth set from known references? The idea would be to sequence a positive control like the Zymo Biomics mock community then run those reads through humann2. But I'd need to know the ground truth of the mock community before I can assess how accurate humann2 is. 

Eric Franzosa

unread,
Oct 8, 2019, 3:17:38 PM10/8/19
to humann...@googlegroups.com
The ground truth was based on the gene content of the genomes in the community augmented by their abundance. So, for example, in a species A + species B community where each species has a homolog of gene family X, and A is twice as abundant as B, the expected output would look something like...

X    3
X|A  2
X|B  1

With abundance in units of coverage or RPK to compensate for differences in genome/gene length. Notably, this procedure does not account for non-uniform read sampling/sequencing along the length of a genome, which (in our experiments) explained ~0.1 units of Bray-Curtis distance between the expected and observed gene abundance profiles. Adding this read-level resolution to the gold standard is a lot more complicated since it requires tracing individual reads back to their genes of origin.
 
Thanks,
Eric



To unsubscribe from this group and stop receiving emails from it, send an email to humann-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/humann-users/ee98c66e-839a-432d-a21e-81f53e2d5dbc%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages