High frequency of UNMAPPED or UNINTEGRATED READS

1,282 views
Skip to first unread message

Matthe...@yahoo.com

unread,
Mar 24, 2017, 12:51:36 PM3/24/17
to HUMAnN Users
Hi,

I sequenced fecal samples from children in the SE United States, quality filtered them, and ran them through HUMAnN2. No matter how I alter the default options, the total of the unmapped and unintegrated reads is at least 90%, on both the genefamilies and pathabundance tables. This was with both UniRef50 and UniRef90 on the default options. Subsequently using UniRef50, I lowered the translated_subject_coverage, translated_query_coverage, and percent_identity to 30 or 40, and have also increased evalue to 10, with minimal effect on the number of UNMAPPED / UNINTEGRATED reads. I would have expected that for human fecal habitats, even in a geographic region that was not included in the HMP studies, this would be higher. Any thoughts on this might reflect?

Thanks.

bi.my...@gmail.com

unread,
Mar 27, 2017, 4:22:35 AM3/27/17
to HUMAnN Users
Hello, I also have this problem. May someone kindly advise on this issue? Thank you.

Eric Franzosa

unread,
Mar 27, 2017, 11:40:26 AM3/27/17
to humann...@googlegroups.com
Indeed, those numbers are surprisingly low. For HMP stool HUMAnN2 tends to map ~60% of reads during pangenome search and a further ~20% (~80% total) after translated search. Some things to check:

* Have you inspected the filtered reads? I have occasionally seen filtering systems that mask bad reads / nucleotides with Ns rather than dropping/trimming them. The masked reads then align very poorly and can inflate the UNMAPPED rate, although I'd also be very surprised if so many reads were having quality issues.

* How do the MetaPhlAn2 profiles for these samples look? For human samples you ought to be getting a fair amount of mapping to detected pangenomes before even falling back to translated search.

* If you BLAST (online) some of the reads that didn't map in HUMAnN2, what do they hit, and what do their alignment profiles look like? This check can reveal (e.g.) a failure to remove adapters OR adapter read-through (the latter is often not caught by explicit adapter trimming).

* You could align a subset of your reads to one of the HUMAnN2 UniRef databases with DIAMOND (i.e. external to HUMAnN2, and hence ignoring all of its filtering / post-processing). That might help to distinguish an issue/interaction with HUMAnN2 from an issue with the reads. 

Hopefully one of these will shed some light on the underlying issue.

Thanks,
Eric


Matthe...@yahoo.com

unread,
Mar 29, 2017, 9:02:19 AM3/29/17
to HUMAnN Users
Hi Eric,

Thanks for your reply. First, I need to apologize in that part of my initial post was partially inaccurate. UNMAPPED reads constituted, as I mentioned, 30 - 40% of the reads depending on the parameters used. Thus, 60 - 70% of the reads could be assigned to a gene. However, in the pathway (but not the genefamilies), UNINTEGRATED reads constituted another 50 - 70%, such that only 5 - 10% of the initial reads could be assigned to a pathway. As to your suggestions for troubleshooting:

-- I confirmed that there were no ambiguous nucleotides in the input fasta files.
-- I inspected the output for four of my samples. The *_bowtie2_unaligned.fa ranged from 30 - 60% of the original sequences. The *diamond_unaligned.fa ranged from 20 - 50% of the original sequences. 
-- I BLASTed the first five sequences from the *diamond_unaligned.fa file. Three of those resulted in no matches on the nucleotide BLAST, with a fourth matching B vulgatus with a negative BLASTx output. The fifth had 92% identity to R. torques, with minimal BLASTx matches.
-- I ran DIAMOND on the *diamond_unaligned.fa file, using the code as indicated in the log file. Somewhat to my surprise, 2.9 million queries aligned, out of approximately 6.7 million input sequences.

My take home messages from this are that there probably are not a whole lot of false negatives among the sequences that failed to align during the HUMAnN2 run -- maybe a minority of them. I am curious what your take is on the finding that the vast majority of sequences that did align with Bowtie2 or DIAMOND failed to map to a pathway. Do you think this is a problem with Metacyc, or was that a fairly typical finding with the original HUMAnN that defaulted to KEGG? 

Regards,

Matt   

Eric Franzosa

unread,
Mar 29, 2017, 7:23:33 PM3/29/17
to humann...@googlegroups.com
Hi Matt,

Thanks for the clarification and for including the details of the checks you performed. Sounds like your gene-level alignment rates are very consistent with what we get for human stool. It is _not_ surprising in my experience to have a large proportion of UNMAPPED/UNINTEGRATED in the pathways file. A lot of reads that map to _some_ sequence are not part of a known pathway in MetaCyc, for example (which inflates UNINTEGRATED).

Because of the way that pathway abundance is computed, the values of UNMAPPED and UNINTEGRATED are also not as easy to compare to "MAPPED" as they are in the gene families file (as outlined here in the manual). We include these values primarily for use as statistical covariates in case you wanted to (say) control for % UNMAPPED when associating pathway abundance with a phenotype.

Thanks,
Eric


Matthe...@yahoo.com

unread,
Mar 30, 2017, 12:53:55 PM3/30/17
to HUMAnN Users
Thank you very much, Eric; this has been quite helpful.

Matt
Reply all
Reply to author
Forward
0 new messages