High % of unaligned reads?

justinw...@gmail.com

unread,

Jul 12, 2017, 11:27:58 AM7/12/17

to HUMAnN Users

Hello Humann Users,

I have a quick question/concern regarding my HUMAnN2 annotation output. I will quickly relay my steps before running Humann2:

1. Trimmomatic trimming of reads (sliding window 4:28)
2. Kneaddata removal of ribosomal rRNA reads.
3. Humann was then run

I'm just concerned when viewing the .log information, for most samples (fecal) the
"Unaligned reads after translated alignment" ranges from 90 -98% (most are above 95%). Is this a really high number? Should I be concerned with my library due to the low % of annotated reads (which i suppose is around 5%~)?

Thanks,
Justin

Eric Franzosa

unread,

Jul 12, 2017, 2:48:06 PM7/12/17

to humann...@googlegroups.com

Hi Justin,

Those pre-processing steps look fine to me.

The mapping rates you're seeing are definitely on the low side. Are these _human_ fecal samples you're analyzing? There we tend to see alignment mapping rates around 50-75%.

Some troubleshooting steps when you have really low alignment rates:

1) Manually inspect the post-qc reads that are actually being fed into HUMAnN2. Do they look reasonable? For example, reads with lots of masked regions (NNNNNN) will often fail to align.

2) Relax the thresholds for mapping during translated search, especially percent identity and database sequence coverage. This can be particularly useful for analyzing a poorly characterized community.

Any other details about your data and HUMAnN2 run (e.g. version #) would be helpful.

Thanks,

Eric

Justin Wright

unread,

Jul 12, 2017, 3:01:12 PM7/12/17

to humann...@googlegroups.com

Humann2 Version -- 0.9.9

Databases --> uniref90_annotated.1.1.dmnd

I should also mention this is a metatranscriptome dataset, but we have received better annotation with such datasets in the past.

I ran humann2 with "standard settings" (humann2 -i inputdata -o output annotations -threads 32)

Thanks,

Justin

Eric Franzosa

unread,

Jul 14, 2017, 1:16:55 PM7/14/17

to humann...@googlegroups.com

Another useful trouble-shooting step in these cases is to grab a handful of the unmapped reads and manually BLAST them online to see what they hit (if anything). Sometimes this can reveal occult sources of contamination. Although you noted that you computationally depleted rRNAs, abundant rRNA reads would definitely explain this, as HUMAnN2's databases are all protein-centric.