unaligned bowtie reads

Mohammed S. Alarawi

unread,

Jan 11, 2017, 12:20:17 AM1/11/17

to humann...@googlegroups.com

Dear group members

Greetings to you all,

I would like to share some output for one samples of NGS data I am trying to analyze and this is a stomach WGS data from illumina. Trying to predict the functional pathway and gene families using humann2 is east however I endup with 99% of reads unmapped.
any advise? I think the number is high.

Mohammed

Running metaphlan2.py ........

Found g__Butyrivibrio.s__Butyrivibrio_unclassified : 53.01% of mapped reads
Found g__Prevotella.s__Prevotella_ruminicola : 13.28% of mapped reads
Found g__Betaretrovirus.s__Jaagsiekte_sheep_retrovirus : 10.19% of mapped reads
Found g__Ruminococcus.s__Ruminococcus_flavefaciens : 7.11% of mapped reads
Found g__Prevotella.s__Prevotella_bryantii : 6.73% of mapped reads
Found g__Treponema.s__Treponema_bryantii : 5.29% of mapped reads
Found g__Fibrobacter.s__Fibrobacter_succinogenes : 2.20% of mapped reads
Found g__Methanobrevibacter.s__Methanobrevibacter_unclassified : 2.19% of mapped reads

Total species selected from prescreen: 8

Selected species explain 100.00% of predicted community composition

Creating custom ChocoPhlAn database ........

Running bowtie2-build ........

Running bowtie2 ........

Total bugs from nucleotide alignment: 6
g__Treponema.s__Treponema_bryantii: 3423 hits
g__Betaretrovirus.s__Jaagsiekte_sheep_retrovirus: 1 hits
g__Fibrobacter.s__Fibrobacter_succinogenes: 2999 hits
g__Prevotella.s__Prevotella_ruminicola: 7190 hits
g__Prevotella.s__Prevotella_bryantii: 2398 hits
g__Ruminococcus.s__Ruminococcus_flavefaciens: 1316 hits

Total gene families from nucleotide alignment: 4697

Unaligned reads after nucleotide alignment: 99.7046691171 %

Running diamond ........

Aligning to reference database: uniref90.ec_filtered.1.1.dmnd

Total bugs after translated alignment: 7
g__Treponema.s__Treponema_bryantii: 3423 hits
g__Betaretrovirus.s__Jaagsiekte_sheep_retrovirus: 1 hits
g__Fibrobacter.s__Fibrobacter_succinogenes: 2999 hits
g__Prevotella.s__Prevotella_ruminicola: 7190 hits
unclassified: 74379 hits
g__Prevotella.s__Prevotella_bryantii: 2398 hits
g__Ruminococcus.s__Ruminococcus_flavefaciens: 1316 hits

Total gene families after translated alignment: 7535

Unaligned reads after translated alignment: 98.5272965001 %

Computing gene families ...

Computing pathways abundance and coverage ...

This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

Eric Franzosa

unread,

Jan 12, 2017, 12:28:46 PM1/12/17

to humann...@googlegroups.com

Hi Mohammed,

Have these samples been previously depleted for host (human?) genomic reads? Most of those won't map to bug pangenomes or isolated coding sequences, and will therefore inflate the % of unmapped reads. Since these data are from stomach, another possibility would be non-microbial, non-host DNA derived from the diet (e.g. plant matter), which could similarly inflate unmapped reads.

Another possibility is that the species in your community are too diverged to map to reference pangenomes or to proteins at 90% identity, in which case switching to the UniRef50 might generate more protein-level hits? I think the former explanation is more likely though.

Thanks,

Eric

silvi...@gmail.com

unread,

May 23, 2017, 11:43:34 AM5/23/17

to HUMAnN Users

Hello,

I am new to Humann2 and looking for an explanation for a similar problem. I am running Humann2 using Metaphlan2 output and my data is a shotgun sequencing analysis of DNA extracted from rat faecal samples. Much like Mohammed, I notice that when running Humann2, after the bowtie2 step I have a high % of unmapped reads, but my samples cannot contain any human DNA. Should I worry about it? Should I use a different database (currently I am using the UniRef90 one). Also, while I have 93% of unmapped reads, at the metaphlan step it's stated that the bugs identified by metaphlan explain the 100% predicted community composition. Does this mean that, even if the bowtie2 step doesn't align a good percentage of reads, I still have all the reads I need to thoroughly analyze the community composition?

Please find below the log and the command I'm running:

humann2 --input 170427_A19_IonXpress_010.fastq --output 170427_A19_fastq --gap-fill on --taxonomic-profile A19_rel.txt

Output files will be written to: /home/bioc1484/Documents/170427_Pilot/170427_A19_fastq
Found g__Parabacteroides.s__Parabacteroides_goldsteinii : 26.36% of mapped reads
Found g__Lactococcus.s__Lactococcus_lactis : 22.97% of mapped reads
Found g__Akkermansia.s__Akkermansia_muciniphila : 20.84% of mapped reads
Found g__Subdoligranulum.s__Subdoligranulum_unclassified : 10.04% of mapped reads
Found g__Bacteroides.s__Bacteroides_uniformis : 7.45% of mapped reads
Found g__Lactobacillus.s__Lactobacillus_murinus : 3.38% of mapped reads
Found g__Peptostreptococcaceae_noname.s__Peptostreptococcaceae_noname_unclassified : 2.85% of mapped reads
Found g__Blautia.s__Ruminococcus_torques : 1.75% of mapped reads
Found g__Clostridium.s__Clostridium_symbiosum : 0.71% of mapped reads
Found g__Dorea.s__Dorea_unclassified : 0.68% of mapped reads
Found g__Butyrivibrio.s__Butyrivibrio_unclassified : 0.58% of mapped reads
Found g__Peptostreptococcaceae_noname.s__Clostridium_glycolicum : 0.54% of mapped reads
Found g__Oscillibacter.s__Oscillibacter_unclassified : 0.53% of mapped reads
Found g__Eggerthella.s__Eggerthella_unclassified : 0.47% of mapped reads
Found g__Clostridium.s__Clostridium_hathewayi : 0.18% of mapped reads
Found g__Adlercreutzia.s__Adlercreutzia_equolifaciens : 0.18% of mapped reads
Found g__Ruminococcus.s__Ruminococcus_flavefaciens : 0.09% of mapped reads
Found g__Anaerotruncus.s__Anaerotruncus_colihominis : 0.07% of mapped reads
Found g__Flavonifractor.s__Flavonifractor_plautii : 0.07% of mapped reads
Found g__Lachnospiraceae_noname.s__Lachnospiraceae_bacterium_3_1_46FAA : 0.07% of mapped reads
Found g__Anaerotruncus.s__Anaerotruncus_unclassified : 0.06% of mapped reads
Found g__Roseburia.s__Roseburia_unclassified : 0.06% of mapped reads
Found g__Clostridium.s__Clostridium_asparagiforme : 0.04% of mapped reads
Found g__Eubacterium.s__Eubacterium_plexicaudatum : 0.02% of mapped reads
Found g__Bifidobacterium.s__Bifidobacterium_pseudolongum : 0.01% of mapped reads

Total species selected from prescreen: 25

Selected species explain 100.00% of predicted community composition

Creating custom ChocoPhlAn database ........

Running bowtie2-build ........

Running bowtie2 ........

Total bugs from nucleotide alignment: 17
g__Akkermansia.s__Akkermansia_muciniphila: 79346 hits
g__Lactococcus.s__Lactococcus_lactis: 120221 hits
g__Blautia.s__Ruminococcus_torques: 5543 hits
g__Parabacteroides.s__Parabacteroides_goldsteinii: 260069 hits
g__Bacteroides.s__Bacteroides_uniformis: 79711 hits
g__Eubacterium.s__Eubacterium_plexicaudatum: 29348 hits
g__Peptostreptococcaceae_noname.s__Clostridium_glycolicum: 4923 hits
g__Anaerotruncus.s__Anaerotruncus_colihominis: 5237 hits
g__Clostridium.s__Clostridium_hathewayi: 4780 hits
g__Clostridium.s__Clostridium_asparagiforme: 1705 hits
g__Lactobacillus.s__Lactobacillus_murinus: 11265 hits
g__Adlercreutzia.s__Adlercreutzia_equolifaciens: 2972 hits
g__Flavonifractor.s__Flavonifractor_plautii: 10271 hits
g__Clostridium.s__Clostridium_symbiosum: 13702 hits
g__Lachnospiraceae_noname.s__Lachnospiraceae_bacterium_3_1_46FAA: 3826 hits
g__Bifidobacterium.s__Bifidobacterium_pseudolongum: 641 hits
g__Ruminococcus.s__Ruminococcus_flavefaciens: 314 hits

Total gene families from nucleotide alignment: 28011

Unaligned reads after nucleotide alignment: 93.3538774989 %

Eric Franzosa

unread,

May 23, 2017, 12:06:06 PM5/23/17

to humann...@googlegroups.com

Hi Silvia,

What sort of alignment are you seeing after translated search? This actually looks like a pretty good run, and it may just be that you sample contains a lot of species that aren't represented in ChocoPhlAn. While I would still recommend depleting your sample for host (in this case, rat) genomic reads, I don't think their presence explains the reduced mapping rate.

Switching from UniRef90 to 50 will not increase the mapping to pangenomes, but it _is_ better for detecting remote homology in translated search, which can be useful for a poorly characterized community.

The message "Selected species explain 100.00% of predicted community composition" is a bit confusing in a vacuum: it means that, of the species detected by MetaPhlAn2 (which sum to 100% by definition), the subset selected by HUMAnN2 cover that entire 100%. The message would be different if you raised HUMAnN2's default abundance threshold (0.01%) to something more stringent (e.g. 1%), in which case the selected species might not cover the full 100% predicted by MetaPhlAn2.

To translate this into "% of sample reads explained" you really do need to look at the final mapping results. In this case ~7% of your sample reads are explained by the detected/selected species.

Thanks,

Eric

silvi...@gmail.com

unread,

May 23, 2017, 1:09:07 PM5/23/17

to HUMAnN Users

Hello Eric,

Thank you for your swift reply.
After translated alignment I still see 90% unaligned reads. Not sure how to proceed now, would it be ok to just process all the other samples and finish the analysis? Or should I try to look more into this problem? I would assume ChocoPhlAn would be one of the richest databases available if it comes as a default option for Humann2. If my species aren't there, where would they be annotated? I highly doubt my samples would contain 90% completely new and never before seen species.

Sorry, I hope I am not wasting your time but I am really confused.

Best,

Silvia

Eric Franzosa

unread,

May 24, 2017, 10:45:17 AM5/24/17

to humann...@googlegroups.com

Hmmm, 90% unmapped after translated search is rather low. I would definitely recommend trying to understand what's happening in the context of one (or a few) samples before doing a large analysis. If it's really an issue of remote homology, re-running on UniRef50 or with a relaxed percent identity threshold should lower that number considerably. If it's _still_ ~90%, I would start to wonder if there's a technical issue at play.

As an example, we had a recent dataset where many reads contained partial adapters at the 3' end from read-through, and it was limiting the ability of reads to map throughout the HUMAnN2 workflow. That sort of thing becomes very obvious if you run your reads through FastQC.

Thanks,

Eric

Silvia Raineri

unread,

May 24, 2017, 11:09:16 AM5/24/17

to humann...@googlegroups.com

Dear Eric,

many thanks for your feedback. I did run fastQC beforehand and it did not find any adapters in the sequences, but I might have made a mistake so I will try running it again on a couple of samples. Furthermore, I am now trying humann2 with the UniRef50 database with the same sample I posted here. There is something that has just occurred to me: I am using the UniRef90 EC-filtered database, should I try the complete one?

Best,

Silvia

Eric Franzosa

unread,

May 24, 2017, 11:14:08 AM5/24/17

to humann...@googlegroups.com

Ah, ok - The EC-filtered database is fine for reconstructing pathways, but it won't give an accurate estimate of the total reads in the sample that can be explained by reference databases (since only ~10% of UniRefs have an EC annotation). Switching to the full database will probably also provide a dramatic improvement in the % of reads that map during translated search (it won't change the % of reads that map to pangenomes). The full database takes about 2x longer to search despite containing ~10x the sequences.

Thanks,

Eric

Silvia Raineri

unread,

May 24, 2017, 12:36:45 PM5/24/17

to humann...@googlegroups.com

Thanks Eric, it's much clearer now. I will try running humann2 with the full ref90 database and in the meantime check again on fastqc for any adapter sequences, Hopefully, this will solve the problem.

Best,

Silvia

Silvia Raineri

unread,

May 25, 2017, 4:29:24 AM5/25/17

to humann...@googlegroups.com

Hello again Eric,

I have now finished running the script against the full Ref90 database, this time the percentage has dropped from 90% to 82%, it doesn't look good enough...I also checked on FASTQC but it doesn't signal any overrepresented sequences. I will try running against the 50Ref database and see what happens. I was wondering, what's the average percentage of aligned translated reads one would be happy to see?

Thanks,

Silvia

Eric Franzosa

unread,

Jun 5, 2017, 1:24:01 PM6/5/17

to humann...@googlegroups.com

Hi Silvia,

Sorry for the delayed reply - I was traveling last week. Feel free to update us on how this is going.

In answer to your question, I would be surprised if <50% of reads mapped to UniRef50 (i.e. with 50% identity to a known protein). That would suggest that the sample contains a LOT of novel protein sequence diversity.

Thanks,

Eric

Silvia Raineri

unread,

Jun 13, 2017, 5:47:48 AM6/13/17

to HUMAnN Users

Hello Eric,

after checking the samples with FastQC, stripping them from host-derived (rat) reads with bowtie2 and running against the UniRef50 database, samples have between 40 and 60% unmapped translated reads. I know it doesn't sound ideal, but I decided to proceed with the analysis as I don't have a better solution yet. Not sure whether this is due to contaminant DNA from the food.

Best,

Silvia

Eric Franzosa

unread,

Jun 13, 2017, 10:30:06 AM6/13/17

to humann...@googlegroups.com

Hi Silvia,

That actually sounds a lot more reasonable, i.e. ~50% of reads explained. I think you should be fine going forward with your analysis given that mapping rate.