Unmapped read percentages don't make sense

5 views
Skip to first unread message

Abhimanyu Lele

unread,
Aug 27, 2025, 12:16:12 PM (9 days ago) Aug 27
to Stacks
Hello everyone, 
Apologies for a long message. In brief, I am trying to understand a mismatch in the number of reads bwa is able to map and the number gstacks can put into a catalog. 

I am analyzing sequence data from a set of ~200 3RAD libraries. I aligned trimmed and decloned reads to a reference genome from a congener using bwa. Examining the aligned read counts shows that bwa was able to align >90% of the reads, approximately what I would expect. However, the percentage of reads that gstacks is able to build catalogs from is highly variable, and in some cases below 20% (to be specific, this is "kept_frac" in gstacks.log.distribs). This is not behavior I have seen before. Can anyone help me understand this? I can move forward with my analyses with the data quantity I have, but I want to be reasonably confident the data are not problematic.

Things I have tried so far: 
1) BLAST queries of random sequences. I have found no evidence of contamination yet. 
2) de-novo assembly. While some samples assemble poorly, there was only a weak correlation between the number of de-novo loci and the percentage of kept reads. 
3) The correlation between the raw read count and the percentage of kept reads is also very low: some samples with >500k decloned reads had less than 20% of those built into a catalog. 
4) A different reference genome in the same genus: the results don't change. 
5) Testing for batch effects: I see no evidence that the poorly mapping samples come from a single pool, or a single extraction batch.

I should note that the samples in question are tissues from salvaged birds, in some cases from roadkill. However, I would expect poor tissue quality to affect mapping percentages as well. 

Any assistance would be appreciated,
Best, 
Abhimanyu
Reply all
Reply to author
Forward
0 new messages