Unmapped read percentages don't make sense

11 views

Skip to first unread message

Abhimanyu Lele

unread,

Aug 27, 2025, 12:16:12 PMAug 27

to Stacks

Hello everyone,

Apologies for a long message. In brief, I am trying to understand a mismatch in the number of reads bwa is able to map and the number gstacks can put into a catalog.

I am analyzing sequence data from a set of ~200 3RAD libraries. I aligned trimmed and decloned reads to a reference genome from a congener using bwa. Examining the aligned read counts shows that bwa was able to align >90% of the reads, approximately what I would expect. However, the percentage of reads that gstacks is able to build catalogs from is highly variable, and in some cases below 20% (to be specific, this is "kept_frac" in gstacks.log.distribs). This is not behavior I have seen before. Can anyone help me understand this? I can move forward with my analyses with the data quantity I have, but I want to be reasonably confident the data are not problematic.

Things I have tried so far:

1) BLAST queries of random sequences. I have found no evidence of contamination yet.

2) de-novo assembly. While some samples assemble poorly, there was only a weak correlation between the number of de-novo loci and the percentage of kept reads.

3) The correlation between the raw read count and the percentage of kept reads is also very low: some samples with >500k decloned reads had less than 20% of those built into a catalog.

4) A different reference genome in the same genus: the results don't change.

5) Testing for batch effects: I see no evidence that the poorly mapping samples come from a single pool, or a single extraction batch.

I should note that the samples in question are tissues from salvaged birds, in some cases from roadkill. However, I would expect poor tissue quality to affect mapping percentages as well.

Any assistance would be appreciated,

Best,

Abhimanyu

Reply all

Reply to author

Forward

0 new messages