Apologies for a long message. In brief, I am trying to understand a mismatch in the number of reads bwa is able to map and the number gstacks can put into a catalog.
I am analyzing sequence data from a set of ~200 3RAD libraries. I aligned trimmed and decloned reads to a reference genome from a congener using bwa. Examining the aligned read counts shows that bwa was able to align >90% of the reads, approximately what I would expect. However, the percentage of reads that gstacks is able to build catalogs from is highly variable, and in some cases below 20% (to be specific, this is "kept_frac" in gstacks.log.distribs). This is not behavior I have seen before. Can anyone help me understand this? I can move forward with my analyses with the data quantity I have, but I want to be reasonably confident the data are not problematic.
Things I have tried so far:
1) BLAST queries of random sequences. I have found no evidence of contamination yet.
2) de-novo assembly. While some samples assemble poorly, there was only a weak correlation between the number of de-novo loci and the percentage of kept reads.
3) The correlation between the raw read count and the percentage of kept reads is also very low: some samples with >500k decloned reads had less than 20% of those built into a catalog.
4) A different reference genome in the same genus: the results don't change.
5) Testing for batch effects: I see no evidence that the poorly mapping samples come from a single pool, or a single extraction batch.
I should note that the samples in question are tissues from salvaged birds, in some cases from roadkill. However, I would expect poor tissue quality to affect mapping percentages as well.