Hi Julian, and thank you for your reply.
My data is indeed paired-end.
Regarding the trimmomatic use, I did HEADCROP:15 CROP:110 , so I cut the first 15 sites (the 5 cutsite bases plus 10, out of an abundance of caution), and then also the last 26 which had very good, but slightly lower, quality. This is partially inspired by NGI's own quality control pipeline.
The IGV screenshots I showed are from the raw bams (but targetting sites I identified after running populations).
Between my earlier messages and now, I've been running alternate pipeline on 4 samples:
- one where I do not use trimmomatic, but pass the raw reads through process radtags with the options --renz-1 ecoRI --renz-2 ecoRI -r -c -q , then do the mapping and variant calling
- one where I use the bams from my previous pipeline (with first and last bases cut off), but only the same 4 samples for SNP calling with gstacks
in both cases, I also applied the populations filters I had applied at that point on my whole data (--min-maf 0.05 --min-gt-depth 10 --max-obs-het 0.7 --min-samples-overall 0.5) because I would anyways.
The results are as follows: with process-radtags checking for cutsite, I lose most of my raw reads (this is also where I should mention that, mistakenly, I omitted to also add --score-limit 30 to my process radtags command, as I did for the trimmomatic reads, so I would lose even possibly even more): from process radtags log:
BEGIN per_file_raw_read_counts
File Retained Reads Low Quality Poly-G Barcode Not Found RAD cutsite Not Found Total
P21411_194_S186_L003_R1_001.fastq.gz 2402524 3881 384 0 6884291 9291080
END per_file_raw_read_counts
BEGIN total_raw_read_counts
Total Sequences 9291080
Barcode Not Found 0 0.0%
Low Quality 3881 0.0%
Poly-G Runs 384 0.0%
RAD Cutsite Not Found 6884291 74.1%
Retained Reads 2402524 25.9%
Properly Paired 398491 8.6%
END total_raw_read_counts
this is not too surprising to me as my cutsite sequences are really quite low quality (see attached image from multiqc report), so process radtags fails to recover many of them (and even moreso considering I'm also asking it to check on the paired-end side). Accordingly my gstacks result show a quarter of the coverage as well.
The number of SNP identified in the resulting VCFs are similar, but not the number of loci:
for the reads with cutsites: 4542 SNP on 11415 loci, of which 2513 SNP are covered by two loci and 2029 by a single one.
for the trimmomatic reads: 3885 SNP on 24190 loci, of which 2 are covered by 3 loci, 1621 by two, and 2260 by a single one.
I clearly have more SNP per loci when keeping the cutsites since I have similar SNP numbers but double the loci. This is surely due to the fact that my radtags with cutsites are 151 bp long rather than 110 for the trimmomatic reads.
Unless you expressly recommend against it (because, say, Gstacks works much better when a cutsite is present), I think I would keep my trimmomatic-cut-reads-based gstacks and populations result, but only whitelist SNPs that are covered by a single locus, to move forward with filtering (on max depth, missing data, and --write-random-snp). But if you say I should do othewise, I will certainly defer to you!
I am quite convinced, that one decently big part of the reason I get so many overlapping loci is that EcoRI cutsites are really frequent in the butterfly genome (it was the default enzyme used by NGI, hence the choice). I found several such cases (see the attached example, taken from the reads-with-cutsites run, but also the ex-3loci image in my previous message. However because in some cases I noticed the genotype calls are different on each loci, I'd rather remove these multi-loci SNP altogether.
Sorry for the novel-length reply, but I really appreciate your replies and insights
Best,
Patrick