Same chromosome position found in multiple locus IDs - populations.snps.vcf

160 views
Skip to first unread message

Ethan Baldwin

unread,
Feb 17, 2025, 6:14:14 PMFeb 17
to Stacks
Hi all,

I am analyzing some GBS data (two enzyme, similar to ddRAD) using stacks in reference mode. Looking at the VCF file, there are sometimes multiple Locus IDs sharing genomic coordinates. For example:

grep -v ^# populations.snps.vcf | grep 2849996 | cut -f1-5

Chr01   2849996 2844:49:-       T       A
Chr01   2849996 2852:53:-       T       A
Chr01   2849996 2854:54:-       T       A
Chr01   2849996 2855:55:-       A       T
Chr01   2849996 2856:56:-       A       T

I'm not showing the genotypes here, but they are wildly different per sample! Looking at a genome browser of the reads mapped to this area, I don't see why stacks thinks they are different loci, unless maybe soft clipping around the beginning of reads could make it think they start at different locations?

Command for mapping:

bwa mem -t 12 $reference ${sample}.1.fq.gz ${sample}.2.fq.gz | samtools view -O BAM | samtools sort --threads 12 -o $sample.psitt.bam

Commands for stacks:

gstacks -I $OUTDIR/bams -M $POPMAP_ALL -t 16 --details -O $GSTACKS_DIR

populations -P $GSTACKS_DIR --popmap $POPMAP_CROSS -t 16 -O $OUTDIR/populations_merged --vcf

Any help would be appreciated!

Thanks,
Ethan

Angel Rivera-Colón

unread,
Feb 19, 2025, 7:55:45 PMFeb 19
to Stacks
Hi Ethan,

This behavior happens because there are multiple catalog loci overlapping that position in the reference. You can see this in the ID field (3rd column in the VCF), which contains the Stacks catalog ID and variant site column. As recommended in section 7.2.2 of the manual (link), we advise to use the --ordered-export flag when exporting a VCF in populations when using reference-aligned data. Among other sorting things, it will export a given genomic coordinate only once. Also, note that you are not applying any filters to your catalog in populations, only exporting the VCF. I suspect that you are seeing that many loci at the site since many of them are spurious "junk" loci present in a single sample. In my experience, applying even baseline missing data filters would remove a large portion of these. Whatever true overlaps remain would then be handled by the --ordered-export.

Thanks,
Angel

kwojt...@gmail.com

unread,
Mar 5, 2025, 12:27:14 PMMar 5
to Stacks
Hi Angel,

I ran into the same thing that Ethan did and took your advice and re-ran populations with ordered-export flag, however when compared with the original vcf (the non ordered-export one) it looks like populations kept the very first instance of each genomic coordinate it came across which in my case meant keeping the spurious junk loci with high amounts of missingness. It is obvious when you compare the 2 vcf's which SNPs sharing the same coordinates are spurious junk and which are not and I am losing the SNPs that are real and that are genotyped across most individuals. Is there a way to tell populations to use ordered-export and only keep the locus that has the least missing data?

Thank you,
Kris

Angel Rivera-Colón

unread,
Mar 6, 2025, 6:31:53 PMMar 6
to Stacks
Hi Kris,

Could you confirm the populations command you used? Did you apply any missing data filters (e.g., -R, -r, -p) when doing this comparison? In theory, the junk loci should be removed first by these filters. The overlap is then checked for the remaining sites, keeping the first one present in the catalog as the representative. In my experience, this should properly handle cases like these.

Thanks,
Angel

kwojt...@gmail.com

unread,
Mar 7, 2025, 9:28:13 AMMar 7
to Stacks
Hi Angel,

My populations command is below, and my problem is almost certainly because I didn't use any filters. The data I'm working with is from amplicons, not RAD-Seq, and others have recommended using stacks to call genotypes instead of a custom pipeline we were provided. Since we have a list of target SNPs my plan was to not use any filters until after narrowing the data down to only the target positions so that I could look at capture rates and how they vary with different filtering strategies, and that's where I'm running into the issues above. Looking at the data it looks like the junk loci have incredibly high missingness (>90% in almost every case). So I could apply a missingness filter in the populations and set it high to get rid of those loci first and that should do it. 

Thank you for your help.

populations \
-P ~/gstacks/Panel_1 \
-O ~/populations/Panel_1 \
-M ~/scripts/Panel_1/Pop_Map.tsv \
--vcf

-Kris

Angel Rivera-Colón

unread,
Mar 8, 2025, 3:53:52 PMMar 8
to Stacks
Hi Kris,

Yes, there aren't any filters being applied, so populations is exporting the "raw" catalog, which is always very messy. The catalog contains all the data from all samples, including many of those spurious junk loci that are always seen in a single sample at low coverage. My advice is to always apply some missing data filter in populations, even if very lenient. Like you mentioned, you could try something like -R 10 or something along those lines. That should remove most of those junk loci.

Angel

Claude Patrick Millet

unread,
Nov 20, 2025, 2:19:07 PM (8 days ago) Nov 20
to Stacks
Hello all, 
I am faced with the same problem but in my case I did do filtration. 
Basically I aligned with bwa mem (default params) | samtools view -b -h | samtools sort
then did gstacks  --var-alpha 0.01 --gt-alpha 0.01 --min-mapq 30
then populations --min-maf 0.05 --min-gt-depth 10 --max-obs-het 0.7 --min-samples-overall 0.5 --vcf
(admittedly I did not use the --ordered-export flag, but notice I asked for the loci to be present in half of my samples!)
and if I run the same grep command as the original poster (E. Baldwin) on my vcf (the one that was filtered as described above):
 grep -v ^# populations.snps.vcf | grep 4751600 | cut -f1-5
ENA|LR989985|LR989985.1 4751600 87009:263:+     C       T
ENA|LR989985|LR989985.1 4751600 87022:64:-      C       T
ENA|LR989985|LR989985.1 4751600 87023:65:-      C       T
ENA|LR989985|LR989985.1 4751600 87024:66:-      C       T

I am curious as to where this might come from. I am worried that if I just ask populations to generate one SNP per position, I may be including spurious SNPs into my final database, despite asking Gstacks to be quite stringent with snp and genotype calling? I must also add, in case that is part of it, that a idiosyncracy of my data is that despite being RADseq (EcoRI-cut with no random shearing, so both paired reads start with cutsite although I cropped them out with trimmomatic due to lower quality of the cutsite), I have very high depths (mean is around 400-500) due to my organism (butterflies) small genomes and the sequencing center using a more powerful sequencing machine than planned to avoid waiting time).

Any ideas?
Thanks
Patrick M

Claude Patrick Millet

unread,
Nov 21, 2025, 6:32:07 AM (7 days ago) Nov 21
to Stacks
To give a bit more information, these duplications seem to be very prevalent in my data. For instance, the vcf I mentionned in my previous message had 70,468 SNP, and after running the same command with --ordered-export, I get 42,799. I'm attaching the log.distribs files for the commands with "ordered-export" (ordex) and without.
From the populations output (the same for both commands):
"
Removed 888292 loci that did not pass sample/population constraints from 903099 loci.
Kept 14807 loci, composed of 4485192 sites; 2043304 of those sites were filtered, 70468 variant sites remained.
    1430521 genomic sites, of which 1155272 were covered by multiple loci (80.8%).
Filtered 7481245 genotypes that fell below the minimum genotype depth threshold.
Mean genotyped sites per locus: 182.02bp (stderr 0.39).
"

I am just wondering if this is normal, as I can't quite picture that so many loci would be overlapping (unless this is an artifact of my high read depths?). Although the ordered export is an easy fix, I wouldn't want to carry on without understanding and having decent confidence in my data...
I'm pulling my hair out on this!!! 
Thank you all
Patrick M

populations_ordex.log.distribs
populations.log.distribs

Angel Rivera-Colón

unread,
Nov 21, 2025, 7:21:08 PM (6 days ago) Nov 21
to Stacks
Hi Patrick,

Export-wise, the --ordered-export flag should handle the issue in VCF. As mentioned before, we advise using it whenever exporting VCF from a reference-aligned catalog.

For the reason behind the very high proportion of overlapping sites, I am curious about your library design. I am not familiar with this library configuration (having the single enzyme flank the reads on both ends), but do you have an idea on the density of EcoRI cutsites in your genome? If the recognition sequence is very common, you might then have many loci in tandem with some level of overlap among them. Since you have very high depths, you might have enough reads to span all the possible combination of loci. You could check this using an in silico digest and/or a simulation to get a number of expected loci.

Hope this helps,
Angel

Claude Patrick Millet

unread,
Nov 24, 2025, 6:34:04 AM (4 days ago) Nov 24
to Stacks
Hi!
And thank you Angel for your reply! 
I don't have a good sense of the density of EcoRI cutsite. I tried doing simulations with SimRAD (RADInitio is not currently installed on the HPC I have access to), and using the full genome it said 96600 sites (the genome is 384 Mb) but I don't know the number of sites that fall within the size range selected during the library prep). 
I wasn't involved with library prep, but from what I understand it's the standard way the sequencing center (NGI Sweden's Scilifelab) does RAD, and it's apparently akin to EZRAD. They keep the cutsite sitcky ends on both sides of the fragments, but only select those that fall within the desired size (420-720). I had bad sequencing quality on the cutsite bases (due to the low complexity tied to all starting sequences being the same), which the sequencing center expected, and because of this I used trimmomatic headcrop. I ended up removing the first 15 bases (including the 5 from the cutsite) to be safe.

These multi-locus SNPs are definitely odd, and they are extremely numerous in my data (more than half!). I had a look at some of them on IGV and it seems like they are loci that do not start at the same position. Some of these stacks do start, as expected, 10 bases after a cutsite (and I suppose overlaps due to closely occuring cutsites must be frequent as I found one while looking around on IGV, see attached examples. I've circled the cutsite motif in red). In other cases, the different stacks seem to be due to a lack of consistent starting position for the reads, I assume due to the absence of a cutsite to "anchor" them. But I noticed sometimes the genotype call for a particular SNP is different based on the locus (I suspect the example with 5 stacks  in the attached image is representative of such a case.
So i think, to be safe, I will just keep only the SNP that are supported by a single locus, even though in doing so I'd be discarding over half to 2/3 of the identified SNPs.

I'm going to run process radtags on the raw reads from a few samples using "--renz1 EcoRI and --renz2 EcoRI -q -c -r  " to see if keeping the cutsite (although I will discard 60-70% of my reads...) reduces the high level  of SNPs that occur on several loci. Although I don't know if some of these odd sites are repeat/transposable elements that I'm better off discarding anyways...
I'd love to hear your thoughts!
ex_3loci_withadjacentcutsites.png
ex_5loci.png
ex_4loci.png
ex_2loci.png

Catchen, Julian

unread,
Nov 24, 2025, 11:13:31 AM (4 days ago) Nov 24
to stacks...@googlegroups.com
Hi Patrick,

I would not trim the 5’ end off of your sequences. The software can account for a certain amount of sequencing error, both in process_radtags, which will correct cutsite remnants, and the genotyping algorithm, which can handle some error in the sequences as well. Stacks relies on having the cutsite remnant to line up all the reads at a particular location. Did you trim a fixed amount of sequence, or did you allow the software to variably trim depending on the quality scores of the sequences? 

Did you also trim the 3’ end of the reads? Finally, did you do paired-end sequencing?

It is hard to tell exactly how SciLifeLab’s protocol will leave the resulting DNA molecules in the sequencing library. However, since it is the same cutsite on both ends of the molecule, and you may have trimmed reads, reads from the same locus may have originated from the positive strand of the DNA or the negative strand of the DNA (assuming you did PE sequencing); the software builds loci from the aligned reads on the positive strand of the reference genome, so trimming could have created staggered start sites on either end.

You could look at raw reads form a single locus, after aligning with BWA and without any processing at all, to see what the raw locus looks like and to see where the cutsites are and what remnants remain at each cutsite location. 

At a minimum, I would run the data without trimming and compare it to your current set. Once you have that, you can ask if you see the same type of overlapping loci. 

Julian

Claude Patrick Millet

unread,
Nov 24, 2025, 12:28:32 PM (4 days ago) Nov 24
to Stacks
Hi Julian, and thank you for your reply.
My data is indeed paired-end. 
Regarding the trimmomatic use, I did HEADCROP:15 CROP:110 , so I cut the first 15 sites (the 5 cutsite bases plus 10, out of an abundance of caution), and then also the last 26 which had very good, but slightly lower, quality. This is partially inspired by NGI's own quality control pipeline.
The IGV screenshots I showed are from the raw bams (but targetting sites I identified after running populations).
Between my earlier messages and now, I've been running alternate pipeline on 4 samples:
- one where I do not use trimmomatic, but pass the raw reads through process radtags with the options --renz-1 ecoRI --renz-2 ecoRI -r -c -q , then do the mapping and variant calling
- one where I use the bams from my previous pipeline (with first and last bases cut off), but only the same 4 samples for SNP calling with gstacks 
in both cases, I also applied the populations filters I had applied at that point on my whole data (--min-maf 0.05 --min-gt-depth 10 --max-obs-het 0.7 --min-samples-overall 0.5) because I would anyways.
The results are as follows: with process-radtags checking for cutsite, I lose most of my raw reads (this is also where I should mention that, mistakenly, I omitted to also add --score-limit 30 to my process radtags command, as I did for the trimmomatic reads, so I would lose even possibly even more): from process radtags log:
BEGIN per_file_raw_read_counts
File    Retained Reads  Low Quality     Poly-G  Barcode Not Found       RAD cutsite Not Found   Total
P21411_194_S186_L003_R1_001.fastq.gz    2402524 3881    384     0       6884291 9291080
END per_file_raw_read_counts

BEGIN total_raw_read_counts
Total Sequences 9291080
Barcode Not Found       0       0.0%
Low Quality     3881    0.0%
Poly-G Runs     384     0.0%
RAD Cutsite Not Found   6884291 74.1%
Retained Reads  2402524 25.9%
Properly Paired 398491  8.6%
END total_raw_read_counts

this is not too surprising to me as my cutsite sequences are really quite low quality (see attached image from multiqc report), so process radtags fails to recover many of them (and even moreso considering I'm also asking it to check on the paired-end side). Accordingly my gstacks result show a quarter of the coverage as well.
The number of SNP identified in the resulting VCFs are similar, but not the number of loci:
for the reads with cutsites: 4542 SNP on 11415 loci, of which 2513 SNP are covered by two loci and 2029 by a single one.
for the trimmomatic reads: 3885 SNP on 24190 loci, of which 2 are covered by 3 loci, 1621 by two, and 2260 by a single one.
I clearly have more SNP per loci when keeping the cutsites since I have similar SNP numbers but double the loci. This is surely due to the fact that my  radtags with cutsites are 151 bp long rather than 110 for the trimmomatic reads.

Unless you expressly recommend against it (because, say, Gstacks works much better when a cutsite is present), I think I would keep my trimmomatic-cut-reads-based gstacks and populations result, but only whitelist SNPs that are covered by a single locus, to move forward with filtering (on max depth, missing data, and --write-random-snp). But if you say I should do othewise, I will certainly defer to you!

I am quite convinced,  that one decently big part of the reason I get so many overlapping loci is that EcoRI cutsites are really frequent in the butterfly genome (it was the default enzyme used by NGI, hence the choice). I found several such cases (see the attached example, taken from the reads-with-cutsites run, but also the ex-3loci image in my previous message. However because in some cases I noticed the genotype calls are different on each loci, I'd rather remove these multi-loci SNP altogether.
Sorry for the novel-length reply, but I really appreciate your replies and insights
Best,
Patrick
renz2_2loci.png
multiqc per base quality.png

Claude Patrick Millet

unread,
Nov 24, 2025, 12:32:06 PM (4 days ago) Nov 24
to Stacks
P.S. I also meant to attach a "per base sequence content" graph from one of my fastqc reports that illustrates even better than the sequence quality plot the many errors that are prevalent in the cutsite region and explain why process radtags can't, well, process my radtags!
per_base_sequence_content_example.png

Claude Patrick Millet

unread,
Nov 24, 2025, 12:35:50 PM (4 days ago) Nov 24
to Stacks
(this is the R2 reads, but the same pattern is also prevalent in R1. They both start with the cutsite so they behave the same. And so I cut them the same way using trimmomatic)

Angel Rivera-Colón

unread,
Nov 24, 2025, 3:58:36 PM (4 days ago) Nov 24
to Stacks
Hi Patrick,

Aside from any issues with trimming, as Julian mentioned, I also think a lot of the issue is with the frequency of the EcoRI sites. You mention finding 96K cutsites in the 340 Mbp genome. This is what, ~285 sites per Mbp? That's a cutsite every 3.5 kb, on average. You will get some areas of the genome with have much higher densities, resulting in these overlapping loci. For comparison, for some of our recent datasets on a fish genome we had around 50K cutsites in a 1 Gbp genome and even then we saw a small, but measurable, proportion of overlap between some loci.

In summary, I think some of this is expected given the library configuration. However, given that it is fairly atypical it does bring some problems during analysis. Regardless of the exact way you end up processing the reads I would definitely remove some of those duplicates SNPs. At the very least, they are just duplicated observations of the same variant and might inflate your estimates of variation and/or cause problems in the downstream processing. At worse, they might be of unequal quality if one of the call happens on those problematic areas with varying coverage.

Thanks,
Angel

Claude Patrick Millet

unread,
Nov 25, 2025, 5:18:30 AM (3 days ago) Nov 25
to Stacks
Hi Angel,
Thanks for your reply. My plan is to whitelist only SNP that occur on a single locus, to err on the side of caution.
My remaining question (and it is a big one, I think) is: do you (and/or Julian) see any big objections to using Stacks on reads that have had their cutsites removed? gstacks seems to have worked pretty well despite the absence of cutsite motifs to anchor the reads.
I would prefer to remove the cutsites (and adjacent base pairs) rather than "waste" all the reads which process radtags can't recover, but maybe I' way off base (pun somewhat intended).
Best,
Patrick

Angel Rivera-Colón

unread,
Nov 25, 2025, 5:51:54 PM (3 days ago) Nov 25
to Stacks
Hi Patrick,

I haven't tested this issue extensively, particularly with overlaps as extreme as these, so I am not 100% sure. I am always a bit suspicious of the raw reads without the cutsites because we cannot be sure what got sequenced (since we expect the RAD reads to always have the cutsite). Yet, they are aligning to the genome, so we could be somewhat confident that they are reads from the host.

I would say that the best approach is to see what the data says. Generate a handful of different datasets and see what biological inferences you are able to generate. For example, do you get the same patterns of population structure, genetic diversity/divergence, etc., when removing the duplicates sites altogether and when using just the ordered-export? Also, remember that not all analyses require super high numbers of SNPs, so you might be able to get robust results even with a reduced dataset.

Not the best answer from a technical sense, but this is more or less how I would go testing this out. Hope it helps.

Thanks,
Angel

Claude Patrick Millet

unread,
Nov 25, 2025, 6:10:24 PM (3 days ago) Nov 25
to Stacks
Hi Angel,
That's definitely helpful, and I appreciate your taking the time to give your perspective on this. I'm definitely thinking I'd rather have fewer, but more trustworthy SNPs than a greater number of untrustworthy ones.
In case you  (or someone else) has any insights, I also have data from a parasitoid wasp from the same library. It's a similar genome size and I expect similar patterns, but this time I won't have a reference genome so I'll have to do de novo calling...I'm thinking of trying the integrated approach with a reference genome from the same subfamily because I'm guessing just de novo will completely ignore these overlaps?
Anyways, thanks again so much!
Patrick

Angel Rivera-Colón

unread,
Nov 25, 2025, 7:09:45 PM (3 days ago) Nov 25
to Stacks
Correct. We can only see these overlaps because of the reference. Normally, I wouldn't worry too much about these overlaps when running the data de novo, since they not super common in most cases. However, since we expect them to be common in this case I would definitely take a look. Also, it might be good to mention this to the folks preparing the libraries. Seems like this approach is not ideal when the cutsite is too common in the genome.

Angel
Reply all
Reply to author
Forward
0 new messages