Same chromosome position found in multiple locus IDs - populations.snps.vcf

89 views
Skip to first unread message

Ethan Baldwin

unread,
Feb 17, 2025, 6:14:14 PMFeb 17
to Stacks
Hi all,

I am analyzing some GBS data (two enzyme, similar to ddRAD) using stacks in reference mode. Looking at the VCF file, there are sometimes multiple Locus IDs sharing genomic coordinates. For example:

grep -v ^# populations.snps.vcf | grep 2849996 | cut -f1-5

Chr01   2849996 2844:49:-       T       A
Chr01   2849996 2852:53:-       T       A
Chr01   2849996 2854:54:-       T       A
Chr01   2849996 2855:55:-       A       T
Chr01   2849996 2856:56:-       A       T

I'm not showing the genotypes here, but they are wildly different per sample! Looking at a genome browser of the reads mapped to this area, I don't see why stacks thinks they are different loci, unless maybe soft clipping around the beginning of reads could make it think they start at different locations?

Command for mapping:

bwa mem -t 12 $reference ${sample}.1.fq.gz ${sample}.2.fq.gz | samtools view -O BAM | samtools sort --threads 12 -o $sample.psitt.bam

Commands for stacks:

gstacks -I $OUTDIR/bams -M $POPMAP_ALL -t 16 --details -O $GSTACKS_DIR

populations -P $GSTACKS_DIR --popmap $POPMAP_CROSS -t 16 -O $OUTDIR/populations_merged --vcf

Any help would be appreciated!

Thanks,
Ethan

Angel Rivera-Colón

unread,
Feb 19, 2025, 7:55:45 PMFeb 19
to Stacks
Hi Ethan,

This behavior happens because there are multiple catalog loci overlapping that position in the reference. You can see this in the ID field (3rd column in the VCF), which contains the Stacks catalog ID and variant site column. As recommended in section 7.2.2 of the manual (link), we advise to use the --ordered-export flag when exporting a VCF in populations when using reference-aligned data. Among other sorting things, it will export a given genomic coordinate only once. Also, note that you are not applying any filters to your catalog in populations, only exporting the VCF. I suspect that you are seeing that many loci at the site since many of them are spurious "junk" loci present in a single sample. In my experience, applying even baseline missing data filters would remove a large portion of these. Whatever true overlaps remain would then be handled by the --ordered-export.

Thanks,
Angel

kwojt...@gmail.com

unread,
Mar 5, 2025, 12:27:14 PMMar 5
to Stacks
Hi Angel,

I ran into the same thing that Ethan did and took your advice and re-ran populations with ordered-export flag, however when compared with the original vcf (the non ordered-export one) it looks like populations kept the very first instance of each genomic coordinate it came across which in my case meant keeping the spurious junk loci with high amounts of missingness. It is obvious when you compare the 2 vcf's which SNPs sharing the same coordinates are spurious junk and which are not and I am losing the SNPs that are real and that are genotyped across most individuals. Is there a way to tell populations to use ordered-export and only keep the locus that has the least missing data?

Thank you,
Kris

Angel Rivera-Colón

unread,
Mar 6, 2025, 6:31:53 PMMar 6
to Stacks
Hi Kris,

Could you confirm the populations command you used? Did you apply any missing data filters (e.g., -R, -r, -p) when doing this comparison? In theory, the junk loci should be removed first by these filters. The overlap is then checked for the remaining sites, keeping the first one present in the catalog as the representative. In my experience, this should properly handle cases like these.

Thanks,
Angel

kwojt...@gmail.com

unread,
Mar 7, 2025, 9:28:13 AMMar 7
to Stacks
Hi Angel,

My populations command is below, and my problem is almost certainly because I didn't use any filters. The data I'm working with is from amplicons, not RAD-Seq, and others have recommended using stacks to call genotypes instead of a custom pipeline we were provided. Since we have a list of target SNPs my plan was to not use any filters until after narrowing the data down to only the target positions so that I could look at capture rates and how they vary with different filtering strategies, and that's where I'm running into the issues above. Looking at the data it looks like the junk loci have incredibly high missingness (>90% in almost every case). So I could apply a missingness filter in the populations and set it high to get rid of those loci first and that should do it. 

Thank you for your help.

populations \
-P ~/gstacks/Panel_1 \
-O ~/populations/Panel_1 \
-M ~/scripts/Panel_1/Pop_Map.tsv \
--vcf

-Kris

Angel Rivera-Colón

unread,
Mar 8, 2025, 3:53:52 PMMar 8
to Stacks
Hi Kris,

Yes, there aren't any filters being applied, so populations is exporting the "raw" catalog, which is always very messy. The catalog contains all the data from all samples, including many of those spurious junk loci that are always seen in a single sample at low coverage. My advice is to always apply some missing data filter in populations, even if very lenient. Like you mentioned, you could try something like -R 10 or something along those lines. That should remove most of those junk loci.

Angel
Reply all
Reply to author
Forward
0 new messages