Setting the reference and alternative alleles when using gstacks/populations

14 views
Skip to first unread message

akfra...@gmail.com

unread,
Sep 29, 2025, 9:29:17 AMSep 29
to Stacks
Hi all,

I am trying to genotype many thousands of samples and running into some memory issues (even after breaking everything up by chromosome). To resolve this, I run gstacks with everyone, but break up my runs of populations by batches of 500 (randomly selected without replacement from popmap file). 

I want everything to be genotyped together so as not to run into the ref/alt allele flipping issues, which seem like they shouldn't occur given that gstacks is using a reference assembly, but it does. Even when doing this, and running it in reference mode, populations runs using these batched lists does not produce consistent ref/alt allele assignments. 

So, to try to further resolve this, I run bcftools +fixref to try to reassign to ref/alt alleles using the reference assembly that they were aligned to and genotyped with. However, even still, ~20% of my loci remain unresolved. Is there a way around this in gstacks or populations? Also, can someone explain to me why this ref/alt allele behavior persists even in reference mode?

For example:

bcftools +fixref -Oz -o ${OTHDIR}/populations.vcf.gz ${OTHDIR}/populations.snps.vcf.gz -- \
-d -f ${REFDIR}/${REF}.fna -m flip

Output looks like the following:

Writing to /tmp/bcftools.ewPhf8
Merging 1 temporary files
Done

Cleaning
# SC, guessed strand convention
SC      TOP-compatible  0
SC      BOT-compatible  0
# ST, substitution types
ST      A>C     4667    7.7%
ST      A>G     5535    9.1%
ST      A>T     3989    6.5%
ST      C>A     6114    10.0%
ST      C>G     2218    3.6%
ST      C>T     8000    13.1%
ST      G>A     8147    13.4%
ST      G>C     2137    3.5%
ST      G>T     6078    10.0%
ST      T>A     3951    6.5%
ST      T>C     5629    9.2%
ST      T>G     4518    7.4%
# NS, Number of sites:
NS      total           60983
NS      ref match       48488   79.5%
NS      ref mismatch    12495   20.5%
NS      flipped         35      0.1%
NS      swapped         9995    16.4%
NS      flip+swap       37      0.1%
NS      unresolved      12295   20.2%
NS      fixed pos       0       0.0%
NS      errors          0
NS      skipped         0
NS      non-ACGT        0
NS      non-SNP         0
NS      non-biallelic   0


Thanks!

Catchen, Julian

unread,
Sep 29, 2025, 9:32:51 AMSep 29
to stacks...@googlegroups.com

Hi,

 

The gstacks program does not use the reference genome to call SNPs and their associated alleles, it only uses it to place the aligned reads. Since Stacks is designed around non-model organisms, it is never assumes that the “reference” allele has any special value other than it was one of two alleles from the single individual that was used to sequence the reference genome. Instead, gstacks only uses the aligned data to call SNPs and set the ref/alt alleles.

Reply all
Reply to author
Forward
0 new messages