Hi all,
I am trying to genotype many thousands of samples and running into some memory issues (even after breaking everything up by chromosome). To resolve this, I run gstacks with everyone, but break up my runs of populations by batches of 500 (randomly selected without replacement from popmap file).
I want everything to be genotyped together so as not to run into the ref/alt allele flipping issues, which seem like they shouldn't occur given that gstacks is using a reference assembly, but it does. Even when doing this, and running it in reference mode, populations runs using these batched lists does not produce consistent ref/alt allele assignments.
So, to try to further resolve this, I run bcftools +fixref to try to reassign to ref/alt alleles using the reference assembly that they were aligned to and genotyped with. However, even still, ~20% of my loci remain unresolved. Is there a way around this in gstacks or populations? Also, can someone explain to me why this ref/alt allele behavior persists even in reference mode?
For example:
bcftools +fixref -Oz -o ${OTHDIR}/populations.vcf.gz ${OTHDIR}/populations.snps.vcf.gz -- \
-d -f ${REFDIR}/${REF}.fna -m flip
Output looks like the following:
Writing to /tmp/bcftools.ewPhf8
Merging 1 temporary files
Done
Cleaning
# SC, guessed strand convention
SC TOP-compatible 0
SC BOT-compatible 0
# ST, substitution types
ST A>C 4667 7.7%
ST A>G 5535 9.1%
ST A>T 3989 6.5%
ST C>A 6114 10.0%
ST C>G 2218 3.6%
ST C>T 8000 13.1%
ST G>A 8147 13.4%
ST G>C 2137 3.5%
ST G>T 6078 10.0%
ST T>A 3951 6.5%
ST T>C 5629 9.2%
ST T>G 4518 7.4%
# NS, Number of sites:
NS total 60983
NS ref match 48488 79.5%
NS ref mismatch 12495 20.5%
NS flipped 35 0.1%
NS swapped 9995 16.4%
NS flip+swap 37 0.1%
NS unresolved 12295 20.2%
NS fixed pos 0 0.0%
NS errors 0
NS skipped 0
NS non-ACGT 0
NS non-SNP 0
NS non-biallelic 0
Thanks!