

Hi Tash,
This is not a problem that should be “fixed” and changing the size of the smoothing window won’t change how the underlying RAD loci do or do not overlap. If you have two RAD loci that overlap in the genome (such that some nucleotide positions are covered by two, separate RAD loci), only one of them can be considered when looking at smoothed statistics – and the populations program will choose one arbitrarily. In the Fst and other smoothed output files, these excess loci are automatically removed. In the sumstats file, which contains smoothed pi, we don’t remove them by default, so that you know they exist. The way I would “fix” this would be to use grep or awk to remove those loci/lines with -1 values before plotting, since they are not part of the smoothed dataset. Be careful that you are not misinterpreting the p-value column (which will also be -1 prior to bootstrapping) in the same file as the smoothed pi value.
Best,
julian
From:
stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Tash Ramsden <natasha....@gmail.com>
Date: Monday, June 27, 2022 at 8:17 AM
To: Stacks <stacks...@googlegroups.com>
Subject: [stacks] Smoothed-pi values and sigma parameter
Hello everyone,
I'm running populations to calculate nucleotide diversity across the genome and want to generate smoothed stats with bootstrap resampling.
At the moment I am getting lots of values of smoothed pi as -1. I can see in the manual that this "indicates that a particular locus was not included in the smoothing operation (likely because it was overlapped by a separate RAD locus that was included)."
I'm not sure though how to overcome this? My thinking is that I need to adjust the sigma value I am using to change the sliding window size. However I'm not sure what to base this choice on and am struggling to find justifications?
I have tried computing the stats with a much bigger window size (eg 1000000 wondering whether there need to be more SNPs per window) and much smaller (80000 ) however this seems to make no difference and I am still getting -1s for smoothed-pi; and obviously I would like to make an informed decision about what sigma should be if there are considerations I should be making.
I've also tried adjusting the filtering of my data, to either include all SNPs with fairly loose filtering, or to only retain a single SNP per locus and remove those with MAF<0.05. My thinking is that I should be including more SNPs for calculating stats like pi/Tajima's D since they are calculated on a site/SNP basis...? Either way none of the changes that I've tried have meant that I don't get lots of -1 values.
The genome I'm working with is about 300,000,000 BP long, I'm getting ~19,500 SNPs in total (~3900 with --write-single-snp and --min-maf 0.05).
Any advice on how I should be choosing a sliding window size would be greatly appreciated. And on how to not get -1s for my smoothed-pi values.
Many thanks in advance,
Tash
I've attached an image of an example of pi across one linkage group with the smoothed values plotted in green (the blue dotted line is just the mean pi):

And another of the same section but with --write-single-snp and --min-maf 0.05:

Hi Tash,
Here, the values of smoothed pi are being influenced by the fixed sites in your data set (as opposed to say, Fst). That is, the fixed sites in your RAD loci, not including any other sites outside of your RAD loci. If you look at populations.sumstats_summary.tsv you should see the average for Pi taken from the variant sites only as well as an average taken from all sites, which should match what your plot if showing.
And another of the same section but with --write-single-snp and --min-maf 0.05:
