Smoothed-pi values and sigma parameter

132 views
Skip to first unread message

Tash Ramsden

unread,
Jun 27, 2022, 9:17:49 AM6/27/22
to Stacks
Hello everyone,

I'm running populations to calculate nucleotide diversity across the genome and want to generate smoothed stats with bootstrap resampling. 

At the moment I am getting lots of values of smoothed pi as -1. I can see in the manual that this "indicates that a particular locus was not included in the smoothing operation (likely because it was overlapped by a separate RAD locus that was included)."

I'm not sure though how to overcome this? My thinking is that I need to adjust the sigma value I am using to change the sliding window size. However I'm not sure what to base this choice on and am struggling to find justifications? 

I have tried computing the stats with a much bigger window size (eg 1000000 wondering whether there need to be more SNPs per window) and much smaller (80000 ) however this seems to make no difference and I am still getting -1s for smoothed-pi; and obviously I would like to make an informed decision about what sigma should be if there are considerations I should be making. 

I've also tried adjusting the filtering of my data, to either include all SNPs with fairly loose filtering, or to only retain a single SNP per locus and remove those with MAF<0.05. My thinking is that I should be including more SNPs for calculating stats like pi/Tajima's D since they are calculated on a site/SNP basis...? Either way none of the changes that I've tried have meant that I don't get lots of -1 values.

The genome I'm working with is about 300,000,000 BP long, I'm getting ~19,500 SNPs in total (~3900 with --write-single-snp and --min-maf 0.05). 

Any advice on how I should be choosing a sliding window size would be greatly appreciated. And on how to not get -1s for my smoothed-pi values.

Many thanks in advance,
Tash


I've attached an image of an example of pi across one linkage group with the smoothed values plotted in green (the blue dotted line is just the mean pi):
smoothed-1_all_snps.png
And another of the same section but with --write-single-snp and --min-maf 0.05:
smoothed-1.png

eg of populations call:

populations \
--in-path $path \
--popmap $popmap \
--threads 5 \
--min-samples-per-pop 0.5 \
--smooth \
--bootstrap-archive \
--sigma 1000000 `#big sigma!` \
--vcf \
--out-path ../data/populations_all_snps/rud_h_smooth_big_sigma

and then again with --bootstrap  and --bootstrap-reps 1000

Catchen, Julian

unread,
Jun 28, 2022, 4:51:35 PM6/28/22
to stacks...@googlegroups.com

Hi Tash,

 

This is not a problem that should be “fixed” and changing the size of the smoothing window won’t change how the underlying RAD loci do or do not overlap. If you have two RAD loci that overlap in the genome (such that some nucleotide positions are covered by two, separate RAD loci), only one of them can be considered when looking at smoothed statistics – and the populations program will choose one arbitrarily. In the Fst and other smoothed output files, these excess loci are automatically removed. In the sumstats file, which contains smoothed pi, we don’t remove them by default, so that you know they exist. The way I would “fix” this would be to use grep or awk to remove those loci/lines with -1 values before plotting, since they are not part of the smoothed dataset. Be careful that you are not misinterpreting the p-value column (which will also be -1 prior to bootstrapping) in the same file as the smoothed pi value.

 

Best,

 

julian

 

 

 

From: stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Tash Ramsden <natasha....@gmail.com>
Date: Monday, June 27, 2022 at 8:17 AM
To: Stacks <stacks...@googlegroups.com>
Subject: [stacks] Smoothed-pi values and sigma parameter

Hello everyone,

 

I'm running populations to calculate nucleotide diversity across the genome and want to generate smoothed stats with bootstrap resampling. 

 

At the moment I am getting lots of values of smoothed pi as -1. I can see in the manual that this "indicates that a particular locus was not included in the smoothing operation (likely because it was overlapped by a separate RAD locus that was included)."

 

I'm not sure though how to overcome this? My thinking is that I need to adjust the sigma value I am using to change the sliding window size. However I'm not sure what to base this choice on and am struggling to find justifications? 

 

I have tried computing the stats with a much bigger window size (eg 1000000 wondering whether there need to be more SNPs per window) and much smaller (80000 ) however this seems to make no difference and I am still getting -1s for smoothed-pi; and obviously I would like to make an informed decision about what sigma should be if there are considerations I should be making. 

 

I've also tried adjusting the filtering of my data, to either include all SNPs with fairly loose filtering, or to only retain a single SNP per locus and remove those with MAF<0.05. My thinking is that I should be including more SNPs for calculating stats like pi/Tajima's D since they are calculated on a site/SNP basis...? Either way none of the changes that I've tried have meant that I don't get lots of -1 values.

 

The genome I'm working with is about 300,000,000 BP long, I'm getting ~19,500 SNPs in total (~3900 with --write-single-snp and --min-maf 0.05). 

 

Any advice on how I should be choosing a sliding window size would be greatly appreciated. And on how to not get -1s for my smoothed-pi values.

 

Many thanks in advance,

Tash

 

 

I've attached an image of an example of pi across one linkage group with the smoothed values plotted in green (the blue dotted line is just the mean pi):

And another of the same section but with --write-single-snp and --min-maf 0.05:

Tash Ramsden

unread,
Jun 29, 2022, 6:10:17 AM6/29/22
to Stacks
Hi Julian,

Thank you so much for you help, that makes a lot more sense now - I've seen the error of my ways!

The values for the smoothed-pi are still all close to zero though which doesn't entirely seem as though it is summarising the individual values correctly? (It's definitely not the p-values though). Do you know why this could be/whether this is an error or my mis-interpretation!? I would have imagined that even if the populations programme was managing to arbitrarily select only the loci with values close to zero, that the smoothed values would be dragged up by the pi values from surrounding loci? Any advice on this would be wonderful!

I've attached another plot with the -1s excluded (again it shows one linkage group, the blue dotted line is the mean, the green line is smoothed pi, and the blue and red stars/blobs are p values < 0.05/0.01 respectively).
e.g. the highest value of smoothed-pi in this linkage group is 0.00388 which is being assigned at a locus with pi=0.49196.

Thank you again for getting back to me and for your help, 
Tash

without-1.png

Catchen, Julian

unread,
Jun 30, 2022, 2:14:11 PM6/30/22
to stacks...@googlegroups.com

Hi Tash,

 

Here, the values of smoothed pi are being influenced by the fixed sites in your data set (as opposed to say, Fst). That is, the fixed sites in your RAD loci, not including any other sites outside of your RAD loci. If you look at populations.sumstats_summary.tsv you should see the average for Pi taken from the variant sites only as well as an average taken from all sites, which should match what your plot if showing.

Image removed by sender.

And another of the same section but with --write-single-snp and --min-maf 0.05:

Image removed by sender.

Tash Ramsden

unread,
Jun 30, 2022, 2:26:40 PM6/30/22
to Stacks
Hi Julian,

That makes perfect sense thank you so much for clearing that up!

So actually for plotting this I should be focusing on the smoothed pi values rather than the specific pi values at the variant sites (like in the (rough - first attempt) plot below). The clouds are clearing!

And so I think I'm also right in thinking then that I should be including all the SNPs in my dataset rather than just, for example, a single SNP per locus (I guess that in this case the SNPs not included would be counted as non-variant, which would be incorrect)?

Thank you so much for all the clarifications, I'm very grateful!
Tash

only_smoothed.png

Reply all
Reply to author
Forward
0 new messages