negative Fst values?

Kelly

unread,

Apr 24, 2014, 11:20:57 AM4/24/14

to stacks...@googlegroups.com

Hello Stacks group,

I've run into some strange outputs from the populations program. I'm using Stacks 1.06, as this is the version installed on our computing cluster. I have two populations that I have assembled de novo. I used the default settings for denovo_map.pl since this was a first-pass, preliminary analysis. After de novo assembly, I ran the populations program to get descriptive statistics on my two populations: population -b 1 -P /path/to/denovo_output -p 2 -r 0.75. I'm mostly interested in Fst, and a large number of my Fst values are reported as negative. The confidence intervals for the Fst estimate are positive, do not contain the reported Fst, and often extend above 1.

Fst: -0.0.25

Fisher's P: 0.00046

CI lower: 1.84

CI upper: 11.83

I observe negative Fst values for 4822 of the 8199 loci found in both populations. In contrast, the AMOVA Fst values are all positive values between 0 and 1. I've been looking at the Fst calculation in the Stacks supplemental materials to gain some intuition for when the calculation may fail. The only thing that comes to mind would be the presence of more than two alleles in the population, which is I think is unlikely especially for so many loci (and even then I'm not 100% sure it would cause this problem). Does anyone have intuition for what might be causing this? Could I have introduced some problems in data quality upstream by not taking more care to specify some additional denovo_map.pl parameters?

Thanks!

Kelly

Julian Catchen

unread,

Apr 24, 2014, 2:51:38 PM4/24/14

to stacks...@googlegroups.com, kellyan...@gmail.com

Hi Kelly,

Almost all Fst calculations are susceptible to negative values in
certain instances. However, our original Fst implementation is more
susceptible than other methods, typically when you have extreme
differences in sample sizes. For this reason, we implemented the AMOVA
Fst and our smoothing/bootstrapping algorithms now rely on this method
which is much less likely to give negative values. In other words, we
consider the AMOVA Fst to be the best measure to use and have kept the
previous implementation just for historical purposes.

Just for the record, we do not calculate SNP-based Fst values for loci
that have more than two alleles present (in fact we do not calculate any
SNP-based summary statistics for these loci, they are filtered out).

The p-value, odds ratio and confidence limits are calculated from a 2x2
contingency table of allele counts, they are not calculated from the Fst
value:

| allele1 | allele2
-----+---------+--------
pop1 | Cnt1 | Cnt2
pop2 | Cnt3 | Cnt4

We use Fisher's exact test with the null hypothesis that the allele
counts in the two populations being compared are the same. A small
p-value indicates that the allele counts are not the same and the odds
ratio (e.g. effect size) gives an indication of how different they are,
with the confidence limit around that measure.

If you want a p-value specifically for your Fst measure, you should use
the bootstrapping feature in the populations program.

You might also consider using haplotype measures of Fst (Phi_st, Fst')
which have recently been implemented in Stacks. These calculations
consider each RAD locus as a haplotype and each haplotype may have one
or more SNPs in it. We have had a good experience with these measures in
our most recent work.

Best,

julian

Kelly

unread,

Apr 24, 2014, 3:14:11 PM4/24/14

to stacks...@googlegroups.com, kellyan...@gmail.com, jcat...@uoregon.edu

Hi Julian,

Thanks very much for getting back to me so quickly and for clarifying the Fst calculation and its interpretation. I do have a substantial difference in population sizes (33 for population 1 and 70 for population 2), and I see now how that is problematic with the original Fst calculation. I'll focus then on the other metrics Stacks calculates that you suggested (AMOVA, etc.).