Fixed, variant, and polymorphic sites

754 views
Skip to first unread message

Eun

unread,
Nov 7, 2018, 7:51:14 PM11/7/18
to Stacks
Hi, 

I would appreciate if someone can provide a clarification/explanation on these basic terms (fixed, variant, and polymorphic sites) on the population.sumstats_summary output file. 

I used Stacks v2.0b to analyze my ddRADseq data on 37 samples from 10 populations. I used de novo pipeline with following parameter settings: M= n= 3, and m= 3. MAF and maximum observed heterozygosity were 0.01 and 0.5, respectively. I chose to retain loci that were present in all individuals (p=10 and r=1) due to small sample size. Further, I kept one random SNP per locus. 

In the population.sumstats_summary file, I got 100,809 sites (variant and fixed), of which only 298 were variant. The number of polymorphic sites varies for each population, but the greatest number is 103 sites. 

While I understand that I'm using conservative settings, the number of variant sites seem very low compared to the total number of sites (only ~0.3%). 

1) By "sites," is Stacks referring to nucleotide positions? (A dumb question, but I wanted to confirm)
2) If a variant site is defined as "nucleotide positions that are polymorphic in at least one population" then a fixed site is monomorphic in all populations. I have a difficult time imagining that so many sites (99.97%) passed all those filtering steps and be fixed in all ten populations. Can this be due to my parameter settings or just the low variability of the study species? 
3) At which point can I start calling these "sites" as SNP loci? If I have 298 variant sites, can I describe this as 298 SNP loci? 

Thank you! 

Eun 
 

Catchen, Julian

unread,
Nov 15, 2018, 2:39:46 PM11/15/18
to stacks...@googlegroups.com, Eun
Hi Eun,

Answers inline below.

Eun wrote on 11/7/18 6:51 PM:
> Hi,
>
> I would appreciate if someone can provide a clarification/explanation on
> these basic terms (fixed, variant, and polymorphic sites) on the
> population.sumstats_summary output file.
>
> I used Stacks v2.0b to analyze my ddRADseq data on 37 samples from 10
> populations. I used de novo pipeline with following parameter settings:
> M= n= 3, and m= 3. MAF and maximum observed heterozygosity were 0.01 and
> 0.5, respectively. I chose to retain loci that were present in all
> individuals (p=10 and r=1) due to small sample size. Further, I kept one
> random SNP per locus.
>
> In the population.sumstats_summary file, I got 100,809 sites (variant
> and fixed), of which only 298 were variant. The number of polymorphic
> sites varies for each population, but the greatest number is 103 sites.
>
> While I understand that I'm using conservative settings, the number of
> variant sites seem very low compared to the total number of sites (only
> ~0.3%).
>
> 1) By "sites," is Stacks referring to nucleotide positions? (A dumb
> question, but I wanted to confirm)

Yes, a site is a nucleotide position.

> 2) If a variant site is defined as "nucleotide positions that are
> polymorphic in at least one population" then a fixed site is monomorphic
> in all populations.

Yes, that is correct.

I have a difficult time imagining that so many sites
> (99.97%) passed all those filtering steps and be fixed in all ten
> populations. Can this be due to my parameter settings or just the low
> variability of the study species?

It could be due to either factor. Before you can rule out parameter
settings, you need to optimize your parameters (see Rochette 2017 if you
haven't already).

Otherwise, before you conclude a very low level of polymorphism in your
speices, you should check the basics of your analysis. What was the
depth of coverage for each individual sample, are most of your variant
sites shared across your populations or particular to one or two
individuals, or a single population?

If your analysis was solid, and you explored your parameters, and you
still see low polymorphism, then I would suggest it is real.

> 3) At which point can I start calling these "sites" as SNP loci? If I
> have 298 variant sites, can I describe this as 298 SNP loci?

It is a matter of opinion, but if the SNP calling model has called a
polymophic site, then it is a SNP. The gstacks model takes into account
all populations when it makes a call.

>
> Thank you!
>
> Eun

julian

CaffeSospeso

unread,
Nov 19, 2018, 11:47:14 AM11/19/18
to Stacks
Hi Julian and Eun,

I wanted to quickly follow up on this question.

In the previous versions of Stacks, the count_fixed_catalog_snps.py script was used to extract information on the number of variable and fixed sites. This script (for what I'm aware) has not been updated yet for the new Stacks version.

So, I would say that we can use the information written in populations.sumstats_summary.tsv file, to do the same thing that   count_fixed_catalog_snps.py script was doing. That is, obtain the number of "Fixed Sites" by subtracting the number of "Variant sites" from the total number of "Sites"

Again, it could be very obvious but I wanted to be sure.

Thank you in advance.

Gabriele
Reply all
Reply to author
Forward
0 new messages