population Fst file

613 views
Skip to first unread message

Giacomo Bernardi

unread,
Dec 11, 2013, 3:15:43 PM12/11/13
to stacks...@googlegroups.com
Dear all,
I have a set of 71 individuals divided in two populations.  When I run a structure script, stacks identifies 1910 SNPs that are used to calculate population parameters.  A structure file is generated, it runs fine and has 71 individuals and 1910 loci.  The Fst file, however, contains much fewer loci (just over 500).  Any ideas on the reasons for the discrepancy?  I can generate a similar file by calculating Fst's with Arlequin, and finding the identity of each locus on the first line of the structure file.  But I'd rather understand what is wrong…..
Thanks

Julian Catchen

unread,
Dec 11, 2013, 11:49:52 PM12/11/13
to stacks...@googlegroups.com, bern...@ucsc.edu
Hi Giacomo,

The batch_X.sumstats.tsv contains all the SNPs called for each of the
two populations. For one of these SNPs to be in the Fst file it has to
exist in both populations, but it can exist in either or both
populations and will still be output in the structure file. You should
identify some loci in the structure file missing from the Fst file and
take a look at them in the sumstats file or the web interface. It should
be clear if they should be output in both files.

Best,

julian
> understand what is wrong�..
> Thanks

Giacomo Bernardi

unread,
Dec 12, 2013, 12:05:07 AM12/12/13
to stacks...@googlegroups.com, bern...@ucsc.edu, jcat...@uoregon.edu
Dear Julian,
thanks for the reply.  Two remarks about this.  1. Values of Fis (not Fst) are given, right?  2. if you ask the structure script to only use the first SNP (as in: --write_single_snp) is it possible to only get that one used SNP in the resulting file? right now all SNPs are recorded. For sure your explanation at least explains the discrepancy, which completely baffled me….

Thanks again

Giacomo

Julian Catchen

unread,
Dec 12, 2013, 12:11:51 AM12/12/13
to stacks...@googlegroups.com, bern...@ucsc.edu
Hi Giacomo,

The batch_X.sumstats.tsv file does include Fis values:

http://creskolab.uoregon.edu/stacks/manual/#pfiles

And yes, if you specify --write_single_snp that should be maintained in all output files (but it is always the first SNP that is output).

julian

Giacomo Bernardi wrote:
Dear Julian,
thanks for the reply. �Two remarks about this. �1. Values of Fis (not Fst) are given, right? �2. if you ask the structure script to only use the first SNP (as in:�--write_single_snp) is it possible to only get that one used SNP in the resulting file? right now all SNPs are recorded. For sure your explanation at least explains the discrepancy, which completely baffled me�.

Thanks again

Giacomo

On Wednesday, December 11, 2013 8:49:52 PM UTC-8, Julian Catchen wrote:
Hi Giacomo,

The batch_X.sumstats.tsv contains all the SNPs called for each of the
two populations. For one of these SNPs to be in the Fst file it has to
exist in both populations, but it can exist in either or both
populations and will still be output in the structure file. You should
identify some loci in the structure file missing from the Fst file and
take a look at them in the sumstats file or the web interface. It should
be clear if they should be output in both files.

Best,

julian

Giacomo Bernardi wrote:
> Dear all,
> I have a set of 71 individuals divided in two populations. �When I run
> a structure script, stacks identifies 1910 SNPs that are used to
> calculate population parameters. �A structure file is generated, it
> runs fine and has 71 individuals and 1910 loci. �The Fst file,
> however, contains much fewer loci (just over 500). �Any ideas on the
> reasons for the discrepancy? �I can generate a similar file by
> calculating Fst's with Arlequin, and finding the identity of each
> locus on the first line of the structure file. �But I'd rather
> understand what is wrong�..
> Thanks
--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Message has been deleted

Julian Catchen

unread,
Dec 12, 2013, 12:32:24 AM12/12/13
to Giacomo Bernardi, stacks...@googlegroups.com
Hi Giacomo,

I'm not sure what you mean by "recover the values of Fst," can you be more specific with what you want to do? The populations program should provide both values for every SNP in the dataset in the batch_X.sumstats.tsv and the batch_X.fst_1-2.tsv files, respectively. You can match up loci between the two files based on the locus ID.

julian

Giacomo Bernardi wrote:
Dear Julian,
thank you for the clarification.  This is what I meant, Fis is given but there is no way to recover the values of Fst, only the Fis, correct?

Thanks 
Giacomo


On Wednesday, December 11, 2013 9:11:51 PM UTC-8, Julian Catchen wrote:
Hi Giacomo,

The batch_X.sumstats.tsv file does include Fis values:

http://creskolab.uoregon.edu/stacks/manual/#pfiles

And yes, if you specify --write_single_snp that should be maintained in all output files (but it is always the first SNP that is output).

julian

Giacomo Bernardi wrote:
Dear Julian,
thanks for the reply. Two remarks about this. 1. Values of Fis (not Fst) are given, right? 2. if you ask the structure script to only use the first SNP (as in: --write_single_snp) is it possible to only get that one used SNP in the resulting file? right now all SNPs are recorded. For sure your explanation at least explains the discrepancy, which completely baffled me.

Giacomo Bernardi

unread,
Dec 12, 2013, 12:42:52 AM12/12/13
to stacks...@googlegroups.com, Giacomo Bernardi, jcat...@uoregon.edu
Sorry Julian to be unclear.  I meant that batch_X.sumstats.tsv gives a column for Fis but no column for Fst, while batch_X.fst_1-2.tsv gives a column for Fst but only for the shared SNPs (if I understand correctly).  Hmmm, as I am writing this I finally understand the logic behind how those file are made.  Shared SNPs are Fst's and non shared are Fis.  Wow, sorry for being so dense….
But I have some sort of excuse, I was under the impression that the number of loci used to estimate population structure were only those that were shared among populations and therefore  the number given at the first output (in my case 1910), and also used in the structure file, was also going to be the number seen in the batch_X.fst_1-2.tsv file.  Clearly this is not the case.

Thanks very much

Giacomo

Giacomo Bernardi

unread,
Dec 12, 2013, 12:58:16 AM12/12/13
to stacks...@googlegroups.com
Julian,
Bingo, now it is working.  Thank you very much!!!!

Giacomo

Julian Catchen

unread,
Dec 12, 2013, 1:55:33 AM12/12/13
to stacks...@googlegroups.com
Hi Giacomo,

Just to add a few last details: the Structure output is done without respect to any specific population that was provided in the population map. So, as long as there is one or more individuals in the overall set of data that have genotypes, then the locus will be output. We do also output the population number in the Structure file for each individual which allows Structure itself to make decisions about which loci to exclude.

Anyway, a typical analysis will include a number of populations, and in this case a small number of different loci will be missing in each different population, so it gets harder to decide to only use loci shared among populations. Of course, you can use the filters for the populations program to require loci to be in all populations. You can also use a population map that only includes a subset of individuals to get data output for only that subset (say for Structure).

For Fst, each population you add to the analysis will give you another set of Fst files doing all pairwise comparisons between the populations, and only loci present in each pair will appear in that specific paired comparison.
Reply all
Reply to author
Forward
0 new messages