Hi Jason, Kevin, Tomas and all,
When we are talking about the two, alternative flags to the populations
program: --write_single_snp and --write_random_snp, we are discussing
cases when there is more than one SNP at a single RAD locus. In this
case these flags will cause the program to either choose the first
useable SNP, or to pick one useable SNP at random from that locus.
Kevin is correct in saying that --write_single_snp is more repeatable,
as different SNPs in a single RAD locus can have different allele
frequencies depending on the age and background of the SNP which will
alter the summary statistics.
This is only relevant if you want a single SNP per locus. For most
applications, you do want to use all the SNPs, such as Fst analysis, and
moreover you get an even stronger signal if you instead use the
haplotype-based summary statistics.
In other cases you don't want all the SNPs, such as building a SNP chip
and running a STRUCTURE analysis. The STRUCTURE program does not want
SNPs that are in tight linkage, which most SNPs at the same RAD locus
will be. You can make the same argument for choosing SNPs to put on a
SNP chip.
In these cases it makes sense to choose one representative SNP per locus.
Best,
julian
From the STRUCTURE manual:
The structure model assumes that loci are independent within
populations (i.e., not in LD within populations). This assumption is
likely to be violated for sequence data, or data from non-recombining
regions such as Y chromosome or mtDNA.
If you have sequence data or dense SNP data from multiple independent
regions, then structure may actually perform reasonably well despite the
data not completely fitting the model. Roughly speaking, this will
happen provided that there is enough independence across regions that LD
within regions does not dominate the data. When there are enough
independent regions, the main cost of the dependence within regions will
be that structure underestimates the uncertainty in the assignment of
particular individuals.