one SNP per locus vs coding of alleles

Fabio Raposo

unread,

Feb 19, 2015, 7:49:54 PM2/19/15

to structure...@googlegroups.com

Dear colleagues,

I am working with collaborators on a sequence capture dataset or ~650 loci of ~300-400 bp each.

Many of those loci contain more than 1 SNP. Despite the routine protocol of sampling one SNP per locus, we have been doing a few runs coding the entire sequence as an allele (e.g. AGTC as 1, AGGC as 2, and so on). The latter strategy apparently gives us more detailed estimates of population structure than the one-SNP-per-locus approach.

We just wanted to make sure: Is this practice acceptable? Does this violates the underlying STRUCTURE model?

Thanks very much,

Fabio

Vikram Chhatre

unread,

Feb 20, 2015, 9:41:42 AM2/20/15

to structure-software

If I am understanding this correctly, there is no difference between your two strategies. When you designate a unique sequence as an allele (coded as 1 or 2), that is no different from choosing a random SNP from that sequence and designating it as an allele.

The only difference that could come about between the two strategies is if the latter results in higher polymorphism (more alleles in the population arising from the various combinations of SNPs within that sequence as opposed). However, at that point, you may have more than two alleles per locus.

Does that make sense or did I misinterpret something?

V

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.
To post to this group, send email to structure...@googlegroups.com.
Visit this group at http://groups.google.com/group/structure-software.
For more options, visit https://groups.google.com/d/optout.

Fabio Raposo

unread,

Feb 21, 2015, 7:29:50 AM2/21/15

to structure...@googlegroups.com

Dear Vikram,

Thanks for your reply - yes, it makes sense.

The reason we are trying this is to use most of the SNPs of the dataset without violating the underlying assumptions. For sequences with more than one SNP, we end up with more alleles per locus, which seem to give us a more detailed picture of the structure (in special considering that we have shallow structure and possible lots of past gene flow).