why choose to write the first SNP in a locus versus a random one?

870 views
Skip to first unread message

Ella

unread,
Apr 21, 2015, 5:07:45 PM4/21/15
to stacks...@googlegroups.com
Hello,

While this isn't really a technical stacks question, I'm wondering if anyone has insight about the consequences of choosing the first SNP in a locus as opposed to a random one for the genepop and structure outputs?

With thanks,
Ella

Jason

unread,
Apr 21, 2015, 9:29:29 PM4/21/15
to stacks...@googlegroups.com
One potential reason is that the first snp is likely to have lower sequencing error associated with it given sequencing errors rates increase towards the end of a read.

Emerson, Kevin

unread,
Apr 27, 2015, 11:00:57 AM4/27/15
to stacks...@googlegroups.com
Choosing the first SNP is more repeatable than choosing a random SNP. If you run populations multiple times, you could get slightly different results based on the random SNP chosen within a locus, whereas every time you run it with the first SNP you will get the same data.

On Tue, Apr 21, 2015 at 9:29 PM, Jason <jason....@gmail.com> wrote:
One potential reason is that the first snp is likely to have lower sequencing error associated with it given sequencing errors rates increase towards the end of a read.

--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
------------------
Kevin J Emerson, PhD
Assistant Professor of Biology
Biology Department
St. Mary's College of Maryland
18952 E. Fisher Rd
St. Mary's City, MD 20686-3001

Office: 240 - 895 - 2123, Shaefer Hall 231

---------------------

Jason Boone

unread,
Apr 27, 2015, 11:26:23 AM4/27/15
to stacks...@googlegroups.com
Kevin, would you mind me asking- wouldn't the same snps show up each time though? Or are you referring to if only one snp only is chosen...at random?

I'm not saying choosing the first snp is bad, I'm just trying to get a handle on why it's so different from using all Snps.

Thanks,
Jason


--
Sent from Gmail Mobile

Tomas Hrbek

unread,
Apr 27, 2015, 11:52:08 AM4/27/15
to stacks...@googlegroups.com
Folks,

I agree, it would make more sense to include all the SNPs (taking into
account the increasing error rates towards the 3' end of the read) when
constructing input files for Structure, etc. This is especially
relevant if sequencing is done on Illumina MySeq, NextSeq or HiSeq2500
or IonTorrent all of which produce relatively long reads. This way one
would obtain loci with more than just two alleles, and the information
content of these loci would be greater.

Cheers,

Tomas

On 04/27/2015 09:26 AM, Jason Boone wrote:
> Kevin, would you mind me asking- wouldn't the same snps show up each
> time though? Or are you referring to if only one snp only is chosen...at
> random?
>
> I'm not saying choosing the first snp is bad, I'm just trying to get a
> handle on why it's so different from using all Snps.
>
> Thanks,
> Jason
>
> On Monday, April 27, 2015, Emerson, Kevin <kjem...@smcm.edu
> <mailto:kjem...@smcm.edu>> wrote:
>
> Choosing the first SNP is more repeatable than choosing a random
> SNP. If you run populations multiple times, you could get slightly
> different results based on the random SNP chosen within a locus,
> whereas every time you run it with the first SNP you will get the
> same data.
>
> On Tue, Apr 21, 2015 at 9:29 PM, Jason <jason....@gmail.com
> <javascript:_e(%7B%7D,'cvml','jason....@gmail.com');>> wrote:
>
> One potential reason is that the first snp is likely to have
> lower sequencing error associated with it given sequencing
> errors rates increase towards the end of a read.
>
>
> --
> Stacks website: http://creskolab.uoregon.edu/stacks/
> ---
> You received this message because you are subscribed to the
> Google Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to stacks-users...@googlegroups.com
> <javascript:_e(%7B%7D,'cvml','stacks-users%2Bunsu...@googlegroups.com');>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> ------------------
> Kevin J Emerson, PhD
> Assistant Professor of Biology
> Biology Department
> St. Mary's College of Maryland
> 18952 E. Fisher Rd
> St. Mary's City, MD 20686-3001
> kjem...@smcm.edu <javascript:_e(%7B%7D,'cvml','kjem...@smcm.edu');>
> http://faculty.smcm.edu/kjemerson
> Office: 240 - 895 - 2123, Shaefer Hall 231
>
> ---------------------
>
> --
> Stacks website: http://creskolab.uoregon.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to stacks-users...@googlegroups.com
> <javascript:_e(%7B%7D,'cvml','stacks-users%2Bunsu...@googlegroups.com');>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Sent from Gmail Mobile
>
> --
> Stacks website: http://creskolab.uoregon.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com
> <mailto:stacks-users...@googlegroups.com>.

Julian Catchen

unread,
Apr 27, 2015, 12:55:14 PM4/27/15
to stacks...@googlegroups.com
Hi Jason, Kevin, Tomas and all,

When we are talking about the two, alternative flags to the populations
program: --write_single_snp and --write_random_snp, we are discussing
cases when there is more than one SNP at a single RAD locus. In this
case these flags will cause the program to either choose the first
useable SNP, or to pick one useable SNP at random from that locus.

Kevin is correct in saying that --write_single_snp is more repeatable,
as different SNPs in a single RAD locus can have different allele
frequencies depending on the age and background of the SNP which will
alter the summary statistics.

This is only relevant if you want a single SNP per locus. For most
applications, you do want to use all the SNPs, such as Fst analysis, and
moreover you get an even stronger signal if you instead use the
haplotype-based summary statistics.

In other cases you don't want all the SNPs, such as building a SNP chip
and running a STRUCTURE analysis. The STRUCTURE program does not want
SNPs that are in tight linkage, which most SNPs at the same RAD locus
will be. You can make the same argument for choosing SNPs to put on a
SNP chip.

In these cases it makes sense to choose one representative SNP per locus.

Best,

julian


From the STRUCTURE manual:

The structure model assumes that loci are independent within
populations (i.e., not in LD within populations). This assumption is
likely to be violated for sequence data, or data from non-recombining
regions such as Y chromosome or mtDNA.
If you have sequence data or dense SNP data from multiple independent
regions, then structure may actually perform reasonably well despite the
data not completely fitting the model. Roughly speaking, this will
happen provided that there is enough independence across regions that LD
within regions does not dominate the data. When there are enough
independent regions, the main cost of the dependence within regions will
be that structure underestimates the uncertainty in the assignment of
particular individuals.

Tomas Hrbek

unread,
Apr 27, 2015, 1:25:22 PM4/27/15
to stacks...@googlegroups.com
Hi Julian,

I was talking about case such as this:

Ind1_loc1 GATC
Ind1_loc2 GGTC

Ind2_loc1 GATC
Ind2_loc2 GATT

Selecting the --write_single_snp we would get this Structure output:

Ind1_loc1 2
Ind1_loc2 3

Ind2_loc1 2
Ind2_loc2 2

But ideally what we would want is this:

Ind1_loc1 21
Ind1_loc2 31

Ind2_loc1 21
Ind2_loc2 20

This coding system incorporates all of the information of the
haplotypes, thus increasing statistical power of the data in downstream
analyses. It also avoids treating each SNP as an independent locus.

Cheers,

tomas
--
---
^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^
Tomas Hrbek, Ph.D.
Departamento de Genética, ICB, LEGAL Phone: +55 92 3305 4233
Universidade Federal do Amazonas e-mail: tomas...@ufam.edu.br
Av. Rodrigo Otavio Ramos 3000 e-mail: hr...@evoamazon.net
Manaus, AM CEP:69077-000
Brasil web: www.evoamazon.net
http://lattes.cnpq.br/4139866243228811
http://orcid.org/0000-0003-3239-7068

><> ><> ><> ><> ><> ><> ><>
_____\|/_______\|/___\|/__\|/______\|/____________\|/___\|/________

Julian Catchen

unread,
Apr 27, 2015, 1:41:06 PM4/27/15
to stacks...@googlegroups.com
Hi Tomas,

I totally agree that what you describe would be ideal -- each RAD locus
should be encoded as a set of haplotypes, instead of using individual
SNPs. However, right now, each SNP is encoded separately, so the SNP
information you give below would not be interpreted in that way, each of
the two SNPs would be treated independently.

Instead we want to encode:

Ind1_loc1 GATC
Ind1_loc2 GGTC

Ind2_loc1 GATC
Ind2_loc2 GATT

as

Ind1_loc1 1
Ind1_loc2 2

Ind2_loc1 1
Ind2_loc2 3

We do this type of encoding for some outputs, such as VCF_haplotypes and
PHASE, but we don't have this type of encoding for STRUCTURE yet. It is
on the list to be added.

The other factor is you need to have high quality data for the
haplotype-based data to be good and reliable. People with really low
coverage or high error rates will get poor results with haplotype stuff,
but can still usually pull individual SNPs out that are informative.

Best,

julian

Tomas Hrbek

unread,
Apr 27, 2015, 4:57:27 PM4/27/15
to stacks...@googlegroups.com
Hi Julian,

thanks for the answer. I hope to see this option implemented in the
near future.

Tomas

Ella

unread,
May 5, 2015, 4:59:24 PM5/5/15
to stacks...@googlegroups.com
Thanks everyone for the answers.
Reply all
Reply to author
Forward
0 new messages