Marker LD and Structure analysis

Vikram Chhatre

unread,

Nov 11, 2011, 11:25:14 AM11/11/11

to structure-software

Hello everyone,

Large number of genomic markers are now becoming available in most
study systems. Many of you have been doing Structure analysis using
such genome wide data. A question invariably comes up on this list
about linkage disequilibrium between markers and how that affects the
population structure inference.

On more than one occasion, I have suggested estimating the rate of
decay of linkage disequilibrium and depending upon outcome, thinning
out markers that may be tightly linked from further analysis. As I
just found out from a discussion with Dr. Pritchard, Structure
software can handle modest amounts of LD between markers within a
genomic region *if* a large number of genomic regions (which are not
in LD with each other) are also part of the data set.

In other words, if your markers are distributed throughout the genome,
the LD between markers in a particular region is not worrisome and
thus there is no need to thin out markers, except to decrease
computational time.

On the other hand, if most of your markers belong to a single genomic
region, this presents a contrasting situation and you will need to
thin out markers.

I apologize for any confusion my previous posts may have caused.
Please see below for an excerpt from a discussion with Dr. Pritchard.

---------------------------------------
I believe that Structure is robust to modest amounts of LD. To be
more precise, I think that if you have many markers spread across the
genome, or at least in a large number of genomic regions, then local
LD within regions does not seem to cause serious problems. This is
in sharp contrast to a situation where all the markers are in a single
genomic region, in which case the inferred structure may simply
reflect the topology of the local coalescent tree.

Here is an example where we looked at this using ~2500 snps from ~32
genomic regions:
http://www.nature.com/ng/journal/v38/n11/extref/ng1911-S7.pdf (see p6)

Another side note on LD is that if there is a modest amount of LD in
the data then Structure may overestimate its confidence in individual
assignments (ie the credible regions may be slightly more narrow than
they should be). But I don't think most people actually look closely
at the credible regions, so that's not much of a concern in practice.

---------------------------------------

Thanks and feel free to add to this discussion with comments,
questions or to share your experiences with a particular dataset.

Vikram

Vinod Kumar

unread,

Aug 7, 2012, 4:13:41 AM8/7/12

to structure...@googlegroups.com

Hi Vikram,

I think my question is not completely related with this post, but I want to learn more from you.......

I've generated a 3000 SNP dataset for a diploid plant species. We selected SNP throughout the chromosomes. I am doing GWAS. I am doing my analysis with TASSEL3.0 and generated Q-matrix using STRUCTURE 2.3.4. I've loaded my PED and MAP files in TASSEL with q matrix and trait files. Now I want to ask what is the standard method of doing GWAS. first I've excluded the MAF 0.02 data and generated haplotypes for all the chromosomes. And done the GLM and MLM using this chromosome wise haplotype file. But after haplotyping r2 and p values are looking very high.

Is this a right way of doing my analysis or something more is required for this.........

Vinod

Vikram Chhatre

unread,

Aug 7, 2012, 11:13:57 AM8/7/12

to structure...@googlegroups.com

Vinod -

Have you tried other approaches to do GWAS than Tassel?

When you say the p values are looking very high, do you mean highly
significant?

Regarding haplotypes:

1. Did you have a physical map to infer the phase correctly to
construct haplotypes?
2. Or does Tassel use the coalescent prior like approach implemented
in fastPHASE to infer haplotypes?
3. When associating haplotypes with traits, is it not necessary to do
do a false positive correction?

V

> --
> You received this message because you are subscribed to the Google Groups
> "structure-software" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/structure-software/-/LHmmuUpP0OkJ.
> To post to this group, send email to structure...@googlegroups.com.
> To unsubscribe from this group, send email to
> structure-softw...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/structure-software?hl=en.

Vinod Kumar

unread,

Aug 7, 2012, 11:26:58 AM8/7/12

to structure...@googlegroups.com

Hi Vikram,

Thanks for your reply.

I've not used other approaches for GWAS and dont know more about it. For plant GWAS I am not getting any literature to straight with data. I dont know about your first point but yes my MAP file contains details of physical positions of every SNP across the chromosome. High p value mean, not significant, it was just high in number like 0.12, 0.23, 0.15 like this.I am not understanding about your third point, anything I've done I've mentioned in my last mail. One more question, how can I find which SNP is present in my haplotype because haplotype is just showing the SNP site, not its ID? Please suggest me and ask if you nedd further information which can assist me in GWAS study.

Thanks,

.

Vinod

Vikram Chhatre

unread,

Aug 7, 2012, 11:39:21 AM8/7/12

to structure...@googlegroups.com

Vinod -

What I meant to ask was how did you obtain map positions for the
markers. Is there a physical map available in your study species. If
not, inferring correct phase and constructing haplotypes from
genotypic data is not very straightforward. Softwares like fastPHASE
can help you with phase inference.

If you search for GWAS software, there is a lot of information on the
web. There is a lot of literature on GWAS studies in plants,
particularly in crops. Here is a partial list of some software. I
will particularly point out 'PLINK' as an alternative to Tassel.

http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html
http://stephenslab.uchicago.edu/software.html
http://www.broadinstitute.org/scientific-community/software?criteria=GWAS

I am not sure what you mean by following question. Could you show us
an example?

> how can I find which SNP is present in
> my haplotype because haplotype is just showing the SNP site, not its ID?

V

Vinod Kumar

unread,

Aug 8, 2012, 1:17:03 AM8/8/12

to structure...@googlegroups.com

Hi Vikram,
Yes, physical map is available for the plant of my study. And we have also selected SNP from known locations of chromosome. I am always confused in association of individual SNP with traits and association of haplotype with traits. How both methods are important in GWAS? Yes, your suggestion of other software for GWAS is good. I am trying to have a hand on them.
I am also setting a set of example file that can understand you more.
Thanks,
Vinod,

Vinod Kumar

unread,

Aug 11, 2012, 2:31:36 AM8/11/12

to structure...@googlegroups.com

Hi Vikram,

Do you PLINK codes for quantitative traits association with SNP data in plants.

Or any help for this.

Thanks,

Vinod

William Nicholson

unread,

Dec 1, 2016, 9:44:17 AM12/1/16

to structure-software

The rival software ADMIXTURE's manual includes a "recipe" for thinning the marker set for linkage disequilibrium which uses the program PLINK. I'm using that method of thinning with some success (I think) with fastSTRUCTURE. However, ADMIXTURE and fastSTRUCTURE do not include linkage disequilibrium in the models they use at all. I have another smaller dataset where I'm using STRUCTURE as I'm interested in the more complex model that includes weak linkage disequilibrium. The recipe in the ADMIXTURE manual is to thin the markers according to the observed sample correlation coefficients using the "--indep-pairwise" option of PLINK i.e. "plink --bfile rawData --indep-pairwise 50 10 0.1" which targets for removal each SNP with an R^2 greater than 0.1 within a sliding window and then "plink --bfile rawData --extract plink.prune.in --make-bed --out prunedData" which copies the remaining untargetted SNPs to prunedData.bed. So in principle, I could use this to thin the data used by STRUCTURE - when run with the model with some linkage - but I'm not sure what correlation threshold is going to be suitable to exclude strong linkage disequilibrium?

William

Vikram Chhatre

unread,

Dec 2, 2016, 10:10:02 PM12/2/16

to structure-software

In all likelihood, you could thin the data with a very stringent threshold and the remaining markers should still give you a pretty robust inference of population structure. I think the best thing to do is to try different subsets to see how different your solutions are.

Perhaps you could apply a two pronged approach. Thin stringently, then subsample the remaining data set using Kevin Emerson's sampling script (included in StrAuto: http://software.popgen.org) into a few sets and try them all out with STRUCTURE.

If you provide more specific information about your data set, a more productive discussion could be had.

V

--

You received this message because you are subscribed to the Google Groups "structure-software" group.

To unsubscribe from this group and stop receiving emails from it, send an email to structure-software+unsub...@googlegroups.com.
To post to this group, send email to structure-software@googlegroups.com.
Visit this group at https://groups.google.com/group/structure-software.
For more options, visit https://groups.google.com/d/optout.

William Nicholson

unread,

Dec 4, 2016, 11:28:58 AM12/4/16

to structure-software

What do you mean by a "very stringent" threshold - one that excludes pretty much all linkage (i.e. like the original ADMIXTURE manual suggestion) or just a little? If I do the former, then what's the advantage of using the more complex model in STRUCTURE that is supposed to deal with weak linkage? I don't have time to fully describe the dataset right now; but it is from DNA capture arrays so it is somewhat smaller than a whole genome sequencing data set (i.e. it's more like a reduced representation or small exome sequencing dataset) but it still takes a considerable time to run STRUCTURE on it. Doing multiple runs to try alternative thresholds is not going to help with one of the reason for doing this - one reason to filter out the data is to improve the running time with STRUCTURE. I had also hoped to have a threshold I could use based on the assumptions in the model that includes weak linkage. Reading the relevant paper though, there isn't an obvious value I can use. (I think the implication from reading the paper on the more complex model in STRUCTURE, is that I would have to estimate a value like background linkage disequilibrium - by some unknown method - and use that as my threshold; but I would have to look at the paper again),

William

Reply all

Reply to author

Forward