LD Pruning Based on P-value

1,233 views
Skip to first unread message

Nathan Lawlor

unread,
May 4, 2017, 11:53:28 AM5/4/17
to plink2-users
From this post (https://www.biostars.org/p/128923/) I read that when two SNPs are determined to be in LD, "PLINK keeps the SNP with higher minor allele frequency". When performing LD pruning, is it possible to instead have PLINK keep the SNP with the more significant P-value? When I refer to P-values, these would be the P-values as determined from particular genome-wide association studies (GWAS) from the GWAS catalog. 

For example, if rs10 (P-value of 1e-4) is in LD with rs20 (P-value of 1e-2), then rs10 would be kept and rs20 would be removed.

Christopher Chang

unread,
May 4, 2017, 11:58:45 AM5/4/17
to plink2-users
You can make PLINK do this by generating a bogus allele frequency file (make the more significant p-values correspond to allele frequency closer to 0.5), and loading it with --read-freq during your LD-pruning run.

Nathan Lawlor

unread,
May 4, 2017, 2:54:55 PM5/4/17
to plink2-users
Thank you for the reply. I noticed the allele frequency report files (.frq) are not fixed-space delimited text files. When I generate a bogus allele frequency file, can I make it a tab-delimited text file? Or will this cause an error with PLINK --read-freq?

Christopher Chang

unread,
May 4, 2017, 3:00:35 PM5/4/17
to plink2-users
Tab-delimited text should work.

Nathan Lawlor

unread,
May 4, 2017, 10:03:02 PM5/4/17
to plink2-users
Thank you for the clarification. I also wanted to clarify that I was doing my analysis correctly. Given my .bed/.bim/.fam files (all.chr.genotypes) and bogus allele frequency file (Type_2_diabetes.frq) in which I adjusted the MAF score column such that SNPs with more significant p-values have a higher MAF score, I want to perform pairwise LD-pruning of a select list of SNP identifiers (Type_2_diabetes.SNPs.txt). To do this I used the following command:

plink --bfile all.chr.genotypes --indep-pariwise 1000 kb 5 0.2  --extract Type_2_diabetes.SNPs.txt --read-freq Type_2_diabetes.frq --r2 square --out T2D.LD.SNPs

Which gives me both prune.out/prune.in files, a log file, and the LD (.ld file) matrix file. Upon further investigation of the LD matrix and pruned files, I noticed that certain SNP pairs with very high (r-square > 0.8) were not being pruned/removed.

For example, the SNPs rs5215 and rs5219 were in LD (r-square = 0.97) and neither one was pruned (both were present in the prune.in file). Is it possible that the command I used is wrong?

Christopher Chang

unread,
May 5, 2017, 1:20:29 AM5/5/17
to plink2-users
You need to split this into two runs.  In the first run, you have --bfile, --indep-pairwise, --read-freq, and --extract on Type_2_diabetes.SNPs.txt.  In the second run, you have --bfile, --r2, and --extract on the prune.in file generated in the first run.

jf

unread,
May 5, 2017, 6:45:33 AM5/5/17
to plink2-users

Just out of curiosity: What does - in your case - the "indep-pairwise" option provide as advantage over the "clump" procedure, which explicitely performs p-value aware pruning?

Nathan Lawlor

unread,
May 5, 2017, 10:39:44 AM5/5/17
to plink2-users
The files I get when I ran the command without separating into 2 runs:

T2D.LD.SNPs.log  T2D.LD.SNPs.nosex  T2D.LD.SNPs.prune.in  T2D.LD.SNPs.prune.out


When I separate into 2 runs:

plink --bfile all.chr.genotypes --indep-pariwise 1000 kb 5 0.2  --extract Type_2_diabetes.SNPs.txt --read-freq Type_2_diabetes.frq --out T2D.LD.SNPs
plink --bfile all.chr.genotypes  --extract T2D.LD.SNPs.prune.in --r2 square --out Results

I get the files:

T2D.LD.SNPs.log  T2D.LD.SNPs.nosex  T2D.LD.SNPs.prune.in  T2D.LD.SNPs.prune.out Results.ld  Results.log  Results.nosex

Where the prune.in/prune.out files contain the same SNPs as I did when I did not separate the commands. The only difference this time is that the LD matrix (Results.ld file) only is showing r-squared values for the SNPs in the prune.in file instead of all SNPs in the Type_2_diabetes.SNPs.txt file. So ultimately, separating the commands into 2 runs did not seem to make a difference for me.

Additionally, both the rs5215 and rs5219 SNPs are still not being removed. Within the LD-matrix, only rs5215 and rs5219 are in LD > 0.2, so at least one of these should be removed and present in the prune.out file. 

Christopher Chang

unread,
May 5, 2017, 10:55:58 AM5/5/17
to plink2-users
Try changing "1000kb 5 0.2" to "1000kb 1 0.2"; I remember there was an anomaly with kb-based pruning when the step size wasn't 1.  (plink 2.0 actually requires a step size of 1 in this case.)

Nathan Lawlor

unread,
May 5, 2017, 11:14:30 AM5/5/17
to plink2-users
That seems to have fixed my issue. Thank you very much for your help, Christopher.

And jf, I tried the --clump option and that seems to work as well. Thank you for the suggestion.

Tim Bigdeli

unread,
Jun 12, 2017, 1:42:43 PM6/12/17
to plink2-users
Hi - a quick question re: this post:

Would you please comment on the interpretation of SNP entries in the *.clumped file that have value of "NONE" in the SP2 field? I had taken this to mean simply that a given index SNP is not in LD with other SNPs passing filtering criteria; however, these SNPs may also appear in the SP2 field of other index SNPs. 

Thanks in advance for your comments! t

Christopher Chang

unread,
Jun 20, 2017, 2:24:54 PM6/20/17
to plink2-users
Hi,

PLINK 1's clumping uses a greedy algorithm, where clumps are formed around the variants with the best p-values first, and by default no SNP appears in multiple clumps.  So a SP2=NONE index SNP with a not-that-great p-value may actually be in LD with other nearby SNPs which have already been claimed by another clump.

...which means that such an index SNP should not be appearing in the SP2 field of other index SNPs.  Can you post the .log file for a run where you're seeing that happen?
Reply all
Reply to author
Forward
0 new messages