Clumping/Prunning from VCF file for polygenic scoring

1,043 views
Skip to first unread message

Jano

unread,
Jun 6, 2020, 8:40:56 PM6/6/20
to plink2-users
Hi,

This is my first time using Plink or Plink2, and working on a genotupe bioinformatics project.
I have installed v1.9 and v2.0, as well as bcftools and vcftools in my mac, and i understand what the programs do, but I am ot proficient in their use:

This is my problem:
I have a pretty standard VCF file with about 5000000 SNPs and 5000 samples, that I want to use to perform a polygenic score analysis.
The FORMAT for each SNP/Sample pair is GT:GP:DS, and an example would look 0|1:(0.1,0.8,0.1):1.0, where:

GT represents if alleles have the alt(1) or ref(0)
GP is a posterior probability where GP[0] =P(0|0), GP[1]=P(0|1), GP[2]=P(1|1)
DS is dosage and is calculated as 1*GP[1]+2*GP[2]

With this in mind I'd like to clump my SNPs based on a linkage dissociation analysis of the dosage (DS), ideally under the following conditions:

1) I only want to keep SNPs where I am sure what they are i.e either GP[1] > 0.8 or GP[2] >0.8
2) I want to have the whole genome represented so I want to limit my LD analysis to 1000KB sections at a time
3) I only want to keep SNPs that are not correlated so r^2 < 0.05
4) I want the best-possible representative for each clumped group
5) For each representative SNP, I'd like to keep the list that clumped to it, in case they are more interesting biologically, or if I need to dig deeper on a genomic region

Is this something that can be done using plink/plink2?
Does the approach I propose makes sense, or simply pruning would be the way to go?
I read in this forum and biostars that clumping, make more sense than pruning for polygenic scoring.

I'd appreciate all the help I can get, as I said before, this is my first time using plink and although I have gone through all the documentation several times, I am still not sure how to produce the correct association file from my vcf to use it for clumping.

Thanks a lot in advance :)


Jano

الوحالي السيد

unread,
Jun 6, 2020, 8:42:16 PM6/6/20
to Jano, plink2-users
عايز افتح الغرفه الصوتيه

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/67535f97-155f-48ce-bcc8-5fa132f1a1aao%40googlegroups.com.

Christopher Chang

unread,
Jun 7, 2020, 2:39:47 AM6/7/20
to plink2-users
Main steps:
1. Convert the VCF dosage data to plink2-format.
plink2 --vcf <VCF filename> dosage=GP --import-dosage-certainty 0.800001 --out <new plink2 fileset prefix>

2. Use plink2 --glm to generate the raw association statistics.

3. Export hardcalls to plink 1.9.
plink2 --pfile <plink2 fileset prefix> --make-bed --out <new plink 1.x fileset prefix>

4. Use plink 1.9 --clump to clump the association results.

Alejandro Wolf Yadlin

unread,
Jun 7, 2020, 11:57:05 AM6/7/20
to Christopher Chang, plink2-users
Thank you so much!

I appreciate the guidance and the super quick response.
I’ll try the steps as soon as I get to the lab tomorrow. I’ll  post anything interesting in this thread do other plink beginners can take advantage of whatever I learn :)

Thanks,


Jano

On Jun 6, 2020, at 11:39 PM, Christopher Chang <chrch...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

Jano

unread,
Jun 10, 2020, 1:44:57 PM6/10/20
to plink2-users
Hi Christopher -
Again, thank you so much for your help.

Sorry this might be a stupid question, but how does glm help me get the association statistics of the SNPs.
On the documentations it mentions that is making associations to the phenotype (my VCF only has genotypes), when all I want is to have the association among the SNPs.
to calculate LD distance. AM I misunderstanding the use of  -glm?

Thanks,

Jano

Christopher Chang

unread,
Jun 10, 2020, 1:47:39 PM6/10/20
to plink2-users
Sorry, I misunderstood what you meant by "association statistics", since polygenic scoring requires a phenotype that you're trying to construct a score for.

plink 1.9's --r/--r2 flag is the generic way to dump SNP-SNP correlation stats.
Message has been deleted

Jano

unread,
Jun 10, 2020, 2:09:28 PM6/10/20
to plink2-users
Sorry i did not explained myself well. I am trying to fit against a phenotype, but outside plink, I have a non linear model i want to try.
I want to use plink to pre-process the data and remove linearly dependent SNPs based on LD within 1000KB to reduce my data set and also remove co-linear SNPs.
is that a clearer description?

Are the steps then:

1. Convert the VCF dosage data to plink2-format.
plink2 --vcf <VCF filename> dosage=GP --import-dosage-certainty 0.800001 --out <new plink2 fileset prefix>

2. Export hardcalls to plink 1.9.
plink2 --pfile <plink2 fileset prefix> --make-bed --out <new plink 1.x fileset prefix>

3. Calculate r2 among SNPS
plink file  <new plink 1.x fileset prefix> --r/--r2 flag --out <assoc.prefix>


4. Use plink 1.9 --clump to clump the association results based on results of r2

5. Get filtered vcf with clumped SNP as input for my non linear model

Does this flow makes sense to you.


Again thanks for all the help -

Jano

Christopher Chang

unread,
Jun 10, 2020, 2:13:47 PM6/10/20
to plink2-users
No.  --clump is specific to --glm's output.

If you just want to remove highly-correlated SNPs based on LD, use --indep or --indep-pairwise.  --r/--r2 is for dumping the raw stats when you want to do something more customized.

Jano

unread,
Jun 10, 2020, 2:30:24 PM6/10/20
to plink2-users
I see -
so pretty much steps 1 and 2 to filter based on dosage

1. Convert the VCF dosage data to plink2-format.
plink2 --vcf <VCF filename> dosage=GP --import-dosage-certainty 0.800001 --out <new plink2 fileset prefix>

2. Export hardcalls to plink 1.9.
plink2 --pfile <plink2 fileset prefix> --make-bed --out <new plink 1.x fileset prefix>

3. Get pruned file based on LD
plink file <new plink 1.x fileset prefix> --prune --indep-pairwise ... --out <prune.perefix>

will plink.pune.out let me know to whihc SNP in prune.in are they related? Is there a way to keep track of that?
Can you filter the original vcf based on the results of pruning to obtain the simplified vcf?
Can Iuse prune.in file to filter?

I really appreciate the help (and patience)
Thanks,

Jano


Christopher Chang

unread,
Jun 10, 2020, 2:33:40 PM6/10/20
to plink2-users


On Wednesday, June 10, 2020 at 11:30:24 AM UTC-7, Jano wrote:
I see -
so pretty much steps 1 and 2 to filter based on dosage

1. Convert the VCF dosage data to plink2-format.
plink2 --vcf <VCF filename> dosage=GP --import-dosage-certainty 0.800001 --out <new plink2 fileset prefix>

2. Export hardcalls to plink 1.9.
plink2 --pfile <plink2 fileset prefix> --make-bed --out <new plink 1.x fileset prefix>

3. Get pruned file based on LD
plink file <new plink 1.x fileset prefix> --prune --indep-pairwise ... --out <prune.perefix>

will plink.pune.out let me know to whihc SNP in prune.in are they related? Is there a way to keep track of that?

No, that sort of thing is what --r/--r2 is for.
 
Can you filter the original vcf based on the results of pruning to obtain the simplified vcf?

Yes, by using --extract on the .prune.in file.
 

Jano

unread,
Jun 10, 2020, 3:23:59 PM6/10/20
to plink2-users
Thanks you I'll try the r2 and and prune commands and see what works best

Thanks a lot!

Jano
Message has been deleted

Jano

unread,
Jun 10, 2020, 4:21:23 PM6/10/20
to plink2-users
Hi,

Last question I hope -

I ran 

plink2 --vcf data.vcf.gz dosage=GP --import-dosage-certainty 0.800001 --out data_plink2
plink2 --pfile data_plink2 --make-bed --out data_plink1

when I try to run

plink --file data_plink1 --prune --indep-pairwise 1000kb 1 0.02 --noweb

or

plink --file data_plink1 --indep-pairwise 1000kb 1 0.02 --noweb


I get the same error

ERROR: No file [ data_plink1.ped ] exists.

Is there a way to make this file?
I don't want to the pruning in the original vcf as it will contain all SNPs and not only the one with GP >0.80001
So i was hoping to prune from the processed files and then use prune.in to extract my SNPs from the original VCF.


Thanks,


Ale
Jano

Christopher Chang

unread,
Jun 10, 2020, 5:05:33 PM6/10/20
to plink2-users
The flag is --bfile, not --file.  Also, you need to install plink 1.9; that error message is from plink 1.07, which has a ~1000x slower implementation of --indep-pairwise and also doesn't support "1000kb" syntax.

Jano

unread,
Jun 10, 2020, 5:22:48 PM6/10/20
to plink2-users
Thanks, I'll make sure that plink is upadted to version 1.9.

In the meantime, after your last message I figured I might be able to concatenate all steps in plink2 and run

plink2 --vcf my_data.vcf.gz dosage=GP --import-dosage-certainty 0.800001 --indep-pairwise 1000kb 1 0.02 --out mydata_prune

Does this makes sense?

Problem now is - when I try to extract

plink2 --vcf mydata.vcf.gz --extract mydata_prune.prune.in --out mydata_reduced

I get 
Error: Basic file conversions do not support regular filter or transform
operations.  Rerun your command with --make-bed/--make-{b}pgen.

Is there a way to get the filter reults in vcf.gz format after extracting only the SNPs in prune.in?

Thanks,

Jano


Christopher Chang

unread,
Jun 10, 2020, 5:27:16 PM6/10/20
to plink2-users
Add "--export vcf bgz vcf-dosage=DS" to your last command line.  (Or vcf-dosage=GP if you have no choice, but try to avoid that because plink2 doesn't actually keep track of the full genotype probability triplet.  If you actually need to keep the triplets for some reason, you can't use plink2 for VCF filtering.)
Message has been deleted

Jano

unread,
Jun 10, 2020, 6:49:25 PM6/10/20
to plink2-users
So I run

plink2 --vcf mydata.vcf.gz --extract mydata_prune.prune.in --export vcf bgz vcf-dosage=DS --out mydata_reduced
I only get a vcf with GT (0|0, 0|1, 1|1)

and  warning Warning: No dosage data present.  DS field will not be exported.
Warning: '_' present in original sample IDs; --export vcf will not be able to
reconstruct them. Consider rerunning with a suitable --export id-delim= value.

but when I look at my vcf header I can see

##source=Minimac3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">

Do i need to change the header to export dosage?

Thanks,

Jano

Christopher Chang

unread,
Jun 10, 2020, 7:13:27 PM6/10/20
to plink2-users
"--vcf mydata.vcf.gz dosage=DS"

Jano

unread,
Jun 10, 2020, 8:00:16 PM6/10/20
to plink2-users
awesome -thanks, feel silly now.
thanks, for all the help

Jano
Reply all
Reply to author
Forward
0 new messages