This is my first time using Plink or Plink2, and working on a genotupe bioinformatics project.
I have installed v1.9 and v2.0, as well as bcftools and vcftools in my mac, and i understand what the programs do, but I am ot proficient in their use:
This is my problem:
I have a pretty standard VCF file with about 5000000 SNPs and 5000 samples, that I want to use to perform a polygenic score analysis.
The FORMAT for each SNP/Sample pair is GT:GP:DS, and an example would look 0|1:(0.1,0.8,0.1):1.0, where:
GT represents if alleles have the alt(1) or ref(0)
GP is a posterior probability where GP[0] =P(0|0), GP[1]=P(0|1), GP[2]=P(1|1)
DS is dosage and is calculated as 1*GP[1]+2*GP[2]
With this in mind I'd like to clump my SNPs based on a linkage dissociation analysis of the dosage (DS), ideally under the following conditions:
1) I only want to keep SNPs where I am sure what they are i.e either GP[1] > 0.8 or GP[2] >0.8
2) I want to have the whole genome represented so I want to limit my LD analysis to 1000KB sections at a time
3) I only want to keep SNPs that are not correlated so r^2 < 0.05
4) I want the best-possible representative for each clumped group
5) For each representative SNP, I'd like to keep the list that clumped to it, in case they are more interesting biologically, or if I need to dig deeper on a genomic region
Is this something that can be done using plink/plink2?
Does the approach I propose makes sense, or simply pruning would be the way to go?
I read in this forum and biostars that clumping, make more sense than pruning for polygenic scoring.
I'd appreciate all the help I can get, as I said before, this is my first time using plink and although I have gone through all the documentation several times, I am still not sure how to produce the correct association file from my vcf to use it for clumping.
Thanks a lot in advance :)
Jano
--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/67535f97-155f-48ce-bcc8-5fa132f1a1aao%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/b882a2d0-9242-46c6-95a3-546d17e291cao%40googlegroups.com.
I see -so pretty much steps 1 and 2 to filter based on dosage1. Convert the VCF dosage data to plink2-format.plink2 --vcf <VCF filename> dosage=GP --import-dosage-certainty 0.800001 --out <new plink2 fileset prefix>2. Export hardcalls to plink 1.9.plink2 --pfile <plink2 fileset prefix> --make-bed --out <new plink 1.x fileset prefix>3. Get pruned file based on LDplink file <new plink 1.x fileset prefix> --prune --indep-pairwise ... --out <prune.perefix>will plink.pune.out let me know to whihc SNP in prune.in are they related? Is there a way to keep track of that?
Can you filter the original vcf based on the results of pruning to obtain the simplified vcf?