Converting summary statistics back to .bed .bim .fam

480 views
Skip to first unread message

Jalil Sharif

unread,
Oct 25, 2021, 9:51:01 AM10/25/21
to plink2-users
Hi,

I have a summary statistic file, which was originally in vcf format, I used the R program MungeSumstats to clean up the file. Subsequently the file has been stored and the header is as follows:

SNP    CHR    BP    A1    A2    INFO    BETA    SE    LP    FRQ    N    N_CAS    ID    P
rs2462492    1    54676    C    T    0.3166    -0.0996    0.0656    0.890421    0.3166    2137    1169    rs2462492    0.128700134271648

I know a vcf file can be converted into .bed .bim .fam file, however, after the formatting the file does not retain it's vcf characteristics, hence I wanted to ask if the non-vcf file with the header above can this be reformatted to a .bed .bim .fam file? and how can this be achieved via plink?

thanks


Matthew Maher

unread,
Oct 25, 2021, 9:57:43 AM10/25/21
to plink2-users
bed/bim/fam is for representing individual genotype calls (i.e. per-snp+sample).    Summary statistics are just 'summary' - there are no genotype data.    I don't know about your source VCF, but VCF file format is designed to contain detailed genotype data, and VCF files usually do.  But some VCF files omit that and only contain summary level information. 

Jalil Sharif

unread,
Oct 26, 2021, 11:00:28 AM10/26/21
to plink2-users

Okay, thank you. Is there any option to develop a genetic risk score for all the variants in a gwas using a vcf summary statistic file? As the --score option would require a separate file, however, it should be extractable from the summary statistic file and when cutting the columns SNP A1 BETA.

Matthew Maher

unread,
Oct 27, 2021, 11:09:03 AM10/27/21
to plink2-users
I'm not sure what you mean by "develop a genetic risk score".  Are you asking about:
1.  How to determine what SNPs, and with what weights, should be included in a PRS formula?  
OR
2.  actually calculate PRS values for specific samples
?
#2 is what PLINK's --score does.  But I don't believe you can just do #2 only against a full GWAS stats file and expect the results to be meaningful, due to LD, winners-curse, among other issues.  That's why there are numerous tools/methods under development to address #1.

Try working through this:  https://choishingwan.github.io/PRS-Tutorial/

As for GWAS-VCF, I suspect you're referring to this proposed standard, which does seem like a good idea because, yea, GWAS stats are so inconsistently formatted.  But AFAIK, that very new proposed standard has not been adopted by any of the tools you'll likely need to be using for PRS work (e.g. PLINK2, PRS-CS, LDpred2, PRSice2, ...).

Jalil Sharif

unread,
Oct 28, 2021, 12:46:31 PM10/28/21
to plink2-users
I was trying to develop PRS, but this kept on failing for me, so Instead I am trying to develop a genetic risk score instead. But I may try a different approach and revist the PRS.

Jalil Sharif

unread,
Oct 28, 2021, 12:47:53 PM10/28/21
to plink2-users
If you could also answer 1. I do know about the PRS tutortial.

On Wednesday, 27 October 2021 at 16:09:03 UTC+1 mma...@broadinstitute.org wrote:

Matthew Maher

unread,
Oct 28, 2021, 3:15:54 PM10/28/21
to plink2-users
If you mean answer to "How to determine what SNPs, and with what weights, should be included in a PRS formula? ", I can recommend using PRS-CS, which I just had good luck with.  As opposed to several of the competing method-tools, it's an easy installation and they provide necessary LD matrices. Essentially it will convert GWAS summary stats to a reduced set of "adjusted" (for LD) GWAS summary stats which you would then be able to use with PLINK2's --score on your actual samples.  
Reply all
Reply to author
Forward
0 new messages