Scoring (PRS) across multiple .bgen-files with PLINK2

427 views
Skip to first unread message

Leon Hendrian

unread,
Sep 10, 2024, 10:38:37 AM9/10/24
to plink2-users
Hello,
I have some PRS weights (calculated using PRScs, a single txt file for all chromosomes) as well as .bgen-files (one per chromosome) of a study population and wanted to calculate the resulting PRS. I am unsure if I did it correctly though, specifically when combining the results obtained per chromosome.

I did the following:

I called plink2 with the --bgen and --score options for each of my bgen files using a command like so:

for i in range(1,23):
    cmd = f"""plink2 --bgen [path]/bgen_file_{i}.bgen 'ref-first' --rm-dup 'exclude-all' --oxford-single-chr {i} --score [path]prs_coefficients.txt 2 4 6 --out [output-path]"""
    !{cmd}
   
As output, I obtained 22 .sscore-files with 4 columns:

#IID, ALLELE_CT, NAMED_ALLELE_DOSAGE_SUM, SCORE1_AVG

Here, I was a bit unsure, as in the documentation at https://www.cog-genomics.org/plink/2.0/formats#sscore there were more columns listed.

Anyway, in undisplayed R code, I multiplied the SCORE1_AVG with ALLELE_CT to get non-averaged sums per study participant and chromosome, which I then simply added to obtain the PRS values for the participants. (Simply adding the averages would result in a bias towards SNPs on smaller chromosomes, I think.)

Is my thinking here correct? Is "ALLELE_CT" the denominator for the average (and if not, what is)?

I also tried using PLINK1.9, but as per the answer here https://groups.google.com/g/plink2-users/c/iaQn0AC-7SU I think it is not suited for .bgen-files. I also saw that there is a --score-list option, but as I understand the documentation, it is used when one has multiple weight/score-files, not multiple genotype files, correct?

Best,
Leon Hendrian

Christopher Chang

unread,
Sep 10, 2024, 8:23:35 PM9/10/24
to plink2-users
- If there are missing genotypes, and they are mean-imputed by --score (this is the default behavior), ALLELE_CT * SCORE1_AVG isn't quite what you want.  One fix is to add "cols=+scoresums" to the end of your --score flag, and then use the SCORE1_SUM output value.  See https://www.cog-genomics.org/plink/2.0/general_usage#colset for more discussion of how this relates to what you saw at https://www.cog-genomics.org/plink/2.0/formats#sscore .
- You are correct that PLINK 1.9 does not preserve .bgen dosages, and that --score-list is for multiple score files rather than multiple genotype files.

Phil Greer

unread,
Sep 11, 2024, 6:32:37 AM9/11/24
to plink2-users
Leon,

If this is working on UKBiobank data, please look into either https://2cjenn.github.io/PRS_Pipeline/ or https://github.com/pjgreer/ukb-rap-tools/tree/main/prs-calc if working on the UKB-RAP. Both scoring pipelines reduce the larger imputed files down to the size of the scoring file, convert the file into a single plink file and then output a single .sscore file. It is a much easier pipeline than to score directly from bgen.

-Phil Greer
Reply all
Reply to author
Forward
0 new messages