Dear PRSice team,
I'm trying to run PRSice on UKBiobank data on a binary outcome. My command runs but the results I'm getting don't look right and I'm struggling to find the error. I'm getting models with very poor discrimination from PRSice and tiny PRS R2.
Having tried to run PRSice across the full range of thresholds I ran it with --lower and --upper set to 5e-08 without regression to generate the scores. This ran but seemed to stall whilst processing phenotypes – it sat at “Processing 50%” for ages (see output log below), so I cancelled the job. However I had a .score file with all scores (except for the final sample) outputted, on which I performed the regression myself and got a PRS R2 of 1.21e-05 (Nagelkerke’s, as I believe PRSice uses). I have manually constructed a GWAS-significant PRS using the same data inputs (same SNPs, same weights and covariates, phenotypes from same phenotype file, and also using average scoring) and have decent performance, with PRS R2 of 0.005.
So I think the problem may be in the actual score construction. I also tried changing the valence and risk allele for those where there were negative effect sizes which made no difference (though I note you addressed this issue below and said it shouldn’t matter).
My base data is a GWAS meta-analysis output file from the META package. I have preclumped target data, so am using PRSice with the --no-clump flag. The input target data is in bgen1.1 format (I filtered and converted the UKB data bgen1.2 release to v1.1 in order to clump using PLINK1.9).
I’d be really grateful for any thoughts as to where the issue might be.
Thanks
Sarah
Log file:
PRSice 2.2.11.b (2019-10-16)
https://github.com/choishingwan/PRSice
(C) 2016-2019 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2020-02-12 10:04:16
./PRSice \
--A1 allele_B \
--A2 allele_A \
--all-score \
--bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
--base renamed_baseSNPdata_allchr_targetrsid_v2.txt \
--beta \
--binary-target T \
--bp pos \
--chr chr \
--extract all_SNPs_clumped_R2_0.2.txt \
--interval 5e-05 \
--lower 5e-08 \
--missing mean_impute \
--model add \
--no-clump \
--no-regress \
--out PRSice_results_noclump_r2_0.2/GWASsig_PRSice2.base_noukb.r2_0.2_noregression \
--pheno phenotype_inc_C_withFID_targetonly_20200207.txt \
--pheno-col incident_C \
--pvalue P_value \
--remove exclude_UKBJan2020.txt \
--score avg \
--snp rsid \
--stat BETA \
--target /ukb_target_bgen1.1/targetukb_qcfiltered_for_clumping_chr#,/ukb_target_bgen1.1/targetukb_qcfiltered.sample \
--thread 1 \
--type bgen \
--upper 5e-08
Initializing Genotype file:
/ukb_target_bgen1.1/targetukb_qcfiltered_for_clumping_chr#
(bgen)
With external sample file:
/ukb_target_bgen1.1/targetukb_qcfiltered.sample
Start processing
renamed_baseSNPdata_allchr_targetrsid_v2
==================================================
Only one column detected, will assume only SNP ID is
provided
Base file:
renamed_baseSNPdata_allchr_targetrsid_v2.txt
6856779 variant(s) observed in base file, with:
6315157 variant(s) excluded based on user input
541622 total variant(s) included from base file
Loading Genotype info from target
==================================================
Detected bgen sample file format
We assume the following line is not a header:
4936591 4936591 0 1
(first column isn't FID or IID)
378410 people (174275 male(s), 204040 female(s)) observed
378304 founder(s) included
6315157 variant(s) not found in previous data
541622 variant(s) included
Check Phenotype file:
phenotype_inc_C_withFID_targetonly_20200207.txt
Column Name of Sample ID: FID+IID
Note: If the phenotype file does not contain a header, the
column name will be displayed as the Sample ID which is ok.
There are a total of 1 phenotype to process
Processing the 1 th phenotype
Preparing Output Files
Start Processing
Processing 50.00%
