Problems with score calculation

477 views
Skip to first unread message

SarahB

unread,
Feb 17, 2020, 9:48:02 AM2/17/20
to PRSice

Dear PRSice team,

 

I'm trying to run PRSice on UKBiobank data on a binary outcome. My command runs but the results I'm getting don't look right and I'm struggling to find the error. I'm getting models with very poor discrimination from PRSice and tiny PRS R2.

 

Having tried to run PRSice across the full range of thresholds I ran it with --lower and --upper set to 5e-08 without regression to generate the scores. This ran but seemed to stall whilst processing phenotypes – it sat at “Processing 50%” for ages (see output log below), so I cancelled the job. However I had a .score file with all scores (except for the final sample) outputted, on which I performed the regression myself and got a PRS R2 of 1.21e-05 (Nagelkerke’s, as I believe PRSice uses). I have manually constructed a GWAS-significant PRS using the same data inputs (same SNPs, same weights and covariates, phenotypes from same phenotype file, and also using average scoring) and have decent performance, with PRS R2 of 0.005.

 

So I think the problem may be in the actual score construction. I also tried changing the valence and risk allele for those where there were negative effect sizes which made no difference (though I note you addressed this issue below and said it shouldn’t matter).

 

My base data is a GWAS meta-analysis output file from the META package. I have preclumped target data, so am using PRSice with the --no-clump flag. The input target data is in bgen1.1 format (I filtered and converted the UKB data bgen1.2 release to v1.1 in order to clump using PLINK1.9).

 

I’d be really grateful for any thoughts as to where the issue might be.


Thanks 

Sarah



Log file:

PRSice 2.2.11.b (2019-10-16) 

https://github.com/choishingwan/PRSice

(C) 2016-2019 Shing Wan (Sam) Choi and Paul F. O'Reilly

GNU General Public License v3

If you use PRSice in any published work, please cite:

Choi SW, O'Reilly PF.

PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.

GigaScience 8, no. 7 (July 1, 2019)

2020-02-12 10:04:16

./PRSice \

    --A1 allele_B \

    --A2 allele_A \

    --all-score  \

    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \

    --base renamed_baseSNPdata_allchr_targetrsid_v2.txt \

    --beta  \

    --binary-target T \

    --bp pos \

    --chr chr \

    --extract all_SNPs_clumped_R2_0.2.txt \

    --interval 5e-05 \

    --lower 5e-08 \

    --missing mean_impute \

    --model add \

    --no-clump  \

    --no-regress  \

    --out PRSice_results_noclump_r2_0.2/GWASsig_PRSice2.base_noukb.r2_0.2_noregression \

    --pheno phenotype_inc_C_withFID_targetonly_20200207.txt \

    --pheno-col incident_C \

    --pvalue P_value \

    --remove exclude_UKBJan2020.txt \

    --score avg \

    --snp rsid \

    --stat BETA \

    --target /ukb_target_bgen1.1/targetukb_qcfiltered_for_clumping_chr#,/ukb_target_bgen1.1/targetukb_qcfiltered.sample \

    --thread 1 \

    --type bgen \

    --upper 5e-08


Initializing Genotype file: 

/ukb_target_bgen1.1/targetukb_qcfiltered_for_clumping_chr# 

(bgen) 

With external sample file: 

/ukb_target_bgen1.1/targetukb_qcfiltered.sample 


Start processing 

renamed_baseSNPdata_allchr_targetrsid_v2 

================================================== 


Only one column detected, will assume only SNP ID is 

provided 


Base file: 

renamed_baseSNPdata_allchr_targetrsid_v2.txt 

6856779 variant(s) observed in base file, with: 

6315157 variant(s) excluded based on user input 

541622 total variant(s) included from base file 


Loading Genotype info from target 

================================================== 


Detected bgen sample file format 


We assume the following line is not a header: 

4936591 4936591 0 1 

(first column isn't FID or IID) 


378410 people (174275 male(s), 204040 female(s)) observed 

378304 founder(s) included 


6315157 variant(s) not found in previous data 

541622 variant(s) included 


Check Phenotype file: 

phenotype_inc_C_withFID_targetonly_20200207.txt 

Column Name of Sample ID: FID+IID 

Note: If the phenotype file does not contain a header, the 

column name will be displayed as the Sample ID which is ok. 

There are a total of 1 phenotype to process 


Processing the 1 th phenotype

Preparing Output Files


Start Processing

Processing 50.00%


Sam Choi

unread,
Feb 18, 2020, 12:26:57 PM2/18/20
to PRSice
The stall at 50% is likely PRSice spending a long time in writing the file output, especially when you are not using --fastscore.

How exactly did you calculate your score? PRSice by default use the dosage to calculate the PRS when bgen is used.

SarahB

unread,
Feb 19, 2020, 4:44:53 AM2/19/20
to PRSice
Thanks Sam,

Yes I'm hoping I replicated your average scoring - so I used dosages for the manual calculation - I extracted these from the bgen using qctool v2, weighted each dosage by its beta, summed them and then divided by the total number of SNPs included.

Sarah

Sam Choi

unread,
Feb 19, 2020, 10:59:12 AM2/19/20
to PRSice
that sounded reasonable.

Screen Shot 2020-02-19 at 10.54.55 AM.png

Just in case, the way PRSice calculate the dosage score is

where w is the probability for the j th allele (0 for homo non-effective, 1 for hetero, 2 for homo effective allele)

and then we divide the score by the number of SNP * ploidy which is 2 for human. This should give very similar results as yours.


Have you try to directly run PRSice on your UKBB data with the --allow-inter option? That should complete within a day (including clumping), just so that I can rule out the possibility where there's something wrong with the bgen 1.1 code.

Sam


SarahB

unread,
Feb 19, 2020, 11:53:50 AM2/19/20
to PRSice
Thanks Sam,

No I haven't, I'll give that a try now and get back to you. 

I did start my attempts with PRSice2 by trying to use it to QC the target dataset, clumping and thresholding, but it ran for days without getting any results which is why I decided to QC filter and clump outside of PRSice2. (I hadn't spotted this googlegroup at that point or I'd have emailed you then!!)

Sarah

SarahB

unread,
Feb 20, 2020, 6:06:20 AM2/20/20
to PRSice
While this is running - is there a way to run PRSice2 without performing QC on the target dataset, if you have already QC'd, and go straight to clumping?

Sam Choi

unread,
Feb 20, 2020, 2:43:03 PM2/20/20
to PRSice
Yes, if you don't use the --maf and --geno option, PRSice2 will skip the QC steps. The part where it takes a seemingly longtime before clumping is the construction of the intermediate file, which's enabled by the option --allow-inter. While it does take a while to generate the intermediate file, it drastically speed up the clumping process and therefore we usually recommend this whenever you have sufficient disk space. It will be even faster if you are using hard coded scores as we can reuse the intermediate file for that too

Message has been deleted

SarahB

unread,
Feb 24, 2020, 6:11:17 AM2/24/20
to PRSice
Hi Sam,

My PRSice command with clumping included, but no QC, with the --allow-inter option, is still running after 5 days. The intermediate file has been generated. Though I didn't include --maf and --geno options, it still spent a very long time calculating the MAFs (well over 24 hours - I guess this is the generation of the intermediate file is it?), with the log file stating "Calculate MAF and perform filtering on target SNPs" during this time. The current status in the log file is that it is 57% through clumping. 

I have just over 6.8million SNPs in my base file, which the log file says are included, and ~378000 individuals. I'm running it on 8 cores which is giving me 385384MB RAM, but PRSice is only allocating 938MB to clumping. 

This seems to be far slower than the times you were expecting!  As I said, I also had the same issue with length of time taken when I first ran PRSice, including QC filtering, on the UKB bgen 1.2 files, but had hoped it would be faster this time without the QC steps. 

Thanks
Sarah

PS this is my current command
Rscript PRSice.R --dir . \
--prsice ./PRSice \
--base renamed_baseSNPdata_allchr_targetrsid_v2.txt \
--target /ukb_target_bgen1.1/targetukb_qcfiltered_for_clumping_chr#,/ukb_target_bgen1.1/targetukb_qcfiltered.sample \
--stat BETA \
--beta \
--type bgen \
--binary-target T \
--A1 allele_B \
--A2 allele_A \
--snp rsid \
--chr chr \
--bp pos \
--pvalue P_value \
--pheno phenotype_inc_C_withFID_targetonly_20200207.txt \
--pheno-col incident_C \
--cov epi_covariates_FID.txt \
--cov-col array_type,@PC[1-4],sex,age \
--cov-factor array_type,sex \
--remove exclude_UKBJan2020.txt \
--model add \
--missing MEAN_IMPUTE \
--allow-inter \
--score avg \
--clump-r2 0.2 \
--clump-kb 250 \
--quantile 10 \
--out PRSice_results/PRSice2.r2_0.2_allSNPs_4PCarray_sexage

Sam Choi

unread,
Feb 24, 2020, 9:57:29 AM2/24/20
to PRSice
This is definitely slower than what I expected, I wonder how much that's because of the bgen1.1 vs bgen1.2 or because of machine. Did you use the executable I released or did you compile the software yourself?

I am also running PRSice on UK biobank bgen files and it takes me around 4 hours to complete (without covariates). So I definitely don't expect PRSice to run for 5 days without getting into the regression steps.

SarahB

unread,
Feb 24, 2020, 11:48:50 AM2/24/20
to PRSice
Something wrong in my setup then I think!! 

I'm using the executable, though I've just realised I'm not using the newest release (though I'm only using 2.2.11) - I've just downloaded 2.2.12 and will see if that helps.

Sam Choi

unread,
Feb 24, 2020, 2:13:36 PM2/24/20
to PRSice
If you are using the executable provided, then I guess it is almost as optimized as it could be in terms of compilation (with the exception of the --march=native flag). In that case, it is likely PRSice is I/O bounded by your server / computer and I have yet figure out a way to speed up PRSice in such scenario.

SarahB

unread,
Feb 24, 2020, 4:21:02 PM2/24/20
to PRSice
Can I ask what machine you're running on Sam, and then I will double check the spec of our cluster for comparison?

I've just downloaded the executable for 2.2.12, and running the exact same script as above I now get "Error: execution halted" before anything runs. Does this newer vsn need any additional specifications that the previous didn't?

Sam Choi

unread,
Feb 24, 2020, 5:22:48 PM2/24/20
to PRSice
No, they should be more or less the same. Judging from the error, it seems like there is some memory issue. How much memory did you requested? (typically, for UK biobank bgen data, I will use around 40GB of memory)

Our servers are equipped with 286 Intel 8168 24 core processors at 2.7 GHz and 192 GB of RAM. I just so happened to also run a PRSice analysis on UK Biobank bgen data this morning (residualized BMI) and it took around 7~8 hours for the whole analysis to complete

SarahB

unread,
Feb 28, 2020, 12:10:51 PM2/28/20
to PRSice
Hi Sam, 

Having spoken to our computing team I'm doing a variety of test runs (just using the bgen1.1 for chr20 with just a few subjects included) with different compute configurations to see if I can speed things up. I've taken out the Rscript and just running the executable as I'm not worried about plotting. I have a few queries/issues that have come up from this!

Firstly, I spotted this old thread about the clumping step (https://groups.google.com/forum/#!msg/PRSice/PvyhIN6TcWM/4KbdBc2LAgAJ) and wondered if it still holds true for the newer version? My log file (for just a few subjects) states "385384 MB RAM detected; reserving 1 MB for clumping" so I'm not sure how to speed up the clumping process, nor how yours runs so much faster on the same dataset?

Secondly, I am running the scripts on a pre-QC'd dataset so running it without the --maf and --geno option, but it seems to be doing QC anyway! This is using the vsn 2.2.11b executable. The log file says that it has calculated MAF and performed filtering, and that it has excluded a handful of variants based on this. These are the settings of the analysis according to the log file, which don't mention MAF/geno:

./PRSice \
    --A1 allele_B \
    --A2 allele_A \
    --allow-inter  \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base test_base_chr20.txt \
    --beta  \
    --binary-target T \
    --bp pos \
    --chr chr \
    --clump-kb 250 \
    --clump-p 1.000000 \
    --clump-r2 0.200000 \
    --cov test_covariates.txt \
    --cov-col array_type,@PC[1-4],sex,age \
    --cov-factor array_type,sex \
    --interval 5e-05 \
    --keep test_include_subjects.txt \
    --lower 5e-08 \
    --missing mean_impute \
    --model add \
    --out 1core_test_PRSise2.2.11_chr20 \
    --pheno test_phenotypes.txt \
    --pheno-col incident_CRC \
    --pvalue P_value \
    --score avg \
    --seed 3766426670 \
    --snp rsid \
    --stat BETA \
    --target /ukb_target_bgen1.1/targetukb_qcfiltered_for_clumping_chr20,/ukb_target_bgen1.1/targetukb_qcfiltered.sample \
    --thread 1 \
    --type bgen \
    --upper 0.5

I am using PRSice2.2.11b, because even using this small test dataset, the 2.2.12 executable fails with no error message. I've tried instead to compile 2.2.12 from source - I don't have cmake so used the command "g++ -std=c++11 -O3 -DNDEBUG -march=native -isystem lib -I inc src/*.cpp -lpthread -lz -o PRSice" from your website, but then get an error as follows:

In file included from inc/commander.hpp:23:0,
                 from inc/genotype.hpp:22,
                 from inc/binarygen.hpp:21,
                 from src/binarygen.cpp:17:
inc/storage.hpp:21:23: fatal error: Eigen/Dense: No such file or directory
 #include <Eigen/Dense>



Apologies for the multiple questions but thanks in advance for your help!
Sarah

Sam Choi

unread,
Mar 2, 2020, 1:11:19 PM3/2/20
to PRSice
No worries, it is always a challenge to make the analysis faster for such a large data set. As far as what I know, 2.2.12's compilation might be a bit problematic, causing it to not run on lots of machine. Maybe you can try this experimental build https://www.dropbox.com/s/nl21wyn5a170jx6/PRSice_linux.zip?dl=0

1. The same thing still holds for clumping, it is a single threaded procedure due to how the analysis was performed and so far I haven't figured out a nice way to speed that up

2. I should've updated the message as what PRSice was doing in that step is basically a) Geno filtering b) MAF filtering c) Info filtering d) generating an intermediate file (either one of those). That's why it will always run. As for the MAF filtering, it is likely that the SNP has MAF of 0 or 1, which then will be removed.

3. One of the main problem with PRSice at its current form is that it still reads a lot from the file, which then makes it rather inefficient when using on a server with busy I/O (that happened before on our institute server). In that situation, PRSice can become extremely slow. I am still trying to figure out a way to  handle such situation without consuming too much memory ( because as it currently stands, PRSice are already using like 40GB of memory for bgen data which we really don't want to further increase that value)

I'll try and update you when I have got some other idea. My recent tests do give me some idea as to what can be done to improve the performance of PRSice.
Sam

Reply all
Reply to author
Forward
0 new messages