PRSice 2 error

729 views
Skip to first unread message

Dom Byrne

unread,
Mar 17, 2021, 5:35:23 AM3/17/21
to PRSice
Hi,

I've recently been trying to run PRSice using the imputed array genotypes from UKBiobank. I was having trouble with the job not completing within 7 days (the runtime limit on my university cluster). Having seen that newer versions of PRSice support multi-threaded clumping, I upgraded, but now have a new error.

The log file:
PRSice 2.3.3 (2020-08-05) 
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2021-03-10 10:23:05
/mnt/iusers01/bk01/v45331db/software/PRSice/PRSice_linux \
    --a1 Coded \
    --a2 Non_coded \
    --allow-inter  \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base Shrine_FEV1_to_FVC_ratio_gwas_sum_stats_processed.txt \
    --beta  \
    --binary-target T \
    --bp Pos \
    --chr Chromosome \
    --clump-kb 250kb \
    --clump-p 1.000000 \
    --clump-r2 0.100000 \
    --cov copd_prs_cov.tsv \
    --extract copd_prs_imp_snps.valid \
    --geno 0.02 \
    --ignore-fid  \
    --info 0.8 \
    --interval 5e-05 \
    --keep eids_passing_array_qc.txt \
    --lower 5e-08 \
    --maf 0.01 \
    --num-auto 22 \
    --out copd_prs_imp_snps \
    --pheno copd_prs_pheno.tsv \
    --pvalue P \
    --seed 2375121353 \
    --snp SNP \
    --stat beta \
    --target /mnt/bk01-home01/shared/uk_biobank/GWAS/bgen_files/chr#,/mnt/bk01-home01/shared/uk_biobank/GWAS/project-specific_sample_files/ukb19056_imp_chr1_v3_s487297.sample \
    --thread 4 \
    --type bgen \
    --upper 0.5

Initializing Genotype file: 
/mnt/bk01-home01/shared/uk_biobank/GWAS/bgen_files/chr# 
(bgen) 
With external fam file: 
/mnt/bk01-home01/shared/uk_biobank/GWAS/project-specific_sample_files/ukb19056_imp_chr1_v3_s487297.sample 

Start processing 
Shrine_FEV1_to_FVC_ratio_gwas_sum_stats_processed 
================================================== 

SNP extraction/exclusion list contains 5 columns, will 
assume first column contains the SNP ID 

Base file: 
Shrine_FEV1_to_FVC_ratio_gwas_sum_stats_processed.txt 
Header of file is: 
SNP Chromosome Pos Coded Non_coded N Neff Coded_freq beta 
SE P Info 

19814168 variant(s) observed in base file, with: 
2937205 variant(s) excluded based on user input 
16876963 total variant(s) included from base file 

Loading Genotype info from target 
================================================== 

487409 people (0 male(s), 0 female(s)) observed 
400993 founder(s) included 

76209143 variant(s) not found in previous data 
9517 variant(s) with mismatch information 
16876963 variant(s) included 

Calculate MAF and perform filtering on target SNPs 
================================================== 

23512 variant(s) excluded based on genotype missingness 
threshold 
10511803 variant(s) excluded based on MAF threshold 
12662 variant(s) excluded based on INFO score threshold 
6328986 variant(s) included 

Phenotype file: copd_prs_pheno.tsv 
Column Name of Sample ID: eid 
Note: If the phenotype file does not contain a header, the 
column name will be displayed as the Sample ID which is 
expected. 

There are a total of 1 phenotype to process 

Start performing clumping 

The error message:
terminate called recursively
terminate called after throwing an instance of 'Error: 
Execution halted

I also tried running the same analysis using a different (related) quantitative target trait, and got essentially the same log/error.

Best wishes,
Dom

Sam Choi

unread,
Mar 17, 2021, 8:29:03 PM3/17/21
to PRSice
So in short: you don't have enough memory to run the new clumping algorithm, which is extremely memory intensive especially when you use the imputed data (it is on my to do list to optimize that, but didn't got the time yet).

As for the reason why it took so long:
1. With 6328986, it is inherently an expensive operation. And will take a long time just to read all the SNPs and calculate the PRS.
2. I see that you are using binary trait with covariates. I am not sure how many covariates you've included, but the thing with PRSice is that it is trying to run at least 6,000 logistic regression, on 400k samples, which is extremely slow to start with. A way to speed things up is to residualize the phenotype though that might not always be correct (trait / case control ratio dependent)

So I guess you can try and increase the memory (not --memory of PRSice, but the memory limit of your server) and see if that helps (though I'd still think the regression analysis will still take a long time). Alternatively, use --fastscore --no-regress to get the scores from PRSice and do the regression downstream. By decoupling the regression from the analysis, you might be able to run the full analysis with your sever settings, though I am not too optimistic of that.

Sam
Reply all
Reply to author
Forward
0 new messages