PRSice 2 error

729 views

Skip to first unread message

Dom Byrne

unread,

Mar 17, 2021, 5:35:23 AM3/17/21

to PRSice

Hi,

I've recently been trying to run PRSice using the imputed array genotypes from UKBiobank. I was having trouble with the job not completing within 7 days (the runtime limit on my university cluster). Having seen that newer versions of PRSice support multi-threaded clumping, I upgraded, but now have a new error.

The log file:

PRSice 2.3.3 (2020-08-05)

https://github.com/choishingwan/PRSice

GNU General Public License v3

If you use PRSice in any published work, please cite:

Choi SW, O'Reilly PF.

PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.

GigaScience 8, no. 7 (July 1, 2019)

2021-03-10 10:23:05

/mnt/iusers01/bk01/v45331db/software/PRSice/PRSice_linux \

--a1 Coded \

--a2 Non_coded \

--allow-inter \

--bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \

--base Shrine_FEV1_to_FVC_ratio_gwas_sum_stats_processed.txt \

--beta \

--binary-target T \

--bp Pos \

--chr Chromosome \

--clump-kb 250kb \

--clump-p 1.000000 \

--clump-r2 0.100000 \

--cov copd_prs_cov.tsv \

--extract copd_prs_imp_snps.valid \

--geno 0.02 \

--ignore-fid \

--info 0.8 \

--interval 5e-05 \

--keep eids_passing_array_qc.txt \

--lower 5e-08 \

--maf 0.01 \

--num-auto 22 \

--out copd_prs_imp_snps \

--pheno copd_prs_pheno.tsv \

--pvalue P \

--seed 2375121353 \

--snp SNP \

--stat beta \

--target /mnt/bk01-home01/shared/uk_biobank/GWAS/bgen_files/chr#,/mnt/bk01-home01/shared/uk_biobank/GWAS/project-specific_sample_files/ukb19056_imp_chr1_v3_s487297.sample \

--thread 4 \

--type bgen \

--upper 0.5

Initializing Genotype file:

/mnt/bk01-home01/shared/uk_biobank/GWAS/bgen_files/chr#

(bgen)

With external fam file:

/mnt/bk01-home01/shared/uk_biobank/GWAS/project-specific_sample_files/ukb19056_imp_chr1_v3_s487297.sample

Start processing

Shrine_FEV1_to_FVC_ratio_gwas_sum_stats_processed

==================================================

SNP extraction/exclusion list contains 5 columns, will

assume first column contains the SNP ID

Base file:

Shrine_FEV1_to_FVC_ratio_gwas_sum_stats_processed.txt

Header of file is:

SNP Chromosome Pos Coded Non_coded N Neff Coded_freq beta

SE P Info

19814168 variant(s) observed in base file, with:

2937205 variant(s) excluded based on user input

16876963 total variant(s) included from base file

Loading Genotype info from target

==================================================

487409 people (0 male(s), 0 female(s)) observed

400993 founder(s) included

76209143 variant(s) not found in previous data

9517 variant(s) with mismatch information

16876963 variant(s) included

Calculate MAF and perform filtering on target SNPs

==================================================

23512 variant(s) excluded based on genotype missingness

threshold

10511803 variant(s) excluded based on MAF threshold

12662 variant(s) excluded based on INFO score threshold

6328986 variant(s) included

Phenotype file: copd_prs_pheno.tsv

Column Name of Sample ID: eid

Note: If the phenotype file does not contain a header, the

column name will be displayed as the Sample ID which is

expected.

There are a total of 1 phenotype to process

Start performing clumping

The error message:

terminate called recursively

terminate called after throwing an instance of 'Error:

Execution halted

I also tried running the same analysis using a different (related) quantitative target trait, and got essentially the same log/error.

Best wishes,

Dom

Sam Choi

unread,

Mar 17, 2021, 8:29:03 PM3/17/21

to PRSice

So in short: you don't have enough memory to run the new clumping algorithm, which is extremely memory intensive especially when you use the imputed data (it is on my to do list to optimize that, but didn't got the time yet).

As for the reason why it took so long:
1. With 6328986, it is inherently an expensive operation. And will take a long time just to read all the SNPs and calculate the PRS.

2. I see that you are using binary trait with covariates. I am not sure how many covariates you've included, but the thing with PRSice is that it is trying to run at least 6,000 logistic regression, on 400k samples, which is extremely slow to start with. A way to speed things up is to residualize the phenotype though that might not always be correct (trait / case control ratio dependent)

So I guess you can try and increase the memory (not --memory of PRSice, but the memory limit of your server) and see if that helps (though I'd still think the regression analysis will still take a long time). Alternatively, use --fastscore --no-regress to get the scores from PRSice and do the regression downstream. By decoupling the regression from the analysis, you might be able to run the full analysis with your sever settings, though I am not too optimistic of that.