PRS training on PennPRS platform: Step 4 Error

2 views
Skip to first unread message

Akwo, Elvis A

unread,
Feb 20, 2026, 2:41:52 PMFeb 20
to pen...@googlegroups.com
Hi,

I am running PRS training on the PennPRS platform and have hit a snag; I would like your help. Could you please share the .bim files of the ancestry-specific references that you use for PRS training? The files I need are:
1kg_hm3_EUR_ref.bim
1kg_hm3_AFR_ref.bim
1kg_hm3_AMR_ref.bim

I keep getting a step 4 error: The Step 4 error: "number of columns of matrices must match” and I think it’s because the SNP universe in my ancestry-specific summary data differs across the ancestry groups the I use the Hapmap RSID list provided via your dropbox.

Thanks,

Elvis Akwo, MD, MS, PhD

Research Scientist

Vanderbilt University Medical Center

Division of Nephrology & Hypertension

Vanderbilt Center for Kidney Disease

1161 21st Ave South, S-3119 MCN

Nashville, TN 37232-2372

P. 336-918-6972

Email: elvis...@vumc.org





Jin, Jin

unread,
Feb 21, 2026, 11:41:30 AMFeb 21
to Akwo, Elvis A, pen...@googlegroups.com

Hi Elvis,

 

Thanks for contacting us! Here is the link to the files you requested: https://www.dropbox.com/scl/fo/21xrx1z3iinsp5vl2j6ew/AEp1BavM3ti8_kRyHu-DP5o?rlkey=87h1wml96a6q53q6u8ybeusgy&st=qkywe2y5&dl=0

 

Please let us know if you need help debugging the issue. It is preferred to have a consistent format of the GWAS summary data files across ancestries. If you are using the online platform, please refer to our tutorial at https://pennprs.gitbook.io/pennprs; if you are using the offline pipeline, please refer to our tutorial at https://github.com/PennPRS/Pipeline/wiki. If you encounter more issues, please feel free to contact us again.

 

Thanks!

Jin

 

--
您收到此邮件是因为您订阅了Google群组上的“PennPRS”群组。
要退订此群组并停止接收此群组的电子邮件,请发送电子邮件到pennprs+u...@googlegroups.com
如需查看此讨论,请访问 https://groups.google.com/d/msgid/pennprs/DM3PR12MB9286A5925F8D6EA9821D02A6FA68A%40DM3PR12MB9286.namprd12.prod.outlook.com
要查看更多选项,请访问https://groups.google.com/d/optout

Akwo, Elvis A

unread,
Feb 21, 2026, 10:23:24 PMFeb 21
to Jin, Jin, pen...@googlegroups.com
Hi Jin,

Thanks for your timely reply and for sharing the reference 1kg_hm3_*_ref.bim files!

I am trying to train a multi-ancestry PRS for cystic kidney disease using the PROSPER-pseudo method on the online PennPRS platform but I keep getting the following error in step 4.1 (Run PROSPER-pseudo by MCCV): "Error in {:task 6 failed - "number of columns of matrices must match (see arg 2)".

I've followed all the steps in the online tutorial to process the summary data for the three population groups under study. I filtered the SNPs in my summary data files (EUR, AFR, and AMR) to keep only the 1.2M SNPs present in the reference 1kg_hm3_*_ref.bim files to ensure all the 03 files used for the analysis have the same number of SNPs, ordering of SNPs, and header info. Yet I still get the same error. I was wondering whether the PennPRS algorithm performs additional internal filtering (e.g., ambiguous or palindromic SNPs) that may drop SNPs differentially across ancestral groups, creating different SNP sets across population groups in step 4.1 even though the summary data inputs had the same SNP-sets at the beginning? I have attached the error log file and the R script I use to preprocess my summary data files. 

I have 03 sumstats for EUR ancestry (MVP, UKBB, and FINNGEN) and 1 each for AFR and AMR (both from MVP). So, I am running 03 jobs that combine 1 of the EUR sumstats with the MVP AFR and AMR sumstats. All 03 runs fail at step 4.1 despite the input files having the same number of SNPs and header info.

For example, the sanity check in my Rscript shows the following number of SNPs in each file after the filtering process:

Before final filtering: EUR=1137091 | AFR=1156097 | AMR=1169696
Shared SNPs (EUR∩AFR∩AMR within ref_universe): 1103803
Final rows: EUR=1103803 | AFR=1103803 | AMR=1103803
Identical SNP vectors? EUR==AFR: TRUE | EUR==AMR: TRUE

Any help you can provide to clarify what I may be doing wrong will be appreciated.

Thanks!

Elvis

From: Jin, Jin <Jin...@Pennmedicine.upenn.edu>
Sent: Saturday, February 21, 2026 10:41 AM
To: Akwo, Elvis A <elvis...@vumc.org>; pen...@googlegroups.com <pen...@googlegroups.com>
Subject: Re: [External] PRS training on PennPRS platform: Step 4 Error
 
[ WARNING : This email came from an external source. Please treat this message with additional caution.]
PRS_model_training.log
format_sumstats_cystickd_alleleslock_v2.R

Jin, Jin

unread,
Feb 22, 2026, 6:17:22 PMFeb 22
to Akwo, Elvis A, pen...@googlegroups.com

Hi Elvis,

 

I think the issue is not due to the format of the input GWAS summary data – in fact, they can have different numbers of SNPs across different ancestry groups. I just tested the online multi-ancestry training pipeline using public GWAS summary data from the GWAS catalog and it successfully ran.

 

I noticed that the estimated heritability is 0.0021. The small heritability may lead to some issue of the original PROSPER algorithm. The small heritability might indicate a minimal power of the PRS model for the trait. I could provide some suggestions on this analysis if you can provide more details of the application.

 

Thanks,

Jin

 

Akwo, Elvis A

unread,
Feb 23, 2026, 12:37:37 AMFeb 23
to Jin, Jin, pen...@googlegroups.com
Hi Jin,

Thanks for your insights. We performed a multi-ancestry GWAS of cystic kidney disease using data from five populations, namely: the EUR populations in UKBB and FINNGEN, and the MVP EUR, AFR, and AMR populations.  We want to use the 5 sets of summary statistics to develop the best-predicting PRS for cystic kidney disease in each population using multi-ancestry approaches on the online PennPRS platform.

For each ancestry, the PGS weights would then be used to compute polygenic risk scores for cystic kidney disease among individuals in Vanderbilt's DNA biobank (BioVU), which is linked to de-identified EHRs. The goal is to leverage this PRS to identify two groups of patients:
  1. Discordant resilient individuals with high PRS for cystic kidney disease (for example,>=90th percentile) but with no clinically observed cystic kidney disease, and
  2. Discordant susceptible individuals with low PRS for cystic kidney disease (for example,<=10th percentile) but with clinically-diagnosed cystic kidney disease.
We would then study these individuals for evidence of either a "second hit" that may explain why low-PRS individuals develop cystic kidney disease, or genetic variants for resilience that protect against cystic kidney disease in high-PRS individuals.

I would be grateful for any insights you can provide on the application of PRSes/the PennPRS platform to this line of research.

Also, I wasn't sure I understood correctly. Could you clarify whether it is a feature of the PROSPER algorithm that, if the trait's heritability is low (as in this case), we might get an error about the number of columns in the matrices? And if that is the case, would you suggest we run LD score regression to estimate the trait's SNP heritability (h^2) in each ancestry, and then proceed with PRS training only for traits with heritability estimates above a certain threshold, say >=0.05 in each ancestry?

Thanks,

Elvis


From: Jin, Jin <Jin...@Pennmedicine.upenn.edu>
Sent: Sunday, February 22, 2026 5:17 PM

Jin, Jin

unread,
Feb 23, 2026, 11:25:39 PMFeb 23
to Akwo, Elvis A, pen...@googlegroups.com

Hi Elvis,

 

Thanks for the information. Could you give me a rough range for the GWAS sample sizes across these ancestry groups? Multi-ancestry methods may run into issues when GWAS sample sizes differ a lot between ancestry groups.

 

And about your question: “Also, I wasn't sure I understood correctly. Could you clarify whether it is a feature of the PROSPER algorithm that, if the trait's heritability is low (as in this case), we might get an error about the number of columns in the matrices? And if that is the case, would you suggest we run LD score regression to estimate the trait's SNP heritability (h^2) in each ancestry, and then proceed with PRS training only for traits with heritability estimates above a certain threshold, say >=0.05 in each ancestry?”

 

Technically, a low heritability does not necessarily lead to an error from PROSPER. But in some cases, either single-ancesry analysis step using lassosum2 (an intermediate step of conducting single-ancestry PRS modeling on each ancestry in PROSPER) or the multi-ancestry analysis step in PROSPER could generate in a PRS model with a minimal power and lead to some errors. This is likely an issue of the original PROSPER algorithm, not the PennPRS pipeline.

 

If it’s possible for you to share the GWAS summary data with me, I’m happy to take a closer look at the issue and give you some suggestions on alternative solutions.

 

Thanks,

Jin

Reply all
Reply to author
Forward
0 new messages