use pgenlibr in R?

63 views
Skip to first unread message

jie huang

unread,
Apr 29, 2025, 3:41:24 AM4/29/25
to plink2-users

Dear Chris:

The PLINK --glm function is wonderful. But now I need to run some more complicated analyses such as Cox regression or LASSO using genome-wide PGEN data.

Such function is easily accessbile in R, such as coxph()。My question is, can I use  pgenlibr to read in PLINK formatted genetic data and then feed the data to coxph() in R?


Thanks!

Jie

Christopher Chang

unread,
Apr 29, 2025, 1:16:43 PM4/29/25
to plink2-users
If you're uncomfortable with writing the necessary glue code, there are preexisting Cox and Lasso packages which support PGEN input:

jie huang

unread,
Apr 30, 2025, 7:46:16 PM4/30/25
to plink2-users

Thanks, Chris!  

How about LMM, the type of analyses that could be done by BOLT-LMM or SAIGE?
I hope that they support PGEN format as well.

No matter PGEN or BGEN, I feel that there is a solution needed for the upcoming era of pangenome and the potential depreciation of REF/ALT [A1/A2].

Best regards,
Jie

Christopher Chang

unread,
May 1, 2025, 12:32:50 PM5/1/25
to plink2-users
1. GCTA supports PGEN input, and includes an LMM implementation.
2. Your comment about REF/ALT makes no sense.  As I previously explained (https://groups.google.com/g/plink2-users/c/o8YzKsZeRYs/m/TDraaAe2AAAJ ), and you either forgot or failed to comprehend, current PLINK2 builds already report when there is no real REF allele ("PROVISIONAL_REF?" column in output files, INFO/PR tag in VCF/.pvar).

jie huang

unread,
Jun 23, 2025, 8:55:59 PM6/23/25
to plink2-users

Dear Chris:

I previously tried GCTA for LMM and contacted Yang Jian. Unfornately, GCTA failed to run LMM on UK Biobank data, because it failed to calculate GRM on such a big dataset.

Regarding the REF/ALT issue, I did notice that PLINK2 output the following columns "REF  ALT  PROVISIONAL_REF A1  OMITTED".
However, no matter it is called REF or PROVISIONAL_REF, my understanding is that it is still based on a binary thinking
In the futuer, all 4 possible A/C/G/T could happen at all base pairs. Therefore, it should be quaternary.

You mentioned that there are preexisting Cox and Lasso. Can you please let me know your recommended software?

BTW, my UK biobank data in pfile format takes 2.7TB on the server, which costs me some money to store it. 
Is there a way to further compress the data?

Thank you very much!

Best regards,
jie

Chris Chang

unread,
Jun 23, 2025, 11:39:34 PM6/23/25
to jie huang, plink2-users
Please read the beginning of any VCF specification from the last ten years, and then reply to this message with an explanation of how the “quaternary” case you describe (not to mention other, more complicated cases) are ALREADY SUPPORTED.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/f3ff307a-9de6-4f16-8319-607d52e6e279n%40googlegroups.com.

jie huang

unread,
Jun 24, 2025, 7:35:40 PM6/24/25
to plink2-users
Dear Chris:

Thank you very much! 

I have not fully studied the lastest format and specification of VCF files. 
Sorry if I misunderstood or misinterpreted something. 

If 4 persons' genotype are AC AG AT CG respectively, then VCF or PLINK would write them as below:
#CHROM POS REF ALT  Man1  Man2   Man3   Man4
1   123  A   C       1/1    ?/?    ?/?   ?/? 
1   123  A   G       ?/?    1/1    ?/?    ?/? 
1   123  A   T       ?/?    ?/?     1/1    ?/? 
1   123  C   G       ?/?    ?/?     ?/?    1/1 

What I hope is something like below:
#CHROM POS Man1  Man2   Man3 
1   123  AC   AG   AT  CG

Best regards,
Jie

Chris Chang

unread,
Jun 24, 2025, 7:57:10 PM6/24/25
to jie huang, plink2-users
This isn't about the "latest" VCF specification, it's about things that were defined more than a decade ago and have been supported by plink2 for years.

You need to take a step back and learn the basics.  The example in your latest post has multiple errors.  Even if you read just the first subsection ("1.1 Example") from a VCF specification published anytime in the last 10 years, you should be able to identify something critical you're misunderstanding.

jie huang

unread,
Jun 24, 2025, 8:31:22 PM6/24/25
to plink2-users
Ok.  It seems that VCF will write these 4 person's genotype as below:
#CHROM  POS  ID  REF  ALT     QUAL  FILTER  INFO  FORMAT  Man1  Man2  Man3  Man4
1       123  .   A    C,G,T   .     .       .     GT      0/1   0/2   0/3   1/2

This time, it uses A as the REF.  Next time, it might uses C as the REF for 4 other persons. 
My main thinking is that we should forget about REF, because of the pangenome idea.

I think PLINK can not deal with  0/3  and might still use 4 lines in the .bim file for this hypothetic SNP.

Best regards,
jie

jie huang

unread,
Jun 24, 2025, 8:31:37 PM6/24/25
to plink2-users
Ok.  It seems that VCF will write these 4 person's genotype as below:
#CHROM  POS  ID  REF  ALT     QUAL  FILTER  INFO  FORMAT  Man1  Man2  Man3  Man4
1       123  .   A    C,G,T   .     .       .     GT      0/1   0/2   0/3   1/2

This time, it uses A as the REF.  Next time, it might uses C as the REF for 4 other persons. 
My main thinking is that we should forget about REF, because of the pangenome idea.

I think PLINK can not deal with  0/3  and might still use 4 lines in the .bim file for this hypothetic SNP.

Best regards,
jie
On Wednesday, 25 June 2025 at 07:57:10 UTC+8 chrch...@gmail.com wrote:

Chris Chang

unread,
Jun 24, 2025, 10:47:06 PM6/24/25
to jie huang, plink2-users
Without any further help from me, you should be able to answer (i) how plink2 CAN deal with 0/3, (ii) how plink2 can distinguish between REF=A and “no REF” here, and (iii) how plink2’s approach PRESERVES COMPATIBILITY WITH EXISTING VCF-PROCESSING SOFTWARE.

Reply all
Reply to author
Forward
0 new messages