Fastest ways to run GWAS across large set of phenotypes

Kexin Huang

unread,

Oct 11, 2022, 6:49:30 PM10/11/22

to plink2-users

Hi Christopher,

I want to run GWAS on 1000 phenotypes for ~500K samples and 500K SNPs. What is the quickest way to do it? I have a server with 250 threads. I saw one way is to just feed in the entire phenotype file into PLINK, is there any optimization happening under the hood? Alternatively, I can create an individual file for each phenotype, and create PLINK for a single phenotype and do GNU parallel? Or is there any other PLINK functionality that can accelerate this?

Thank you so much for the help!

Kexin

Zuxi Cui

unread,

Oct 11, 2022, 8:41:04 PM10/11/22

to Kexin Huang, plink2-users

Hi Kexin,

In your case, I would do a loop in shell varying parameters with "--pheno-col-nums" or "--pheno-name".

You can specify and input your phenotype file with "--pheno".

Let me know if anyone has better ideas.

Terry

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/16e0d43f-f41d-46d1-9c71-34b5796e77f5n%40googlegroups.com.

Christopher Chang

unread,

Oct 12, 2022, 4:13:18 PM10/12/22

to plink2-users

1. Make sure to use PLINK 2.0 instead of PLINK 1.9 for this. The relative speedup often exceeds 100x, and sometimes even 1000x. (There is no need to use GNU parallel with PLINK 2.0 --glm, PLINK 2 will do its own multithreading.)

2. If some of your quantitative phenotypes have no missing values, you should process them in a single PLINK 2 run. There is a multi-phenotype optimization that can provide up to a ~10x speed multiplier.

3. For case/control phenotypes, the 'cc-residualize' modifier (which performs a single regression on the covariates and applies the resulting offsets to the logistic regressions for all variants, instead of re-fitting the covariates in every single logistic regression) speeds up the calculation substantially at a fairly small accuracy cost.

Kexin Huang

unread,

Oct 12, 2022, 5:44:35 PM10/12/22

to plink2-users

Great, thank you both!

Kexin

Reply all

Reply to author

Forward