plink2 pca approx question

697 views
Skip to first unread message

Lin T

unread,
Aug 13, 2018, 12:17:18 PM8/13/18
to plink2-users
Hi, 
I am running --pca approx on a GWAS data (n=300K+ SNPs) and n=300K+ samples, bgen v1.2 input format using plink2 and would like to ask a couple of questions. 
1. using --pca approx, by default, how many pcs will I get in output file? is it 20 pcs? 
2. I use --pca approx (n of pcs by default) and it has been running for two days and which step takes time and makes it so slow? I am curious.
3.  for my case, if I use plink2 (July 17 or newer), may I use --thread command to run multiple threads and accelerate the pca calculation?
4. any other suggestions to accelerate my pca calculation? 
Thank you in advance for your help and time!
Lin

Christopher Chang

unread,
Aug 13, 2018, 12:59:40 PM8/13/18
to plink2-users
1. --pca approx defaults to 10 PCs.  (You can find this out with "plink2 --help pca".)
2. PCA requires a bunch of slow large-matrix linear algebra operations.  What machine and operating system are you running on?  What is the console output so far?
3. plink2 will automatically try to use as many threads as your machine has cores.  --threads is primarily used to *reduce* the number of compute threads when working on a shared machine.
4. Runtime is roughly quadratic in the number of accurate PCs, so if you know you don't need 10 PCs you can speed things up substantially by requesting fewer.  (But if you're in doubt, don't tweak this.)

Lin T

unread,
Aug 13, 2018, 3:20:44 PM8/13/18
to chrch...@gmail.com, plink2...@googlegroups.com
Dear Christopher, 
thank you very much for your quick response. I submit the job onto a server which has Haswell processors and is running RHEL 6.0. for the console output, I have no idea. I only can see the temporary output as below. what do you mean the console output? how can I check it? Do you happen to know? 
my code is as below and the temporary output I got are: ukb_imp_genosnps_thin_British_PCA-temporary.pgen, ukb_imp_genosnps_thin_British_PCA-temporary.psam, ukb_imp_genosnps_thin_British_PCA-temporary.pvar. 
code:
#!/bin/bash
#PBS -S /bin/bash
#PBS -l walltime=120:00:00
#PBS -l nodes=1:ppn=28
#PBS -l mem=4gb
#PBS -d /scratch/ltong
module load gcc/6.2.0
module load plink/2.0
plink2 --bgen /group/pierce-lab/UK_Biobank/500K/March2018/imp/LinT_analysis/plink2/imp_genosnps_British_PCA_thin.bgen --pca approx --out ukb_imp_genosnps_thin_British_PCA

thanks, 
Lin

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Chang

unread,
Aug 13, 2018, 3:43:28 PM8/13/18
to plink2-users
Is there any way to check what has been written to standard output by your job on that machine?

If not, you can rerun with the --debug flag added; that will force every line to be flushed to the .log file immediately, so head on that file would be more informative.  This command does take multiple hours to complete on most machines, but it shouldn't take two days unless you have a rather underpowered machine or a lot more than 300k SNPs.  And 4 GB RAM is kind of absurdly low for analyzing a dataset of this size, but it may still be enough for this command.


On Monday, August 13, 2018 at 12:20:44 PM UTC-7, Lin T wrote:
Dear Christopher, 
thank you very much for your quick response. I submit the job onto a server which has Haswell processors and is running RHEL 6.0. for the console output, I have no idea. I only can see the temporary output as below. what do you mean the console output? how can I check it? Do you happen to know? 
my code is as below and the temporary output I got are: ukb_imp_genosnps_thin_British_PCA-temporary.pgen, ukb_imp_genosnps_thin_British_PCA-temporary.psam, ukb_imp_genosnps_thin_British_PCA-temporary.pvar. 
code:
#!/bin/bash
#PBS -S /bin/bash
#PBS -l walltime=120:00:00
#PBS -l nodes=1:ppn=28
#PBS -l mem=4gb
#PBS -d /scratch/ltong
module load gcc/6.2.0
module load plink/2.0
plink2 --bgen /group/pierce-lab/UK_Biobank/500K/March2018/imp/LinT_analysis/plink2/imp_genosnps_British_PCA_thin.bgen --pca approx --out ukb_imp_genosnps_thin_British_PCA

thanks, 
Lin

On Mon, Aug 13, 2018 at 11:59 AM Christopher Chang wrote:
1. --pca approx defaults to 10 PCs.  (You can find this out with "plink2 --help pca".)
2. PCA requires a bunch of slow large-matrix linear algebra operations.  What machine and operating system are you running on?  What is the console output so far?
3. plink2 will automatically try to use as many threads as your machine has cores.  --threads is primarily used to *reduce* the number of compute threads when working on a shared machine.
4. Runtime is roughly quadratic in the number of accurate PCs, so if you know you don't need 10 PCs you can speed things up substantially by requesting fewer.  (But if you're in doubt, don't tweak this.)

On Monday, August 13, 2018 at 9:17:18 AM UTC-7, Lin T wrote:
Hi, 
I am running --pca approx on a GWAS data (n=300K+ SNPs) and n=300K+ samples, bgen v1.2 input format using plink2 and would like to ask a couple of questions. 
1. using --pca approx, by default, how many pcs will I get in output file? is it 20 pcs? 
2. I use --pca approx (n of pcs by default) and it has been running for two days and which step takes time and makes it so slow? I am curious.
3.  for my case, if I use plink2 (July 17 or newer), may I use --thread command to run multiple threads and accelerate the pca calculation?
4. any other suggestions to accelerate my pca calculation? 
Thank you in advance for your help and time!
Lin

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages