Plink2 speed on association test with BGEN files

2,928 views
Skip to first unread message

刘莹

unread,
Jul 13, 2018, 10:24:39 AM7/13/18
to plink2-users
I want to conduct a simple association test on quantitative phenotype with just one or two covariates using PLINK2 on UKB bgen files. I will exclude lots of SNPs based on set criteria and restrict to a smaller sample size, but I just do it using --remove, --exclude flags. However, I noticed the speed is still very slow because it seems PLINK2 will still scan and convert all the SNPs from BGEN file to plink data format. Do you suggest me to use other tools like BGENIX or QCTOOLS to minimize the bgen files first, then run the association test? Or do you have other suggestions might be helpful to increase the speed? (--threads? sorry I am not a plink savvy.) For chromosome 22, it took me more than 2 days given the huge sample size and SNP number in UKB data, which makes me worry I could not afford to complete the analysis on chr1 or chr2 since we have restrictions on running length of a certain job. I guess another way is to separate the chr using range into several smaller ones. But I wonder do you have other possible ways to achieve the goal.

Thanks for any reply!
Ying

Christopher Chang

unread,
Jul 13, 2018, 12:01:58 PM7/13/18
to plink2-users
You are supposed to convert the bgen files to pgen once, and then do all further filtering/analysis with the pgen fileset.

If you are selecting a small subset of samples/variants from the bgen, and don't ever plan to analyze any other parts of the same bgen with PLINK2, it's reasonable to use BGENIX/QCTOOL2/etc. to perform the initial filtering.

刘莹

unread,
Jul 19, 2018, 2:11:36 PM7/19/18
to plink2-users
Thanks so much! 

Then now I have a follow-up question. I know in PLINK2 how to use --make-pgen to transfer my bgen (v1.2) files to pfiles, and then I could do --freq, --hardy, --glm, etc. Otherwise, I could use --bgen --sample to work on my original bgen files and conduct those --freq, --hardy, --glm, but each step needs to take a much longer time since it will need to transfer the bgen files to pfiles all over again. So in general, I should use --make-pgen first and do the rest later to save computing time, as your suggested. 

My only concern is do I lose any information (perhaps dosage?) from the --make-pgen step? If so, is there any way to walk around? Do I need to worry the information lost in any kinds of following analyses, since some might affect the results, e.g. rare alleles? 

Thanks!

Christopher Chang

unread,
Jul 19, 2018, 3:17:08 PM7/19/18
to plink2-users
No, --make-pgen does not lose any information.  In fact, all that --bgen does is autoconvert your data to .pgen format; the rest of plink2 is built around .pgen.
Message has been deleted

刘莹

unread,
Jul 20, 2018, 12:51:26 PM7/20/18
to plink2-users
Running the job on converting bgen to pgen was OK for smaller chromosomes. But some of my slurm jobs get killed after long hours of running. The error message tells me it is not the problem in PLINK2, but the memory exceeded the limit in our cluster computers: 

Writing ./extracted/chr21_tryspeed/chrX_good_pfile.pgen ... 0%/usr/spool/slurm/job28646770/slurm_script: line 21: 51042 Killed                  plink2 --bgen ./extracted/chr21_tryspeed/chrX_good.bgen --sample ukb24487_imp_chrX_v3_s486743.sample --make-pgen --out ./extracted/chr21_tryspeed/chrX_good_pfile
slurmstepd: error: Exceeded step memory limit at some point. 

I wonder how to change my script to keep memory used in PLINK steps to a certain level to avoid exceeding limits. Can I alter --memory or --threads to achieve this? 

Also I want to know if I divide those few super large chromosome (chr1 or chr2) to parts using bgenix or qctool first, and then run plink2 to convert them to pgen files, I would get pgen.chr1_p1, pgen.chr1_p2, pgen.chr1_p3.... I surely could run other computing using these segments. But I am just wondering could I merge pgen files and run the downstream analyses? Thanks. 

Christopher Chang

unread,
Jul 20, 2018, 12:58:51 PM7/20/18
to plink2-users
Yes, you should use --memory to limit the amount of memory plink2 uses if your cluster imposes a limit, and --threads to limit the number of compute threads.

.pgen merge is not implemented yet.
Message has been deleted

刘莹

unread,
Jul 27, 2018, 10:09:30 AM7/27/18
to plink2-users
I have successfully converted all my autosomal chromosomes from bgen files to pgen files and then ran basic association test without any problem. Thanks!! 

Now I am working on to do the same thing on UKB chrX file. The bgen to pgen process was OK, but the association test gave the following error message: 

Warning: Skipping --glm regression on phenotype 'intBFP' since covariate
correlation matrix could not be inverted. You may want to remove redundant
covariates and try again.

Actually I am using the same phenotype file with the other chromosomes, just deleted the samples not in chrX sample file. intBFP is just a continuous phenotype derived from body fat percentage, and it was running just fine in the other 22 chromosomes. Do you have any clue what might be the reason and how could I modify my script or the input file? 

log file: 
PLINK v2.00a1LM 64-bit Intel (11 Feb 2018)
Options in effect:
  --covar ...../sampleX.sample 
  --covar-name sex
  --glm cols=+a1count,+machr2
  --memory 50000
  --out BFP_chrX
  --pfile .../chrX_good_pfile
  --pheno ..../sampleX.sample
  --pheno-name intBFP
  --remove ....../BFPexcludeX

Hostname: sbcs2
Working directory: ..../GWAS_BFP
Start time: Thu Jul 26 18:50:27 2018

Random number seed: 1532649027
258218 MB RAM detected; reserving 50000 MB for main workspace.
Using up to 72 threads (change this with --threads).
486757 samples (263910 females, 222833 males, 14 ambiguous; 486757 founders)
loaded from ..../chrX_good_pfile.psam.
858805 variants loaded from
....../chrX_good_pfile.pvar.
1 quantitative phenotype loaded (288791 values).
--remove: 288791 samples remaining.
1 covariate loaded from ....../sampleX.sample.
288791 samples (157049 females, 131742 males; 288791 founders) remaining after
main filters.
288791 quantitative phenotype values remaining after main filters.
Warning: Skipping --glm regression on phenotype 'intBFP' since covariate
correlation matrix could not be inverted. You may want to remove redundant
covariates and try again.

End time: Thu Jul 26 18:50:44 2018

PS:
I have googled previous post and saw similar question. But I have tried to alter my --pheno file to contain only ID and intBFP, and my --covar file to contain only ID and sex. Still I have the same Warning and the process skip --glm. The weird thing is it only happens in chrX processing. When I process autosomal chromosomes, there is no issue on running linear regression, even I have more covariates in the file (maybe correlated, but no used in each regression). 

Thanks for your time. 

Christopher Chang

unread,
Jul 27, 2018, 11:36:40 AM7/27/18
to plink2-users
plink's linear/logistic regressions automatically add a sex covariate on chrX, and this is colliding with the covariate you're adding.  You can use --glm's 'no-x-sex' modifier to disable this behavior.

刘莹

unread,
Jul 27, 2018, 12:02:42 PM7/27/18
to plink2-users
I see! Thanks so much! It is running smoothly for now. 

刘莹

unread,
Jul 27, 2018, 7:48:28 PM7/27/18
to plink2-users
Just a side question. When I convert UKB .bgen files to .pgen files, I could only see #CHROM  POS     ID      REF     ALT in the .pvar file. Is there anyway I could also see the INFO, QUAL or others? I see from other post that: 
Meanwhile, there currently aren't any plink2 commands that add anything to the INFO column; instead, --make-pgen just preserves preexisting entries.  However, you can dump freshly computed mach-r2 values with "--freq cols=+machr2", and then use a short script to place them in the INFO column if that's where you need them.

I could also +machr2 in --glm steps. But I am not sure if this reflects the snp INFO. Is the INFO the same with the info score we could see from the mfi file from UKB data?

Also I know I should be able to filter out low INFO snps in --glm if I use --exclude-if-info, but for now, I could not see the INFO from pvar file. So I do not know how could I use the filter (I understand I could use the provided MFI file to generate a txt list to exclude though).

Thanks.  

Christopher Chang

unread,
Jul 27, 2018, 8:01:06 PM7/27/18
to plink2-users
There's nothing to see: .bgen files do not have QUAL/FILTER/INFO fields.  So --exclude-if-info won't do anything on a converted .bgen.  These flags only apply to VCF data.

You can use plink2 --mach-r2-filter to filter on the MaCH-r^2 imputation quality statistic; this is related but not identical to IMPUTE2's INFO statistic.  If you want to use the mfi directly, write a script to read the mfi file and dump the SNPs that pass (or fail) your preferred filter to a file, and then use --extract/--exclude.

刘莹

unread,
Jul 27, 2018, 9:23:35 PM7/27/18
to plink2-users
That's exactly what I think and did. Thanks. 

刘莹

unread,
Aug 1, 2018, 4:51:28 PM8/1/18
to plink2-users
Now I have finished the association tests and want to proceed to post-GWAS processing. I would like to clump snps. But I found out there is no --clump flags in PLINK2 and when I use plink1.9 I have to provide input files, which does not recognize pfile. Do you suggest me to use --indep-pairwise (but I thought it is a pruning process usually conducted before association test, I could be wrong)? or I further convert pfiles to bfiles? 

[liuy39@sbcs2 try]$ plink --clump BMI_chr22_0400001.intBMI.glm.linear --clump-r2 0.1 --clump-p1 5e-8 --clump-p2 5e-8 --out BMI_chr22_0400001_clump_try
PLINK v1.90b5 64-bit (14 Nov 2017)             www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to BMI_chr22_0400001_clump_try.log.
Options in effect:
  --clump BMI_chr22_0400001.intBMI.glm.linear
  --clump-p1 5e-8
  --clump-p2 5e-8
  --clump-r2 0.1
  --out BMI_chr22_0400001_clump_try

Error: No input dataset.
For more information, try 'plink --help [flag name]' or 'plink --help | more'.

-----------------------

plink --clump BMI_chr22_0400001_try --clump-r2 0.1 --clump-p1 5e-8 --clump-p2 5e-8 --out BMI_chr22_0400001_clump_try --pfile /data1/liuy39/UKB_temp0723/pgencomplete/chr22_all_pfile
PLINK v1.90b5 64-bit (14 Nov 2017)             www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to BMI_chr22_0400001_clump_try.log.
Options in effect:
  --clump BMI_chr22_0400001_try
  --clump-p1 5e-8
  --clump-p2 5e-8
  --clump-r2 0.1
  --out BMI_chr22_0400001_clump_try
  --pfile /data1/liuy39/UKB_temp0723/pgencomplete/chr22_all_pfile

Error: Unrecognized flag (--pfile).  (This is PLINK 1.9, not 2.x.)

Christopher Chang

unread,
Aug 1, 2018, 5:08:56 PM8/1/18
to plink2-users
Export bfiles, they should be good enough for clumping purposes for now.

刘莹

unread,
Aug 1, 2018, 5:13:36 PM8/1/18
to plink2-users
export all to bfiles or just the significant SNPs based on my association test?
Also I have heard to use reference set to do clumping, do you suggest to use a reference set from, say HapMap, or just use the individual genotype file I have used for the association test? 

Boison Solomon

unread,
Aug 1, 2018, 5:15:20 PM8/1/18
to plink2-users

Hello Chang,

I have been using plink a couple of years. I was wondering if you will like to add a flag that excludes samples that are too heterozygote or homozygote. For example –hetind. The flag will also take two argument (minimum and maximum sample heterozygosity  value). Currently, I do this by computing inbreeding (--het) and then removing samples that were excessively homozygote and suppling through the –remove flag.

 

This is a suggestion and will be glad if a flag like that is added.

Thanks for the wonderful work you’ve been doing

 

King regards

Solomon

Christopher Chang

unread,
Aug 1, 2018, 5:25:02 PM8/1/18
to plink2-users
Only necessary to include the SNPs you're interested in.

It's reasonable to use a reference dataset for clumping (1000 Genomes is a common choice) if your sample size is smaller and the reference dataset contains all the relevant SNPs.

刘莹

unread,
Aug 1, 2018, 5:47:37 PM8/1/18
to plink2-users
Is there any difference to convert from bgen to pfile to bfile with directly from bgen to bfile? 

Christopher Chang

unread,
Aug 1, 2018, 5:54:40 PM8/1/18
to plink2-users
There shouldn't be if both conversions are performed with plink 2.0 (though pfile -> bfile is much faster).

To match what plink 1.9 does, add "--import-dosage-certainty 0.9" when importing the bgen.

刘莹

unread,
Aug 1, 2018, 5:57:36 PM8/1/18
to plink2-users
This is very helpful. I'll just perform the conversion using plink2 via pfile (already converted from bgen) to bfile to save time. 

刘莹

unread,
Aug 1, 2018, 6:03:14 PM8/1/18
to plink2-users
Just to be clear. Do I need to add "--import-dosage-certainty 0.9" when I --make-pgen from bgen file (bgen to pgen conversion)? I remember you mentioned to me before that pfile contains all the information from bgen, so I assume this is only needed if I import bgen and make the conversion directly to bfile to make plink1.9 compatible?  

Christopher Chang

unread,
Aug 1, 2018, 6:55:34 PM8/1/18
to plink2-users
No, pgen files do not contain *all* the information in a bgen; they just contain all the information you'd actually use in a normal GWAS analysis.  bgen files contain genotype posterior probability triplets, such as {P(0/0) = 0.2, P(0/1) = 0.52, P(1/1) = 0.28} or {P(0/0) = 0, P(0/1) = 0.92, P(1/1) = 0.08}.  pgen files contain allelic dosages; in both of these cases, dosage(1) = 1.08.  (0.52 + 0.28 * 2, or 0.92 + 0.08 * 2.)

Sometimes, you're better off treating poorly imputed genotype calls (such as {0.2, 0.52, 0.28}) as entirely missing; this is what --import-dosage-certainty is for.  "--import-dosage-certainty 0.9" checks the largest probability in each triplet, and causes a missing call to be saved instead of a dosage when that largest probability is smaller than your threshold; so {0.2, 0.52, 0.28} would be treated as missing and {0, 0.92, 0.08} would not.

Other times, any better-than-random guess is better than a missing entry.  If the mathematical method you're using is going to mean-impute missing values anyway, you may as well save even the least reliable dosages.

When you're unsure, you can always run your analysis twice, once with --import-dosage-certainty and once without.  With good data quality, the results should be nearly identical.

sy06...@gmail.com

unread,
Aug 17, 2018, 1:48:59 PM8/17/18
to plink2-users
hello and thank you for this valuable discussion.


If I understood correctly,  for  basic allelic test ( --asso)  or any other analysis using the allelic count,  we would need  to produce a file generating the count with the 0.9 threshold . Is that correct? 

In the documentation about dosage_import, I  am not sure to understand this part of the last sentence "PLINK 2 also saves (possibly missing) hardcalls for those commands to use. " I am not understanding which information at which threshold PLINK2 is keeping for the command using allelic counts ( for example) . Could you please help ?


Many thanks, 

Saliha
PS:  source: https://www.cog-genomics.org/plink/2.0/input#dosage_import  :  "The PLINK 2 binary file format supports allelic dosages, with ~4 decimal place precision. However, some of PLINK 2's commands do not make use of dosage data. Thus, when importing dosage data, PLINK 2 also saves (possibly missing) hardcalls for those commands to use.  " 

Christopher Chang

unread,
Aug 17, 2018, 1:53:19 PM8/17/18
to plink2-users
The .pgen file contains both dosages AND thresholded hardcalls.  So you don't need to create two separate sets of files to work with the data in plink2.

(However, to use a plink 1.9 command that hasn't yet been ported to plink 2.0, you'll need to use --make-bed.)

sy06...@gmail.com

unread,
Aug 17, 2018, 2:15:34 PM8/17/18
to plink2-users
thank you very much for the answer!  Are  the thresholded hardcalls kept  by .pgen  those with GP  >=0.9 ? 

Christopher Chang

unread,
Aug 17, 2018, 2:22:30 PM8/17/18
to plink2-users
It defaults to dosages that are with 0.1 of an integer (so [0, 0.1], [0.9, 1.1], [1.9, 2.0]).  However, you can change this whenever you want with --hard-call-threshold + --make-pgen.  And during import (but not afterward), you can erase all dosages with highest-GP below a threshold by using --import-dosage-certainty.

sy06...@gmail.com

unread,
Aug 17, 2018, 3:24:44 PM8/17/18
to plink2-users
thank you very much for the clarification.  Last point. Does this means that if we use a --hard-call-threshold  of 0 + --make-pgen then we fall back in what is equivalent to  types genotypes  ( directly measured) ?  
Cheers, 
Saliha

Christopher Chang

unread,
Aug 17, 2018, 6:16:18 PM8/17/18
to plink2-users
Yes, that will treat essentially all non-integer dosages as missing calls.  (Exception: dosages have at least 5 decimal places of precision, and there are numbers like 1.00001 which are extremely close to an integer; plink2 only uses ~4 decimal places so it treats 1.00001 as if it were exactly 1.)

Shicheng Guo

unread,
Aug 18, 2018, 8:03:01 PM8/18/18
to Christopher Chang, plink2...@googlegroups.com
Hi Chris, 

I have 100,000 SNP pairs to be calculate the LD (r2 or d'), what's the best way to do it with plink?  be careful, it is pair by pair, not double loop all of SNPs. 

Thanks. 

Shicheng

FYI:

rs10000185      rs7672495
rs10000196      rs2867461
rs10000226      rs2867461
rs10000609      rs17630466
rs10000678      rs6448119
rs10000888      rs62324212
rs10001373      rs7660520
rs10002005      rs7660520
rs10002145      rs62324212
rs10002601      rs7660520
rs10002643      rs7660520
rs10003163      rs437943
rs10003482      rs7660520
rs10003783      rs13142500
rs10003784      rs13142500
rs10003809      rs2867461
rs10004019      rs2867461
rs10004264      rs13142500
rs10004910      rs2664035
rs10006432      rs7660520
rs10006717      rs62324212
rs10007259      rs13142500
rs10008032      rs437943
rs10008074      rs1501138
rs1000820       rs11870477
rs1000821       rs11870477
rs10008644      rs1501138
rs10010061      rs62324212
rs10010101      rs437943
rs10011458      rs7672495
rs1001168       rs11870477
rs10011864      rs437943
rs1001191       rs1877030
rs10012259      rs11937061
rs10013032      rs13119723
rs10013495      rs62324212
rs10013749      rs13137105
rs10013803      rs7660520
rs10013924      rs7660520
rs10014066      rs7660520
rs10014110      rs2664035
rs10014628      rs2664035
rs10015055      rs7660520
rs10015763      rs7660520
rs1001604       rs983332
rs10016122      rs7660520
rs10016456      rs2867461
rs10016718      rs62324212
rs10017730      rs6853094
rs10017851      rs62324212
rs10018389      rs437943
rs1001876       rs61996546
rs10018958      rs7660520
rs10019263      rs7660520
rs10019944      rs7660520
rs10020363      rs7672495
rs10020927      rs62324212
rs10021795      rs62324212
rs10022053      rs7660520
rs10023654      rs437943
rs10023754      rs7660520
rs10024901      rs2664035
rs10024930      rs7660520
rs10025062      rs7672495

Christopher Chang

unread,
Aug 18, 2018, 8:55:58 PM8/18/18
to plink2-users
plink doesn't directly support this workflow.  I'd do something like the following:
1. Divide the list into groups of, say, 1000.  (This should be small enough to keep the number of useless --r2 report lines under control, while large enough to limit program startup overhead.)
2. For each group of 1000, extract just the named SNPs from your dataset, run --r2 with a sufficiently large window size (and 'inter-chr' if necessary), and extract just the lines of interest from the report.

A reasonable alternative is to use a R or Python API to operate on your plink-formatted dataset, since the r2/dprime computation is pretty straightforward.

Ying Liu

unread,
Sep 6, 2018, 12:32:55 PM9/6/18
to plink2-users
Sorry I am replying to the old post of mine. For clumping, if I do have big enough sample size, like UKB data of 500k subjects, or at least 200k related females for my project. Do I still need to involve a reference dataset or I could just use my own data. I guess the problem would then be the huge data size to process. As for reference data, I also need to convert it to bfile for next steps, correct? Usually which file is appropriate to use? I assume latest assembly of european population from 1000 Genome for my UKB project? Sorry if this is very basic questions. 
Reply all
Reply to author
Forward
0 new messages