Plink2 speed on association test with BGEN files

刘莹

unread,

Jul 13, 2018, 10:24:39 AM7/13/18

to plink2-users

I want to conduct a simple association test on quantitative phenotype with just one or two covariates using PLINK2 on UKB bgen files. I will exclude lots of SNPs based on set criteria and restrict to a smaller sample size, but I just do it using --remove, --exclude flags. However, I noticed the speed is still very slow because it seems PLINK2 will still scan and convert all the SNPs from BGEN file to plink data format. Do you suggest me to use other tools like BGENIX or QCTOOLS to minimize the bgen files first, then run the association test? Or do you have other suggestions might be helpful to increase the speed? (--threads? sorry I am not a plink savvy.) For chromosome 22, it took me more than 2 days given the huge sample size and SNP number in UKB data, which makes me worry I could not afford to complete the analysis on chr1 or chr2 since we have restrictions on running length of a certain job. I guess another way is to separate the chr using range into several smaller ones. But I wonder do you have other possible ways to achieve the goal.

Thanks for any reply!

Ying

Christopher Chang

unread,

Jul 13, 2018, 12:01:58 PM7/13/18

to plink2-users

You are supposed to convert the bgen files to pgen once, and then do all further filtering/analysis with the pgen fileset.

If you are selecting a small subset of samples/variants from the bgen, and don't ever plan to analyze any other parts of the same bgen with PLINK2, it's reasonable to use BGENIX/QCTOOL2/etc. to perform the initial filtering.

刘莹

unread,

Jul 19, 2018, 2:11:36 PM7/19/18

to plink2-users

Thanks so much!

Then now I have a follow-up question. I know in PLINK2 how to use --make-pgen to transfer my bgen (v1.2) files to pfiles, and then I could do --freq, --hardy, --glm, etc. Otherwise, I could use --bgen --sample to work on my original bgen files and conduct those --freq, --hardy, --glm, but each step needs to take a much longer time since it will need to transfer the bgen files to pfiles all over again. So in general, I should use --make-pgen first and do the rest later to save computing time, as your suggested.

My only concern is do I lose any information (perhaps dosage?) from the --make-pgen step? If so, is there any way to walk around? Do I need to worry the information lost in any kinds of following analyses, since some might affect the results, e.g. rare alleles?

Thanks!

Christopher Chang

unread,

Jul 19, 2018, 3:17:08 PM7/19/18

to plink2-users

No, --make-pgen does not lose any information. In fact, all that --bgen does is autoconvert your data to .pgen format; the rest of plink2 is built around .pgen.

Message has been deleted

刘莹

unread,

Jul 20, 2018, 12:51:26 PM7/20/18

to plink2-users

Running the job on converting bgen to pgen was OK for smaller chromosomes. But some of my slurm jobs get killed after long hours of running. The error message tells me it is not the problem in PLINK2, but the memory exceeded the limit in our cluster computers:

Writing ./extracted/chr21_tryspeed/chrX_good_pfile.pgen ... 0%/usr/spool/slurm/job28646770/slurm_script: line 21: 51042 Killed plink2 --bgen ./extracted/chr21_tryspeed/chrX_good.bgen --sample ukb24487_imp_chrX_v3_s486743.sample --make-pgen --out ./extracted/chr21_tryspeed/chrX_good_pfile

slurmstepd: error: Exceeded step memory limit at some point.

I wonder how to change my script to keep memory used in PLINK steps to a certain level to avoid exceeding limits. Can I alter --memory or --threads to achieve this?

Also I want to know if I divide those few super large chromosome (chr1 or chr2) to parts using bgenix or qctool first, and then run plink2 to convert them to pgen files, I would get pgen.chr1_p1, pgen.chr1_p2, pgen.chr1_p3.... I surely could run other computing using these segments. But I am just wondering could I merge pgen files and run the downstream analyses? Thanks.

Christopher Chang

unread,

Jul 20, 2018, 12:58:51 PM7/20/18

to plink2-users

Yes, you should use --memory to limit the amount of memory plink2 uses if your cluster imposes a limit, and --threads to limit the number of compute threads.

.pgen merge is not implemented yet.

Message has been deleted

刘莹

unread,

Jul 27, 2018, 10:09:30 AM7/27/18

to plink2-users

I have successfully converted all my autosomal chromosomes from bgen files to pgen files and then ran basic association test without any problem. Thanks!!

Now I am working on to do the same thing on UKB chrX file. The bgen to pgen process was OK, but the association test gave the following error message:

Warning: Skipping --glm regression on phenotype 'intBFP' since covariate

correlation matrix could not be inverted. You may want to remove redundant

covariates and try again.

Actually I am using the same phenotype file with the other chromosomes, just deleted the samples not in chrX sample file. intBFP is just a continuous phenotype derived from body fat percentage, and it was running just fine in the other 22 chromosomes. Do you have any clue what might be the reason and how could I modify my script or the input file?

log file:

PLINK v2.00a1LM 64-bit Intel (11 Feb 2018)

Options in effect:

--covar ...../sampleX.sample

--covar-name sex

--glm cols=+a1count,+machr2

--memory 50000

--out BFP_chrX

--pfile .../chrX_good_pfile

--pheno ..../sampleX.sample

--pheno-name intBFP

--remove ....../BFPexcludeX

Hostname: sbcs2

Working directory: ..../GWAS_BFP

Start time: Thu Jul 26 18:50:27 2018

Random number seed: 1532649027

258218 MB RAM detected; reserving 50000 MB for main workspace.

Using up to 72 threads (change this with --threads).

486757 samples (263910 females, 222833 males, 14 ambiguous; 486757 founders)

loaded from ..../chrX_good_pfile.psam.

858805 variants loaded from

....../chrX_good_pfile.pvar.

1 quantitative phenotype loaded (288791 values).

--remove: 288791 samples remaining.

1 covariate loaded from ....../sampleX.sample.

288791 samples (157049 females, 131742 males; 288791 founders) remaining after

main filters.

288791 quantitative phenotype values remaining after main filters.

Warning: Skipping --glm regression on phenotype 'intBFP' since covariate

correlation matrix could not be inverted. You may want to remove redundant

covariates and try again.

End time: Thu Jul 26 18:50:44 2018

PS:

I have googled previous post and saw similar question. But I have tried to alter my --pheno file to contain only ID and intBFP, and my --covar file to contain only ID and sex. Still I have the same Warning and the process skip --glm. The weird thing is it only happens in chrX processing. When I process autosomal chromosomes, there is no issue on running linear regression, even I have more covariates in the file (maybe correlated, but no used in each regression).

Thanks for your time.

Christopher Chang

unread,

Jul 27, 2018, 11:36:40 AM7/27/18

to plink2-users

plink's linear/logistic regressions automatically add a sex covariate on chrX, and this is colliding with the covariate you're adding. You can use --glm's 'no-x-sex' modifier to disable this behavior.

刘莹

unread,

Jul 27, 2018, 12:02:42 PM7/27/18

to plink2-users

I see! Thanks so much! It is running smoothly for now.

刘莹

unread,

Jul 27, 2018, 7:48:28 PM7/27/18

to plink2-users

Just a side question. When I convert UKB .bgen files to .pgen files, I could only see #CHROM POS ID REF ALT in the .pvar file. Is there anyway I could also see the INFO, QUAL or others? I see from other post that:

Meanwhile, there currently aren't any plink2 commands that add anything to the INFO column; instead, --make-pgen just preserves preexisting entries. However, you can dump freshly computed mach-r2 values with "--freq cols=+machr2", and then use a short script to place them in the INFO column if that's where you need them.

I could also +machr2 in --glm steps. But I am not sure if this reflects the snp INFO. Is the INFO the same with the info score we could see from the mfi file from UKB data?

Also I know I should be able to filter out low INFO snps in --glm if I use --exclude-if-info, but for now, I could not see the INFO from pvar file. So I do not know how could I use the filter (I understand I could use the provided MFI file to generate a txt list to exclude though).

Thanks.

Christopher Chang

unread,

Jul 27, 2018, 8:01:06 PM7/27/18

to plink2-users

There's nothing to see: .bgen files do not have QUAL/FILTER/INFO fields. So --exclude-if-info won't do anything on a converted .bgen. These flags only apply to VCF data.

You can use plink2 --mach-r2-filter to filter on the MaCH-r^2 imputation quality statistic; this is related but not identical to IMPUTE2's INFO statistic. If you want to use the mfi directly, write a script to read the mfi file and dump the SNPs that pass (or fail) your preferred filter to a file, and then use --extract/--exclude.

刘莹

unread,

Jul 27, 2018, 9:23:35 PM7/27/18

to plink2-users

That's exactly what I think and did. Thanks.

刘莹

unread,

Aug 1, 2018, 4:51:28 PM8/1/18

to plink2-users

Now I have finished the association tests and want to proceed to post-GWAS processing. I would like to clump snps. But I found out there is no --clump flags in PLINK2 and when I use plink1.9 I have to provide input files, which does not recognize pfile. Do you suggest me to use --indep-pairwise (but I thought it is a pruning process usually conducted before association test, I could be wrong)? or I further convert pfiles to bfiles?

[liuy39@sbcs2 try]$ plink --clump BMI_chr22_0400001.intBMI.glm.linear --clump-r2 0.1 --clump-p1 5e-8 --clump-p2 5e-8 --out BMI_chr22_0400001_clump_try

PLINK v1.90b5 64-bit (14 Nov 2017) www.cog-genomics.org/plink/1.9/

Logging to BMI_chr22_0400001_clump_try.log.

Options in effect:

--clump BMI_chr22_0400001.intBMI.glm.linear

--clump-p1 5e-8

--clump-p2 5e-8

--clump-r2 0.1

--out BMI_chr22_0400001_clump_try

Error: No input dataset.

For more information, try 'plink --help [flag name]' or 'plink --help | more'.

-----------------------

plink --clump BMI_chr22_0400001_try --clump-r2 0.1 --clump-p1 5e-8 --clump-p2 5e-8 --out BMI_chr22_0400001_clump_try --pfile /data1/liuy39/UKB_temp0723/pgencomplete/chr22_all_pfile

PLINK v1.90b5 64-bit (14 Nov 2017) www.cog-genomics.org/plink/1.9/

Logging to BMI_chr22_0400001_clump_try.log.

Options in effect:

--clump BMI_chr22_0400001_try

--clump-p1 5e-8

--clump-p2 5e-8

--clump-r2 0.1

--out BMI_chr22_0400001_clump_try

--pfile /data1/liuy39/UKB_temp0723/pgencomplete/chr22_all_pfile

Error: Unrecognized flag (--pfile). (This is PLINK 1.9, not 2.x.)

Christopher Chang

unread,

Aug 1, 2018, 5:08:56 PM8/1/18

to plink2-users

Export bfiles, they should be good enough for clumping purposes for now.

刘莹

unread,

Aug 1, 2018, 5:13:36 PM8/1/18

to plink2-users

export all to bfiles or just the significant SNPs based on my association test?

Also I have heard to use reference set to do clumping, do you suggest to use a reference set from, say HapMap, or just use the individual genotype file I have used for the association test?

Boison Solomon

unread,

Aug 1, 2018, 5:15:20 PM8/1/18

to plink2-users

Hello Chang,

I have been using plink a couple of years. I was wondering if you will like to add a flag that excludes samples that are too heterozygote or homozygote. For example –hetind. The flag will also take two argument (minimum and maximum sample heterozygosity value). Currently, I do this by computing inbreeding (--het) and then removing samples that were excessively homozygote and suppling through the –remove flag.

This is a suggestion and will be glad if a flag like that is added.

Thanks for the wonderful work you’ve been doing

King regards

Solomon

Christopher Chang

unread,

Aug 1, 2018, 5:25:02 PM8/1/18

to plink2-users

Only necessary to include the SNPs you're interested in.

It's reasonable to use a reference dataset for clumping (1000 Genomes is a common choice) if your sample size is smaller and the reference dataset contains all the relevant SNPs.

刘莹

unread,

Aug 1, 2018, 5:47:37 PM8/1/18

to plink2-users

Is there any difference to convert from bgen to pfile to bfile with directly from bgen to bfile?

Christopher Chang

unread,

Aug 1, 2018, 5:54:40 PM8/1/18

to plink2-users

There shouldn't be if both conversions are performed with plink 2.0 (though pfile -> bfile is much faster).

To match what plink 1.9 does, add "--import-dosage-certainty 0.9" when importing the bgen.

刘莹

unread,

Aug 1, 2018, 5:57:36 PM8/1/18

to plink2-users

This is very helpful. I'll just perform the conversion using plink2 via pfile (already converted from bgen) to bfile to save time.

刘莹

unread,

Aug 1, 2018, 6:03:14 PM8/1/18

to plink2-users

Just to be clear. Do I need to add "--import-dosage-certainty 0.9" when I --make-pgen from bgen file (bgen to pgen conversion)? I remember you mentioned to me before that pfile contains all the information from bgen, so I assume this is only needed if I import bgen and make the conversion directly to bfile to make plink1.9 compatible?

Christopher Chang

unread,

Aug 1, 2018, 6:55:34 PM8/1/18

to plink2-users

No, pgen files do not contain *all* the information in a bgen; they just contain all the information you'd actually use in a normal GWAS analysis. bgen files contain genotype posterior probability triplets, such as {P(0/0) = 0.2, P(0/1) = 0.52, P(1/1) = 0.28} or {P(0/0) = 0, P(0/1) = 0.92, P(1/1) = 0.08}. pgen files contain allelic dosages; in both of these cases, dosage(1) = 1.08. (0.52 + 0.28 * 2, or 0.92 + 0.08 * 2.)

Sometimes, you're better off treating poorly imputed genotype calls (such as {0.2, 0.52, 0.28}) as entirely missing; this is what --import-dosage-certainty is for. "--import-dosage-certainty 0.9" checks the largest probability in each triplet, and causes a missing call to be saved instead of a dosage when that largest probability is smaller than your threshold; so {0.2, 0.52, 0.28} would be treated as missing and {0, 0.92, 0.08} would not.

Other times, any better-than-random guess is better than a missing entry. If the mathematical method you're using is going to mean-impute missing values anyway, you may as well save even the least reliable dosages.

When you're unsure, you can always run your analysis twice, once with --import-dosage-certainty and once without. With good data quality, the results should be nearly identical.

sy06...@gmail.com

unread,

Aug 17, 2018, 1:48:59 PM8/17/18

to plink2-users

hello and thank you for this valuable discussion.

If I understood correctly, for basic allelic test ( --asso) or any other analysis using the allelic count, we would need to produce a file generating the count with the 0.9 threshold . Is that correct?

In the documentation about dosage_import, I am not sure to understand this part of the last sentence "PLINK 2 also saves (possibly missing) hardcalls for those commands to use. " I am not understanding which information at which threshold PLINK2 is keeping for the command using allelic counts ( for example) . Could you please help ?

Many thanks,

Saliha

PS: source: https://www.cog-genomics.org/plink/2.0/input#dosage_import : "The PLINK 2 binary file format supports allelic dosages, with ~4 decimal place precision. However, some of PLINK 2's commands do not make use of dosage data. Thus, when importing dosage data, PLINK 2 also saves (possibly missing) hardcalls for those commands to use. "

Christopher Chang

unread,

Aug 17, 2018, 1:53:19 PM8/17/18

to plink2-users

The .pgen file contains both dosages AND thresholded hardcalls. So you don't need to create two separate sets of files to work with the data in plink2.

(However, to use a plink 1.9 command that hasn't yet been ported to plink 2.0, you'll need to use --make-bed.)

sy06...@gmail.com

unread,

Aug 17, 2018, 2:15:34 PM8/17/18

to plink2-users

thank you very much for the answer! Are the thresholded hardcalls kept by .pgen those with GP >=0.9 ?

Christopher Chang

unread,

Aug 17, 2018, 2:22:30 PM8/17/18

to plink2-users

It defaults to dosages that are with 0.1 of an integer (so [0, 0.1], [0.9, 1.1], [1.9, 2.0]). However, you can change this whenever you want with --hard-call-threshold + --make-pgen. And during import (but not afterward), you can erase all dosages with highest-GP below a threshold by using --import-dosage-certainty.

sy06...@gmail.com

unread,

Aug 17, 2018, 3:24:44 PM8/17/18

to plink2-users

thank you very much for the clarification. Last point. Does this means that if we use a --hard-call-threshold of 0 + --make-pgen then we fall back in what is equivalent to types genotypes ( directly measured) ?

Cheers,

Saliha

Christopher Chang

unread,

Aug 17, 2018, 6:16:18 PM8/17/18

to plink2-users

Yes, that will treat essentially all non-integer dosages as missing calls. (Exception: dosages have at least 5 decimal places of precision, and there are numbers like 1.00001 which are extremely close to an integer; plink2 only uses ~4 decimal places so it treats 1.00001 as if it were exactly 1.)

Shicheng Guo

unread,

Aug 18, 2018, 8:03:01 PM8/18/18

to Christopher Chang, plink2...@googlegroups.com

Hi Chris,

I have 100,000 SNP pairs to be calculate the LD (r2 or d'), what's the best way to do it with plink? be careful, it is pair by pair, not double loop all of SNPs.

Thanks.

Shicheng

FYI:

rs10000185 rs7672495

rs10000196 rs2867461

rs10000226 rs2867461

rs10000609 rs17630466

rs10000678 rs6448119

rs10000888 rs62324212

rs10001373 rs7660520

rs10002005 rs7660520

rs10002145 rs62324212

rs10002601 rs7660520

rs10002643 rs7660520

rs10003163 rs437943

rs10003482 rs7660520

rs10003783 rs13142500

rs10003784 rs13142500

rs10003809 rs2867461

rs10004019 rs2867461

rs10004264 rs13142500

rs10004910 rs2664035

rs10006432 rs7660520

rs10006717 rs62324212

rs10007259 rs13142500

rs10008032 rs437943

rs10008074 rs1501138

rs1000820 rs11870477

rs1000821 rs11870477

rs10008644 rs1501138

rs10010061 rs62324212

rs10010101 rs437943

rs10011458 rs7672495

rs1001168 rs11870477

rs10011864 rs437943

rs1001191 rs1877030

rs10012259 rs11937061

rs10013032 rs13119723

rs10013495 rs62324212

rs10013749 rs13137105

rs10013803 rs7660520

rs10013924 rs7660520

rs10014066 rs7660520

rs10014110 rs2664035

rs10014628 rs2664035

rs10015055 rs7660520

rs10015763 rs7660520

rs1001604 rs983332

rs10016122 rs7660520

rs10016456 rs2867461

rs10016718 rs62324212

rs10017730 rs6853094

rs10017851 rs62324212

rs10018389 rs437943

rs1001876 rs61996546

rs10018958 rs7660520

rs10019263 rs7660520

rs10019944 rs7660520

rs10020363 rs7672495

rs10020927 rs62324212

rs10021795 rs62324212

rs10022053 rs7660520

rs10023654 rs437943

rs10023754 rs7660520

rs10024901 rs2664035

rs10024930 rs7660520

rs10025062 rs7672495

Christopher Chang

unread,

Aug 18, 2018, 8:55:58 PM8/18/18

to plink2-users

plink doesn't directly support this workflow. I'd do something like the following:

1. Divide the list into groups of, say, 1000. (This should be small enough to keep the number of useless --r2 report lines under control, while large enough to limit program startup overhead.)

2. For each group of 1000, extract just the named SNPs from your dataset, run --r2 with a sufficiently large window size (and 'inter-chr' if necessary), and extract just the lines of interest from the report.

A reasonable alternative is to use a R or Python API to operate on your plink-formatted dataset, since the r2/dprime computation is pretty straightforward.

Ying Liu

unread,

Sep 6, 2018, 12:32:55 PM9/6/18

to plink2-users

Sorry I am replying to the old post of mine. For clumping, if I do have big enough sample size, like UKB data of 500k subjects, or at least 200k related females for my project. Do I still need to involve a reference dataset or I could just use my own data. I guess the problem would then be the huge data size to process. As for reference data, I also need to convert it to bfile for next steps, correct? Usually which file is appropriate to use? I assume latest assembly of european population from 1000 Genome for my UKB project? Sorry if this is very basic questions.

Reply all

Reply to author

Forward