pmerge UK biobank chromosomes

Jalil Sharif

unread,

Oct 28, 2021, 12:58:26 PM10/28/21

to plink2-users

Hi,

I would like to merge all chromosomal bgen files from the UK biobank using pmerge.

I am using this command:

plink2 --bgen /path/to/file/ukb_imp_chr1.bgen ref-first --sample /path/to/file/ukb_imp_chr1.sample --pmerge-list /path/to/file/merge.list --write-snplist --export bgen-1.3 --zst-level 22 --out /path/to/file/ukb_imp_allchr

merge.list contains the following file names:

/path/to/file/ukb_imp_chr1

/path/to/file/ukb_imp_chr2

/path/to/file/ukb_imp_chr3

...

/path/to/file/ukb_imp_chr21

/path/to/file/ukb_imp_chr22

/path/to/file/ukb_imp_chrX

/path/to/file/ukb_imp_chrXY

I want to clarify if I should still have

/path/to/file/ukb_imp_chr1 in the merge.list or start from the second chromosome?

e.g.

/path/to/file/ukb_imp_chr2

/path/to/file/ukb_imp_chr3

...

/path/to/file/ukb_imp_chr21

/path/to/file/ukb_imp_chr22

/path/to/file/ukb_imp_chrX

/path/to/file/ukb_imp_chrXY

thanks

Christopher Chang

unread,

Oct 29, 2021, 4:32:05 AM10/29/21

to plink2-users

0. "--zst-level 22" is normally a bad idea, unless you're very short on disk space. The default compression level creates slightly larger files, but is a lot faster.

1. --pmerge-list requires .bed or .pgen input. You cannot use it with .bgen files.

Jalil Sharif

unread,

Oct 29, 2021, 8:06:14 AM10/29/21

to plink2-users

I also have .bed files, but compared to .bgen they are much bigger, is it possible to input as .bed and output as .bgen or pgen, as their file size is smaller?

E.g. like this?

plink2 \
--bfile /path/to/file/ukb_imp_chr1 \
--pmerge-list /path/to/file/merge.list \
--maf 0.01 \
--hwe 1e-6 \
--write-snplist \
--make-pgen \
--out /path/to/file/ukb_imp_allchr

Jalil Sharif

unread,

Oct 29, 2021, 11:08:23 AM10/29/21

to plink2-users

After running the command above: I only get a .psam file. And this as a result.

--pmerge-list: Merged .psam written to
/rds/general/user/js4120/ephemeral/merged_bgen/ukb_imp_allchr-merge.psam .

Christopher Chang

unread,

Oct 29, 2021, 1:11:52 PM10/29/21

to plink2-users

1. The point is that you should convert all your .bgen files to .pgen before using --pmerge-list.
2. Please post the full .log file from the run where you only got a .psam file.

Jalil Sharif

unread,

Oct 30, 2021, 3:16:30 PM10/30/21

to plink2-users

Nevermind. My command was wrong.

It is now:

for i in $(seq 1 24)
do

/rds/general/user/js4120/home/bin/plink2 --bgen /rds/general/user/js4120/home/analysis_files/ukb_imp_chr${i}.bgen ref-first --sample /rds/general/user/js4120/home/analysis_files/ukb_imp_chr1.sample --maf 0.01 --hwe 1e-6 --make-pgen --out /rds/general/user/js4120/ephemeral/merged_bgen/ukb_imp_chr${i}.QC

done

I am letting it run and if any errors occur, will write back.

Jalil Sharif

unread,

Nov 8, 2021, 7:40:01 AM11/8/21

to plink2-users

Hi Chris,

I was able to convert the files to pgen but getting the following error.

The biallelic variants with ID 'rs6657544' at position 1:1186665 in
/path/to/file/merged_bgen/ukb_imp_chr1.pvar appear to be
the components of a 'split' multiallelic variant; if so, it must be 'joined'
(with e.g. "bcftools norm -m") before a correct merge can occur. If you are
SURE that your data does not contain any same-position same-ID variant groups
that should be joined, you can suppress this error with
--multiallelics-already-joined.

Can bcftools be used directly on .pvar files?

As the following command didn't work for me. bcftools norm -m /path/to/file/merged_bgen/ukb_imp_chr1.pvar

Christopher Chang

unread,

Nov 8, 2021, 11:57:33 AM11/8/21

to plink2-users

Use "--export bcf" to generate a BCF that bcftools can read, and --bcf to re-import the result.

If your .psam file contains sex/pedigree/phenotype information about your samples, that isn't exported to the BCF; but you can work around this by using --psam with --bcf when re-importing.

Jalil Sharif

unread,

Nov 23, 2021, 10:32:30 AM11/23/21

to plink2-users

Hi Chris,

I did the following sequence of commands.

for i in $(seq 1 22)
do
    plink2 \
    --bgen /path/to/file/ukb_imp_chr$i.bgen ref-first \
    --sample /path/to/file/ukb_imp_chr$i.sample \
    --mind 0.01\
    --geno 0.01 \
    --maf 0.01 \
    --hwe 1e-6 \
    --export bcf \
    --out /path/to/file/ukb_imp_chr$i
done

Followed by this command to normalise multi-allelic variants:

for i in $(seq 1 22)
do
    bcftools \
    norm -m +any /path/to/file/ukb_imp_chr$i.bcf \
    -Ob \/path/to/file/ukb_imp_chr$i.normalised.bcf.gz
done

Then the aim was to concat the variants using.

bcftools concat /path/to/file/ukb_imp_chr1.normalised.bcf \
/path/to/file/ukb_imp_chr2.normalised.bcf \
/path/to/file/ukb_imp_chr3.normalised.bcf \
/path/to/file/ukb_imp_chr4.normalised.bcf \
/path/to/file/ukb_imp_chr5.normalised.bcf \
/path/to/file/ukb_imp_chr6.normalised.bcf \
/path/to/file/ukb_imp_chr7.normalised.bcf \
/path/to/file/ukb_imp_chr8.normalised.bcf \
/path/to/file/ukb_imp_chr9.normalised.bcf \
/path/to/file/ukb_imp_chr10.normalised.bcf \
/path/to/file/ukb_imp_chr11.normalised.bcf \
/path/to/file/ukb_imp_chr12.normalised.bcf \
/path/to/file/ukb_imp_chr13.normalised.bcf \
/path/to/file/ukb_imp_chr14.normalised.bcf \
/path/to/file/ukb_imp_chr15.normalised.bcf \
/path/to/file/ukb_imp_chr16.normalised.bcf \
/path/to/file/ukb_imp_chr17.normalised.bcf \
/path/to/file/ukb_imp_chr18.normalised.bcf \
/path/to/file/ukb_imp_chr19.normalised.bcf \
/path/to/file/ukb_imp_chr20.normalised.bcf \
/path/to/file/ukb_imp_chr21.normalised.bcf \
/path/to/file/ukb_imp_chr22.normalised.bcf -Ob -o /path/to/file/allchromosomes.bcf.gz

The script is failing at concat. And gives me the following error.

Checking the headers and starting positions of 3 files
Different number of samples in /path/to/file/ukb_imp_chr2.normalised.bcf. Perhaps "bcftools merge" is what you are looking for?

But I am not trying to merge different samples, all the UK biobank imputed files have data for all samples.

I alternatively tried to convert the normalised.bcf files to .pgen, but this failed, too.

Example:

PLINK v2.00a3LM 64-bit Intel (11 Oct 2021) www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /path/to/file/ukb_imp_chr1.log.
Options in effect:
--bcf /path/to/file/ukb_imp_chr1.normalised.bcf
--make-pgen
--out /path/to/file/ukb_imp_chr1
--psam /path/to/file/ukb_imp_chr1.psam

Start time: Mon Nov 22 10:28:43 2021
386941 MiB RAM detected; reserving 193470 MiB for main workspace.
Using up to 24 threads (change this with --threads).
Error: Mismatched IDs between --bcf file and
/rds/general/user/js4120/home/analysis_files/ukb_imp_chr1.psam.
End time: Mon Nov 22 10:28:43 2021

Any advise?

Christopher Chang

unread,

Nov 23, 2021, 12:01:44 PM11/23/21

to plink2-users

You have to postpone the --mind filter, because that removes samples.

Jalil Sharif

unread,

Jan 11, 2022, 4:47:27 AM1/11/22

to plink2-users

I tried concating via bcftools but it gave me an error during clumping using plink1.9

I have instead now removed all multi-allelic variants through the "--rm-dup exclude-all" and exported as pgen.

For plink1.9 I require either a bed or bgen file. I would prefer bgen, is this the correct command?

plink2 --pmerge-list merge.list --export bgen-1.3 --threads 40 --out chr_merged

Christopher Chang

unread,

Jan 11, 2022, 1:50:48 PM1/11/22

to plink2-users

bgen is a terrible idea here, since plink 1.9 will just waste a lot of time converting to a bed temporary file at the beginning of each run. Just use --make-bed.

Reply all

Reply to author

Forward