Loading bcf files with --bcf in plink2 fails

574 views
Skip to first unread message

freeseek

unread,
Jul 1, 2021, 6:50:15 PM7/1/21
to plink2-users
I am having issue converting bcf files with plink2.

I generate a toy VCF:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Phased genotypes\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM"
echo -e "chr1\t1\trs123\tC\tT\t.\t.\t.\tGT\t0/0") > file.vcf



Then I try converting a binary VCF version of it:

$ bcftools view --no-version -Ob file.vcf -o file.bcf && plink2 --bcf file.bcf --make-bed
PLINK v2.00a3LM AVX2 Intel (8 Jun 2021)        www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --bcf file.bcf
  --make-bed

Start time: Thu Jul  1 18:47:15 2021
7448 MiB RAM detected; reserving 3724 MiB for main workspace.
Using up to 8 compute threads.
--bcf: 1 variant scanned.

Error: Out of memory.  The --memory flag may be helpful.
Failed allocation size: 4295098432
End time: Thu Jul  1 18:47:15 2021


It looks like plink2 wants to allocate an inordinate amount of memory.



If I use --vcf instead of --bcf:

$ bcftools view --no-version -Oz file.vcf -o file.vcf.gz && plink2 --vcf file.vcf.gz --make-bed
PLINK v2.00a3LM AVX2 Intel (8 Jun 2021)        www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --make-bed
  --vcf file.vcf.gz

Start time: Thu Jul  1 18:48:53 2021
7448 MiB RAM detected; reserving 3724 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 1 variant scanned.
--vcf: plink2-temporary.pgen + plink2-temporary.pvar.zst +
plink2-temporary.psam written.
1 sample (0 females, 0 males, 1 ambiguous; 1 founder) loaded from
plink2-temporary.psam.
1 variant loaded from plink2-temporary.pvar.zst.
Note: No phenotype data present.
Writing plink2.fam ... done.
Writing plink2.bim ... done.
Writing plink2.bed ... done.
End time: Thu Jul  1 18:48:53 2021


It works just fine.

Giulio

Christopher Chang

unread,
Jul 1, 2021, 9:15:46 PM7/1/21
to plink2-users
Thanks for reporting this.  There was an integer underflow in the INFO memory-required calculation; bugfix is posted.

freeseek

unread,
Jul 2, 2021, 10:30:20 AM7/2/21
to plink2-users
Yes, that fixed it. Thank you. Tangentially related, while I can use PLINK 1.9 to work on a VCF binary stream:

$ bcftools view --no-version -Ob --compression-level 0 file.vcf | plink --bcf /dev/stdin --make-bed
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/

(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
Options in effect:
  --bcf /dev/stdin
  --make-bed

7448 MB RAM detected; reserving 3724 MB for main workspace.
--bcf: plink-temporary.bed + plink-temporary.bim + plink-temporary.fam written.
1 variant loaded from .bim file.
1 person (0 males, 0 females, 1 ambiguous) loaded from .fam.
Ambiguous sex ID written to plink.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1 founder and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is exactly 1.
1 variant and 1 person pass filters and QC.
Note: No phenotypes present.
--make-bed to plink.bed + plink.bim + plink.fam ... done.




While with PLINK 2.0 this is not possible anymore:

$ bcftools view --no-version -Ob --compression-level 0 file.vcf | plink2 --bcf /dev/stdin --make-bed
PLINK v2.00a3LM AVX2 Intel (1 Jul 2021)        www.cog-genomics.org/plink/2.0/

(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --bcf /dev/stdin
  --make-bed

Start time: Fri Jul  2 10:20:25 2021

7448 MiB RAM detected; reserving 3724 MiB for main workspace.
Using up to 8 compute threads.
--bcf: 1 variant scanned.
End time: Fri Jul  2 10:20:25 2021

No output is created, my guess being that's because PLINK 2.0 wants to read the file twice and this is not possible when the file is a stream.

I am writing a pipeline where I have to merge many VCFs and then convert the output to PLINK format and this leaves me with three workarounds:

1) Ask users (or the Docker running the pipeline) to have both PLINK 1.9 and PLINK 2.0 installed to perform this step (and other PLINK 2.0 steps down the line)
2) Generate a large mostly temporary uncompressed VCF on disk that I then convert to PLINK format with PLINK 2.0, increasing the disk requirements of the task
3) Generate a compressed VCF on disk that I then convert with PLINK 2.0, increasing the CPU requirements

Any reason PLINK 2.0 needs to read the VCF file twice while PLINK 1.9 does not?

Chris Chang

unread,
Jul 2, 2021, 10:54:46 AM7/2/21
to freeseek, plink2-users
The reason is that PLINK 2.0 converts much more information in the VCF.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/6f6d03e5-fc56-4514-8cbc-3513dbd14b4en%40googlegroups.com.

Matthew Maher

unread,
Jul 2, 2021, 12:30:16 PM7/2/21
to Chris Chang, freeseek, plink2-users
Just out of curiosity, can I please ask for a bit of elucidation on that?   does PLINK2 make multiple passes through a VCF as freeseek suggests (and if so, why)?  I'd just like to ensure I understand what PLINK2 is doing.  

I have the same use case as freeseek described and I basically do his option #3 (because #2 creates HUGE files, even with Bcf).  I can't do his #1 as I need to --make-pgen rather than --make-bed.

Thanks again for PLINK(2) - awesome tools. 


Chris Chang

unread,
Jul 2, 2021, 6:07:40 PM7/2/21
to Matthew Maher, freeseek, plink2-users
The PLINK 1.x .bed file format is a simple binary matrix with no compression.  You can convert to it efficiently in a single pass, and then efficiently access variants in the middle of the file without any sort of index, since every row has the same (unnecessarily long) length.

.pgen files have variable-length rows that can store several types of VCF genotype data, and an embedded index/header.  VCF/BCF headers do not consistently declare the number of variants, or whether phased genotypes are present, or whether there are multiallelic variants, etc.  So there is a scanning pass to determine this information; then the embedded index and the internal data structures can be properly sized for the main multithreaded conversion pass.

Matthew Maher

unread,
Jul 2, 2021, 8:50:38 PM7/2/21
to Chris Chang, freeseek, plink2-users
I get it now - thanks for the explanation.
Reply all
Reply to author
Forward
0 new messages