bug in plink2 --bcf import - file ok in plink1.9 and bcftools

24 views
Skip to first unread message

Gabriel Doctor

unread,
Dec 4, 2025, 7:12:28 PMDec 4
to plink2-users
Hi Christopher, 
I have been using minimac4.1.6 to impute and chose to output as  a bcf file (this is not via TOPMED/MIS server but my own set up).
The outoput bcf file can be read by bcftools and plink1.9, but not by plink2.

If I use bcftools just to I/O back to bcf, the resulting file still doesn't work in plink2, with the same error.
If i create a vcf rrom bcftools, plink2 reads this. 
If i then back-convert this vcf to bcf in bcftools, plink2 can read the resutling bcftovcftobcf file... Note that it has the same HDS:DS fields as far as I can see. 

Selecting dosage=HDS or dosage=DS made no difference.

Perhpas this is all a quirk of the minimac4.1 output, but just FYI. 

IE:
ORIG=chr22_1_32000000.identicals.dose.bcf
./plink2 --bcf $ORIG --make-pgen --out ORIG
#Error: Variant record #1 of --bcf file is malformed.

$bcftools view $ORIG -Ov -o  ORIG.bcftovcf.vcf
plink2 --vcf ORIG.bcftovcf.vcf --make-pgen --out ORIG.bcftovcf
# vcf loads

$bcftools view $ORIG -Ob -o bcfdirectobcf.bcf
plink2 --bcf bcfdirectobcf.bcf --make-pgen --out bcfdirectobcf  
#Error: Variant record #1 of --bcf file is malformed.

$bcftools view ORIG.bcftovcf.vcf -Ob -o vcfbacktobcf.bcf
plink2 --bcf vcfbacktobcf.bcf --make-pgen --out vcfbacktobcf
#this is bcf-->vcf--> bcf loads without error!!

$bcftools view  $ORIG | head -n19 > originalheader.txt
$bcftools view  vcfbacktobcf.bcf | head -n21  > reconvertedheader.txt
diff originalheader.txt reconvertedheader.txt
#reports only expected header differences (2 additional lines of bcftools, even thoug this includes the first variant line. Explicitly:
tail -n1 originalheader.txt
tail -n1 reconvertedheader.txt
# these look the same
htsfile "$ORIG"
# file.bcf:   BCF version 2.2 compressed variant calling data



PLINK v2.0.0-a.7LM AVX2 Intel (28 Nov 2025)
Options in effect:
  --bcf file.bcf
  --make-pgen
  --out ORIG

Hostname: job-J4jzzj8Jfk46Y6g0b358X4Jv
Working directory: /home/dnanexus
Start time: Thu Dec  4 23:44:19 2025

Random number seed: 1764891859
7816 MiB RAM detected, ~6539 available; reserving 3908 MiB for main workspace.
Using up to 2 compute threads.
Error: Variant record #1 of --bcf file is malformed.

End time: Thu Dec  4 23:44:19 2025  


singlevariant.bcf

Chris Chang

unread,
Dec 5, 2025, 10:28:15 AMDec 5
to Gabriel Doctor, plink2-users
This was due to irregular encoding of the IMPUTED flag; today's development build should be able to read the file.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/0f292b3c-50d6-4d41-af72-d8bb9f8cdbc1n%40googlegroups.com.

Gabriel Doctor

unread,
Dec 5, 2025, 10:31:56 AMDec 5
to Chris Chang, plink2-users
Thank you!
Could you comment on whether compressed bcf, or vcf.gz import into plink2 is quicker, if one can choose ? Does it depend on any variables in the data?


Best wishes
Gabriel

On 5 Dec 2025, at 15:28, Chris Chang <chrch...@gmail.com> wrote:



Chris Chang

unread,
Dec 5, 2025, 10:37:51 AMDec 5
to Gabriel Doctor, plink2-users
BCF took ~40% less time in the 1000 Genomes chr 1 test I just ran.

Gabriel Doctor

unread,
Dec 7, 2025, 2:48:42 PMDec 7
to plink2-users
Thank you. I have run a similar test on WGS dataset with ~450k samples. Sharing this as possibly of general interest. 
Using  <15mb RAM with 4 threads, 
 For a 1MB region with ~200,000 variants of phased genotyped data (no other fields), just to import and make-pgen

region selection using tabix took 46m33s
plink2 --vcf vcf.gz wall clock 44min18s, user time 63min  

region selection and conversion of vcf.gz > bcf took 1h00m25s
plink2 --bcf wall clock **6 min41s**, user time 11 mins

Similar results with another 1mb region and times are scaled in smaller regions. My impression also is that the plink2 import variant scanning step is rate-limited so that higher CPU does not improve the time, whereas the variant conversion step is clearly improved by more CPU and RAM.   

Overall in this test:
 --vcf converting 0.02mb / min, 
--bcf converting 0.15 mb/min  - 7.5 times faster. 


Best wishes

Gabriel 

Reply all
Reply to author
Forward
0 new messages