--recode VCF : Null (binary 0) character in VCF headers

28 views
Skip to first unread message

Scott Gordon

unread,
Oct 22, 2024, 11:34:44 PM10/22/24
to plink2-users
Hi

At some point (I think a version in the last few weeks, as I didn't have the problem before; I am using plink 1.90 version b7.6), PLINK 1.90 has started adding an extraneous null/binary 0 character (\0 or ^@) at the front of the 'contig' row of the VCF header when using --recode VCF.

Could someone please investigate and correct ?

[note there is a work-around : passing the whole VCF file through 'sed' to strip the character (sed '{s/\x0//gi}') but that is hardly ideal particularly for large files].

It is causing some programs (eg. the Michigan Imputation Server) to decide that the file ends at that point, with obvious error messages about a missing header row, program failures etc. .

It is hard to see how I could have done anything to cause such a strange bug.

It may or not be visible depending on how you view the file. The following is using the standard Linux 'zless' command (I've cut it off before the start of the ID list, as no reason to show it) :

##fileformat=VCFv4.2
##fileDate=20241022
##source=PLINKv1.9
^@##contig=<ID=8,length=146294107>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT 

This resulted from the following log :

PLINK v1.9.0-b.7.6 64-bit (13 Oct 2024)
Options in effect:
  --bfile Twinning_Nigeria_H3Africa_ImputationServerUpload_22Oct2024_chr8_REFALTassignedHRC_chrposnames
  --keep-allele-order
  --out Twinning_Nigeria_H3Africa_ImputationServerUpload_22Oct2024_chr8_VCFrecode
  --recode vcf bgz
  --threads 1

Hostname: hpcnode073.adqimr.ad.lan
Working directory: /mnt/lustre/working/lab_nickm/scottG/Twinning_Nigeria_H3Africa_Sep2024/ImputationServer_withinbatch
Start time: Tue Oct 22 16:33:44 2024

Random number seed: 1729578824
515270 MB RAM detected; reserving 257635 MB for main workspace.
102978 variants loaded from .bim file.
1296 people (0 males, 1296 females) loaded from .fam.
Using 1 thread.
Before main variant filters, 1296 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998739.
102978 variants and 1296 people pass filters and QC.
Note: No phenotypes present.
Warning: Underscore(s) present in sample IDs.
--recode vcf bgz to
Twinning_Nigeria_H3Africa_ImputationServerUpload_22Oct2024_chr8_VCFrecode.vcf.gz
... done.
Warning: At least one VCF allele code violates the official specification;
other tools may not accept the file.  (Valid codes must either start with a
'<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or
represent a breakend.)

End time: Tue Oct 22 16:34:07 2024

Christopher Chang

unread,
Oct 22, 2024, 11:53:48 PM10/22/24
to plink2-users
Eeek, the version header line was not updated correctly on Oct 11; thanks for reporting the problem.  Bugfix is posted.
Reply all
Reply to author
Forward
0 new messages