Hi
At some point (I think a version in the last few weeks, as I didn't have the problem before; I am using plink 1.90 version b7.6), PLINK 1.90 has started adding an extraneous null/binary 0 character (\0 or ^@) at the front of the 'contig' row of the VCF header when using --recode VCF.
Could someone please investigate and correct ?
[note there is a work-around : passing the whole VCF file through 'sed' to strip the character (sed '{s/\x0//gi}') but that is hardly ideal particularly for large files].
It is causing some programs (eg. the Michigan Imputation Server) to decide that the file ends at that point, with obvious error messages about a missing header row, program failures etc. .
It is hard to see how I could have done anything to cause such a strange bug.
It may or not be visible depending on how you view the file. The following is using the standard Linux 'zless' command (I've cut it off before the start of the ID list, as no reason to show it) :
##fileformat=VCFv4.2
##fileDate=20241022
##source=PLINKv1.9
^@##contig=<ID=8,length=146294107>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
This resulted from the following log :
PLINK v1.9.0-b.7.6 64-bit (13 Oct 2024)
Options in effect:
--bfile Twinning_Nigeria_H3Africa_ImputationServerUpload_22Oct2024_chr8_REFALTassignedHRC_chrposnames
--keep-allele-order
--out Twinning_Nigeria_H3Africa_ImputationServerUpload_22Oct2024_chr8_VCFrecode
--recode vcf bgz
--threads 1
Hostname: hpcnode073.adqimr.ad.lan
Working directory: /mnt/lustre/working/lab_nickm/scottG/Twinning_Nigeria_H3Africa_Sep2024/ImputationServer_withinbatch
Start time: Tue Oct 22 16:33:44 2024
Random number seed: 1729578824
515270 MB RAM detected; reserving 257635 MB for main workspace.
102978 variants loaded from .bim file.
1296 people (0 males, 1296 females) loaded from .fam.
Using 1 thread.
Before main variant filters, 1296 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998739.
102978 variants and 1296 people pass filters and QC.
Note: No phenotypes present.
Warning: Underscore(s) present in sample IDs.
--recode vcf bgz to
Twinning_Nigeria_H3Africa_ImputationServerUpload_22Oct2024_chr8_VCFrecode.vcf.gz
... done.
Warning: At least one VCF allele code violates the official specification;
other tools may not accept the file. (Valid codes must either start with a
'<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or
represent a breakend.)
End time: Tue Oct 22 16:34:07 2024