Feature request: split vcf before import

30 views
Skip to first unread message

Gabriel Doctor

unread,
Nov 23, 2025, 1:14:37 PM (11 days ago) Nov 23
to plink2-users
Dear Chris
I was wondering if it is possible to add a feature which can filter a vcf by range before import: i.e. only a section of markers within a given range from the whole length of a vcf (based on chrom position, not names) is converted to plink binary. 
 
Use case: I find that for many operations, working with the vcf in plink2 is much quicker than working directly with bcftools. Plink2 converts the whole vcf/bcf into pgen before applying a range filter. Having to pre-filter by range with bcftools adds a lot of time. In some cases, a feature of the vcf (e.g. rare long allele names) means that it is not actually possible to import the whole chrom file without prior bcftools processing adding ++ time. 

I freely admit to not understanding enough about vcf/bcf architecture to know if this is even possible, but i thought i'd ask. you manged to do something very nifty for pre-processing of sample ids with vcf_subset

Best wishes

Gabriel 

Chris Chang

unread,
Nov 24, 2025, 4:15:15 PM (10 days ago) Nov 24
to Gabriel Doctor, plink2-users
bcftools should be able to do this efficiently.  Does your BCF have an index?  It also helps if bcftools is compiled with libdeflate.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/33f2da2c-9354-4b51-850a-b7886ed12ff6n%40googlegroups.com.

Gabriel Doctor

unread,
Nov 25, 2025, 4:07:40 AM (10 days ago) Nov 25
to plink2-users
Hi Chris, 
Unfortunately the UK Biobank population vcf data (e.g. for phased data) is stored as indexed vcf.gz, not bcf files. The fileset I am working with are tabix indexed.
As an indication of the problem I ran a test: it takes *one hour* to process a 1MB range with a simple command, using 4 cores for de/compression. I ran it twice, once mounting the file and once downloading to the worker. (See below for time printout). This is much slower than plink2, obviously. However loading the whole chrom into plink2 fails because of occasional long allele names in every chromosomse. .
 
A minimal solution would just be something that 'cut' the comprssed vcf e.g. by line index (not even having to perform a region look-up) , and one could then simply cat the header from the original vcf to make a valid file. 

Best wishes

Gabriel 

```
        Command being timed: "bcftools annotate --set-id . --remove INFO -r chr22:34000000-35000000 -Ob -o chr22_34000000-35000000.forplink.bcf --threads 4 ukb30108_c22_b0_v1.vcf.gz"
        User time (seconds): 5304.29
        System time (seconds): 142.28
        Percent of CPU this job got: 146%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:01:48
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 227304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 55
        Minor (reclaiming a frame) page faults: 56311
        Voluntary context switches: 17477829
        Involuntary context switches: 7387397
        Swaps: 0
        File system inputs: 2829376
        File system outputs: 1852192
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Chris Chang

unread,
Nov 25, 2025, 6:40:37 AM (10 days ago) Nov 25
to Gabriel Doctor, plink2-users
1. What is the output of "bcftools --version"?
2. Please provide a full .log file capturing the plink2 whole-chromosome import error you're referring to.

Gabriel Doctor

unread,
Nov 25, 2025, 7:14:46 AM (10 days ago) Nov 25
to plink2-users
  (Thought I had responded but can't see it)

1. I do not think my installation included libdeflate. I followed instructions here (https://raw.githubusercontent.com/samtools/bcftools/develop/INSTALL)  when I first installed on to the worker, and then subseqently use the dnanexus 'snapshot'  of that worker image with all subsequent workers (so not recompiling). 

bcftools --version
bcftools 1.21-79-gcef68bc8
Using htslib 1.21-29-g243e97ec
Copyright (C) 2024 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


2 . Below is an example logfile.  I am working with UKB phased WGS files for 500k samples. Variant IDs in the chrom.vcf.gz files are given as eg: chr22:10553755:G:A;chr22:10553755:G:C 
I got the same fail message - at different lines, for all tested chrs14-22.  Hence my work around to use bcftools annotate to update the ID before importing - this works but adds a very rate-limiting step.  

PLINK v2.0.0-a.6.28LM AVX2 Intel (23 Nov 2025)
Options in effect:
--exclude-snps .
--mac 2
--make-pgen multiallelics=-
--memory 30000
--new-id-max-allele-len 2 missing
--out chr18.bcftoplink
--set-all-var-ids @_#_$r_$a
--threads 8
--vcf ukb30108_c18_b0_v1.vcf.gz
Hostname: job-J4X3p0jJfk41JYx3Fp00ZvqB
Working directory: /home/dnanexus
Start time: Mon Nov 24 11:18:52 2025
Random number seed: 1763983132
31718 MiB RAM detected, ~29664 available; reserving 29600 MiB for main
workspace.
Using up to 8 compute threads.
Error: Invalid ID on line 871235 of --vcf file (max 16000 chars).
End time: Mon Nov 24 11:29:37 2025

Chris Chang

unread,
Nov 25, 2025, 7:41:57 AM (10 days ago) Nov 25
to Gabriel Doctor, plink2-users
1. bcftools annotate -r is able to efficiently jump to the right place in the file in my testing.  Does the 1 hour processing time go down much if you shrink the window from 1 MB to 100 kb?  What if you change "--remove INFO" to "--remove INFO,FORMAT"?

2. I did not expect official release VCFs to include such long variant IDs.  I will add a flag to the development build soon to ignore them (and write ".") during VCF/BCF import.

Gabriel Doctor

unread,
Nov 25, 2025, 12:26:06 PM (9 days ago) Nov 25
to plink2-users
A flag to mark imported VCF snp IDs  as "."   would be fantastic. 

The output file starts writing quickly, so I think it is just that the length of each line is so long; the -r region jumping is working fine. 

Using a 100kb segment instead of 1MB took 5:40  (1/10 1h) ; using  --remove INFO,FORMAT didn't change that. I reinstalled latest bcftools (inside a docker) ensuring htslib was build with libdeflate  - also didn't change anything. 

I am also surprised the UKB has made these files available as vcf.gz instead of bcf.gz - seems a waste of space and time for end users. 

Thank you for your suggestions in any case!

Chris Chang

unread,
Nov 28, 2025, 3:36:19 PM (6 days ago) Nov 28
to Gabriel Doctor, plink2-users
Try the --import-overlong-var-ids flag in today's development build.

Gabriel Doctor

unread,
Dec 1, 2025, 7:52:21 AM (4 days ago) Dec 1
to plink2-users
Hi Chris, 
I have tried this now on the UKB data and it works - thank you! Here is the log below, for interest.
I still think a crude "split indexed vcf.gz by record number" feature would be worthwhile if possible! (so tha it can be parallelised, without vv slow bcftools I/O) 
Scanning and conversion for chr22 with 8 cores 32G RAM too 4h45m (and no quicker with double cores/ram)
Thank you once again for this great software and your engagement with users. 

PLINK v2.0.0-a.7LM AVX2 Intel (28 Nov 2025)
Options in effect:
  --import-overlong-var-ids skip
  --keep samples_id_to_keep_for_imputation.list
  --make-pgen multiallelics=-
  --memory 30000
  --new-id-max-allele-len 2 missing
  --out chr22.vcftoplink
  --set-all-var-ids @_#_$r_$a
  --threads 8
  --vcf ukb30108_c22_b0_v1.vcf.gz

Hostname: job-J4fB1qjJfk40P358y72bQ125
Working directory: /home/dnanexus
Start time: Sun Nov 30 20:00:41 2025

Random number seed: 1764532841
31718 MiB RAM detected, ~29661 available; reserving 29597 MiB for main
workspace.
Using up to 8 compute threads.
--vcf: 7504305 variants scanned; 18 excluded by --import-overlong-var-ids,
7504287 remaining.
--vcf: chr22.vcftoplink-temporary.pgen + chr22.vcftoplink-temporary.pvar.zst +
chr22.vcftoplink-temporary.psam written.
490541 samples (0 females, 0 males, 490541 ambiguous; 490541 founders) loaded
from chr22.vcftoplink-temporary.psam.
Warning: 459613 variant IDs erased by --set-all-var-ids due to allele code
length.
7504287 variants loaded from chr22.vcftoplink-temporary.pvar.zst.
Note: No phenotype data present.
--keep: 448876 samples remaining.
448876 samples (0 females, 0 males, 448876 ambiguous; 448876 founders)
remaining after main filters.
Writing chr22.vcftoplink.psam ... done.
Writing chr22.vcftoplink.pvar ... done.
Writing chr22.vcftoplink.pgen ... done.
Multiallelic split: 10100109 variants written.

End time: Mon Dec  1 01:17:00 2025

Reply all
Reply to author
Forward
0 new messages