Missing variants from --extract bed1

15 views
Skip to first unread message

Jeff Kim

unread,
Nov 12, 2025, 12:40:46 PM (3 days ago) Nov 12
to plink2-users
Hi Chris,

I encountered a strange issue/potential bug related to --extract bed1. I am using UK Biobank WGS data and I wanted to extract coding regions from one of the PLINK2 WGS binary files.

The version is PLINK2 v2.0.0-a.7LM AVX2 Intel (26 Oct 2025). This is the command I ran (log attached):

plink2 --pfile UKB_WGS_TEST \
--extract bed1 gencode.v47.CDS.15bp_flank.bed \
--make-pgen --out UKB_WGS_TEST.CDS


There is a variant of interest at chromosome 1 and basepair position 154602065. This is a multiallelic variant, but the UKB WGS data splits multiallelics into multiple variants like so:

>zgrep -e '154602065' UKB_WGS_TEST.pvar
1 154602065 chr1:154602065:G:A G A 99.9999 PASS
1 154602065 chr1:154602065:G:C G C 99.9998 PASS
1 154602065 chr1:154602065:G:T G T 100 PASS


This is fine with me, but the puzzling part is that the operation with --extract bed1 keeps 2 variants, but removes 1.

>zgrep -e '154602065' UKB_WGS_TEST.CDN.pvar
1 154602065 chr1:154602065:G:A G A 99.9999 PASS
1 154602065 chr1:154602065:G:T G T 100 PASS


The chr1:154602065:G:C variant is missing. This is actually the most common alternate allele of the three. This initially made me think I set a maxMAF threshold or something in my script, but I hadn't.

The bed file is generated from a gencode GTF file that looks like this:

...
chr1 154571173 154572176
chr1 154575746 154575944
chr1 154607976 154608021
chr1 154601025 154602641
chr1 154598386 154598600
chr1 154597812 154597991
chr1 154597107 154597282
...


I am quite perplexed on why --extract bed1 will remove 1 variant when all 3 variants have the same chromosome and basepair position. Posting here to check if I am missing something here, or it's a bug.

Let me know if you need any additional information. Appreciate all your work on plink!
Jeff
UKB_WGS_TEST.CDS.log

Chris Chang

unread,
Nov 12, 2025, 12:52:02 PM (3 days ago) Nov 12
to Jeff Kim, plink2-users
Hmm, this looks like it is a recently-introduced bug in the development build.  The "19858 variants in UKB_WGS_TEST.pvar; 3198 excluded by , 16660 remaining." .log line does not make sense (a reason is supposed to be given for the exclusion).

Is it possible for you to send just the .pvar and .bed files for debugging purposes?

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/7620553b-7c8d-4268-a523-676e79513482n%40googlegroups.com.

Jeff Kim

unread,
Nov 12, 2025, 1:42:33 PM (3 days ago) Nov 12
to plink2-users
Hi Chris,

Thank you for the quick response. Here they are. The BED file is gzipped here.

Regards,
Jeff
UKB_WGS_TEST.CDS.pvar
UKB_WGS_TEST.pvar
gencode.v47.CDS.15bp_flank.bed.gz

Jeff Kim

unread,
Nov 12, 2025, 2:25:22 PM (3 days ago) Nov 12
to plink2-users
And yes I can confirm that this bug doesn't occur in Alpha 6.26 build. See below.

I will swap over to the Alpha build for now. Thank you!

Jeff

LOG:
---
PLINK v2.0.0-a.6.26LM AVX2 Intel (26 Oct 2025)      cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang    GNU General Public License v3
Logging to UKB_WGS_TEST.CDS.old_plink.log.
Options in effect:
  --extract bed1 gencode.v47.CDS.15bp_flank.bed
  --make-pgen
  --out UKB_WGS_TEST.CDS.old_plink
  --pfile UKB_WGS_TEST

Start time: Wed Nov 12 18:52:34 2025
31156 MiB RAM detected, ~29088 available; reserving 15578 MiB for main
workspace.
Using up to 16 threads (change this with --threads).
490541 samples (265739 females, 224347 males, 455 ambiguous; 490541 founders)
loaded from UKB_WGS_TEST.psam.
19858 variants loaded from UKB_WGS_TEST.pvar.
Note: No phenotype data present.
--extract bed1: 18333 variants excluded.
1525 variants remaining after main filters.
Writing UKB_WGS_TEST.CDS.old_plink.psam ... done.
Writing UKB_WGS_TEST.CDS.old_plink.pvar ... 10101112121313141415161617171818192020212122222324242525262627282829293030313232333334343536363737383839404041414242434444454546464748484949505051525253535454555656575758585960606161626263646465656666676868696970707172727373747475767677777878798080818182828384848585868687888889899090919292939394949596969797989899done.
Writing UKB_WGS_TEST.CDS.old_plink.pgen ... done.
End time: Wed Nov 12 18:52:34 2025
---
>zgrep -e '154602065' UKB_WGS_TEST.CDS.old_plink.pvar
1 154602065 chr1:154602065:G:A G A 99.9999 PASS
1 154602065 chr1:154602065:G:C G C 99.9998 PASS
1 154602065 chr1:154602065:G:T G T 100 PASS





Chris Chang

unread,
Nov 12, 2025, 5:01:31 PM (3 days ago) Nov 12
to Jeff Kim, plink2-users
Ok, yes, it's reasonable to use alpha 6 for real work for now.

Meanwhile, I'm having some difficulty replicating the bug.  Two quick questions:
(i) is your result consistently reproducible, or does there seem to be randomness involved?  I.e. if you rerun


plink2 --pfile UKB_WGS_TEST \
--extract bed1 gencode.v47.CDS.15bp_flank.bed \
--make-pgen --out UKB_WGS_TEST.CDS

with the buggy build, do you get the exact same log output?

(ii) do you still get a buggy result if you change the command to

plink2 --pvar UKB_WGS_TEST.pvar \
--extract bed1 gencode.v47.CDS.15bp_flank.bed \
--make-just-pvar --out UKB_WGS_TEST.CDS

?

Jeff Kim

unread,
Nov 12, 2025, 6:04:59 PM (3 days ago) Nov 12
to plink2-users
Hi Chris,

The result is consistently replicable on my DNANexus instances. It is always the same variant that is missing.

>plink2 --pvar UKB_WGS_TEST.pvar \
--extract bed1 gencode.v47.CDS.15bp_flank.bed \
--make-just-pvar --out UKB_WGS_TEST.CDS.justpvar

>zgrep -e '154602065' UKB_WGS_TEST.CDS.justpvar.pvar
1 154602065 chr1:154602065:G:A G A 99.9999 PASS
1 154602065 chr1:154602065:G:T G T 100 PASS

LOG:
---------------
PLINK v2.0.0-a.7LM AVX2 Intel (26 Oct 2025)         cog-genomics.org/plink/2.0/

(C) 2005-2025 Shaun Purcell, Christopher Chang    GNU General Public License v3
Logging to UKB_WGS_TEST.CDS.justpvar.log.

Options in effect:
  --extract bed1 gencode.v47.CDS.15bp_flank.bed
  --make-just-pvar
  --out UKB_WGS_TEST.CDS.justpvar
  --pvar UKB_WGS_TEST.pvar

Start time: Wed Nov 12 22:42:01 2025
31156 MiB RAM detected, ~29306 available; reserving 15578 MiB for main

workspace.
Using up to 16 threads (change this with --threads).
19858 variants in UKB_WGS_TEST.pvar; 3198 excluded by , 16660 remaining.
Note: No phenotype data present.
--extract bed1: 15364 variants excluded.
1296 variants remaining after main filters.
Writing UKB_WGS_TEST.CDS.justpvar.pvar ... done.
End time: Wed Nov 12 22:42:01 2025

---------------

But a weird wrinkle: I just tried it again with the Linux AVX2 AMD compilation, and it seems to be working fine:

>plink2 --pvar UKB_WGS_TEST.pvar \
--extract bed1 gencode.v47.CDS.15bp_flank.bed \
--make-just-pvar --out UKB_WGS_TEST.CDS.justpvar.amd

>zgrep -e '154602065' UKB_WGS_TEST.CDS.justpvar.amd.pvar
1 154602065 chr1:154602065:G:A G A 99.9999 PASS
1 154602065 chr1:154602065:G:C G C 99.9998 PASS
1 154602065 chr1:154602065:G:T G T 100 PASS


LOG:
---------------
PLINK v2.0.0-a.7LM AVX2 AMD (11 Nov 2025)

Options in effect:
  --extract bed1 gencode.v47.CDS.15bp_flank.bed
  --make-just-pvar
  --out UKB_WGS_TEST.CDS.justpvar.amd
  --pvar UKB_WGS_TEST.pvar

Hostname: 6fec6eb15894
Working directory: /opt/notebooks
Start time: Wed Nov 12 22:43:07 2025

Random number seed: 1762987387
31156 MiB RAM detected, ~29303 available; reserving 15578 MiB for main

workspace.
Using up to 16 threads (change this with --threads).
19858 variants loaded from UKB_WGS_TEST.pvar.
Note: No phenotype data present.
--extract bed1: 18333 variants excluded.
1525 variants remaining after main filters.
Writing UKB_WGS_TEST.CDS.justpvar.amd.pvar ... done.

End time: Wed Nov 12 22:43:07 2025
---------------

And it also works fine with Linux 64-bit Intel compilation (log is basically the same). lscpu output states that the instance is running with Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz.

If you are having trouble replicating, perhaps it is specifically this hardware set up + Intel AVX2. Since it seems relevant, I have attached the full output of lscpu from the DNANexus instance.

Let me know if you need additional information!

Regards,
Jeff
lscpu.txt

Chris Chang

unread,
Nov 12, 2025, 6:53:13 PM (3 days ago) Nov 12
to Jeff Kim, plink2-users
Thanks, I've replicated the bug, will try to post a fix later tonight.

Chris Chang

unread,
Nov 12, 2025, 8:10:55 PM (2 days ago) Nov 12
to Jeff Kim, plink2-users
...it turns out that I "accidentally" fixed it yesterday when I refactored out the buggy code for totally unrelated reasons.  The bugfix is now described on the website notes for the current build.

Jeff Kim

unread,
Nov 12, 2025, 8:54:54 PM (2 days ago) Nov 12
to plink2-users
Great, thank you!

Jeff
Reply all
Reply to author
Forward
0 new messages