--zst-decompress <pgen> on Windows

72 views
Skip to first unread message

Matthew Maher

unread,
Jun 20, 2024, 9:33:42 PM6/20/24
to plink2-users
Trying to help a new user get started with PLINK*, on a Windows machine.  We downloaded the 1KG fileset listed in the "Resources" section.   It describes how PLINK2 provides a --zst-decompress convenience feature for those who don't have a zst decompressor.   We used that, which appeared to work.  But then when we tried to actually use the file, the pgen is declared not valid (see log below). 

Upon inspection, I noticed that the decompressed pgen's filesize was approximately double the size of the same decompression operation done on a Linux box.  So I moved the pgen aside (added '.plink2' suffix), downloaded the standalone zstd.exe tool, and re-executed the decompress with that. Now the decompressed filesize is as expected, and the subsequent command now works.

Maybe there's some Windows-specific requirement I'm missing (it's a foreign land to me), or maybe there's a windows-specific issue with the --zst-decompress?

Any info much appreciated.    And thanks for PLINK* - great tools.

Here are the filesizes of the decompressed pgen via zstd and plink2 --zst-decompress:
-a----         6/20/2024   2:09 PM     9530881334 all_hg38.pgen
-a----         6/20/2024   4:27 PM    19309023026 all_hg38.pgen.plink2

PLINK v2.00a5.11 64-bit (26 May 2024)          www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --allow-extra-chr
  --chr 1-22,X,Y
  --freq
  --memory 6000
  --missing
  --pfile all_hg38

Start time: Thu Jun 20 18:10:33 2024
8065 MiB RAM detected; reserving 6000 MiB for main workspace.
Using up to 4 compute threads.
3202 samples (1603 females, 1599 males; 2583 founders) loaded from
all_hg38.psam.
73627150 out of 75193455 variants loaded from all_hg38.pvar.
Error: all_hg38.pgen is not a .pgen file (first two bytes don't match the magic
number).

End time: Thu Jun 20 18:27:58 2024

Christopher Chang

unread,
Jun 21, 2024, 10:54:54 AM6/21/24
to plink2-users
Hmm, I was not able to replicate this with either the AVX2 or 64-bit build on a Windows test machine.
1. (optional) Is there a smaller instance of faulty .zst decompression you can replicate?
2. Assuming you can replicate this at all, would you be able to run a sequence of debug plink2 builds?

Matthew Maher

unread,
Jun 21, 2024, 11:47:21 AM6/21/24
to Christopher Chang, plink2-users
I'll try some various decompressions of different sizes to see if I notice anything illuminative or at least a smaller size.  stay tuned....

If nothing turns up, debug builds would be no problem.

Thanks


--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/3fc08950-556b-494d-be29-0033f5e7a8a1n%40googlegroups.com.

Matthew Maher

unread,
Jun 23, 2024, 11:01:08 AM6/23/24
to plink2-users
First, FWIW, my (admittedly old) Windows test box says it is:  Windows 10 Pro version 22H2 

I reduced the variant set in stages and even with just a single variant, the issue persists - after compressing a pgen with the zstd.exe tool, and then decompressing with PLINK2, the result is a file twice the expected size and deemed not a valid pgen. 

I've attached the fileset (all 3 parts). 

I'm happy to run debug builds if need be.   Thanks for investigating. 


PS C:\Users\Matthew\test> .\plink2 --pfile .\all_hg38_maf01_first1K --snp rs1264112204 --make-pgen --out 1var

PLINK v2.00a5.11 64-bit (26 May 2024)          www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1var.log.
Options in effect:
  --make-pgen
  --out 1var
  --pfile .\all_hg38_maf01_first1K
  --snp rs1264112204

Start time: Sun Jun 23 10:43:23 2024
8065 MiB RAM detected; reserving 4032 MiB for main workspace.

Using up to 4 compute threads.
3202 samples (1603 females, 1599 males; 2583 founders) loaded from
.\all_hg38_maf01_first1K.psam.
1000 variants loaded from .\all_hg38_maf01_first1K.pvar.
2 categorical phenotypes loaded.
--snp: 1 variant remaining.
1 variant remaining after main filters.
Writing 1var.psam ... done.
Writing 1var.pvar ... done.
Writing 1var.pgen ... done.
End time: Sun Jun 23 10:43:23 2024
PS C:\Users\Matthew\test> .\zstd.exe 1var.pgen
1var.pgen            : 85.38%   (   253 B =>    216 B, 1var.pgen.zst)
PS C:\Users\Matthew\test> mv 1var.pgen 1var.pgen.orig
PS C:\Users\Matthew\test> .\plink2.exe --zst-decompress 1var.pgen.zst >  1var.pgen
PS C:\Users\Matthew\test> dir 1var*


    Directory: C:\Users\Matthew\test


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         6/23/2024  10:43 AM            771 1var.log
-a----         6/23/2024  10:44 AM            530 1var.pgen
-a----         6/23/2024  10:43 AM            253 1var.pgen.orig
-a----         6/23/2024  10:43 AM            216 1var.pgen.zst
-a----         6/23/2024  10:43 AM          81022 1var.psam
-a----         6/23/2024  10:43 AM         206542 1var.pvar


PS C:\Users\Matthew\test> .\plink2.exe --pfile 1var --freq

PLINK v2.00a5.11 64-bit (26 May 2024)          www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --freq
  --pfile 1var

Start time: Sun Jun 23 10:45:16 2024
8065 MiB RAM detected; reserving 4032 MiB for main workspace.

Using up to 4 compute threads.
3202 samples (1603 females, 1599 males; 2583 founders) loaded from 1var.psam.
1 variant loaded from 1var.pvar.
Error: 1var.pgen is not a .pgen file (first two bytes don't match the magic
number).

End time: Sun Jun 23 10:45:16 2024
1var.psam
1var.pvar
1var.pgen

Christopher Chang

unread,
Jun 23, 2024, 11:21:35 AM6/23/24
to plink2-users
Okay, the problem is that the piped output is being converted to UTF-16 by Windows.  I will look for a way to disable this; in the meantime, replacing "> 1var.pgen" with "1var.pgen" should work, and I will modify the documentation accordingly.

Matthew Maher

unread,
Jun 23, 2024, 11:23:52 AM6/23/24
to plink2-users
yup - that resolves it.   Thanks!

Juan David Quintero Giraldo

unread,
Jun 24, 2024, 10:55:26 AM6/24/24
to chrch...@gmail.com, plink2...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages