PGEN format specifications

31 views
Skip to first unread message

Gloria Benoit

unread,
Jun 3, 2025, 8:59:50 AMJun 3
to plink2-users
Hi!
I'm currently looking into the PGEN file format, and I'm confused about something. In the PLINK 2 File Format Specification Draft (https://github.com/chrchang/plink-ng/tree/master/pgen_spe), you specify the following:
  • If the variant record is of PLINK 1’s type, the four categories are encoded in a packed array of 2-bit values as 0
    = double-ALT, 1 = missing, 2 = heterozygous REF-ALT, and 3 = homozygous-REF. Otherwise, the 2-bit encoding
    is 0 = homozygous-REF, 1 = heterozygous REF-ALT, 2 = double-ALT, and 3 = missing, and the bottom three bits
    of the variant record type indicates what compression is used on the resulting packed array.
Meaning two bit represent one of the four categories, and the bottom three bits the type of compression. 

However, you also specify the following:
  • Bits 0-3 of the twelfth byte indicate how variant-record types and lengths are stored. Interpreting them as a
    single number in 0..15, these are the meanings:
    – 0: 4 bits per record type, 1 byte per record length.
    – 1: 4 bits per record type, 2 bytes per record length.
    – 2: 4 bits per record type, 3 bytes per record length.
    – 3: 4 bits per record type, 4 bytes per record length.
    – 4: 8 bits per record type, 1 byte per record length.
    – 5: 8 bits per record type, 2 bytes per record length.
    – 6: 8 bits per record type, 3 bytes per record length.
    – 7: 8 bits per record type, 4 bytes per record length.
Meaning record type can be stored using only 4 bits in some cases.

Therefore it is unclear to me which bits are the two indicating the four categories.

Chris Chang

unread,
Jun 3, 2025, 9:26:34 AMJun 3
to Gloria Benoit, plink2-users
Unless the entire file uses PLINK 1 encoding (that special case is grandfathered in so that PLINK 2 doesn’t need to perform .bed -> .pgen conversions all the time; see also the PLINK 1 .bed definition at
 ), the PLINK 2 category encoding is used, and the record types in the header describe how they are compressed.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/plink2-users/8b608e1e-515f-442c-9ff9-026e8e2766e2n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages