I'm currently looking into the PGEN file format, and I'm confused about something. In the PLINK 2 File Format Specification Draft (
https://github.com/chrchang/plink-ng/tree/master/pgen_spe), you specify the following:
- If the variant record is of PLINK 1’s type, the four categories are encoded in a packed array of 2-bit values as 0
= double-ALT, 1 = missing, 2 = heterozygous REF-ALT, and 3 = homozygous-REF. Otherwise, the 2-bit encoding
is 0 = homozygous-REF, 1 = heterozygous REF-ALT, 2 = double-ALT, and 3 = missing, and the bottom three bits
of the variant record type indicates what compression is used on the resulting packed array.
Meaning two bit represent one of the four categories, and the bottom three bits the type of compression.
However, you also specify the following:
- Bits 0-3 of the twelfth byte indicate how variant-record types and lengths are stored. Interpreting them as a
single number in 0..15, these are the meanings:
– 0: 4 bits per record type, 1 byte per record length.
– 1: 4 bits per record type, 2 bytes per record length.
– 2: 4 bits per record type, 3 bytes per record length.
– 3: 4 bits per record type, 4 bytes per record length.
– 4: 8 bits per record type, 1 byte per record length.
– 5: 8 bits per record type, 2 bytes per record length.
– 6: 8 bits per record type, 3 bytes per record length.
– 7: 8 bits per record type, 4 bytes per record length.
Meaning record type can be stored using only 4 bits in some cases.
Therefore it is unclear to me which bits are the two indicating the four categories.