Background: Downstream analysis tools like regenie segfault when provided with pgen files that contain bcftools norm'ed multiallelic sites, and require that the data be denormalized to a pseudo-biallelic format before use. Some of these tools also require that they operate on a single set of pgen files (not allowing, e.g., one per chromosome). Therefore, being able to merge pgen files with pseudo-biallelic variants is of interest.
To this end, plink2 (April 15, 2022 version) does a fine job of ingesting vcf files that have been processed with bcftools to be "normalized" with multiallelics all on one line, or "denormalized" with multiallelics split into pseudo-biallelic variants on separate lines. In the latter case, the .pvar shows that these pseudo-biallelic sites remain in their expected format after ingestion.
It's much more convenient to deal with plink files than with large concatenated vcf files, so I prefer to perform this conversion on chunks of an otherwise massive vcf. At the end, after all chunks are converted to a pseudo-biallelic plink files, I would like to concatenate these files for use in those downstream tools. (My files are guaranteed to have no interval overlaps, all have the same samples, same order, etc.)
However, at least in my testing, plink2 cannot concatenate multiple pseudo-biallelic pgen files without modifying them. Naively merging is blocked by the expected error message ("Error: The biallelic variants with ID '.' at position XXX in denorm1.pvar appear to be the components of a 'split' multiallelic variant; if so, it must be 'joined' (with e.g. "bcftools norm -m") before a correct merge can occur. If you are SURE that your data does not contain any same-position same-ID variant groups that should be joined, you can suppress this error with --multiallelics-already-joined.").
If I override the error and pass --multiallelics-already-joined (which of course is not true, these multiallelics are not joined and that is the point), the merge will work but at least some multiallelics get re-normalized by plink, showing up with several variants on the same line. That is, even though these sites existed as pseudo-biallelic in their converted plink files, and no files share the same sites, the multiallelic sites get somewhat normalized during the merge process.
So far, my workaround is to keep everything in the usual normalized multiallelic configuration until the last possible step, where I merge all plink files, export to vcf, denormalize to pseudobiallelic, and then re-ingest into plink. But I am wondering if in some future version of plink2 it may be possible to enforce a 'naive' concatenation that doesn't reprocess pseudo-biallelic sites. (Or if it's possible today with a different incantation.)
Thanks!