Concatenate plink files with pseudo-biallelic multiallelic sites

370 views
Skip to first unread message

James Pirruccello

unread,
Apr 17, 2022, 10:12:38 AM4/17/22
to plink2-users
Background: Downstream analysis tools like regenie segfault when provided with pgen files that contain bcftools norm'ed multiallelic sites, and require that the data be denormalized to a pseudo-biallelic format before use. Some of these tools also require that they operate on a single set of pgen files (not allowing, e.g., one per chromosome). Therefore, being able to merge pgen files with pseudo-biallelic variants is of interest.

To this end, plink2 (April 15, 2022 version) does a fine job of ingesting vcf files that have been processed with bcftools to be "normalized" with multiallelics all on one line, or "denormalized" with multiallelics split into pseudo-biallelic variants on separate lines. In the latter case, the .pvar shows that these pseudo-biallelic sites remain in their expected format after ingestion.

It's much more convenient to deal with plink files than with large concatenated vcf files,  so I prefer to perform this conversion on chunks of an otherwise massive vcf. At the end, after all chunks are converted to a pseudo-biallelic plink files, I would like to concatenate these files for use in those downstream tools. (My files are guaranteed to have no interval overlaps, all have the same samples, same order, etc.)

However, at least in my testing, plink2 cannot concatenate multiple pseudo-biallelic pgen files without modifying them. Naively merging is blocked by the expected error message  ("Error: The biallelic variants with ID '.' at position XXX in denorm1.pvar appear to be the components of a 'split' multiallelic variant; if so, it must be 'joined' (with e.g. "bcftools norm -m") before a correct merge can occur. If you are SURE that your data does not contain any same-position same-ID variant groups that should be joined, you can suppress this error with --multiallelics-already-joined.").

If I override the error and pass --multiallelics-already-joined (which of course is not true, these multiallelics are not joined and that is the point), the merge will work but at least some multiallelics get re-normalized by plink, showing up with several variants on the same line. That is, even though these sites existed as pseudo-biallelic in their converted plink files, and no files share the same sites, the multiallelic sites get somewhat normalized during the merge process.

So far, my workaround is to keep everything in the usual normalized multiallelic configuration until the last possible step, where I merge all plink files, export to vcf, denormalize to pseudobiallelic, and then re-ingest into plink. But I am wondering if in some future version of plink2 it may be possible to enforce a 'naive' concatenation that doesn't reprocess pseudo-biallelic sites. (Or if it's possible today with a different incantation.)

Thanks!





Christopher Chang

unread,
Apr 17, 2022, 10:27:16 AM4/17/22
to plink2-users
If you don't want the pieces of a split multiallelic variant to be (incorrectly) merged by --pmerge-list, give those pieces distinct IDs.

I'm very unlikely to add a naive-concatenation special case that has fundamentally different semantics than regular --pmerge-list, since "bcftools concat" exists, and conversion to and from .vcf.gz/.bcf is highly optimized.

Christopher Chang

unread,
Apr 17, 2022, 10:39:49 AM4/17/22
to plink2-users
For what it's worth, plink2 does have a currently-undocumented command for splitting multiallelic variants: "--make-pgen multiallelics=-".  (The intention is to document this when the corresponding join command is also implemented.)  So the supported plink2-only workflow is to merge first, then use this command to split the multiallelic variants.

James Pirruccello

unread,
Apr 17, 2022, 10:56:40 AM4/17/22
to plink2-users
Thanks—Just tried that and it works perfectly for my needs. This makes it much easier to keep things in the proper format during analysis/processing, and then emit pseudobiallelic sites only when needed by downstream tools!
Reply all
Reply to author
Forward
0 new messages