Troubleshooting multallelic variant merging issue

153 views
Skip to first unread message

Nik Tz

unread,
Oct 18, 2023, 6:30:29 AM10/18/23
to plink2-users
Hello,

I want to recode the IIDs of imputed data .bgen files into two different filesets, and merge these (working on eye-level analyses with Regenie). As I'm only interested in dosages, I've converted these to .pgen using PLINK2 (ref-first as UK Biobank):

plink2 --bgen data.bgen ref-first --sample data.sample --update-ids recoded_ids_a.txt --make-pgen --out recoded_file_a

plink2 --bgen data.bgen ref-first --sample data.sample --update-ids recoded_ids_b.txt --make-pgen --out recoded_file_b

However, at the merging step, I run into the following error:
plink2 \
           --pfile recoded_file_a \
           --pmerge \
            recoded_file_b.pgen \
            recoded_file_b.pvar \
            recoded_file_b.psam \
           --out merged_files
The biallelic variants with ID 'x' at position x:x in recoded_file_a.pvar appear to be the components of a 'split' multiallelic variant; if so, it must be 'joined' (with e.g. "bcftools norm -m").

As discussed previously on this forum (https://groups.google.com/g/plink2-users/c/fVF9LGK1A0w), " if I override the error and pass --multiallelics-already-joined (which of course is not true, these multiallelics are not joined and that is the point), the merge will work but at least some multiallelics get re-normalized by plink, showing up with several variants on the same line. "

I was hoping to use the undocumented command Chris mentioned for splitting multiallelic variants: --make-pgen multiallelics=- However, I run into error:

plink2 --bgen data.bgen ref-first --sample data.sample --update-ids recoded_ids_a.txt --make-pgen multiallelics=- --out recoded_file_a
Error: --bgen accepts at most 3 arguments.

I also tried dropping the multiallelics from my existing recoded pgen files:

plink2 --pfile recoded_file_a --make-pgen multiallelics=- --out modified_recoded_file_a
Error: Multiallelic dosages aren't supported yet.

And yet my log says "Note: All variants are biallelic; nothing to split." ? Please see attached below.

Would appreciate advice on how to get past this step. Is there a workaround in PLINK?

If not, as I'm only interested in biallelic variants, how can I set multiallele doses to missing, or remove these SNPs altogether, with PLINK or another tool?

Many thanks, Nik

PLINK v2.00a3.1LM 64-bit Intel (19 May 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to imp_mod_a_21.log.
Options in effect:
  --make-pgen multiallelics=-
  --out imp_mod_a_21
  --pfile 
recoded_file_a
Start time: x
31629 MiB RAM detected; reserving 15814 MiB for main workspace.
Using up to 4 compute threads.
487409 samples (264222 females, 222938 males, 249 ambiguous; 487409 founders)
loaded from 
recoded_file_a.psam.
1261158 variants loaded from
recoded_file_a.pvar.
Note: No phenotype data present.
Writing 
modified_recoded_file_a.psam ... done.
Note: All variants are biallelic; nothing to split.
Error: Multiallelic dosages aren't supported yet.
Writing modified_recoded_file_a.pvar ... 0% 0% 1% 1% 2% 2% 3% 3% 4% 4% 5% 5% 6% 6% 7% 7% 8% 8% 9% 9% 10% 10% 11% 11% 12% 12% 13% 13% 14% 14% 15% 15% 16% 16% 17% 17% 18% 18% 19% 19% 20% 20% 21% 21% 22% 22% 23% 23% 24% 24% 25% 25% 26% 26% 27% 27% 28% 28% 29% 29% 30% 30% 31% 31% 32% 32% 33% 33% 34% 34% 35% 35% 36% 36% 37% 37% 38% 38% 39% 39% 40% 40% 41% 41% 42% 42% 43% 43% 44% 44% 45% 45% 46% 46% 47% 47% 48% 48% 49% 50% 50% 51% 51% 52% 52% 53% 53% 54% 54% 55% 55% 56% 56% 57% 57% 58% 58% 59% 59% 60% 60% 61% 61% 62% 62% 63% 63% 64% 64% 65% 65% 66% 66% 67% 67% 68% 68% 69% 69% 70% 70% 71% 71% 72% 72% 73% 73% 74% 74% 75% 75% 76% 76% 77% 77% 78% 78% 79% 79% 80% 80% 81% 81% 82% 82% 83% 83% 84% 84% 85% 85% 86% 86% 87% 87% 88% 88% 89% 89% 90% 90% 91% 91% 92% 92% 93% 93% 94% 94% 95% 95% 96% 96% 97% 97% 98% 98% 99% done.
End time: x

Nik Tz

unread,
Oct 18, 2023, 7:05:06 AM10/18/23
to plink2-users
The context for this is: "Downstream analysis tools like regenie segfault when provided with pgen files that contain bcftools norm'ed multiallelic sites, and require that the data be denormalized to a pseudo-biallelic format before use. " https://groups.google.com/g/plink2-users/c/fVF9LGK1A0w 

Nik Tz

unread,
Oct 18, 2023, 7:17:35 AM10/18/23
to plink2-users
Also, in the UK Biobank: where multi-allelic variants exist in these data, they have been split into a series of bi-allelic variants. This implies that several variants may share the same genomic position but with different alternative alleles. https://enkre.net/cgi-bin/code/bgen/wiki/?name=BGEN+in+the+UK+Biobank 

Christopher Chang

unread,
Oct 18, 2023, 8:09:40 AM10/18/23
to plink2-users
This error message was recently changed to the following:
"Error: The biallelic variants with ID 'x' at position x:x in x appear to be the components of a 'split' multiallelic variant; if so, it must be 'joined' (with e.g. "bcftools norm -m") before a correct merge can occur. Alternatively, you can keep the variants separate by assigning them distinct IDs; unless you have very long indels, adding --set-all-var-ids to your merge command is a simple way to do this."

--set-all-var-ids + merge works properly in alpha 6.

If you're not interested in using that, you're on your own.

Nik Tz

unread,
Oct 20, 2023, 4:16:28 PM10/20/23
to plink2-users
Thanks very much Chris, in the end I removed the multiallelic variants and re-ran --pmerge as above on the two filesets (identical other than the individual ID, which one is coded, e.g., 123456_a and the other 123456_b). However, this only merged the .pvar and .psam files. There is no .pgen output. 

The .pgen input filesets are definitely there. Any suggestions please? Thanks in advance for your time. Log attached below.

Downloading files using 8 threads+ [[ '' == '' ]]
+ eval 'plink2          --pfile "/mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_a_20"          --pmerge          "/mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.pgen"          "/mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.pvar"          "/mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.psam"          --out imp_merged_20'
++ plink2 --pfile /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_a_20 --pmerge /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.pgen /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.pvar /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.psam --out imp_merged_20

PLINK v2.00a3.1LM 64-bit Intel (19 May 2022)   www.cog-genomics.org/plink/2.0/
Error: Non-concatenating --pmerge-list is under development.

(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to imp_merged_20.log.
Options in effect:
  --out imp_merged_20
  --pfile /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_a_20
  --pmerge /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.pgen /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.pvar /mnt/project/imp-manipulation/recoded-imp-pgens-biallelic/mod_imp_b_20.psam
Start time: Fri Oct 20 17:14:19 2023
63500 MiB RAM detected; reserving 31750 MiB for main workspace.
Using up to 8 compute threads.
--pmerge: 974818 samples present.
--pmerge: Merged .psam written to imp_merged_20.psam .
--pmerge: 2 .pvar files scanned.
End time: Fri Oct 20 17:14:38 2023
+ set +x
uploading file: /home/dnanexus/out/out/imp_merged_20.log -> /imp_merged_20.log
uploading file: /home/dnanexus/out/out/imp_merged_20.psam -> /imp_merged_20.psam
uploading file: /home/dnanexus/out/out/imp_merged_20.pvar -> /imp_merged_20.pvar

Christopher Chang

unread,
Oct 21, 2023, 4:10:34 PM10/21/23
to plink2-users

Nik Tz

unread,
Oct 21, 2023, 5:20:40 PM10/21/23
to plink2-users
Thank you
Reply all
Reply to author
Forward
0 new messages