Merging two data sets in PLINK 2 binary format

4,107 views
Skip to first unread message

Y

unread,
Mar 10, 2021, 10:45:02 PM3/10/21
to plink2-users
Hi there,

I tried to merge two data sets in PLINK 2 binary format (.pgen, .psam, .pvar) through plink2 --pmerge-list, but encountered the following error:
  • --pmerge[-list] is under development.
I read that --pmerge only handles concatenation-like jobs for now. Does this mean the current version of PLINK 2.0 can not merge different data sets? If so, how may I merge the two data sets without using PLINK?

Additionally, the same variant may have different A1/A2 values in the two data sets. I think this is accounted for in plink1.9 --bmerge so that the merged A1/A2 allele will depend on which allele is major in the merged data. Will this be the case with plink2 --pmerge as well?

Thank you in advance for your help.
Message has been deleted

Christopher Chang

unread,
Mar 11, 2021, 12:59:19 AM3/11/21
to plink2-users
1. If there are overlapping variants between the two datasets, yes, that part of --pmerge/--pmerge-list is not finished yet.  Continue using plink 1.9 or bcftools for that case for now.
2. --pmerge/--pmerge-list will be able to handle mismatched REF/ALT between different datasets, as long as any possibly-wrong REF is marked as such (.bim REF/ALT alleles are assumed to be possibly-wrong).

Y

unread,
Mar 11, 2021, 7:25:29 PM3/11/21
to plink2-users
Thanks a lot for your reply!

So currently, if I would like to merge the two data sets (data1.pgen, ... and data2.pgen, ...) into PLINK 2.0 binary format and keep the dosage information, shall I do the following?
  1. Convert the data sets to .vcf format.
    • plink2 --pfile data1 --export vcf bgz vcf-dosage=DS-force --out data1
    • plink2 --pfile data2 --export vcf bgz vcf-dosage=DS-force --out data2
  2. Merge the .vcf.gz files using bcftools
    • This step follows this post's advice: https://www.biostars.org/p/307035/#307036
      • Not sure if the normalization is necessary in this case; maybe it will help with the "different REF/ALT alleles across data sets" problem?
    • bcftools norm -m-any data1.vcf.gz | bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > data1.norm.bcf
    • bcftools norm -m-any data2.vcf.gz | bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > data2.norm.bcf
    • bcftools index data1.norm.bcf
    • bcftools index data2.norm.bcf
    • bcftools merge -Ob -m none data1.norm.bcf data2.norm.bcf > merged.bcf
  3. Convert the merged .bcf back to PLINK 2 binary format.
    • plink2 --bcf merged.bcf dosage=DS --make-pgen --out merged
I am not sure if DS-force is the best dosage export mode for avoiding information loss. Also it seems that when importing .bcf files, dosage=DS-force is not one of the valid arguments, so I used dosage=DS here. Does this seem right to you? Which modes would you advise to use?

Sorry for the lengthy post, and thanks again for your help.
Reply all
Reply to author
Forward
0 new messages