PLINK2 merge

231 views
Skip to first unread message

Matthew Maher

unread,
Sep 2, 2022, 4:56:58 PM9/2/22
to plink2-users
I know that PLINK2's "full-powered" merge is a future capability - it is currently limited to "concatenation jobs" - and that seems to refer to the case of "all-inputs-have-the-same-samples-but-disjoint-sets-of-variant".   Unless I've done something wrong, I don't believe that it also applies to the other dimension - i.e. the case of "all-inputs-have-the-same-variants-but-disjoint-sets-of-samples".   Is that correct?  and if so, are there any known/recommended tricks that might help accomplish that (without having to export-to-vcf-call-bcftools-reimport-vcf)?    and if that (VCF-round-trip) is the current recommendation, I'm curious if the PLINK2 Python API (which I haven't tried yet, but I'm game...) has sufficient functionality to support this specific use-case (same-variants/disjoint-samples)? 

Thanks for any info and thanks for PLINK2

Christopher Chang

unread,
Sep 2, 2022, 5:04:03 PM9/2/22
to plink2-users
I'd expect BCF-round-trip to be quite efficient for this use case.

MGru

unread,
Jan 17, 2023, 3:12:52 AM1/17/23
to plink2-users
Hi, 
Jumping on this thread. What did you mean here by BCF-round-trip- a specific function in bcftools or to just merge once in --export / --recode vcf format? 

And in addition, is there a fast way to do the "non concatening" aka, different samples / same variants for large scale data except bcftools? Could Plink 2 be used to filter out the indels/multiallelics and then merge with Plink 1.9 > switch to vcf, use Plink 2 to switch the indels/multiallelics to vcf and  then merge via bcftools? Is this what you meant here by round-trip? 

Thanks! 

Christopher Chang

unread,
Jan 17, 2023, 12:11:35 PM1/17/23
to plink2-users
By "BCF round trip", I mean using "--export bcf", followed by performing some operations with bcftools, followed by --bcf to import the result.  (Though in this case you'd need to do a bit of additional work at the end to import sex/pedigree info; normally you can use --psam with --bcf to reuse your old .psam file, when your bcftools operations don't change the set of samples.)

Yes, you can consider plink 1.9 merge for this.  If you're willing to do a small amount of low-level coding, you can also roll your own solution based on plink2's "--export ind-major-bed" command and its ability to re-import sample-major .bed files.  (Hmm, I should benchmark that approach against plink 1.9 and bcftools, and if there's at least a 10x difference, I'll autodetect and implement it in plink 2.0 in the near future.  Sorry about not looking at that earlier, I had previously assumed that bcftools merge was already "good enough" for that case.)

Christopher Chang

unread,
Jan 21, 2023, 12:59:08 PM1/21/23
to plink2-users
Benchmarked this; only a ~5x speed difference, so I won't rush this feature.  Instead, I'll post the shell script I tested, which you should be able to adapt to your use case if it lines up:

#!/bin/bash
set -exuo pipefail
for i in {1..25}
do
    plink2 --bfile "chr1_batch"$i --export ind-major-bed --out "chr1_batch"$i"_t"
done
cp "chr1_batch1_t.bed" chr1_merged_t.bed
cp "chr1_batch1_t.bim" chr1_merged_t.bim
cp "chr1_batch1_t.fam" chr1_merged_t.fam
for i in {2..25}
do
    gtail -c +4 "chr1_batch"$i"_t.bed" >> chr1_merged_t.bed
    cat "chr1_batch"$i"_t.fam" >> chr1_merged_t.fam
done
plink2 --bfile chr1_merged_t --make-bed --out chr1_merged

This script merges "chr1_batch1", "chr1_batch2", ..., "chr1_batch25" PLINK 1-formatted datasets, assuming identical .bim files and disjoint sample IDs.  Note the use of "gtail" (GNU tail); on my Mac test machine, this was much faster than the preinstalled "tail" program.
Reply all
Reply to author
Forward
0 new messages