Hi.
I'm currently writing a simple vcf to bed conversion pipeline on cromwell. I need to use a mix of plink1.9 and plink 2 since the multithreading comes in handy, but plink2 doesn't support --merge-list.
ATM i'm testing with the 1kg data and there's something that is not quite clear to me.
In the conversion step I specify --max-alleles 2
PLINK v2.00a2LM AVX2 Intel (13 Dec 2019) www.cog-genomics.org/plink/2.0/
(C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to 21.log.
Options in effect:
--allow-extra-chr
--make-bed
--max-alleles 2
--memory 30000
--out 21
--vcf /cromwell_root/thousand_genome/vcf/chrom/ALL.chr21_GRCh38.genotypes.20170504.vcf.gz
--vcf-half-call h
Start time: Wed Jan 15 12:12:00 2020
32167 MiB RAM detected; reserving 30000 MiB for main workspace.
Using up to 32 threads (change this with --threads).
--vcf: 1104028 variants scanned.
--vcf: 21-temporary.pgen + 21-temporary.pvar + 21-temporary.psam written.
2504 samples (0 females, 0 males, 2504 ambiguous; 2504 founders) loaded from
21-temporary.psam.
1097776 out of 1104028 variants loaded from 21-temporary.pvar.
Note: No phenotype data present.
1097776 variants remaining after main filters.
Writing 21.fam ... done.
Writing 21.bim ... done.
Writing 21.bed ... done.
End time: Wed Jan 15 12:12:11 2020
Same for chrom 20.
1802302 out of 1811146 variants loaded from 20-temporary.pvar.
Note: No phenotype data present.
1802302 variants remaining after main filters.
Obviously, in the merging step, plink 1.9 complains about multiallelic variants.
PLINK v1.90b6.13 64-bit (30 Nov 2019) www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to 1k.log.
Options in effect:
--allow-extra-chr
--make-bed
--memory 30000
--merge-list merge_list.txt
--out 1k
32170 MB RAM detected; reserving 30000 MB for main workspace.
Error: 147 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
1k-merge.missnp.
(Warning: if this seems to work, strand errors involving SNPs with A/T or C/G
alleles probably remain in your data. If LD between nearby SNPs is high,
--flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
However, even if I add the `--biallelic-only strict ` flag I still get the same error.
PLINK v1.90b6.13 64-bit (30 Nov 2019) www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to 1k.log.
Options in effect:
--allow-extra-chr
--biallelic-only strict
--make-bed
--memory 30000
--merge-list merge_list.txt
--out 1k
32170 MB RAM detected; reserving 30000 MB for main workspace.
Error: 147 variants with 3+ alleles present.
What am I missing and what is the best way to proceed?
Thanks