Converting VCF to PLINK format: biallelic SNP sites with possible sequencing error in reference

1,337 views
Skip to first unread message

Jeroen Huyghe

unread,
May 22, 2015, 2:43:13 PM5/22/15
to plink2...@googlegroups.com
Hi,

Not sure whether this issue has been reported previously. Due to occasional possible sequencing errors in the human reference sequence, some biallelic SNPs have 2 alleles listed in the ALT field of a VCF file, while the allele in the REF field is never observed. For those cases, genotype fields in the VCF can be one of 1/1, 1/2 and 2/2, and allele 0 is never observed. When converting a VCF to PLINK format, for these SNPs, PLINK always sets the A1 allele equal to the REF allele, A2 equal to the major allele, and everything else equal to missing. Is there a way to change this behavior? This affects hundreds to thousands of variants in large whole-genome sequencing datasets.

Thanks!
Jeroen

Christopher Chang

unread,
May 22, 2015, 4:01:42 PM5/22/15
to plink2...@googlegroups.com, jeroen....@gmail.com
No, this cannot be changed in PLINK 1.9.  Its support for the VCF ref/alt distinction is weak enough as it is; I do not want to add an option which worsens the situation.

You should either redefine the reference in the VCF file, or just throw out the variants in question (the latter can be done with --biallelic-only).

Jeroen Huyghe

unread,
May 22, 2015, 6:21:58 PM5/22/15
to plink2...@googlegroups.com, jeroen....@gmail.com
Thanks for your reply!

Best,
Jeroen

freeseek

unread,
May 26, 2015, 3:57:38 PM5/26/15
to plink2...@googlegroups.com
Jeroen, this might or might not be relevant to you, but I will mention that GRCh38 fixed a lot of these sites for which, as you describe, the reference allele is never observed. While not a full solution, this would probably take care of most of these sites. The other solution would be to split these multi-allelic sites to make them all become bi-allelic, which is the format plink can deal with. See here for more information: http://bit.ly/1sgOcuP

Jeroen Huyghe

unread,
May 27, 2015, 1:41:05 AM5/27/15
to freeseek, plink2...@googlegroups.com
Thanks! Great blog post, but splitting multi-allelic sites the way it is described on the blog post will not be very useful here. Redefining the reference allele would be useful but could lead to issues in downstream analyses so I prefer to remove these variants altogether or, better, keep them in a separate file. Unfortunately, it is likely going to take a while before people transition to GRCh38.

Jeroen

On Tue, May 26, 2015 at 12:57 PM, freeseek <giulio....@gmail.com> wrote:
Jeroen, this might or might not be relevant to you, but I will mention that GRCh38 fixed a lot of these sites for which, as you describe, the reference allele is never observed. While not a full solution, this would probably take care of most of these sites. The other solution would be to split these multi-allelic sites to make them all become bi-allelic, which is the format plink can deal with. See here for more information: http://bit.ly/1sgOcuP

--
You received this message because you are subscribed to a topic in the Google Groups "plink2-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/plink2-users/ygupmq9vCZw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages