Transfer plink format to vcf format: Warning: At least one VCF allele code violates the official sp

2,096 views
Skip to first unread message

Shicheng Guo

unread,
Apr 4, 2018, 12:54:26 AM4/4/18
to plink2-users
Hi All, 

I want to transfer plink files  ped/map to vcf so that I can do the haplotype phase later. 

$ plink --bfile SH --keep SH.input --recode vcf --out SH

However, after the command, I get the warning: 

Warning: At least one VCF allele code violates the official specification;
other tools may not accept the file.  (Valid codes must either start with a
'<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or
represent a breakend.)

What happened here?  How to avoid this warning? 

Thanks. 




PLINK v1.90b5 64-bit (14 Nov 2017)             www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to S_Hebbring_Unr.Guo.log.
Options in effect:
  --bfile S_Hebbring_Unr
  --keep S_Hebbring_Unr.Guo.Schroid.7913.input
  --out S_Hebbring_Unr.Guo
  --recode vcf

32112 MB RAM detected; reserving 16056 MB for main workspace.
550601 variants loaded from .bim file.
8648 people (4050 males, 4592 females, 6 ambiguous) loaded from .fam.
Ambiguous sex IDs written to S_Hebbring_Unr.Guo.nosex .
--keep: 7913 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 7913 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 3901 het. haploid genotypes present (see S_Hebbring_Unr.Guo.hh ); many
commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate in remaining samples is 0.993333.
550601 variants and 7913 people pass filters and QC.
Note: No phenotypes present.
--recode vcf to S_Hebbring_Unr.Guo.vcf ... done.
Warning: At least one VCF allele code violates the official specification;
other tools may not accept the file.  (Valid codes must either start with a
'<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or
represent a breakend.)

Christopher Chang

unread,
Apr 4, 2018, 10:26:05 AM4/4/18
to plink2-users
This says some of your alleles are coded in a VCF-incompatible way; you’ll need to find the nonconforming allele codes and choose a different representation.

What’s the output of “head SH.bim”?

Isabel Hostettler

unread,
Jun 12, 2018, 7:20:43 AM6/12/18
to plink2-users
I have the exact same problem. What are confirming allele codes or non-confirming allele codes respectively and what is the different representation that they should be represented in instead without making the data even more wrong? is there a command to find those?

I get the error message when ie.doing:

./plink --bfile plink.postqc.Whiteonly.ch1 --recode vcf bgz --out White_chr1


Thanks a lot for any help.

Isabel

Isabel Hostettler

unread,
Jun 12, 2018, 7:28:13 AM6/12/18
to plink2-users
PS: below you find the output of; head plink.postqc.Whiteonly.ch1.bim


dyn901-227:plink isimee$ head plink.postqc.Whiteonly.ch1.bim

1 GSA-rs114420996 0 58814 A G

1 GSA-rs9283150 0 565508 G A

1 GSA-1:726912 0 726912 G A

1 GSA-rs116587930 0 727841 A G

1 rs3131972 0 752721 T C

1 rs12567639 0 756268 0 A

1 GSA-rs114525117 0 759036 A G

1 rs12127425 0 794332 A G

1 GSA-rs79373928 0 801536 G T

1 rs28444699 0 830181 0 A

dyn901-227:plink isimee$ 


Am Mittwoch, 4. April 2018 05:54:26 UTC+1 schrieb Shicheng Guo:

Isabel Hostettler

unread,
Jun 12, 2018, 7:48:06 AM6/12/18
to plink2-users
Update: I have used:

./plink --bfile plink.postqc.Whiteonly.ch1 --recode vcf-iid bgz --out White_ch1_TEST


But it doesnt work either. Same warning again. Attached a larger extract from the file: plink.postqc.Whiteonly.ch.bim. Could the seq-1 lines or the MVH lines be a problem?


Am Mittwoch, 4. April 2018 05:54:26 UTC+1 schrieb Shicheng Guo:
Screen Shot 2018-06-12 at 12.46.16.png

Isabel Hostettler

unread,
Jun 12, 2018, 7:49:05 AM6/12/18
to plink2-users

dyn901-227:plink isimee$ ./plink --bfile plink.postqc.Whiteonly.ch1 --recode vcf-iid bgz --out White_ch1_TEST

PLINK v1.90b5.3 64-bit (21 Feb 2018)           www.cog-genomics.org/plink/1.9/

(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3

Logging to White_ch1_TEST.log.

Options in effect:

  --bfile plink.postqc.Whiteonly.ch1

  --out White_ch1_TEST

  --recode vcf-iid bgz


8192 MB RAM detected; reserving 4096 MB for main workspace.

51349 variants loaded from .bim file.

2104 people (1146 males, 958 females) loaded from .fam.

2104 phenotype values loaded from .fam.

Using up to 4 threads (change this with --threads).

Before main variant filters, 2104 founders and 0 nonfounders present.

Calculating allele frequencies... done.

Total genotyping rate is 0.995335.

51349 variants and 2104 people pass filters and QC.

Among remaining phenotypes, 872 are cases and 1232 are controls.

--recode vcf-iid bgz to White_ch1_TEST.vcf.gz ... done.

Warning: At least one VCF allele code violates the official specification;

other tools may not accept the file.  (Valid codes must either start with a

'<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or

represent a breakend.)


Am Mittwoch, 4. April 2018 05:54:26 UTC+1 schrieb Shicheng Guo:

Christopher Chang

unread,
Jun 12, 2018, 9:05:05 AM6/12/18
to plink2-users
The lines with D/I allele codes are the problem. One way to filter them out is “—snps-only just-acgt”.

Fedik Rahimov

unread,
Jan 3, 2020, 10:51:58 AM1/3/20
to plink2-users
Hi Chris

I am getting a similar error.

My ped files represent SNP alleles as "A T G C" and "-" for indels. --make-bed coverts everything correctly to binary files.

When I run "plink --bfile [filename prefix] --recode vcf --out [VCF prefix]". I get the same error.
 
In close inspection I notice that variants with missing genotypes coded as "0" in the original ped file are entirely skipped in the vcf file. But indels coded as "-" are correctly converted in the vcf file.

I tried "--snps-only just-acgt” and the error message did disappear but variants with missing genotypes still did not convert in the vcf file.  

Is there a modifier for --recode to handle missing genotypes? Since "0" is the default value for missing genotypes in PLINK files, I thought --recode would handle these by default as well. But I am losing a lot of good variant during conversion fo vcf because a few samples have missing genotypes.

Thank you,

Fedik 

Christopher Chang

unread,
Jan 3, 2020, 11:06:09 AM1/3/20
to plink2-users
I'm a bit confused by the question here; can you post a short example of what's going wrong, with .log file and VCF included?

Fedik Rahimov

unread,
Jan 3, 2020, 11:24:21 AM1/3/20
to plink2-users
I am sorry for the confusion. I was using the filtered binary file where variants with missing genotyped were already filtered out. When I converted the original ped file to vcf, variants with missing genotypes also converted. All good. I could delete the post.
That being said. I am attaching the log file, but I still cannot figure out some of the issues with my variants that lead to this error message. I do not have "I" or "D" for indels. As I mentioned in my post, indels are represented as "-" or "TT" and SNPs have the typical "A, T, G, C"s. What are some of the other VCF incompatible variant representations that can give this error error message? I am really sorry I cannot post the VCF file due to company policies. 

Screen Shot 2020-01-03 at 10.14.58 AM.png

Christopher Chang

unread,
Jan 3, 2020, 11:29:58 AM1/3/20
to plink2-users
indel='-' is not part of the VCF specification; you'd need to either spell them out or switch to codes like "<INS>" and "<DEL>".

Fedik Rahimov

unread,
Jan 3, 2020, 11:34:27 AM1/3/20
to plink2-users
That must be the reason. I will switch the codes. Thank you for your help. Much appreciated.

Sander W. van der Laan

unread,
Mar 14, 2024, 12:42:46 PM3/14/24
to plink2-users
I have this issue too. 

What does '0' or '-' in the `.bim` file mean?

Thanks

Christopher Chang

unread,
Mar 14, 2024, 7:56:19 PM3/14/24
to plink2-users
'0' should imply an allele that never appears in the dataset (analogous to VCF ALT='.').  As for '-', you will need to look upstream; plink does not introduce that allele code.
Reply all
Reply to author
Forward
0 new messages