question of "--keep-allele-order"

3,052 views
Skip to first unread message

Yiyi Ma

unread,
Aug 31, 2015, 12:01:51 PM8/31/15
to plink2-users
Hi plink 2 developers,
Could you help me to clarify some of my confusions of the argument of "--keep-allele-order" in the plink 2.0 version?

I am planning to convert the vcf file to the plink file using plink2.0 version. My exact version should be PLINK v1.90b3g.

The code I provide is as follows:
plink --vcf test.vcf.gz --keep-allele-order --double-id --set-missing-var-ids @:#\$1,\$2 --make-bed --out test

My thought for the argument of "--keep-allele-order" is to keep the ref allele in the vcf as the A1 in plink while the alt allele in the vcf file as the A2 in the plink file. In this case, I can obviate the automatically assignment of the A1/A2 in the plink file to minor/major allele.

However, this is not the case for the role of "--keep-allele-order", in which the output bim file still get different A1/A2 allele designation as the original vcf file.

Can you let me know what is the exact function of the "--keep-allele-order" and what should we do to get the right allele designation for our future analysis using plink?

Thank you!

Best,
Yiyi



Christopher Chang

unread,
Aug 31, 2015, 12:12:05 PM8/31/15
to plink2-users
--vcf + --keep-allele-order normally keeps the ref allele as A2, not A1.  Is there any reason you need it to be A1 instead?

(If you do need to force it to A1, take a look at the --a1-allele flag.)

Yiyi Ma

unread,
Aug 31, 2015, 12:51:00 PM8/31/15
to plink2-users
Thank you Christopher for your answer. It is OK if all the ref alleles in vcf file is set to A1 consistently. It does not matter whether it is A1 or A2, as long as they are consistent. But this is the not case right now after I used "--keep-allele-order" argument. Some ref are set to A2 while some others are set to A1 in the output bim file. It seems that the "--keep-allele-order" does not do the job to force the ref/alt allele in the vcf to the plink file. Is it correct? Or has I missed any other arguments during the first conversion?

Thanks!
Yiyi

Christopher Chang

unread,
Aug 31, 2015, 12:57:39 PM8/31/15
to plink2-users
Can you send me an example VCF (and the .log file for your run) where this behaves inconsistently?  Ref should always be set to A2 when you use --vcf + --keep-allele-order.

Janete Chung

unread,
Sep 10, 2015, 4:52:47 AM9/10/15
to plink2-users
Hi Chirstopher,

I am also having this problem with some of my variants. The variants that has high MAF the ref allele stay at A1 position.  I am running plink v1.90b3v 64-bit on window.

plink --vcf exomeGATK.vcf  --no-sex --no-parents --no-fid --keep-allele-order --make-bed --out exomeGATK

PLINK v1.90b3v 64-bit (15 Jul 2015)
Options in effect:
  --keep-allele-order
  --make-bed
  --no-fid
  --no-parents
  --no-sex
  --out exomeGATK.vcf
  --vcf exomeGATK.vcf

Hostname: CELEGANS-JC
Working directory: C:\Data_shared_with_VM\plink1.9
Start time: Thu Sep 10 10:52:57 2015

Random number seed: 1441875177
5959 MB RAM detected; reserving 2979 MB for main workspace.
--vcf: exomeGATK.vcf-temporary.bed + exomeGATK.vcf-temporary.bim +
exomeGATK.vcf-temporary.fam written.
324623 variants loaded from .bim file.
192 people (0 males, 0 females, 192 ambiguous) loaded from .fam.
Ambiguous sex IDs written to exomeGATK.vcf.nosex .
192 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 192 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.200238.
324623 variants and 192 people pass filters and QC.
Phenotype data is quantitative.
--make-bed to exomeGATK.vcf.bed + exomeGATK.vcf.bim + exomeGATK.vcf.fam ...
done.

End time: Thu Sep 10 10:53:07 2015

Best

Janete

Christopher Chang

unread,
Sep 10, 2015, 10:35:40 AM9/10/15
to plink2-users
Hi,

Can you send me a dataset to replicate this issue with?  Thanks.

Abel Chang

unread,
Mar 2, 2016, 2:29:42 AM3/2/16
to plink2-users
Mr. Chang,

I have to say, I do need the ref allele as A1. Do you know how to solve this problem?

在 2015年9月1日星期二 UTC+8上午12:12:05,Christopher Chang写道:

Christopher Chang

unread,
Mar 2, 2016, 3:58:43 AM3/2/16
to plink2-users
I stated that the --a1-allele flag lets you control the A1 allele.

Abel Chang

unread,
Mar 2, 2016, 4:02:23 AM3/2/16
to plink2-users
I just want to convert VCF files to BED/BIM/FAM format, while keeping REF and ALT unchanged. So, of course there are a lot of users like me need it to be A1

If I use --a1-allele, plink will throw out this:  Error: Duplicate ID '.' ,because most of the IDs in my VCF files are '.'

If I synthesize REF and ALT to make a unique IDs and use them to replace '.', then plink will throw out this: Error: Variant names are limited to 16000 characters.

You might say "you'll need to limit the length of the variant IDs", but actually, the long IDs are synthesized by REF and ALT to avoid duplicate ID.  If I limit the length of the variant IDs, plink will throw out this:  Error: Duplicate ID '.' . In other words, plink is in a dilemma.

Now, you know how ridiculous the situation is !


If I limit the length of the variant IDs, plink will throw out this:  Error: Duplicate ID '.' . In other words, it seems that plink is in a dilemma.


在 2015年9月1日星期二 UTC+8上午12:12:05,Christopher Chang写道:
--vcf + --keep-allele-order normally keeps the ref allele as A2, not A1.  Is there any reason you need it to be A1 instead?

Christopher Chang

unread,
Mar 2, 2016, 4:07:21 AM3/2/16
to plink2-users
Again, you are not synthesizing the variant IDs properly.  You take e.g. the first 23 letters of the REF and ALT, not the entire allele code (which could be millions of characters and makes the ID totally unusable).  This should be done with the VCF, before trying to import to plink.

(It is actually possible to do it with plink 1.9's --set-missing-var-ids flag, but it's a slightly unforgiving procedure.)

Abel Chang

unread,
Mar 2, 2016, 4:13:19 AM3/2/16
to plink2-users
Yes, I understand. But the variants have absolutely the same chromosome and position, because they are split from one multi-allelic variant.
If I limit the length of IDs to 23 letters, then 'duplicate' again......

在 2016年3月2日星期三 UTC+8下午5:07:21,Christopher Chang写道:

Abel Chang

unread,
Mar 3, 2016, 4:23:37 AM3/3/16
to plink2-users
Hello, everybody! I have solved this problem by writing a simple python script. I would like to share my solution here.

Problem
How can plink be used to convert VCF files to BED/BIM/FAM format, while keeping REF and ALT unchanged ?

Solution
1. Synthesize a new ID for each variant by CHR:POS:REF:ALT
Because in a VCF file, a lot of IDs are just '.' and will cause plink to throw out error like this: Error: Duplicate ID '.' 


2. Calculate MD5 of each synthesized ID and replace the 'ID' column of the VCF files with MD5
Plink cannot deal with very long IDs. So MD5 of IDs is a perfect alternate, because MD5 is either short and unique(as long as the synthesized ID is uniq). If you directly use the synthesized ID, then plink will throw out this error if the synthesized ID is too long
Error: Variant names are limited to 16000 characters.

Here is my python script for your reference.
import md5
...
new_id=chr+':'+pos+':'+ref+':'+alt
md5_id=md5.new(new_id).hexdigest()
...
You may want to truncate REF and ALT to short strings to avoid the error above. But you cannot guarantee the uniqueness of truncated IDs, because some variants are split from a multi-allelic variant and they have the same CHR and POS. And, if the truncated IDs are not uniqe, plink will throw out this error again: Error: Duplicate ID ... 


3. Add three parameters to plink and run it(input.vcf is the VCF file you want to convert):
--vcf input.vcf
--keep-allele-order
--a1-allele input.vcf 4 3 '#'


4. Restore the original IDs after conversion.
You need to record the corresponding relation between each MD5 and i'ts original ID in step 2, and restore the original IDs after conversion. I will not give my script here because it's not hard to implement.

By the way, I suggest the author of plink to modify plink's codes and add an appropriate parameter to avoid this problem, because my solution is just for emergency usage after all.

Frankly speaking, it's kind of ridiculous to reverse REF and ALT in my view. I admit there may be strong reasons for doing so, but there are also a lot of users like me simply want to convert VCF files with REF and ALT unchanged.


在 2016年3月2日星期三 UTC+8下午5:02:23,Abel Chang写道:

Christopher Chang

unread,
Mar 3, 2016, 10:06:42 AM3/3/16
to plink2-users
Yes, MD5 is a good solution when even ~20 REF/ALT bases is not enough.

I have already stated that REF/ALT will not be reversed by PLINK 2.0.  The issue is that numerous existing scripts depend on PLINK 1's automatic determination of major/minor alleles (note that formats like .ped and .tped don't even specify REF vs. ALT at all), so I could not justify breaking backward compatibility in 1.9.

Ciaran O'Flynn

unread,
Jul 11, 2017, 12:30:27 PM7/11/17
to plink2-users
Has a solution been resolved for this yet? 

I am going through exactly the same problem. 

Christopher Chang

unread,
Jul 11, 2017, 1:58:36 PM7/11/17
to plink2-users
PLINK 2.0 keeps track of REF and ALT when importing VCF files.
Reply all
Reply to author
Forward
0 new messages