Handling duplicate variant IDs

4,796 views
Skip to first unread message

Colm O'Dushlaine

unread,
Feb 10, 2014, 2:01:03 PM2/10/14
to plink2...@googlegroups.com
Loving plink1.9!

Quick question, if I do the following to keep either the SNP id or a positional id (if no SNP id), is there a simple way I can keep both indels and SNPs? Getting duplicate markers. I know I can remove these, but if an indel overlaps a SNP I actually probably want to keep these I think.

plink --bfile myfile --set-missing-var-ids ^:#   ...

Thanks,

Colm



freeseek

unread,
Feb 10, 2014, 2:36:49 PM2/10/14
to plink2...@googlegroups.com
GATK will genotype SNPs and indels as different variants, even if they have the same position in the VCF file. I think it is due to the internal behavior of GATK due to the fact that SNPs with the same position have the same dbSNP rsID, but not SNP and indels with the same position. It would be nice if it was possible to have a code to indicate if a variant is a SNP or an indel, so that different names are assigned for each variant.

Christopher Chang

unread,
Feb 10, 2014, 6:15:23 PM2/10/14
to plink2...@googlegroups.com
Hmm.  If this can only happen with one SNP and one indel at the same position, I'll address this by allowing you to specify two template strings instead of one (one for SNPs and one for indels).  Do you need something more general than that (i.e. something analogous to a file being saved as "file.txt(1)" if "file.txt" already exists)?

freeseek

unread,
Feb 10, 2014, 6:28:16 PM2/10/14
to plink2...@googlegroups.com
I think on a first approximation being able to define something as a SNP or an INDEL would be effective for most case scenarios. However, I do not believe the VCF specification forbids a variant to be a mix of SNP and INDEL (but GATK never generates that), so you might want to have a template string for the mixed scenario as well (I mean something were, for example, REF is A and ALT is AC,G). I think it would be also useful if this was mentioned in the help, as most people (my previous self included) are unaware of this issue until they see the errors in the plink files.

Christopher Chang

unread,
Feb 10, 2014, 6:50:30 PM2/10/14
to plink2...@googlegroups.com
Okay, I'll just make a SNP vs. non-SNP distinction for now.  Mixed SNP/indel variants have to wait until v2.0 adds support for more than two alleles per site, of course.

Colm O'Dushlaine

unread,
Feb 11, 2014, 2:38:52 PM2/11/14
to plink2...@googlegroups.com
That's perfect, thanks! Let me know when you get a chance to implement it. Many thanks for the prompt reply
.

Christopher Chang

unread,
Feb 11, 2014, 4:59:43 PM2/11/14
to plink2...@googlegroups.com
It's in the 11 February builds (--set-missing-snp-ids, --set-missing-nonsnp-ids).  --set-missing-var-ids has also had its behavior changed: it errors out if there are two consecutive unnamed variants with the same position.

Colm O'Dushlaine

unread,
Feb 11, 2014, 6:22:23 PM2/11/14
to plink2...@googlegroups.com
Works like a dream, THANK YOU!

freeseek

unread,
Apr 13, 2014, 1:23:36 AM4/13/14
to plink2...@googlegroups.com
It seems to me that BCF files with missing variant IDs are processed in a differential way from their VCF counterparts. Here an example:

(echo "##fileformat=VCFv4.1"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo "##contig=<ID=1,length=249212577>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA"
echo -e "1\t69428\t.\tT\tG\t.\t.\t.\tGT\t0/0") > plink.vcf

If I run:
plink --vcf plink.vcf --set-missing-var-ids @:#[b37] --make-bed
I get:
--make-bed to plink.bed + .bim + .fam... done.

But if I use --bcf instead of --vcf as follows:
bcftools view -Ob test.vcf | plink --bcf /dev/stdin --set-missing-var-ids @:#[b37] --make-bed
I get:
Error: Missing variant ID.

Somehow plink in bcf mode refuses to process the site if the variant ID is missing, but only if the format is BCF.

Christopher Chang

unread,
Apr 13, 2014, 1:53:05 AM4/13/14
to plink2...@googlegroups.com
Just added a patch for this to the 13 April development build; let me know if you still have any problems.

Genomeo Dev

unread,
Oct 24, 2014, 5:43:40 AM10/24/14
to plink2...@googlegroups.com
Hi,

Is it possible to alter the behaviour of --set-missing-var-ids so that it can take a general expression such as @:#a1:a2 which would insure that new variant ID is unique? I am doing some clumping using 1000 genomes VCF files which have lots of missing IDs, and I largely trust the quality of the variant calling for that (hence I am ignoring such issue for now) and just want to go ahead and do some clumping for a list of variants with P-values from GWAS.

Thanks!

G.

Christopher Chang

unread,
Oct 24, 2014, 1:25:13 PM10/24/14
to plink2...@googlegroups.com
Have you tried using --set-missing-snp-ids and --set-missing-nonsnp-ids in combination?  Let me know if that is no longer sufficient for 1000 Genomes data.

freeseek

unread,
Oct 27, 2014, 11:18:13 AM10/27/14
to plink2...@googlegroups.com
I think what Genomeo Dev is referring to is variants that have the same position and are all SNPs or all indels and can only be distinguished by the alternate allele. I have also started to face the same issue. To handle multi-allelic sites, I now split them first with bcftools and then convert them using plink2, but the naming becomes an issue as there is no way using plink2 to assign different names to variants after they are split.

Christopher Chang

unread,
Oct 27, 2014, 1:18:48 PM10/27/14
to plink2...@googlegroups.com
Oh, that's how triallelic SNPs and the like are now represented?  Okay, I'll add support for this later today.  "$0" will probably represent the ref/A2 allele, and "$1" the alt/A1 allele (this should be more forward-compatible than $2/$1); I hope nobody needs dollar signs in their variant names?

Christopher Chang

unread,
Oct 28, 2014, 12:40:03 AM10/28/14
to plink2...@googlegroups.com
Well, I've looked at this more closely, and concluded that it doesn't really belong in PLINK 1.9, for two reasons:

1. PLINK 1 does not automatically retain REF/ALT allele assignments; instead, it normally checks what is the major allele *in the immediate dataset*, and swaps the A1/A2 allele assignments when appropriate.  This makes it way too easy to end up with different names for the same variant if you're trying to include the alt allele in the variant name.

2. In contrast, a short sed/awk-using shell script acting directly on the VCF can do this in a consistent manner.

I will make sure PLINK 2.0 does have distinct "reference allele" and "major allele" concepts, and extend --set-missing-var-ids there.

freeseek

unread,
Oct 29, 2014, 11:57:26 AM10/29/14
to plink2...@googlegroups.com
However, even if PLINK does not automatically retain REF/ALT allele assignments, the naming takes place coming from a VCF file, so there would never be a confusion here, as the VCF always has REF first and ALT second. I just wouldn't use the "@:#a1:a2" syntax but rather the "@:#ref:alt" syntax. As of now I do use a short sed/awk shell script to assign names, but this ends up being the bottleneck of the conversion for very large BCF files.

Christopher Chang

unread,
Oct 29, 2014, 1:26:25 PM10/29/14
to plink2...@googlegroups.com
Since --set-missing-var-ids is not a VCF/BCF-specific operation, the order of operations is

1. convert VCF/BCF to PLINK binary format
2. set missing IDs

This keeps complexity under control (the --set-missing-var-ids logic is all in one place), so it's the way almost everything is done in PLINK, but it has the no-REF/ALT drawback.

Since this is actually a performance bottleneck for you, I can add a 1.9-only --bcf-set-missing-var-ids flag which handles REF/ALT and acts during the BCF -> PLINK conversion.  (To avoid future script breakage, this will become an alias for --set-missing-var-ids in PLINK 2.0.)

Christopher Chang

unread,
Oct 29, 2014, 1:50:06 PM10/29/14
to plink2...@googlegroups.com
One warning: this won't work on very long indels (there's a hard variant ID length limit of ~16k characters, and a soft limit of ~80 if you don't want horrible memory consumption).  Do you want a fallback naming scheme for them, or do you already have a way to avoid that problem?

freeseek

unread,
Oct 29, 2014, 2:00:18 PM10/29/14
to plink2...@googlegroups.com
I did not know that. Interesting. How do these alleles get currently converted for A1 and A2? Maybe using whatever they are currently converted to would be okay.

Christopher Chang

unread,
Oct 29, 2014, 2:56:26 PM10/29/14
to plink2...@googlegroups.com
A1 and A2 *are* permitted to be millions of characters; it's only the variant ID that is limited to ~16k characters.  Since PLINK errors out on longer variant IDs, it sounds like the datasets you've processed with your current naming script don't have any very long indels.

freeseek

unread,
Oct 29, 2014, 4:00:20 PM10/29/14
to plink2...@googlegroups.com
Don't worry about my particular case too much. I don't really need this option at the moment. I was just trying to hint for solutions to a problem that might be common to others.

freeseek

unread,
Nov 19, 2014, 6:16:55 PM11/19/14
to plink2...@googlegroups.com
Another easy solution is to have the two alleles A1 and A2 sorted in alphabetical order, so that no matter which allele is the minor allele, the assigned name will be the same, and there would be no issues when merging two plink files coming from VCF files. If something like this gets implemented, I would love to include it in best practices to covert VCF files to plink files using bcftools and plink.

Christopher Chang

unread,
Nov 19, 2014, 6:26:43 PM11/19/14
to plink2...@googlegroups.com
Great idea, sorting does solve the problem.

Okay, I'll actually implement this now, and to address the long indel issue, I'll limit the number of allele name characters used in the new variant ID to, say, 16 by default (while making this value adjustable).


On Wednesday, November 19, 2014 3:16:55 PM UTC-8, freeseek wrote:

Christopher Chang

unread,
Nov 20, 2014, 3:19:54 AM11/20/14
to plink2...@googlegroups.com
This is implemented in the 20 Nov development build.  (I've also removed the --set-missing-snp-ids and --set-missing-nonsnp-ids flags, since they're rendered obsolete by the improved --set-missing-var-ids.)

freeseek

unread,
Nov 24, 2014, 11:51:51 AM11/24/14
to plink2...@googlegroups.com
It turns out that this does not work. There is an intrinsic design issue with plink and indels. Suppose you have two variants in a VCF file like these:
22 16404838 . G GA
22 16404838 . GA G
These are two different variants: one is an insertion and one is a deletion. However, the "--set-missing-var-ids @:#[b37]\$1,\$2" command will attempt to assign the same name to both. The issue is even more deeper than related to how to assign names. For indels, it is sometimes not possible to understand which allele is the reference allele, as either could match the reference. This is a typical issue for homopolymer indels.

Christopher Chang

unread,
Nov 24, 2014, 1:13:57 PM11/24/14
to plink2...@googlegroups.com
Those are legitimately different variants?!

Well, since plink now errors out and gives the line number here, the user can assign variant IDs and perhaps even tweak the allele names manually.  If you have a recommendation as to how that should be done, I'll add it to the documentation.

freeseek

unread,
Nov 24, 2014, 1:25:33 PM11/24/14
to plink2...@googlegroups.com
On Monday, November 24, 2014 1:13:57 PM UTC-5, Christopher Chang wrote:
Those are legitimately different variants?!

Yes, they are indeed different variants as is impossible to determine from the alleles alone if the reference is the deletion allele or the reference allele. There is nothing in my view that can be done once the data is moved in plink format and this bit of information is lost. The way indels are defined in plink is just non-specific. Here is what I would do to circumvent the issue with the VCF and assign names before the conversion to plink:
awk -F"\t" -v OFS="\t" '$3=="." {$3="chr"$1"_"$2"_b37_"$4"_"$5} {print}'
It is unfortunate though, as this causes quite a bottleneck in handling the VCF and forces the use of the non-binary format which is quite slower.

freeseek

unread,
Nov 25, 2014, 4:50:55 PM11/25/14
to plink2...@googlegroups.com
Okay, I promise I will not keep further posting on this forum thread. Here is a potential "best practice" guide for how to convert VCF files to plink format in case anybody needs to do this: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html

freeseek

unread,
Jan 8, 2015, 2:59:46 PM1/8/15
to plink2...@googlegroups.com
I actually don't think sorting is such a great idea anymore. Would it be possible to revert to the previous behavior? If I convert from VCF to plink using the previous behavior and keeping the "--keep-allele-order" flag on, wouldn't that guarantee me that $1 corresponds to the alternate allele and $2 corresponds to the reference allele? I wanted to try this myself by removing this code from plink_data.c:
if (strcmp(bufptr4, bufptr5) <= 0) {
  memcpy(insert_buf[2], bufptr4, uii);
  insert_buf_len[2] = uii;
  memcpy(insert_buf[3], bufptr5, ujj);
  insert_buf_len[3] = ujj;
} else {
But I was not able to compile plink myself on my Ubuntu machine. I tried the following:
cd plink-ng/
./plink_first_compile
And I got the following error output:
/usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../liblapack.so when searching for -llapack
/usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../liblapack.a when searching for -llapack
/usr/bin/ld: skipping incompatible /usr/lib/liblapack.so when searching for -llapack
/usr/bin/ld: skipping incompatible /usr/lib/liblapack.a when searching for -llapack
/usr/bin/ld: cannot find -llapack
/usr/bin/ld: cannot find -lcblas
/usr/bin/ld: cannot find -latlas
But I don't understand as I have already these libraries installed on my machine.

Christopher Chang

unread,
Jan 8, 2015, 3:36:34 PM1/8/15
to plink2...@googlegroups.com
When cloning from GitHub, you probably want to copy Makefile.std over Makefile, and you may need to make a few other edits depending on how your system is configured.

A non-sorting --set-missing-var-ids does not belong in PLINK 1.9, since using it correctly is barely any simpler than writing a Unix shell script for the job, and it's too easy to use it incorrectly.  I will consider removing the option of including allele codes in the IDs; this will mostly depend on whether the procedure generates any nonunique ID pairs for 1000 Genomes phase 1.
Reply all
Reply to author
Forward
0 new messages