vcf to plink format with missing rsIDs

1,517 views
Skip to first unread message

Yang Luo

unread,
Jan 29, 2014, 5:39:30 AM1/29/14
to plink2...@googlegroups.com
Hi,

I'm converting a vcf file to PLINK library by using the following command:

plink --vcf test.vcf.gz --biallelic-only --make-bed --out test

However, since the ID field in the test.vcf.gz file is missing, 

#CHROM  POS     ID      REF     ALT     QUAL    FILTER
1      1      .       C       A       501     PASS
1      50      .       C       T       415     PASS
1      100     .       C       T       999     PASS
1      302     .       T       TA      999     PASS
1      308     .       C       T       43.80   FAIL
1      400     .       T       C       102     PASS

The resulting test.bim file gives:

1      .       0       1      A       C
1      .       0       50      T       C
1      .       0       100     T       C
1      .       0       302     TA      T
1      .       0       308     T       C
1      .       0       400     C       T


Is there a feature in PLINK1.90 that can detect missing IDs in the vcf file and output 'CHR-POS' as the default variant ID? My vcfs are quite big, so it would be great if I can keep the intermediate file manipulation to the minimal. I would imagine this is quite similar to the --id-delim flag when converting sampleIDs to FIDs and IIDs. This will be quite useful when using vcf as a direct input and performing tasks such as and pruning, so that the output won't be a list of variant IDS equal to '.'

Many thanks,

Yang

Christopher Chang

unread,
Jan 29, 2014, 5:50:09 AM1/29/14
to plink2...@googlegroups.com
I'll try to add this tomorrow.  How about the following:

--vcf-missing-var-id [template string]

The template string is expected to have one '^' and one '@'; the '^' is replaced with the chromosome string as it appears in the VCF file, and the '@' is replaced with the bp position.  For example, "--vcf-missing-var-id ^-@" corresponds to CHR-POS, while "--vcf-missing-var-id ^:@" corresponds to CHR:POS.

Yang Luo

unread,
Jan 29, 2014, 8:11:41 AM1/29/14
to plink2...@googlegroups.com
That would be awesome. Thanks,

Yang

Christopher Chang

unread,
Jan 30, 2014, 5:01:20 AM1/30/14
to plink2...@googlegroups.com
This is now implemented as --set-missing-var-ids , since VCF imports are not the only situation where it comes up.  The bp coordinate position is now designated by '#' instead of '@' since it's easy for me to imagine someone wanting to include a @ in the ID, but I can change it to something else if # also interferes with existing pipelines.

DaveK

unread,
Feb 13, 2014, 6:33:51 AM2/13/14
to plink2...@googlegroups.com
Hi guys,

I'm getting an error with this. Here's what I put in...

plink --vcf genomes.vcf.gz --set-missing-snp-ids ^:#[b37] --memory 7000

..and here's what I get...

Error: the set-missing-snp-ids template string requires exactly one '^' and one '#'.

There's definitely only one of each of ^ and # in the argument.

Christopher Chang

unread,
Feb 13, 2014, 6:43:26 AM2/13/14
to plink2...@googlegroups.com
Argh, I didn't realize that I had made a Windows-unfriendly choice: ^ in the Windows command prompt acts like \ does in Unix.  I'll try to find an alternative that doesn't confuse anyone.

In the meantime, if you change the parameter to ^^:#[b37] it'll work for now.
Reply all
Reply to author
Forward
0 new messages