PLINK and 1000 Genomes VCF files

686 views
Skip to first unread message

John Gennari

unread,
Nov 23, 2021, 12:51:20 AM11/23/21
to plink2-users
Greetings. I've been using PLINK v1.9 to get frequencies of SNPs in the 1000 Genomes vcf files. E.g. "PLINK --freqx --vcf <filename.vcf.gz> --snp --rs429358 --out APOE-freq"

This used to work on the older "phase 3" files from 1000 Genomes, but I note that they've recently used a new reference genome (I think), and all of the vcf files are now labeled "v5b", whereas they used to be called "v5a". But maybe that's not related. 

Can anyone help with the command above? E.g. the "filename" above for chr 19 would be:
ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

Thanks in advance for any help.

-John G.

Christopher Chang

unread,
Nov 23, 2021, 12:00:46 PM11/23/21
to plink2-users
Remove the dashes in front of "rs429358".

John Gennari

unread,
Nov 23, 2021, 12:29:21 PM11/23/21
to plink2-users
Whoops I miss-typed. No, I didn't have any dashes in front of "rs429358". The error message I get is 

Proj3>plink --freqx --vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz --snp rs429358 --out newchr19
PLINK v1.90b6.17 64-bit (28 Apr 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to newchr19.log.
Options in effect:
  --freqx
  --out newchr19
  --snp rs429358
  --vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

8000 MB RAM detected; reserving 4000 MB for main workspace.
--vcf: newchr19-temporary.bed + newchr19-temporary.bim + newchr19-temporary.fam
written.
Error: --snp variant 'rs429358' not found.

Christopher Chang

unread,
Nov 23, 2021, 1:22:21 PM11/23/21
to plink2-users
Okay, I took a look at the v5b files.  Unlike the v5a files, rsIDs were not filled in; if you want to use them, you have to fill them in yourself.  (They also happen to be incorrectly double-gzipped.)

John Gennari

unread,
Nov 23, 2021, 3:14:27 PM11/23/21
to plink2-users
Thanks so much! 
But what happened to the v5a files? Can I still get these? I can't find them anywhere on the 1000 genome site. 

And how might I "fill in" the rsIDs with the v5b files? Is that a matter of looking up a particular spot and variation on the chromosome and matching it against some rsID resource? Ugh. Sounds challenging to me...

-John G.

Christopher Chang

unread,
Nov 23, 2021, 3:21:51 PM11/23/21
to plink2-users
1. I don't know whether the 1000 Genomes site still has them, but https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 has plink2-formatted v5a files.  (You will need to use PLINK 2.0 to read these.)
2. It takes a bit of scripting work, but you can fill them in by (i) downloading a suitable from dbSNP (https://ftp.ncbi.nih.gov/snp/ ), and then postprocessing both that file and your main dataset into forms that allow --update-ids to do its job.
Reply all
Reply to author
Forward
0 new messages