Conversion minimac3-VCF to Oxford-gen PLINK2alpha

671 views
Skip to first unread message

Sander W. van der Laan

unread,
Aug 20, 2018, 5:59:29 AM8/20/18
to plink2-users
Hi,

I have imputed data using the Michigan Imputation Server. These look like this:

##fileformat=VCFv4.1
##filedate=2018.2.26
##source=Minimac3
##contig=<ID=22>
##FILTER=<ID=GENOTYPED,Description="Marker was genotyped AND imputed">
##FILTER=<ID=GENOTYPED_ONLY,Description="Marker was genotyped but NOT imputed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 0277 0065 0177
22 16050115 22:16050115 G A . PASS AF=0.00639;MAF=0.00639;R2=0.00000 GT:DS:GP 0|0:0.013:0.987,0.013,0.000 0|0:0.013:0.987,0.013,0.000 0|0:0.013:0.987,0.013,0.000
22 16050213 22:16050213 C T . PASS AF=0.00759;MAF=0.00759;R2=0.00000 GT:DS:GP 0|0:0.015:0.985,0.015,0.000 0|0:0.015:0.985,0.015,0.000 0|0:0.015:0.985,0.015,0.000
22 16050568 22:16050568 C A . PASS AF=0.00040;MAF=0.00040;R2=0.00000 GT:DS:GP 0|0:0.001:0.999,0.001,0.000 0|0:0.001:0.999,0.001,0.000 0|0:0.001:0.999,0.001,0.000
22 16050607 22:16050607 G A . PASS AF=0.00100;MAF=0.00100;R2=0.00000 GT:DS:GP 0|0:0.002:0.998,0.002,0.000 0|0:0.002:0.998,0.002,0.000 0|0:0.002:0.998,0.002,0.000
22 16050627 22:16050627 G T . PASS AF=0.00040;MAF=0.00040;R2=0.00000 GT:DS:GP 0|0:0.001:0.999,0.001,0.000 0|0:0.001:0.999,0.001,0.000 0|0:0.001:0.999,0.001,0.000
22 16050654 22:16050654 A <CN0> . PASS AF=0.00180;MAF=0.00180;R2=0.00000 GT:DS:GP 0|0:0.004:0.996,0.004,0.000 0|0:0.004:0.996,0.004,0.000 0|0:0.004:0.996,0.004,0.000


I have converted these to .gen files so as to be able to work with QCTOOL/BGENIX/SNPTEST. Here is the log:

PLINK v2.00a2 64-bit (9 Jul 2018)
Options in effect:
  --export oxford
  --out cohort1.1kgp3.chr22
  --vcf cohort1.1kgp3.chr22.vcf.gz dosage=GP

Hostname: DLA001
Working directory: /Volumes/WDMyBook8Tb/impute_hrc/MICHIMP/COHORT1_MICHIMP_1000Gp3
Start time: Tue Jul 10 19:12:58 2018

Random number seed: 1531242778
65536 MiB RAM detected; reserving 32768 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 652195 variants scanned.
--vcf: cohort1.1kgp3.chr22-temporary.pgen + cohort1.1kgp3.chr22-temporary.pvar +
cohort1.1kgp3.chr22-temporary.psam written.
657 samples (0 females, 0 males, 657 ambiguous; 657 founders) loaded from
cohort1.1kgp3.chr22-temporary.psam.
652195 variants loaded from cohort1.1kgp3.chr22-temporary.pvar.
Note: No phenotype data present.
Writing cohort1.1kgp3.chr22.gen ... done.
Writing cohort1.1kgp3.chr22.sample ... done.

End time: Tue Jul 10 19:14:02 2018

This I the outputed .gen file:
22 22:16050115 16050115 A G 0 0.987 0.013 0 0.987 0.013
22 22:16050213 16050213 T C 0 0.985 0.015 0 0.985 0.015
22 22:16050568 16050568 A C 0 0.999 0.001 0 0.999 0.001
22 22:16050607 16050607 A G 0 0.998 0.002 0 0.998 0.002
22 22:16050627 16050627 T G 0 0.999 0.001 0 0.999 0.001
22 22:16050654 16050654 <CN0> A 0 0.996 0.004 0 0.996 0.004
22 22:16050654 16050654 <CN2> A 0 0.966 0.034 0 0.966 0.034
22 22:16050654 16050654 <CN3> A 0 0.775 0.225 0 0.775 0.225
22 22:16050654 16050654 <CN4> A 0 0.992 0.008 0 0.992 0.008
22 22:16050678 16050678 T C 0 0.999 0.001 0 0.999 0.001

I noticed that this doesn't conform to the format as described here: http://www.stats.ox.ac.uk/%7Emarchini/software/gwas/file_format.html

Am I missing something or is this an omission?

Thanks

Sander

Christopher Chang

unread,
Aug 20, 2018, 10:51:40 AM8/20/18
to plink2-users
--vcf dosage=GP is broken; will post a fix today.

Christopher Chang

unread,
Aug 20, 2018, 12:57:37 PM8/20/18
to plink2-users
"--vcf dosage=GP" bugfix is now on GitHub; will post updated binaries tonight, and add an automated test for this case.  The internal flag which specified GP instead of DS-style dosage stopped being set properly on 7 May, so any "--vcf dosage=GP" runs with a build from the last three months will need to be repeated; very sorry about the inconvenience.

Aside from this bug, though, the .gen file should be as conformant as any other .gen files in the wild, since due to the lack of a dedicated chromosome column and the redundancy between "SNP ID" and "RS ID", it's common to use the SNP ID column to store chromosome codes; can you clarify what the problem is?

A few other notes:
- If you care about the actual genotype probabilities, rather than just the dosages, you should not use plink2 to convert the GP field to .gen/.bgen-format.  All versions of plink collapse genotype probabilities down to dosages, even when performing simple data conversion commands like yours; the only additional thing you can do with genotype probability input is --import-dosage-certainty filtering.  For most GWAS studies, this collapsing is generally accepted to be the right thing to do: see e.g. Zheng J, Li Y, Abecasis G, Scheet P (2011) A Comparison of Approaches to Account for Uncertainty in Analysis of Imputed Genotypes (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3143715/ ).  But if you're working in the large-effect, low-imputation-accuracy setting where plink's behavior is suboptimal, you'll need to use another program to convert VCFs to .gen/.bgen.
(While I can write a one-off tool that performs this conversion in a lossless manner at a comparable speed to "plink2 --vcf --export {b}gen", it's more appropriate for someone who writes other genotype-probability-handling software to be responsible for this.  I'm happy to explain all details of plink2's VCF-parser and .gen/.bgen exporters to the developer(s) who want to do this.)
- In the usual case where dosages are good enough, you'll probably want to switch to "--export bgen-1.2" when dealing with large datasets.  But only after the kinks have been wrung out of the pipeline, of course; easier to see when things are going wrong with good old text files.

On Monday, August 20, 2018 at 2:59:29 AM UTC-7, Sander W. van der Laan wrote:

Sander W. van der Laan

unread,
Aug 21, 2018, 7:22:10 AM8/21/18
to plink2-users
Hi,

Thanks for the quick fix. Great work!

In the end I want a .gen file which I can use with SNPTEST/QCTOOL etc. Dosages are probably fine indeed, but: all the other data was generated using IMPUTE2, and well, call me a stickler-for-details, I'd rather have all data in the end in the same format - thus genotype probabilities. In other words: I need a minimac3 to Oxford-style converter that handles genotype probabilities. 

So, do I understand correctly: when I use Oxford-style data (i.e. imputed using IMPUTE2) PLINK will handle the probabilities as dosages? 

Best,

Sander

Christopher Chang

unread,
Aug 21, 2018, 11:47:53 AM8/21/18
to plink2-users
That is correct.  The probabilities are immediately collapsed down to a dosage after --import-dosage-certainty filtering, and then expanded to a minimum-uncertainty triple of genotype probabilities that correspond to the correct dosage during .gen export.  So if you're using plink2 for this conversion at all, you may as well use "--vcf dosage=DS".  I went ahead and changed plink2 yesterday to actually error out if you try to use "--vcf dosage=GP" here, to make this more obvious.

You need to use another program for the conversion if you want to keep all the GP data.
Reply all
Reply to author
Forward
0 new messages